#### 1. Import pandas library

In [1]:
import pandas as pd

#### 2. Import users table:

In [2]:
users=pd.read_csv('users_table.csv')

#### 3. Rename Id column to userId

In [3]:
users.rename(columns={'Id':'userId'}, inplace = True)

#### 4. Import posts table:

In [4]:
posts=pd.read_csv('posts_table.csv')

#### 5. Rename Id column to postId and OwnerUserId to userId

In [5]:
posts.rename(columns={'Id':'postId', 'OwnerUserId':'userId'}, inplace=True)

#### 6. Define new dataframes for users and posts with the following selected columns:
    **users columns**: userId, Reputation,Views,UpVotes,DownVotes
    **posts columns**: postId, Score,userId,ViewCount,CommentCount

In [6]:
users=users[['userId','Reputation','Views','UpVotes','DownVotes']]
posts=posts[['postId', 'Score','userId','ViewCount','CommentCount']]

#### 7. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [7]:
merged_df=pd.merge(left=users,
                   right=posts,
                   left_on='userId',
                   right_on='userId',
                   how='inner'
)

#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [8]:
# To  check the missing values we'll use the isnull function
merged_df.isnull().sum()

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
postId              0
Score               0
ViewCount       23572
CommentCount        0
dtype: int64

In [9]:
# There are 23572 missing values, all in column ViewCount

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before passing to the next step

In [10]:
# To answer this question we should first compare the number of missing values with the total number of dataframe rows
# to know the percentage of values null.
# Also we should check what type of values are the ones in this column and if we can fill them with with a statistic like the mean.
print((merged_df['ViewCount'].isnull().sum()/merged_df.shape[0])*100)
print(merged_df['ViewCount'].dtypes)
merged_df['ViewCount'].value_counts()

60.499974333966435
float64


98.0       43
150.0      43
122.0      42
156.0      41
108.0      41
143.0      39
88.0       39
77.0       39
120.0      39
159.0      39
124.0      38
203.0      38
84.0       36
246.0      36
136.0      36
104.0      36
132.0      36
158.0      36
112.0      36
134.0      36
113.0      35
126.0      35
148.0      35
106.0      34
114.0      34
131.0      34
111.0      34
86.0       33
149.0      33
92.0       33
           ..
5903.0      1
2261.0      1
10503.0     1
1220.0      1
1752.0      1
10785.0     1
1245.0      1
4801.0      1
1803.0      1
2063.0      1
2028.0      1
2408.0      1
5620.0      1
2542.0      1
2961.0      1
2149.0      1
3159.0      1
11200.0     1
1966.0      1
1736.0      1
4470.0      1
21588.0     1
1498.0      1
1342.0      1
14004.0     1
4414.0      1
9772.0      1
1796.0      1
4976.0      1
2174.0      1
Name: ViewCount, Length: 3402, dtype: int64

In [11]:
# As around 60% of the values of this column are missing values, we'll discard this column
merged_df=merged_df.drop(columns=['ViewCount'])

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [12]:
merged_df.dtypes

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
postId          int64
Score           int64
CommentCount    int64
dtype: object

In [13]:
merged_df.head(2)

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,CommentCount
0,-1,1,0,5007,1920,2175,0,0
1,-1,1,0,5007,1920,8576,0,0


In [14]:
merged_df.nunique().sort_values(ascending=False)

postId          38962
userId           8138
Reputation        871
Views             336
UpVotes           296
Score             122
DownVotes          68
CommentCount       34
dtype: int64

In [24]:
# Eventually the Score should be changed to float type
merged_df=merged_df.astype({'Score':'float'})

In [25]:
merged_df.dtypes

userId            int64
Reputation        int64
Views             int64
UpVotes           int64
DownVotes         int64
postId            int64
Score           float64
CommentCount      int64
dtype: object