# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import the users table.

In [2]:
users = pd.read_csv("../data/users.csv")
users

Unnamed: 0.1,Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 3. Rename Id column to userId.

In [3]:
users.rename(columns = {'Id':'userId'}, inplace = True)

#### 4. Import the posts table. 

In [4]:
posts = pd.read_csv("../data/posts.csv")

#### 5. Rename Id column to postId and OwnerUserId to userId.

In [5]:
posts.rename(columns = {'Id':'postId', "OwnerUsedId":"userId"}, inplace = True)

#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [6]:
users_sliced = users[["userId", "Reputation", "Views", "UpVotes", "DownVotes"]].copy()
posts_sliced = posts[["PostId", "Score", "userId", "ViewCount", "CommentCount"]].copy()

In [7]:
users_sliced

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5
...,...,...,...,...,...
40320,55743,1,0,0,0
40321,55744,6,1,0,0
40322,55745,101,0,0,0
40323,55746,106,1,0,0


#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [8]:
users_posts = users_sliced.merge(right=posts_sliced, how="inner", left_on="userId", right_on="userId")
users_posts

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0
...,...,...,...,...,...,...,...,...,...
90579,55734,1,0,0,0,115352,0,16.0,0
90580,55738,11,0,0,0,115360,2,40.0,4
90581,55742,6,0,0,0,115366,1,17.0,0
90582,55744,6,1,0,0,115370,1,13.0,2


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [9]:
users_posts.isnull().sum().sum()

48396

In [10]:
users_posts.isnull().sum()

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
PostId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [14]:
users_posts["ViewCount"].fillna(0)

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
         ... 
90579    16.0
90580    40.0
90581    17.0
90582    13.0
90583     5.0
Name: ViewCount, Length: 90584, dtype: float64

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 