# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import the users table.

In [2]:
users = pd.read_csv("../data/users.csv")
users.head(5)

Unnamed: 0.1,Unnamed: 0,Id,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5


#### 3. Rename Id column to userId.

In [3]:
users.rename(columns = {'Id' : 'UserID'}, inplace = True)
users.head(5)

Unnamed: 0.1,Unnamed: 0,UserID,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5


#### 4. Import the posts table. 

In [4]:
posts = pd.read_csv("../data/posts.csv")
posts.head(5)

Unnamed: 0.1,Unnamed: 0,Id,OwnerUserId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3


#### 5. Rename Id column to postId and OwnerUserId to userId.

In [5]:
posts.rename(columns = {'Id' : 'PostID', 'OwnerUserId' : 'UserID'}, inplace = True)
posts.head(5)

Unnamed: 0.1,Unnamed: 0,PostID,UserID,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes      
**posts_sliced columns**: postId, Score, userId, ViewCount

In [6]:
users_sliced = users.filter(['UserID','Reputation','Views', 'UpVotes'], axis=1)
users_sliced.head(5)

Unnamed: 0,UserID,Reputation,Views,UpVotes
0,-1,1,0,5007
1,2,101,25,3
2,3,101,22,19
3,4,101,11,0
4,5,6792,1145,662


In [7]:
posts_sliced = posts.filter(['PostID', 'Score', 'UserID', 'ViewCount'], axis =1)
posts_sliced.head(5)

Unnamed: 0,PostID,Score,UserID,ViewCount
0,1,23,8.0,1278.0
1,2,22,24.0,8198.0
2,3,54,18.0,3613.0
3,4,13,23.0,5224.0
4,5,81,23.0,


#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [8]:
userpost_new = users_sliced.merge(posts_sliced)
userpost_new.head(10)

Unnamed: 0,UserID,Reputation,Views,UpVotes,PostID,Score,ViewCount
0,-1,1,0,5007,2175,0,
1,-1,1,0,5007,8576,0,
2,-1,1,0,5007,8578,0,
3,-1,1,0,5007,8981,0,
4,-1,1,0,5007,8982,0,
5,-1,1,0,5007,9857,0,
6,-1,1,0,5007,9858,0,
7,-1,1,0,5007,9860,0,
8,-1,1,0,5007,10130,0,
9,-1,1,0,5007,10131,0,


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [9]:
userpost_new.isnull()

Unnamed: 0,UserID,Reputation,Views,UpVotes,PostID,Score,ViewCount
0,False,False,False,False,False,False,True
1,False,False,False,False,False,False,True
2,False,False,False,False,False,False,True
3,False,False,False,False,False,False,True
4,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...
90579,False,False,False,False,False,False,False
90580,False,False,False,False,False,False,False
90581,False,False,False,False,False,False,False
90582,False,False,False,False,False,False,False


In [10]:
userpost_new.isnull().sum()

UserID            0
Reputation        0
Views             0
UpVotes           0
PostID            0
Score             0
ViewCount     48396
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [11]:
# Since all missing values are in column View Count you could think about deleting this column or fill it with the mean of the rest.

In [12]:
userpost_new = userpost_new.fillna(0)
userpost_new.isnull().sum()

UserID        0
Reputation    0
Views         0
UpVotes       0
PostID        0
Score         0
ViewCount     0
dtype: int64