# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import the users table.

In [9]:
users = pd.read_csv("../data/users.csv")
users

Unnamed: 0.1,Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 3. Rename Id column to userId.

In [10]:
users_2 = users.rename({"userId":"Id"}, axis=1)
users_2

Unnamed: 0.1,Unnamed: 0,Id,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 4. Import the posts table. 

In [25]:
post = pd.read_csv("../data/posts.csv")
post

Unnamed: 0.1,Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 5. Rename Id column to postId and OwnerUserId to userId.

In [24]:
post.rename({"PostId":"Id", "userId":"OwnerUserId"}, axis=1)

Unnamed: 0.1,Unnamed: 0,Id,OwnerUserId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [29]:
pd.users_sliced = users[["userId","Reputation","Views","UpVotes","DownVotes"]]
pd.users_sliced.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


In [30]:
pd.posts_sliced = post[["PostId","Score","userId","ViewCount","CommentCount"]]
pd.posts_sliced.head()

Unnamed: 0,PostId,Score,userId,ViewCount,CommentCount
0,1,23,8.0,1278.0,1
1,2,22,24.0,8198.0,1
2,3,54,18.0,3613.0,4
3,4,13,23.0,5224.0,2
4,5,81,23.0,,3


#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [38]:
new_df = users_sliced.merge(posts_sliced)
new_df

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0
...,...,...,...,...,...,...,...,...,...
90579,55734,1,0,0,0,115352,0,16.0,0
90580,55738,11,0,0,0,115360,2,40.0,4
90581,55742,6,0,0,0,115366,1,17.0,0
90582,55744,6,1,0,0,115370,1,13.0,2


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [39]:
new_df.isnull()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...
90579,False,False,False,False,False,False,False,False,False
90580,False,False,False,False,False,False,False,False,False
90581,False,False,False,False,False,False,False,False,False
90582,False,False,False,False,False,False,False,False,False


#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [44]:
df = new_df.fillna(0)

df.isnull().sum()

userId          0
Reputation      0
Views           0
UpVotes         0
DownVotes       0
PostId          0
Score           0
ViewCount       0
CommentCount    0
dtype: int64

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [46]:
df.ViewCount = df.ViewCount.astype(int)
df.userId = df.userId.astype(int)
df.dtypes

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
PostId          int64
Score           int64
ViewCount       int64
CommentCount    int64
dtype: object