# Data Cleaning 

#### 1. Import pandas library.

In [18]:
import numpy as np
import pandas as pd

#### 2. Import the users table.

In [19]:
users = pd.read_csv('../data/users.csv')
users.head()

Unnamed: 0.1,Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5


#### 3. Rename Id column to userId.

In [25]:
users_renamed = users.rename(columns={'userId':'Id'})
users_renamed.head()

Unnamed: 0.1,Unnamed: 0,Id,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5


#### 4. Import the posts table. 

In [28]:
posts = pd.read_csv('../data/posts.csv')
posts.head()

Unnamed: 0.1,Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3


#### 5. Rename Id column to postId and OwnerUserId to userId.

In [30]:
posts_renamed = posts.rename(columns={'PostId':'Id'})
posts_renamed.head()

Unnamed: 0.1,Unnamed: 0,Id,userId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [48]:
users_sliced = users[['userId','Reputation','Views','UpVotes','DownVotes']]
users_sliced

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5
...,...,...,...,...,...
40320,55743,1,0,0,0
40321,55744,6,1,0,0
40322,55745,101,0,0,0
40323,55746,106,1,0,0


In [49]:
posts_sliced = posts[['PostId','Score','userId','ViewCount','CommentCount']]
posts_sliced

Unnamed: 0,PostId,Score,userId,ViewCount,CommentCount
0,1,23,8.0,1278.0,1
1,2,22,24.0,8198.0,1
2,3,54,18.0,3613.0,4
3,4,13,23.0,5224.0,2
4,5,81,23.0,,3
...,...,...,...,...,...
91971,115374,2,805.0,,2
91972,115375,0,49365.0,9.0,0
91973,115376,1,55746.0,5.0,2
91974,115377,0,805.0,,0


#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [60]:
users_sliced

Unnamed: 0,PostId,Score,userId,ViewCount,CommentCount
0,1,23,8.0,1278.0,1
1,2,22,24.0,8198.0,1
2,3,54,18.0,3613.0,4
3,4,13,23.0,5224.0,2
4,5,81,23.0,,3
...,...,...,...,...,...
91971,115374,2,805.0,,2
91972,115375,0,49365.0,9.0,0
91973,115376,1,55746.0,5.0,2
91974,115377,0,805.0,,0


In [58]:
users_post = users_sliced.merge(right=posts_sliced, how='inner',left_on='userId',right_on='PostId')
users_post.head(15)

Unnamed: 0,userId_x,Reputation,Views,UpVotes,DownVotes,PostId,Score,userId_y,ViewCount,CommentCount
0,2,101,25,3,0,2,22,24.0,8198.0,1
1,3,101,22,19,0,3,54,18.0,3613.0,4
2,4,101,11,0,0,4,13,23.0,5224.0,2
3,5,6792,1145,662,5,5,81,23.0,,3
4,6,457,114,47,0,6,152,5.0,29229.0,5
5,7,429,56,20,0,7,76,38.0,5808.0,3
6,8,6764,1089,604,25,8,0,37.0,288.0,2
7,10,121,20,2,0,10,23,24.0,21925.0,4
8,11,136,10,10,0,11,2,34.0,224.0,2
9,12,101,10,5,0,12,20,5.0,,1


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [59]:
users_post.isnull().sum()

userId_x            0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
PostId              0
Score               0
userId_y          772
ViewCount       19011
CommentCount        0
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [63]:
#I would clean the values because for these specific columns values varies a lot. Otherwise I would apply a statistical value to the missing values.

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [64]:
users_post.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32057 entries, 0 to 32056
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId_x      32057 non-null  int64  
 1   Reputation    32057 non-null  int64  
 2   Views         32057 non-null  int64  
 3   UpVotes       32057 non-null  int64  
 4   DownVotes     32057 non-null  int64  
 5   PostId        32057 non-null  int64  
 6   Score         32057 non-null  int64  
 7   userId_y      31285 non-null  float64
 8   ViewCount     13046 non-null  float64
 9   CommentCount  32057 non-null  int64  
dtypes: float64(2), int64(8)
memory usage: 2.7 MB


In [68]:
users_post[['userId_x','PostId']] = users_post[['userId_x','PostId']].astype('category')

In [69]:
users_post.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32057 entries, 0 to 32056
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   userId_x      32057 non-null  category
 1   Reputation    32057 non-null  int64   
 2   Views         32057 non-null  int64   
 3   UpVotes       32057 non-null  int64   
 4   DownVotes     32057 non-null  int64   
 5   PostId        32057 non-null  category
 6   Score         32057 non-null  int64   
 7   userId_y      31285 non-null  float64 
 8   ViewCount     13046 non-null  float64 
 9   CommentCount  32057 non-null  int64   
dtypes: category(2), float64(2), int64(6)
memory usage: 5.3 MB


In [None]:
#I believe UserId_x and PostId shoulb be categorical variables because we always use them as labebs.