# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import the users table.

In [4]:
users = pd.read_csv('../data/users.csv')
users

Unnamed: 0.1,Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 3. Rename Id column to userId.

In [19]:
users = users.rename(columns ={'userId':'userID'})
users

Unnamed: 0.1,Unnamed: 0,userID,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 4. Import the posts table. 

In [11]:
posts = pd.read_csv('../data/posts.csv')
posts

Unnamed: 0.1,Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 5. Rename Id column to postId and OwnerUserId to userId.

In [24]:
posts = posts.rename(columns={'userId':'userID', 'PostId':'postID'})
posts

Unnamed: 0.1,Unnamed: 0,postID,userID,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [25]:
users_sliced = users[['userID','Reputation', 'Views', 'UpVotes', 'DownVotes']]
posts_sliced = posts[['postID', 'Score', 'userID', 'ViewCount', 'CommentCount']]

#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [41]:
user_posts = posts_sliced.merge(right=users_sliced, how='inner', on='userID')
user_posts

Unnamed: 0,postID,Score,userID,ViewCount,CommentCount,Reputation,Views,UpVotes,DownVotes
0,1,23,8.0,1278.0,1,6764,1089,604,25
1,16,16,8.0,,3,6764,1089,604,25
2,36,41,8.0,67396.0,7,6764,1089,604,25
3,65,14,8.0,,3,6764,1089,604,25
4,78,33,8.0,,4,6764,1089,604,25
...,...,...,...,...,...,...,...,...,...
90579,115366,1,55742.0,17.0,0,6,0,0,0
90580,115370,1,55744.0,13.0,2,6,1,0,0
90581,115371,0,35801.0,19.0,0,1,1,0,0
90582,115375,0,49365.0,9.0,0,1,0,0,0


In [42]:
user_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   postID        90584 non-null  int64  
 1   Score         90584 non-null  int64  
 2   userID        90584 non-null  float64
 3   ViewCount     42188 non-null  float64
 4   CommentCount  90584 non-null  int64  
 5   Reputation    90584 non-null  int64  
 6   Views         90584 non-null  int64  
 7   UpVotes       90584 non-null  int64  
 8   DownVotes     90584 non-null  int64  
dtypes: float64(2), int64(7)
memory usage: 6.9 MB


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [43]:
user_posts.isnull().sum()

postID              0
Score               0
userID              0
ViewCount       48396
CommentCount        0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [46]:
user_posts.fillna({'ViewCount': 0}, inplace=True)
user_posts.isnull().sum()

postID          0
Score           0
userID          0
ViewCount       0
CommentCount    0
Reputation      0
Views           0
UpVotes         0
DownVotes       0
dtype: int64

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [48]:
user_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   postID        90584 non-null  int64  
 1   Score         90584 non-null  int64  
 2   userID        90584 non-null  float64
 3   ViewCount     90584 non-null  float64
 4   CommentCount  90584 non-null  int64  
 5   Reputation    90584 non-null  int64  
 6   Views         90584 non-null  int64  
 7   UpVotes       90584 non-null  int64  
 8   DownVotes     90584 non-null  int64  
dtypes: float64(2), int64(7)
memory usage: 6.9 MB


In [52]:
user_posts = user_posts.astype({'ViewCount': 'int64', 'userID': 'int64'})

user_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   postID        90584 non-null  int64
 1   Score         90584 non-null  int64
 2   userID        90584 non-null  int64
 3   ViewCount     90584 non-null  int64
 4   CommentCount  90584 non-null  int64
 5   Reputation    90584 non-null  int64
 6   Views         90584 non-null  int64
 7   UpVotes       90584 non-null  int64
 8   DownVotes     90584 non-null  int64
dtypes: int64(9)
memory usage: 6.9 MB
