# Data Cleaning 

#### 1. Import pandas library.

In [3]:
import pandas as pd

#### 2. Import the users table.

In [22]:
users_df = pd.read_csv('../data/users.csv')
users_df

Unnamed: 0.1,Unnamed: 0,Id,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 3. Rename Id column to userId.

In [24]:
users_df.rename(columns={'Id': 'userId'}, inplace=True)

#### 4. Import the posts table. 

In [25]:
posts_df = pd.read_csv('../data/posts.csv')

#### 5. Rename Id column to postId and OwnerUserId to userId.

In [34]:
posts_df.rename(columns = {'Id': 'postId', 'OwnerUserId': 'userId'}, inplace=True)
posts_df

Unnamed: 0.1,Unnamed: 0,postId,userId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [35]:
users_sliced = users_df[['userId', 'Views', 'UpVotes', 'DownVotes']].copy()
users_sliced

Unnamed: 0,userId,Views,UpVotes,DownVotes
0,-1,0,5007,1920
1,2,25,3,0
2,3,22,19,0
3,4,11,0,0
4,5,1145,662,5
...,...,...,...,...
40320,55743,0,0,0
40321,55744,1,0,0
40322,55745,0,0,0
40323,55746,1,0,0


In [37]:
posts_sliced = posts_df[['postId', 'Score', 'userId', 'ViewCount', 'CommentCount']].copy()
posts_sliced

Unnamed: 0,postId,Score,userId,ViewCount,CommentCount
0,1,23,8.0,1278.0,1
1,2,22,24.0,8198.0,1
2,3,54,18.0,3613.0,4
3,4,13,23.0,5224.0,2
4,5,81,23.0,,3
...,...,...,...,...,...
91971,115374,2,805.0,,2
91972,115375,0,49365.0,9.0,0
91973,115376,1,55746.0,5.0,2
91974,115377,0,805.0,,0


#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [40]:
merged_df = users_sliced.merge(right=posts_sliced, how="inner", left_on="userId", right_on="userId")
merged_df

Unnamed: 0,userId,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
0,-1,0,5007,1920,2175,0,,0
1,-1,0,5007,1920,8576,0,,0
2,-1,0,5007,1920,8578,0,,0
3,-1,0,5007,1920,8981,0,,0
4,-1,0,5007,1920,8982,0,,0
...,...,...,...,...,...,...,...,...
90579,55734,0,0,0,115352,0,16.0,0
90580,55738,0,0,0,115360,2,40.0,4
90581,55742,0,0,0,115366,1,17.0,0
90582,55744,1,0,0,115370,1,13.0,2


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [41]:
merged_df.isna().sum()

userId              0
Views               0
UpVotes             0
DownVotes           0
postId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [44]:
dropped_df = merged_df.dropna()
dropped_df

Unnamed: 0,userId,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
211,5,1145,662,5,6,152,29229.0,5
219,5,1145,662,5,103,28,1990.0,6
221,5,1145,662,5,125,75,29261.0,2
233,5,1145,662,5,423,156,64481.0,7
238,5,1145,662,5,562,10,1005.0,1
...,...,...,...,...,...,...,...,...
90579,55734,0,0,0,115352,0,16.0,0
90580,55738,0,0,0,115360,2,40.0,4
90581,55742,0,0,0,115366,1,17.0,0
90582,55744,1,0,0,115370,1,13.0,2


Answer: In this case it looks like the missing values are coming from the same userID (-1) and it seems there is a mistake here. There are many rows with the same userID of -1 and all with missing values. For this reason it should be fine to drop the values. Filling them would skew the results, especially because there are many missing values.

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [45]:
dropped_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        42188 non-null  int64  
 1   Views         42188 non-null  int64  
 2   UpVotes       42188 non-null  int64  
 3   DownVotes     42188 non-null  int64  
 4   postId        42188 non-null  int64  
 5   Score         42188 non-null  int64  
 6   ViewCount     42188 non-null  float64
 7   CommentCount  42188 non-null  int64  
dtypes: float64(1), int64(7)
memory usage: 2.9 MB


Answer: ViewCount should potentially be an int (there are no point values for this column). UserId could be a string value because we do not need to calculate anything with this column. There doesn't seem to be a problem with the datatypes (e.g. something being an object)