# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd
import os

#### 2. Import the users table.

In [2]:
os.chdir("C:\\Users\\GiantsV3\\Documents\\Ironhack\\Week2\\Day4\\lab-data-cleaning")
users = pd.read_csv("data/users.csv", index_col=0)

#### 3. Rename Id column to userId.

In [3]:
users.rename(columns={"Id":"userId"}, inplace=True)
users

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5
...,...,...,...,...,...
40320,55743,1,0,0,0
40321,55744,6,1,0,0
40322,55745,101,0,0,0
40323,55746,106,1,0,0


#### 4. Import the posts table. 

In [4]:
posts = pd.read_csv("data/posts.csv", index_col=0)

#### 5. Rename Id column to postId and OwnerUserId to userId.

In [5]:
posts.rename(columns={"Id":"postId", "OwnedUserId":"userId"}, inplace=True)
posts

Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3
...,...,...,...,...,...
91971,115374,805.0,2,,2
91972,115375,49365.0,0,9.0,0
91973,115376,55746.0,1,5.0,2
91974,115377,805.0,0,,0


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [6]:
users_sliced = users[["userId", "Reputation", "Views", "UpVotes", "DownVotes"]]
posts_sliced = posts[["PostId", "Score", "userId", "ViewCount", "CommentCount"]]

#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [7]:
merged_table = users_sliced.merge(right=posts_sliced, how="left", on="userId")
merged_table.head(2)

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175.0,0.0,,0.0
1,-1,1,0,5007,1920,8576.0,0.0,,0.0


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [8]:
merged_table.isnull().sum()

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
PostId          18342
Score           18342
ViewCount       66738
CommentCount    18342
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [9]:
merged_table.shape

(108926, 9)

In [10]:
merged_table["PostId"].value_counts(dropna=False)

NaN        18342
46679.0        1
14324.0        1
95794.0        1
71600.0        1
           ...  
6829.0         1
3860.0         1
5924.0         1
11888.0        1
90159.0        1
Name: PostId, Length: 90585, dtype: int64

In [11]:
merged_table["Score"].value_counts(dropna=False)

1.0      22901
0.0      19927
NaN      18342
2.0      15248
3.0       9909
         ...  
78.0         1
135.0        1
92.0         1
102.0        1
113.0        1
Name: Score, Length: 129, dtype: int64

In [12]:
merged_table["ViewCount"].value_counts(dropna=False)

NaN       66738
38.0        295
31.0        293
37.0        277
27.0        277
          ...  
4754.0        1
4034.0        1
6383.0        1
5089.0        1
2257.0        1
Name: ViewCount, Length: 3655, dtype: int64

In [13]:
merged_table["CommentCount"].value_counts(dropna=False)

0.0     38051
NaN     18342
1.0     14798
2.0     12527
3.0      7835
4.0      5560
5.0      3651
6.0      2601
7.0      1701
8.0      1198
9.0       835
10.0      552
11.0      357
12.0      270
13.0      173
14.0      127
15.0       83
16.0       75
17.0       54
19.0       28
18.0       24
20.0       14
22.0       12
21.0       12
24.0       12
30.0        5
28.0        4
23.0        4
33.0        3
25.0        3
27.0        2
37.0        2
31.0        2
41.0        2
35.0        2
34.0        1
32.0        1
26.0        1
45.0        1
29.0        1
Name: CommentCount, dtype: int64

In [14]:
merged_table.dropna(subset=["PostId", "Score", "ViewCount", "CommentCount"], how="all", inplace=True)

In [15]:
merged_table.shape

(90584, 9)

I have cleaned all the rows that had all the new variables missing. I didn't fill them up as they data was very ambiguous and there wasn't a proper "response" that could fill that data.

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [18]:
merged_table.head(20)

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175.0,0.0,,0.0
1,-1,1,0,5007,1920,8576.0,0.0,,0.0
2,-1,1,0,5007,1920,8578.0,0.0,,0.0
3,-1,1,0,5007,1920,8981.0,0.0,,0.0
4,-1,1,0,5007,1920,8982.0,0.0,,0.0
5,-1,1,0,5007,1920,9857.0,0.0,,0.0
6,-1,1,0,5007,1920,9858.0,0.0,,0.0
7,-1,1,0,5007,1920,9860.0,0.0,,0.0
8,-1,1,0,5007,1920,10130.0,0.0,,0.0
9,-1,1,0,5007,1920,10131.0,0.0,,0.0


In [16]:
merged_table.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 108924
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        90584 non-null  int64  
 1   Reputation    90584 non-null  int64  
 2   Views         90584 non-null  int64  
 3   UpVotes       90584 non-null  int64  
 4   DownVotes     90584 non-null  int64  
 5   PostId        90584 non-null  float64
 6   Score         90584 non-null  float64
 7   ViewCount     42188 non-null  float64
 8   CommentCount  90584 non-null  float64
dtypes: float64(4), int64(5)
memory usage: 6.9 MB


In my opinion, all the data types are correct. The only change that comes to my mind would be change the *float64* to *int64*, but as there are **NaN** it can't be done.