# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import the users table.

In [2]:
users = pd.read_csv("../data/users.csv")
print(users)

       Unnamed: 0     Id  Reputation  Views  UpVotes  DownVotes
0               0     -1           1      0     5007       1920
1               1      2         101     25        3          0
2               2      3         101     22       19          0
3               3      4         101     11        0          0
4               4      5        6792   1145      662          5
...           ...    ...         ...    ...      ...        ...
40320       40320  55743           1      0        0          0
40321       40321  55744           6      1        0          0
40322       40322  55745         101      0        0          0
40323       40323  55746         106      1        0          0
40324       40324  55747           1      0        0          0

[40325 rows x 6 columns]


In [3]:
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40325 entries, 0 to 40324
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   Unnamed: 0  40325 non-null  int64
 1   Id          40325 non-null  int64
 2   Reputation  40325 non-null  int64
 3   Views       40325 non-null  int64
 4   UpVotes     40325 non-null  int64
 5   DownVotes   40325 non-null  int64
dtypes: int64(6)
memory usage: 1.8 MB


#### 3. Rename Id column to userId.

In [4]:
new_users = users.rename(columns={"Id": "userId"})
print(new_users)

       Unnamed: 0  userId  Reputation  Views  UpVotes  DownVotes
0               0      -1           1      0     5007       1920
1               1       2         101     25        3          0
2               2       3         101     22       19          0
3               3       4         101     11        0          0
4               4       5        6792   1145      662          5
...           ...     ...         ...    ...      ...        ...
40320       40320   55743           1      0        0          0
40321       40321   55744           6      1        0          0
40322       40322   55745         101      0        0          0
40323       40323   55746         106      1        0          0
40324       40324   55747           1      0        0          0

[40325 rows x 6 columns]


#### 4. Import the posts table. 

In [5]:
posts = pd.read_csv("../data/posts.csv")
print(posts)

       Unnamed: 0      Id  OwnerUserId  Score  ViewCount  CommentCount
0               0       1          8.0     23     1278.0             1
1               1       2         24.0     22     8198.0             1
2               2       3         18.0     54     3613.0             4
3               3       4         23.0     13     5224.0             2
4               4       5         23.0     81        NaN             3
...           ...     ...          ...    ...        ...           ...
91971       91971  115374        805.0      2        NaN             2
91972       91972  115375      49365.0      0        9.0             0
91973       91973  115376      55746.0      1        5.0             2
91974       91974  115377        805.0      0        NaN             0
91975       91975  115378       7250.0      0        NaN             0

[91976 rows x 6 columns]


#### 5. Rename Id column to postId and OwnerUserId to userId.

In [6]:
new_posts = posts.rename(columns={"Id": "postId", "OwnerUserId": "userId"})
print(new_users)

       Unnamed: 0  userId  Reputation  Views  UpVotes  DownVotes
0               0      -1           1      0     5007       1920
1               1       2         101     25        3          0
2               2       3         101     22       19          0
3               3       4         101     11        0          0
4               4       5        6792   1145      662          5
...           ...     ...         ...    ...      ...        ...
40320       40320   55743           1      0        0          0
40321       40321   55744           6      1        0          0
40322       40322   55745         101      0        0          0
40323       40323   55746         106      1        0          0
40324       40324   55747           1      0        0          0

[40325 rows x 6 columns]


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes      
**posts_sliced columns**: postId, Score, userId, ViewCount

In [7]:
users_sliced = new_users.loc[:, ["userId", "Reputation", "Views", "UpVotes"]]
print(users_sliced)


       userId  Reputation  Views  UpVotes
0          -1           1      0     5007
1           2         101     25        3
2           3         101     22       19
3           4         101     11        0
4           5        6792   1145      662
...       ...         ...    ...      ...
40320   55743           1      0        0
40321   55744           6      1        0
40322   55745         101      0        0
40323   55746         106      1        0
40324   55747           1      0        0

[40325 rows x 4 columns]


In [8]:
posts_sliced = new_posts.loc[:, ["postId", "Score", "userId", "ViewCount"]]
print(posts_sliced)

       postId  Score   userId  ViewCount
0           1     23      8.0     1278.0
1           2     22     24.0     8198.0
2           3     54     18.0     3613.0
3           4     13     23.0     5224.0
4           5     81     23.0        NaN
...       ...    ...      ...        ...
91971  115374      2    805.0        NaN
91972  115375      0  49365.0        9.0
91973  115376      1  55746.0        5.0
91974  115377      0    805.0        NaN
91975  115378      0   7250.0        NaN

[91976 rows x 4 columns]


#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [9]:
list(new_users) #Checking the columns that are the same in eacht dataframe


['Unnamed: 0', 'userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']

In [10]:
list(new_posts)

['Unnamed: 0', 'postId', 'userId', 'Score', 'ViewCount', 'CommentCount']

In [11]:
merged_df = users_sliced.merge(posts_sliced, how="outer", left_on="userId", right_on="userId")
merged_df

#Merged dataframes with the userId as the common column

Unnamed: 0,userId,Reputation,Views,UpVotes,postId,Score,ViewCount
0,-1.0,1.0,0.0,5007.0,2175.0,0.0,
1,-1.0,1.0,0.0,5007.0,8576.0,0.0,
2,-1.0,1.0,0.0,5007.0,8578.0,0.0,
3,-1.0,1.0,0.0,5007.0,8981.0,0.0,
4,-1.0,1.0,0.0,5007.0,8982.0,0.0,
...,...,...,...,...,...,...,...
110313,,,,,114678.0,0.0,20.0
110314,,,,,114812.0,0.0,16.0
110315,,,,,114815.0,1.0,14.0
110316,,,,,115225.0,0.0,8.0


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [12]:
merged_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110318 entries, 0 to 110317
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   userId      108926 non-null  float64
 1   Reputation  108926 non-null  float64
 2   Views       108926 non-null  float64
 3   UpVotes     108926 non-null  float64
 4   postId      91976 non-null   float64
 5   Score       91976 non-null   float64
 6   ViewCount   42921 non-null   float64
dtypes: float64(7)
memory usage: 6.7 MB


In [13]:
merged_df.isnull().sum()

userId         1392
Reputation     1392
Views          1392
UpVotes        1392
postId        18342
Score         18342
ViewCount     67397
dtype: int64

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [None]:
#The total number of rows is 110318 and the column with the biggest number of NaNs is Viewcount which is a bit more than half the rows. If we remove the rows with the missing values, the sample will shrink immensely.
