# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import the users table.

In [2]:
users = pd.read_csv("../data/users.csv")   # pd.read_csv("/Users/NH/Desktop/01Ironhack/data/users.csv")
users

Unnamed: 0.1,Unnamed: 0,Id,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 3. Rename Id column to userId.

In [3]:
users.columns

Index(['Unnamed: 0', 'Id', 'Reputation', 'Views', 'UpVotes', 'DownVotes'], dtype='object')

In [4]:
users = users.rename(columns={'Id':'UserId'})

In [5]:
users = users.rename(columns={'userid':'UserId'})

In [6]:
users.columns

Index(['Unnamed: 0', 'UserId', 'Reputation', 'Views', 'UpVotes', 'DownVotes'], dtype='object')

#### 4. Import the posts table. 

In [7]:
posts = pd.read_csv("../data/posts.csv")
posts

Unnamed: 0.1,Unnamed: 0,Id,OwnerUserId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 5. Rename Id column to postId and OwnerUserId to userId.

In [8]:
posts = posts.rename(columns={'Id':'PostId','OwnerUserId':'UserId'})
posts

Unnamed: 0.1,Unnamed: 0,PostId,UserId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 6. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [9]:
posts.columns

Index(['Unnamed: 0', 'PostId', 'UserId', 'Score', 'ViewCount', 'CommentCount'], dtype='object')

In [10]:
users_sliced = users.loc[:, ['UserId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]

In [11]:
posts_sliced = posts.loc[:, ['PostId', 'Score', 'UserId', 'ViewCount', 'CommentCount']]

#### 7. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [12]:
users_sliced.columns

Index(['UserId', 'Reputation', 'Views', 'UpVotes', 'DownVotes'], dtype='object')

In [13]:
posts_sliced.columns

Index(['PostId', 'Score', 'UserId', 'ViewCount', 'CommentCount'], dtype='object')

In [14]:
df = users_sliced.merge(right=posts_sliced, how="inner", left_on="UserId", right_on="UserId")
df

Unnamed: 0,UserId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0
...,...,...,...,...,...,...,...,...,...
90579,55734,1,0,0,0,115352,0,16.0,0
90580,55738,11,0,0,0,115360,2,40.0,4
90581,55742,6,0,0,0,115366,1,17.0,0
90582,55744,6,1,0,0,115370,1,13.0,2


#### 8. How many missing values do you have in your merged dataframe? On which columns?

In [15]:
pd.DataFrame(df.isna().sum())

Unnamed: 0,0
UserId,0
Reputation,0
Views,0
UpVotes,0
DownVotes,0
PostId,0
Score,0
ViewCount,48396
CommentCount,0


In [16]:
df.shape

(90584, 9)

#### 9. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [19]:
df['ViewCount'].fillna(df['ViewCount'].mean())

0        556.656158
1        556.656158
2        556.656158
3        556.656158
4        556.656158
            ...    
90579     16.000000
90580     40.000000
90581     17.000000
90582     13.000000
90583      5.000000
Name: ViewCount, Length: 90584, dtype: float64

#### 10. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [20]:
df.dtypes

UserId            int64
Reputation        int64
Views             int64
UpVotes           int64
DownVotes         int64
PostId            int64
Score             int64
ViewCount       float64
CommentCount      int64
dtype: object