# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

#### 4. Import the users table.

In [2]:
users = pd.read_csv('../data/users.csv')
users

Unnamed: 0.1,Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,0,-1,1,0,5007,1920
1,1,2,101,25,3,0
2,2,3,101,22,19,0
3,3,4,101,11,0,0
4,4,5,6792,1145,662,5
...,...,...,...,...,...,...
40320,40320,55743,1,0,0,0
40321,40321,55744,6,1,0,0
40322,40322,55745,101,0,0,0
40323,40323,55746,106,1,0,0


#### 5. Rename Id column to userId.

In [3]:
users_clean = users.drop('Unnamed: 0', axis = 1)
users_clean.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


#### 6. Import the posts table. 

In [4]:
posts = pd.read_csv('../data/posts.csv')
posts

Unnamed: 0.1,Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,0,1,8.0,23,1278.0,1
1,1,2,24.0,22,8198.0,1
2,2,3,18.0,54,3613.0,4
3,3,4,23.0,13,5224.0,2
4,4,5,23.0,81,,3
...,...,...,...,...,...,...
91971,91971,115374,805.0,2,,2
91972,91972,115375,49365.0,0,9.0,0
91973,91973,115376,55746.0,1,5.0,2
91974,91974,115377,805.0,0,,0


#### 7. Rename Id column to postId and OwnerUserId to userId.

In [5]:
posts.columns[1]

'PostId'

In [6]:
posts_clean = posts.rename(columns = {'PostId': 'postId'}).drop('Unnamed: 0', axis = 1)
posts_clean.head()

Unnamed: 0,postId,userId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3


#### 8. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

#### 9. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [7]:
merged = posts_clean.merge(right=users_clean, how='left', on='userId')
merged.head()

Unnamed: 0,postId,userId,Score,ViewCount,CommentCount,Reputation,Views,UpVotes,DownVotes
0,1,8.0,23,1278.0,1,6764.0,1089.0,604.0,25.0
1,2,24.0,22,8198.0,1,344.0,48.0,36.0,1.0
2,3,18.0,54,3613.0,4,128.0,8.0,16.0,0.0
3,4,23.0,13,5224.0,2,308.0,52.0,34.0,1.0
4,5,23.0,81,,3,308.0,52.0,34.0,1.0


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [8]:
merged.info()
#on columns userID, viewcount, reputation, views, Upvotes, Downvotes

<class 'pandas.core.frame.DataFrame'>
Int64Index: 91976 entries, 0 to 91975
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   postId        91976 non-null  int64  
 1   userId        90584 non-null  float64
 2   Score         91976 non-null  int64  
 3   ViewCount     42921 non-null  float64
 4   CommentCount  91976 non-null  int64  
 5   Reputation    90584 non-null  float64
 6   Views         90584 non-null  float64
 7   UpVotes       90584 non-null  float64
 8   DownVotes     90584 non-null  float64
dtypes: float64(6), int64(3)
memory usage: 7.0 MB


#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [9]:
# the rows that do not have user ID will be dropped.
merged = merged.dropna(subset=['userId'])
merged.info()

#now we have the viewCount which will be replaced by mean

merged['ViewCount']= merged['ViewCount'].fillna(merged['ViewCount'].mean())
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 91975
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   postId        90584 non-null  int64  
 1   userId        90584 non-null  float64
 2   Score         90584 non-null  int64  
 3   ViewCount     42188 non-null  float64
 4   CommentCount  90584 non-null  int64  
 5   Reputation    90584 non-null  float64
 6   Views         90584 non-null  float64
 7   UpVotes       90584 non-null  float64
 8   DownVotes     90584 non-null  float64
dtypes: float64(6), int64(3)
memory usage: 6.9 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 91975
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   postId        90584 non-null  int64  
 1   userId        90584 non-null  float64
 2   Score         90584 non-null  int64  
 3   ViewCount     90584 non-null  float64
 4 

#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [10]:
# we should change all the types "float64" to "int64", as the values included cannot have decimals (number of views, number of comments...)
merged = merged.astype(int)
merged.info()

# maybe we could change the "Id" types to string in order to avoid statistics such as mean,std, etc
merged[['postId','userId']] = merged[['postId','userId']].astype(str)
merged.info()
merged.head(50)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 91975
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   postId        90584 non-null  int32
 1   userId        90584 non-null  int32
 2   Score         90584 non-null  int32
 3   ViewCount     90584 non-null  int32
 4   CommentCount  90584 non-null  int32
 5   Reputation    90584 non-null  int32
 6   Views         90584 non-null  int32
 7   UpVotes       90584 non-null  int32
 8   DownVotes     90584 non-null  int32
dtypes: int32(9)
memory usage: 3.8 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 91975
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   postId        90584 non-null  object
 1   userId        90584 non-null  object
 2   Score         90584 non-null  int32 
 3   ViewCount     90584 non-null  int32 
 4   CommentCount  90584 non-null  int32 
 

Unnamed: 0,postId,userId,Score,ViewCount,CommentCount,Reputation,Views,UpVotes,DownVotes
0,1,8,23,1278,1,6764,1089,604,25
1,2,24,22,8198,1,344,48,36,1
2,3,18,54,3613,4,128,8,16,0
3,4,23,13,5224,2,308,52,34,1
4,5,23,81,556,3,308,52,34,1
5,6,5,152,29229,5,6792,1145,662,5
6,7,38,76,5808,3,133,15,1,0
7,8,37,0,288,2,156,2,0,0
8,9,50,13,556,3,101,3,3,0
9,10,24,23,21925,4,344,48,36,1
