# Data Cleaning 

#### 1. Import pandas library.

In [8]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [12]:
import pymysql 
import sqlalchemy
import numpy as np

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [18]:
# = 'mysql+pymysql://root:relational=n*@relational.fit.cvut.cz:3306/users'
conn_string = 'mysql+pymysql://guest:relational@relational.fit.cvut.cz/stats'
conn = sqlalchemy.create_engine(conn_string)

#### 4. Import the users table.

In [70]:
users = pd.read_sql_query('SELECT * FROM users;', conn)

#### 5. Rename Id column to userId.

In [71]:
users.rename(columns={'Id':'userId'}, inplace=True)

#### 6. Import the posts table. 

In [72]:
posts = pd.read_sql_query('SELECT * FROM posts;', conn)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [73]:
posts.rename(columns={'Id':'postId','OwnerUserId':'userId'}, inplace=True)

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [83]:
usersdf = users[['userId','Reputation','Views','UpVotes','DownVotes']]
postsdf = posts[['postId','Score','userId','ViewCount','CommentCount']]                        

#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [88]:
userpost = usersdf.merge(postsdf)

#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [86]:
userpost.isna().sum()

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
postId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64

#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [106]:
# I would transform the Nan into 0 as all the Nan come from ViewCount which would reset the count to 0.
userpost.replace(np.nan, 0, inplace=True)

userId          0
Reputation      0
Views           0
UpVotes         0
DownVotes       0
postId          0
Score           0
ViewCount       0
CommentCount    0
dtype: int64

#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [116]:
#We would 
userpost = userpost.astype(int)
userpost.dtypes



userId          int32
Reputation      int32
Views           int32
UpVotes         int32
DownVotes       int32
postId          int32
Score           int32
ViewCount       int32
CommentCount    int32
dtype: object