# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [2]:
import pymysql
from sqlalchemy import create_engine

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [3]:
engine = create_engine('mysql+pymysql://guest:relational@relational.fit.cvut.cz:3306/stats')

#### 4. Import the users table.

In [4]:
users_df = pd.read_sql_query('SELECT * FROM users', engine)
# users_df.head(60)

#### 5. Rename Id column to userId.

In [5]:
users_df = users_df.rename(columns={'Id':'userId'})
# users_df.head(60)

#### 6. Import the posts table. 

In [6]:
posts_df = pd.read_sql_query('SELECT * FROM posts', engine)
# posts_df.head(60)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [7]:
posts_df = posts_df.rename(columns={'Id':'postId','OwnerUserId':'userId'})
# posts_df = posts_df.rename(columns={'OwnerUserId':'userId'}) - I tried seperatatly first then merged both lines into one
# posts_df.head(10)

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [8]:
users_df_new = users_df[['userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']]
# users_df_new.head(10)
posts_df_new = posts_df[['postId', 'Score', 'userId', 'ViewCount', 'CommentCount']]
# posts_df_new.head(10)
# users_df_new.head(10)

#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [9]:
merged_inner = pd.merge(left=users_df_new, right=posts_df_new, left_on='userId', right_on='userId')
# merged_inner.head(30)
# The thing here is that I merged on userId, but each user has more postID's so now 

#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [10]:
null_col = merged_inner.isnull().sum()

print("We have the following missing values:", null_col[null_col > 0])

We have the following missing values: ViewCount    48396
dtype: int64


#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [11]:
# Taking a look at the number of rows
merged_inner.shape

# Taking a look at the ViewCount values
merged_inner[['ViewCount']].sort_values(by='ViewCount', ascending=False)

# I decided that because 48396 out of 90584 rows is empty (which is roughly 50%), I'm cleaning the data.

# Replacing NaN by 0
merged_inner[['ViewCount']] = merged_inner[['ViewCount']].fillna(0)

# Checking the ViewCount values
merged_inner[['ViewCount']].sort_values(by='ViewCount')


Unnamed: 0,ViewCount
0,0.0
43546,0.0
43547,0.0
43548,0.0
43549,0.0
...,...
22067,88129.0
28018,91848.0
31022,92612.0
16828,98109.0


#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [12]:
# Taking a look at the datatypes
merged_inner.dtypes

# I'm chaning the ViewCount because they are all integers
merged_inner[['ViewCount']] = merged_inner[['ViewCount']].astype('int')

# Checking the datatypes
merged_inner.dtypes

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
postId          int64
Score           int64
ViewCount       int64
CommentCount    int64
dtype: object