# Data Cleaning 

#### 1. Import pandas library.

In [2]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [3]:

from sqlalchemy import create_engine
import pymysql

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [4]:
driver = "mysql+pymysql"
user = "guest"
password = "relational"
ip = "relational.fit.cvut.cz"
database = "stats"

connection_string = f"{driver}://{user}:{password}@{ip}/{database}"

engine = create_engine(connection_string)

#### 4. Import the users table.

In [5]:
query1 = """ SELECT * FROM users
"""
users = pd.read_sql(query1, engine)


#### 5. Rename Id column to userId.

In [6]:
users= users.rename(columns = {"Id": "userId"})

#### 6. Import the posts table. 

In [7]:
query2 = """ SELECT * FROM posts
"""
posts = pd.read_sql(query2, engine)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [8]:
posts= posts.rename(columns = {"Id": "postId", "OwnerUserId": "userId"})

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [16]:
users_renamed = users [["userId", "Reputation", "Views", "UpVotes", "DownVotes"]]
posts_renamed = posts [["postId", "Score", "userId", "ViewCount", "CommentCount"]]


#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [40]:
merge_df = users_renamed.merge(posts_renamed, how = 'inner', on = "userId")
merge_df.tail()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
90579,55734,1,0,0,0,115352,0,16.0,0
90580,55738,11,0,0,0,115360,2,40.0,4
90581,55742,6,0,0,0,115366,1,17.0,0
90582,55744,6,1,0,0,115370,1,13.0,2
90583,55746,106,1,0,0,115376,1,5.0,2


#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [25]:
merge_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
userId          90584 non-null int64
Reputation      90584 non-null int64
Views           90584 non-null int64
UpVotes         90584 non-null int64
DownVotes       90584 non-null int64
postId          90584 non-null int64
Score           90584 non-null int64
ViewCount       42188 non-null float64
CommentCount    90584 non-null int64
dtypes: float64(1), int64(8)
memory usage: 6.9 MB


#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [36]:
# more than half of our ViewCount is Nan, too much information would be lost if we remove/drop all Nan values. 
# we consider replaceing Nans with 0 as they seem to indicate a view count of less than 2 (as some of the NaNs have a comment count, but never higher than 2 or 3)

merge_df["ViewCount"] = merge_df["ViewCount"].fillna(0)

#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [39]:
merge_df["ViewCount"] = merge_df["ViewCount"].astype("int64")
merge_df.info()
# 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
userId          90584 non-null int64
Reputation      90584 non-null int64
Views           90584 non-null int64
UpVotes         90584 non-null int64
DownVotes       90584 non-null int64
postId          90584 non-null int64
Score           90584 non-null int64
ViewCount       90584 non-null int64
CommentCount    90584 non-null int64
dtypes: int64(9)
memory usage: 6.9 MB
