# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [2]:
from sqlalchemy import create_engine
import pymysql

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [4]:
driver = "mysql+pymysql"
user = "guest"
password = "relational"
ip = "relational.fit.cvut.cz"
database = "stats"

connection_string = f"{driver}://{user}:{password}@{ip}/{database}"
engine = create_engine(connection_string)

#### 4. Import the users table.

In [12]:
query = """
        SELECT * FROM users
"""

df_user = pd.read_sql(query, engine)

#### 5. Rename Id column to userId.

In [16]:
df_user = df_user.rename(columns={"Id":"userId"})
df_user.columns

Index(['userId', 'Reputation', 'CreationDate', 'DisplayName', 'LastAccessDate',
       'WebsiteUrl', 'Location', 'AboutMe', 'Views', 'UpVotes', 'DownVotes',
       'AccountId', 'Age', 'ProfileImageUrl'],
      dtype='object')

#### 6. Import the posts table. 

In [14]:
query = """
        SELECT * FROM posts
"""

df_posts = pd.read_sql(query, engine)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [25]:
df_posts.head()
df_posts = df_posts.rename(columns= {"Id":"postId", "OwnerUserId":"userId"})
df_posts.columns


Index(['postId', 'PostTypeId', 'AcceptedAnswerId', 'CreaionDate', 'Score',
       'ViewCount', 'Body', 'userId', 'LasActivityDate', 'Title', 'Tags',
       'AnswerCount', 'CommentCount', 'FavoriteCount', 'LastEditorUserId',
       'LastEditDate', 'CommunityOwnedDate', 'ParentId', 'ClosedDate',
       'OwnerDisplayName', 'LastEditorDisplayName'],
      dtype='object')

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [26]:
user = df_user[["userId","Reputation","Views","UpVotes","DownVotes"]]
posts = df_posts[["postId", "Score", "userId", "ViewCount", "CommentCount"]]

#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [47]:
user_posts = user.merge(posts, left_on = "userId", right_on = "userId")

#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [48]:
user_posts.info()

# There are 48,396 missing values on ViewCount columns

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
userId          90584 non-null int64
Reputation      90584 non-null int64
Views           90584 non-null int64
UpVotes         90584 non-null int64
DownVotes       90584 non-null int64
postId          90584 non-null int64
Score           90584 non-null int64
ViewCount       42188 non-null float64
CommentCount    90584 non-null int64
dtypes: float64(1), int64(8)
memory usage: 6.9 MB


#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [49]:
# As there are almost 50% missing data in the column ViewCount, I will fill the missing data.

user_posts.tail()
user_posts = user_posts.fillna("0")
user_posts.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
userId          90584 non-null int64
Reputation      90584 non-null int64
Views           90584 non-null int64
UpVotes         90584 non-null int64
DownVotes       90584 non-null int64
postId          90584 non-null int64
Score           90584 non-null int64
ViewCount       90584 non-null object
CommentCount    90584 non-null int64
dtypes: int64(8), object(1)
memory usage: 6.9+ MB


#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [50]:
user_posts = user_posts.astype({"ViewCount":"int64"})