# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [2]:
import pymysql
import sqlalchemy

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [4]:
from sqlalchemy import create_engine
USER = 'guest'
PASSWORD = 'relational'
HOST = 'relational.fit.cvut.cz'
PORT = '3306'
DATABASE = 'stats'
db_connection_str = f'mysql+pymysql://{USER}:{PASSWORD}@{HOST}:{PORT}/{DATABASE}'
db_connection = create_engine(db_connection_str)

#### 4. Import the users table.

In [5]:
query = 'SELECT * FROM users'
df_users = pd.read_sql(query, con=db_connection)

#### 5. Rename Id column to userId.

In [6]:
df_users_renamed=df_users.rename(columns={"Id":"userId"})

#### 6. Import the posts table. 

In [7]:
query = 'SELECT * FROM posts'
df_posts = pd.read_sql(query, con=db_connection)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [122]:
df_posts_renamed=df_posts.rename(columns={"Id":"postId","OwnerUserId":"userId"})


#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [9]:
df_users_selected=df_users_renamed[["userId", "Reputation", "Views", "UpVotes", "DownVotes"]]
df_posts_selected=df_posts_renamed[["postId", "Score", "userId", "ViewCount", "CommentCount"]]

#### 9. Merge the new dataframes you have created, of users and posts. 
You will need to make an inner [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [22]:
df_merged = pd.merge(df_users_selected, df_posts_selected,how="inner", on='userId')

#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [23]:
df_merged.info()
# With the info provided by the info() function I can see that the df has 90584 entries
# Since I have the count of non-null values per column, I can see that every column is fully populated except the ViewCount column
# in which there are only 42188 rows populated

df_merged.isnull().sum(axis = 0)
#I can also count the null-values per column with the isnull() functiom


<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        90584 non-null  int64  
 1   Reputation    90584 non-null  int64  
 2   Views         90584 non-null  int64  
 3   UpVotes       90584 non-null  int64  
 4   DownVotes     90584 non-null  int64  
 5   postId        90584 non-null  int64  
 6   Score         90584 non-null  int64  
 7   ViewCount     42188 non-null  float64
 8   CommentCount  90584 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 6.9 MB


userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
postId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64

#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [123]:
#looking at the data one could think that it is not logical to have rows in which the views is zero and the view count is not and vice-versa. 
#but since one value seems to count the views of the post and the other seems to refer to the views of the user (profile), they don't have to mathch

#I have considered the possibility of viewcount being an aggregation of views at user level (of the total number of posts of the user), but checking the data,
#it can be seen that the numbers in the user are lower, so I tend to think that post numbers and user numbers don't have to fit. 

#On the other hand , looking only into the post values, it does not make sense to consider that null values are zeros, because we find rows with null values 
#and other non null data for post related data. Filling with 1 (as a minimum value) or with an average viewcount per user (as an estimation) could be a solution,
#but looking at the raw data in the posts table the null values in viewcount seem to be associated with null values in other fields, which gives an impression of bad quality data.

#Given all that, I decide to clean the null values for the column, eliminating all the rows


#There is another issue with idvalues =-1, which I would also eliminate because the data for these rows is rare for other columns (Score, coment count equal to zero)
#but they are included in the viewcount nullvalues exclussion

#This is the code:

df_filtered=df_merged[df_merged["ViewCount"].notnull()]
df_filtered.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        42188 non-null  int64  
 1   Reputation    42188 non-null  int64  
 2   Views         42188 non-null  int64  
 3   UpVotes       42188 non-null  int64  
 4   DownVotes     42188 non-null  int64  
 5   postId        42188 non-null  int64  
 6   Score         42188 non-null  int64  
 7   ViewCount     42188 non-null  float64
 8   CommentCount  42188 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 3.2 MB


#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [120]:
df_typed=df_filtered.astype({"userId":"int64","Reputation":"int64","Views":"int64","UpVotes":"int64","DownVotes":"int64","postId":"int64","Score":"int64", "ViewCount":"int64","CommentCount":"int64"})
df_typed.info()
# Since all of the value are numeric, key (userId, postId), and counters (views, UpVotes, DownVotes, ViewCount, CommentCounts) or rank data (Reputation, Score), 
#it make sense to set all types to integer.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   userId        42188 non-null  int64
 1   Reputation    42188 non-null  int64
 2   Views         42188 non-null  int64
 3   UpVotes       42188 non-null  int64
 4   DownVotes     42188 non-null  int64
 5   postId        42188 non-null  int64
 6   Score         42188 non-null  int64
 7   ViewCount     42188 non-null  int64
 8   CommentCount  42188 non-null  int64
dtypes: int64(9)
memory usage: 3.2 MB
