# Data Cleaning 

#### 1. Import pandas library.

In [16]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


In [17]:
import pymysql
from sqlalchemy import create_engine



#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

In [18]:
driver = 'mysql+pymysql:'
user = 'guest'
password = "relational"
ip = 'relational.fit.cvut.cz'
database = 'stats'
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'
engine = create_engine(connection_string)

#### 4. Import the users table.

In [24]:
users_table = pd.read_sql_query('SELECT * FROM stats.users', engine)
users_table.head()

Unnamed: 0,Id,Reputation,CreationDate,DisplayName,LastAccessDate,WebsiteUrl,Location,AboutMe,Views,UpVotes,DownVotes,AccountId,Age,ProfileImageUrl
0,-1,1,2010-07-19 06:55:26,Community,2010-07-19 06:55:26,http://meta.stackexchange.com/,on the server farm,"<p>Hi, I'm not really a person.</p>\n\n<p>I'm ...",0,5007,1920,-1,,
1,2,101,2010-07-19 14:01:36,Geoff Dalgas,2013-11-12 22:07:23,http://stackoverflow.com,"Corvallis, OR",<p>Developer on the StackOverflow team. Find ...,25,3,0,2,37.0,
2,3,101,2010-07-19 15:34:50,Jarrod Dixon,2014-08-08 06:42:58,http://stackoverflow.com,"New York, NY","<p><a href=""http://blog.stackoverflow.com/2009...",22,19,0,3,35.0,
3,4,101,2010-07-19 19:03:27,Emmett,2014-01-02 09:31:02,http://minesweeperonline.com,"San Francisco, CA",<p>currently at a startup in SF</p>\n\n<p>form...,11,0,0,1998,28.0,http://i.stack.imgur.com/d1oHX.jpg
4,5,6792,2010-07-19 19:03:57,Shane,2014-08-13 00:23:47,http://www.statalgo.com,"New York, NY",<p>Quantitative researcher focusing on statist...,1145,662,5,54503,35.0,


#### 5. Rename Id column to userId.

In [25]:
users_table = users_table.rename(columns={ users_table.columns[0]: "userId" })

#### 6. Import the posts table. 

In [35]:
posts_table = pd.read_sql_query('SELECT * FROM stats.posts', engine)

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [36]:
posts_table = posts_table.rename(columns={ posts_table.columns[0]: "postId", posts_table.columns[7]:"userId" })

#### 8. Define new dataframes for users and posts with the following selected columns:
**users columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts columns**: postId, Score, userID, ViewCount, CommentCount

In [38]:
users_columns = users_table[["userId", "Reputation", "Views", "UpVotes", "DownVotes"]]
posts_columns = posts_table[["postId", "Score", "userId", "ViewCount", "CommentCount"]]

#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [52]:
merged_df = pd.merge(users_columns, posts_columns, left_index=True, right_index=True)

#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [53]:
count_nan = len(merged_df) - merged_df.count()
print(count_nan)
# NaN  = 25307 - UserId_y / ViewCount


userId_x            0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
postId              0
Score               0
userId_y         1040
ViewCount       24267
CommentCount        0
dtype: int64
       userId_x  Reputation  Views  UpVotes  DownVotes  postId  Score  \
0            -1           1      0     5007       1920       1     23   
1             2         101     25        3          0       2     22   
2             3         101     22       19          0       3     54   
3             4         101     11        0          0       4     13   
4             5        6792   1145      662          5       5     81   
5             6         457    114       47          0       6    152   
6             7         429     56       20          0       7     76   
7             8        6764   1089      604         25       8      0   
8            10         121     20        2          0       9     13   
9            11         136     10   

#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [55]:
merged_df = merged_df.fillna(0)
#As you could remove more tha n65% of the database, its much better to fix them

       userId_x  Reputation  Views  UpVotes  DownVotes  postId  Score  \
0            -1           1      0     5007       1920       1     23   
1             2         101     25        3          0       2     22   
2             3         101     22       19          0       3     54   
3             4         101     11        0          0       4     13   
4             5        6792   1145      662          5       5     81   
5             6         457    114       47          0       6    152   
6             7         429     56       20          0       7     76   
7             8        6764   1089      604         25       8      0   
8            10         121     20        2          0       9     13   
9            11         136     10       10          0      10     23   
10           12         101     10        5          0      11      2   
11           13         817    178       44          1      12     20   
12           15          11      8        0        

#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [57]:
merged_df = merged_df.astype({"postId":'int64'}) 
#As they are not supposed to have any floating number they can be changed to int