# Data Cleaning 

#### 1. Import pandas library.

In [3]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

#### 4. Import the users table.

In [4]:
users = pd.read_csv("../Data/users.csv", index_col = "Unnamed: 0")
users.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


#### 5. Rename Id column to userId.

In [5]:
users.rename(columns={'userId': 'Id'})
users.rename(columns={'Id': 'userId'})

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5
...,...,...,...,...,...
40320,55743,1,0,0,0
40321,55744,6,1,0,0
40322,55745,101,0,0,0
40323,55746,106,1,0,0


#### 6. Import the posts table. 

In [6]:
posts = pd.read_csv("../Data/posts.csv", index_col="Unnamed: 0")
posts.head()

Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3


#### 7. Rename Id column to postId and OwnerUserId to userId.

In [7]:
#The column name was already chande so I undid it and do it again
posts.rename(columns = {"PostId":"Id"})
posts.rename(columns = {"Id":"PostId"})
#The column name was already chande so I undid it and do it again
posts.rename(columns = {"userId":"OwnerUserId"})
posts.rename(columns = {"OwnerUserId":"userId"})

Unnamed: 0,PostId,userId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3
...,...,...,...,...,...
91971,115374,805.0,2,,2
91972,115375,49365.0,0,9.0,0
91973,115376,55746.0,1,5.0,2
91974,115377,805.0,0,,0


#### 8. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [8]:
users_sliced = pd.DataFrame(users,columns = ["userId", "Reputation", "Views", "UpVotes", "DownVotes"])
posts_sliced = pd.DataFrame(posts,columns = ["PostId", "Score", "userId", "ViewCount", "CommentCount"])

print(users_sliced.head())
print(posts_sliced.head())

   userId  Reputation  Views  UpVotes  DownVotes
0      -1           1      0     5007       1920
1       2         101     25        3          0
2       3         101     22       19          0
3       4         101     11        0          0
4       5        6792   1145      662          5
   PostId  Score  userId  ViewCount  CommentCount
0       1     23     8.0     1278.0             1
1       2     22    24.0     8198.0             1
2       3     54    18.0     3613.0             4
3       4     13    23.0     5224.0             2
4       5     81    23.0        NaN             3


#### 9. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [18]:
merge = users_sliced.merge(right=posts_sliced, how= "inner", on="userId")
merge.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [19]:
merge.isnull().sum()

#I have 48396 values on the viewcount column that are null. There are 48396 userId that haven't view anything.

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
PostId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64

#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [20]:
#By replacing it for 0 you are saying that they have 0 viewing counts
#If we calculate the correlation between View Counts and the rest of the variables to see if we should drop it or leave it.
merge.corr()
#As we can see in the correlation table, the series ViewCount doesn't have a strong correlation with any of the variables but Score.
#Assuming that is a variable not really related to anything else, we could drop those cells. 

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
userId,1.0,-0.344814,-0.301128,-0.293232,-0.19985,0.704867,-0.233026,-0.181916,-0.032604
Reputation,-0.344814,1.0,0.906704,0.852445,0.558412,-0.076308,0.124807,0.057293,0.04173
Views,-0.301128,0.906704,1.0,0.805355,0.636178,-0.122016,0.128674,0.056347,0.049023
UpVotes,-0.293232,0.852445,0.805355,1.0,0.636087,-0.099961,0.130299,0.046272,0.02882
DownVotes,-0.19985,0.558412,0.636178,0.636087,1.0,-0.079508,0.073268,0.033729,0.002877
PostId,0.704867,-0.076308,-0.122016,-0.099961,-0.079508,1.0,-0.262381,-0.235046,-0.041256
Score,-0.233026,0.124807,0.128674,0.130299,0.073268,-0.262381,1.0,0.532106,0.148255
ViewCount,-0.181916,0.057293,0.056347,0.046272,0.033729,-0.235046,0.532106,1.0,0.044713
CommentCount,-0.032604,0.04173,0.049023,0.02882,0.002877,-0.041256,0.148255,0.044713,1.0


In [24]:
merge = merge.dropna()

In [25]:
merge.isnull().sum()

userId          0
Reputation      0
Views           0
UpVotes         0
DownVotes       0
PostId          0
Score           0
ViewCount       0
CommentCount    0
dtype: int64

#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [27]:
merge.dtypes

userId            int64
Reputation        int64
Views             int64
UpVotes           int64
DownVotes         int64
PostId            int64
Score             int64
ViewCount       float64
CommentCount      int64
dtype: object

In [28]:
merge_updated = merge.astype(int)

In [29]:
merge_updated.dtypes

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
PostId          int64
Score           int64
ViewCount       int64
CommentCount    int64
dtype: object