# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

#### 4. Import the users table.

In [2]:
users = pd.read_csv('../Data/users.csv',index_col = 'Unnamed: 0')
users.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


#### 5. Rename Id column to userId.

In [3]:
users.rename(columns = {'Id':'userId'},inplace = True)
users["userId"][0] = 1
users.head()
users["userId"].min()

1

#### 6. Import the posts table. 

In [4]:
posts = pd.read_csv('../Data/posts.csv',index_col = 'Unnamed: 0')
posts["userId"] = [e if e >= 0 else -e for e in posts["userId"]]
posts.userId.min()

1.0

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [5]:
#I can change the names but they are already OK in the dataset
posts.rename(columns = {'PostId':'postId','OwnerUserId':'userId'},inplace = True)
posts.head()

Unnamed: 0,postId,userId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3


#### 8. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [6]:
users_sliced = users[["userId","Reputation","Views","UpVotes","DownVotes"]]
posts_sliced = posts[["postId","Score","userId","ViewCount","CommentCount"]]

#### 9. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [7]:
merged = users_sliced.merge(posts_sliced,how='inner',on='userId')
merged.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
0,1,1,0,5007,1920,2175,0,,0
1,1,1,0,5007,1920,8576,0,,0
2,1,1,0,5007,1920,8578,0,,0
3,1,1,0,5007,1920,8981,0,,0
4,1,1,0,5007,1920,8982,0,,0


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [8]:
total = 0
for c in merged.columns:
    missing = merged[c].isnull().sum()
    total += missing
    print(c,missing)
print("Total:",total)
merged.info()

userId 0
Reputation 0
Views 0
UpVotes 0
DownVotes 0
postId 0
Score 0
ViewCount 48396
CommentCount 0
Total: 48396
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90584 entries, 0 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        90584 non-null  int64  
 1   Reputation    90584 non-null  int64  
 2   Views         90584 non-null  int64  
 3   UpVotes       90584 non-null  int64  
 4   DownVotes     90584 non-null  int64  
 5   postId        90584 non-null  int64  
 6   Score         90584 non-null  int64  
 7   ViewCount     42188 non-null  float64
 8   CommentCount  90584 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 6.9 MB


#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [22]:
#ViewCount has no significant correlation with any variable, so I will drop all the missing values:
merged_drop = merged.dropna()
merged_drop.info()
merged.corr()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        42188 non-null  int64  
 1   Reputation    42188 non-null  int64  
 2   Views         42188 non-null  int64  
 3   UpVotes       42188 non-null  int64  
 4   DownVotes     42188 non-null  int64  
 5   postId        42188 non-null  int64  
 6   Score         42188 non-null  int64  
 7   ViewCount     42188 non-null  float64
 8   CommentCount  42188 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 3.2 MB


Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
userId,1.0,-0.344814,-0.301128,-0.293232,-0.199846,0.704867,-0.233026,-0.181916,-0.032605
Reputation,-0.344814,1.0,0.906704,0.852445,0.558412,-0.076308,0.124807,0.057293,0.04173
Views,-0.301128,0.906704,1.0,0.805355,0.636178,-0.122016,0.128674,0.056347,0.049023
UpVotes,-0.293232,0.852445,0.805355,1.0,0.636087,-0.099961,0.130299,0.046272,0.02882
DownVotes,-0.199846,0.558412,0.636178,0.636087,1.0,-0.079508,0.073268,0.033729,0.002877
postId,0.704867,-0.076308,-0.122016,-0.099961,-0.079508,1.0,-0.262381,-0.235046,-0.041256
Score,-0.233026,0.124807,0.128674,0.130299,0.073268,-0.262381,1.0,0.532106,0.148255
ViewCount,-0.181916,0.057293,0.056347,0.046272,0.033729,-0.235046,0.532106,1.0,0.044713
CommentCount,-0.032605,0.04173,0.049023,0.02882,0.002877,-0.041256,0.148255,0.044713,1.0


#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [20]:
merged_drop.ViewCount = merged_drop.ViewCount.astype(int)
merged_drop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   userId        42188 non-null  int64
 1   Reputation    42188 non-null  int64
 2   Views         42188 non-null  int64
 3   UpVotes       42188 non-null  int64
 4   DownVotes     42188 non-null  int64
 5   postId        42188 non-null  int64
 6   Score         42188 non-null  int64
 7   ViewCount     42188 non-null  int64
 8   CommentCount  42188 non-null  int64
dtypes: int64(9)
memory usage: 3.2 MB
