# Data Cleaning 

#### 1. Import pandas library.

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data. 


#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/dataset/Stats).

#### 4. Import the users table.

In [29]:
df_users = pd.read_csv("../../Datasets as a CSV/users.csv")

In [33]:
#I get automatically a column with index and csv has a column with index too so I have to delete one (the one from the csv file)
#I drop the "Unnamed: 0" column and I have to add axis = 1 because it's a column (axis = 0 for row)

df_users = df_users.drop("Unnamed: 0", axis = 1) 
df_users.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


#### 5. Rename Id column to userId.

In [34]:
df_users.columns

Index(['userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes'], dtype='object')

In [35]:
df_users = df_users.rename(columns = {'userId':'Id'})
df_users = df_users.rename(columns = {'Id':'userId'})
df_users

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5
...,...,...,...,...,...
40320,55743,1,0,0,0
40321,55744,6,1,0,0
40322,55745,101,0,0,0
40323,55746,106,1,0,0


#### 6. Import the posts table. 

In [36]:
df_posts = pd.read_csv("../../Datasets as a CSV/posts.csv")

#### 7. Rename Id column to postId and OwnerUserId to userId.

In [37]:
df_posts.columns

Index(['Unnamed: 0', 'PostId', 'userId', 'Score', 'ViewCount', 'CommentCount'], dtype='object')

#### 8. Define new dataframes for users and posts with the following selected columns:
**users_sliced columns**: userId, Reputation, Views, UpVotes, DownVotes  
**posts_sliced columns**: postId, Score, userId, ViewCount, CommentCount

In [46]:
column_order = ['userId', 'Reputation', 'Views', 'UpVotes', 'DownVotes']
users_sliced = df_users[column_order]
users_sliced

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5
...,...,...,...,...,...
40320,55743,1,0,0,0
40321,55744,6,1,0,0
40322,55745,101,0,0,0
40323,55746,106,1,0,0


In [47]:
column_order = ['PostId', 'Score', 'userId', 'ViewCount', 'CommentCount']
posts_sliced = df_posts[column_order]
posts_sliced

Unnamed: 0,PostId,Score,userId,ViewCount,CommentCount
0,1,23,8.0,1278.0,1
1,2,22,24.0,8198.0,1
2,3,54,18.0,3613.0,4
3,4,13,23.0,5224.0,2
4,5,81,23.0,,3
...,...,...,...,...,...
91971,115374,2,805.0,,2
91972,115375,0,49365.0,9.0,0
91973,115376,1,55746.0,5.0,2
91974,115377,0,805.0,,0


#### 9. Merge the two dataframes created in the step above (8), users_sliced and posts_sliced. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [59]:
merged = pd.merge(users_sliced, posts_sliced, how = 'inner', on = 'userId')
merged

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0
...,...,...,...,...,...,...,...,...,...
90579,55734,1,0,0,0,115352,0,16.0,0
90580,55738,11,0,0,0,115360,2,40.0,4
90581,55742,6,0,0,0,115366,1,17.0,0
90582,55744,6,1,0,0,115370,1,13.0,2


#### 10. How many missing values do you have in your merged dataframe? On which columns?

In [61]:
missing_values = merged.isnull().sum()
missing_values

userId              0
Reputation          0
Views               0
UpVotes             0
DownVotes           0
PostId              0
Score               0
ViewCount       48396
CommentCount        0
dtype: int64

#### 11. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before going to the next step.

In [68]:
merged.corr()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
userId,1.0,-0.344814,-0.301128,-0.293232,-0.19985,0.704867,-0.233026,-0.181916,-0.032604
Reputation,-0.344814,1.0,0.906704,0.852445,0.558412,-0.076308,0.124807,0.057293,0.04173
Views,-0.301128,0.906704,1.0,0.805355,0.636178,-0.122016,0.128674,0.056347,0.049023
UpVotes,-0.293232,0.852445,0.805355,1.0,0.636087,-0.099961,0.130299,0.046272,0.02882
DownVotes,-0.19985,0.558412,0.636178,0.636087,1.0,-0.079508,0.073268,0.033729,0.002877
PostId,0.704867,-0.076308,-0.122016,-0.099961,-0.079508,1.0,-0.262381,-0.235046,-0.041256
Score,-0.233026,0.124807,0.128674,0.130299,0.073268,-0.262381,1.0,0.532106,0.148255
ViewCount,-0.181916,0.057293,0.056347,0.046272,0.033729,-0.235046,0.532106,1.0,0.044713
CommentCount,-0.032604,0.04173,0.049023,0.02882,0.002877,-0.041256,0.148255,0.044713,1.0


In [69]:
merged = merged.dropna() #Delete all NaN in merged

In [70]:
merged

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,PostId,Score,ViewCount,CommentCount
211,5,6792,1145,662,5,6,152,29229.0,5
219,5,6792,1145,662,5,103,28,1990.0,6
221,5,6792,1145,662,5,125,75,29261.0,2
233,5,6792,1145,662,5,423,156,64481.0,7
238,5,6792,1145,662,5,562,10,1005.0,1
...,...,...,...,...,...,...,...,...,...
90579,55734,1,0,0,0,115352,0,16.0,0
90580,55738,11,0,0,0,115360,2,40.0,4
90581,55742,6,0,0,0,115366,1,17.0,0
90582,55744,6,1,0,0,115370,1,13.0,2


#### 12. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [72]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   userId        42188 non-null  int64  
 1   Reputation    42188 non-null  int64  
 2   Views         42188 non-null  int64  
 3   UpVotes       42188 non-null  int64  
 4   DownVotes     42188 non-null  int64  
 5   PostId        42188 non-null  int64  
 6   Score         42188 non-null  int64  
 7   ViewCount     42188 non-null  float64
 8   CommentCount  42188 non-null  int64  
dtypes: float64(1), int64(8)
memory usage: 3.2 MB


In [73]:
merged.ViewCount = merged.ViewCount.astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [74]:
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 42188 entries, 211 to 90583
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   userId        42188 non-null  int64
 1   Reputation    42188 non-null  int64
 2   Views         42188 non-null  int64
 3   UpVotes       42188 non-null  int64
 4   DownVotes     42188 non-null  int64
 5   PostId        42188 non-null  int64
 6   Score         42188 non-null  int64
 7   ViewCount     42188 non-null  int64
 8   CommentCount  42188 non-null  int64
dtypes: int64(9)
memory usage: 3.2 MB
