#### 1. Import pandas library

In [1]:
import pandas as pd
import numpy as np

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data 


In [2]:
import pymysql
import sqlalchemy

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/search?tableCount%5B%5D=0-10&tableCount%5B%5D=10-30&dataType%5B%5D=Numeric&databaseSize%5B%5D=KB&databaseSize%5B%5D=MB)

In [3]:
from sqlalchemy import create_engine
engine = create_engine('mysql+pymysql://guest:relational@relational.fit.cvut.cz')

#### 4. Import the users table 

In [4]:
users_df =pd.read_sql_query('SELECT * FROM stats.users', engine)

#### 5. Rename Id column to userId

In [5]:
users_df = users_df.rename(columns={'Id':'UserId'})

#### 6. Import the posts table. 

In [6]:
posts_df =pd.read_sql_query('SELECT * FROM stats.posts', engine)

#### 7. Rename Id column to postId and OwnerUserId to userId

In [7]:
posts_df = posts_df.rename(columns={'Id':'PostId','OwnerUserId':'UserId'})

#### 8. Define new dataframes for users and posts with the following selected columns:
    **users columns**: userId, Reputation,Views,UpVotes,DownVotes
    **posts columns**: postId, Score,userID,ViewCount,CommentCount

In [8]:
users_df1 = users_df[['UserId','Reputation','Views','UpVotes','DownVotes']]
posts_df1 = posts_df[['PostId','Score','UserId','ViewCount','CommentCount']]

#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [9]:
new_df = pd.merge(users_df1,posts_df1, on = 'UserId')

#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [10]:
null_cols = new_df.isnull().sum()
null_cols[null_cols > 0]

ViewCount    48396
dtype: int64

#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before passing to the next step

I will clean it, because the amount of missing values is high and in the same column

In [11]:
drop_cols = list(null_cols[null_cols > 10000].index)
new_df1 = new_df.drop(drop_cols, axis=1)
new_df1

Unnamed: 0,UserId,Reputation,Views,UpVotes,DownVotes,PostId,Score,CommentCount
0,-1,1,0,5007,1920,2175,0,0
1,-1,1,0,5007,1920,8576,0,0
2,-1,1,0,5007,1920,8578,0,0
3,-1,1,0,5007,1920,8981,0,0
4,-1,1,0,5007,1920,8982,0,0
5,-1,1,0,5007,1920,9857,0,0
6,-1,1,0,5007,1920,9858,0,0
7,-1,1,0,5007,1920,9860,0,0
8,-1,1,0,5007,1920,10130,0,0
9,-1,1,0,5007,1920,10131,0,0


#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [12]:
users_df.dtypes

UserId                      int64
Reputation                  int64
CreationDate       datetime64[ns]
DisplayName                object
LastAccessDate     datetime64[ns]
WebsiteUrl                 object
Location                   object
AboutMe                    object
Views                       int64
UpVotes                     int64
DownVotes                   int64
AccountId                   int64
Age                       float64
ProfileImageUrl            object
dtype: object

In [13]:
posts_df.dtypes

PostId                            int64
PostTypeId                        int64
AcceptedAnswerId                float64
CreaionDate              datetime64[ns]
Score                             int64
ViewCount                       float64
Body                             object
UserId                          float64
LasActivityDate          datetime64[ns]
Title                            object
Tags                             object
AnswerCount                     float64
CommentCount                      int64
FavoriteCount                   float64
LastEditorUserId                float64
LastEditDate             datetime64[ns]
CommunityOwnedDate       datetime64[ns]
ParentId                        float64
ClosedDate               datetime64[ns]
OwnerDisplayName                 object
LastEditorDisplayName            object
dtype: object

##### I would change the following types in order to prevent future issues:

in the table users:

- Age from float64 to int64.

In the table posts:

- AcceptedAnswerId from float64 to int64.
- ViewCount from float64 to int64.
- UserId from float64 to int64.
- AnswerCount from float64 to int64.
- FavoriteCount from float64 to int64.
- LastEditorUserId from float64 to int64.
- ParentId from float64 to int64.

In [14]:
convert_u_dict = {'Age': int}
convert_p_dict = {'AcceptedAnswerId':int,
'ViewCount':int,
'UserId':int,
'AnswerCount':int,
'FavoriteCount':int,
'LastEditorUserId':int,
'ParentId':int}

In [15]:
# users_df.astype(convert_u_dict)
# users_df.dtypes

In [16]:
# posts_df.astype(convert_p_dict)
# posts_df.dtypes

#### Bonus: Identify extreme values in your merged dataframe as you have learned in class, create a dataframe called outliers with the same columns as our data set and calculate the bounds. The values of the outliers dataframe will be the values of the merged_df that fall outside that bounds. You will need to save your outliers dataframe to a csv file on your-code folder.

In [17]:
low_variance = []

for col in new_df1._get_numeric_data():
    minimum = min(new_df1[col])
    ninety_perc = np.percentile(new_df1[col], 90)
    if ninety_perc == minimum:
        low_variance.append(col)

print(low_variance)

[]


In [18]:
stats = new_df1.describe().transpose()
stats['IQR'] = stats['75%'] - stats['25%']
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,IQR
UserId,90584.0,16546.764727,15273.367108,-1.0,3437.0,11032.0,27700.0,55746.0,24263.0
Reputation,90584.0,6282.395412,15102.26867,1.0,60.0,396.0,4460.0,87393.0,4400.0
Views,90584.0,1034.245176,2880.074012,0.0,5.0,45.0,514.25,20932.0,509.25
UpVotes,90584.0,734.315718,2050.869327,0.0,1.0,22.0,283.0,11442.0,282.0
DownVotes,90584.0,33.273249,134.936435,0.0,0.0,0.0,8.0,1920.0,8.0
PostId,90584.0,56539.080522,33840.307529,1.0,26051.75,57225.5,86145.25,115378.0,60093.5
Score,90584.0,2.780767,4.948922,-19.0,1.0,2.0,3.0,192.0,2.0
CommentCount,90584.0,1.89465,2.638704,0.0,0.0,1.0,3.0,45.0,3.0


In [19]:
outliers = pd.DataFrame(columns=new_df1.columns)

for col in stats.index:
    iqr = stats.at[col,'IQR']
    cutoff = iqr * 1.5
    lower = stats.at[col,'25%'] - cutoff
    upper = stats.at[col,'75%'] + cutoff
    results = new_df1[(new_df1[col] < lower) | 
                   (new_df1[col] > upper)].copy()
    results['Outlier'] = col
    outliers = outliers.append(results)

outliers

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Unnamed: 0,CommentCount,DownVotes,Outlier,PostId,Reputation,Score,UpVotes,UserId,Views
1155,0,126,Reputation,74,14082,25,4235,88,3320
1156,0,126,Reputation,94,14082,5,4235,88,3320
1157,1,126,Reputation,99,14082,7,4235,88,3320
1158,3,126,Reputation,119,14082,6,4235,88,3320
1159,0,126,Reputation,140,14082,7,4235,88,3320
1160,2,126,Reputation,143,14082,5,4235,88,3320
1161,1,126,Reputation,255,14082,8,4235,88,3320
1162,0,126,Reputation,265,14082,14,4235,88,3320
1163,0,126,Reputation,275,14082,5,4235,88,3320
1164,1,126,Reputation,309,14082,2,4235,88,3320


In [23]:
outliers.to_csv('outliers.csv')