#### 1. Import pandas library

In [1]:
import pandas as pd

#### 2. Import pymysql and sqlalchemy as you have learnt in the lesson of importing/exporting data 


In [2]:
# import pymysql
# from sqlalchemy import create_engine

#### 3. Create a mysql engine to set the connection to the server. Check the connection details in [this link](https://relational.fit.cvut.cz/search?tableCount%5B%5D=0-10&tableCount%5B%5D=10-30&dataType%5B%5D=Numeric&databaseSize%5B%5D=KB&databaseSize%5B%5D=MB)

In [3]:
# username = 'guest'
# password = 'relational'
# hostname = 'relational.fit.cvut.cz'
# port     = '3306'
# database = 'stats'
# 
# engine = create_engine(f'mysql+pymysql://{username}:{password}@{hostname}:{port}/{database}')

#### 4. Import the users table 

In [4]:
users = pd.read_csv('./data_sets/users.csv')
users.head()

Unnamed: 0,Id,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


#### 5. Rename Id column to userId

In [5]:
users = users.rename(columns={'Id':'userId'})
users.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes
0,-1,1,0,5007,1920
1,2,101,25,3,0
2,3,101,22,19,0
3,4,101,11,0,0
4,5,6792,1145,662,5


#### 6. Import the posts table. 

In [6]:
posts = pd.read_csv('./data_sets/posts.csv')
posts.head()

Unnamed: 0,Id,OwnerUserId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3


#### 7. Rename Id column to postId and OwnerUserId to userId

In [7]:
posts = posts.rename(columns={'Id':'postId', 'OwnerUserId':'userId'})
posts.head()

Unnamed: 0,postId,userId,Score,ViewCount,CommentCount
0,1,8.0,23,1278.0,1
1,2,24.0,22,8198.0,1
2,3,18.0,54,3613.0,4
3,4,23.0,13,5224.0,2
4,5,23.0,81,,3


#### 8. Define new dataframes for users and posts with the following selected columns:

**users columns**: userId, Reputation,Views,UpVotes,DownVotes

**posts columns**: postId, Score,userID,ViewCount,CommentCount

In [8]:
# Already done

#### 8. Merge both dataframes, users and posts. 
You will need to make a [merge](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) of posts and users dataframes.

In [9]:
df = pd.merge(users, posts, on='userId')
df.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,ViewCount,CommentCount
0,-1,1,0,5007,1920,2175,0,,0
1,-1,1,0,5007,1920,8576,0,,0
2,-1,1,0,5007,1920,8578,0,,0
3,-1,1,0,5007,1920,8981,0,,0
4,-1,1,0,5007,1920,8982,0,,0


#### 9. How many missing values do you have in your merged dataframe? On which columns?

In [10]:
null_cols = df.isnull().sum()
null_cols[null_cols > 0]

ViewCount    48396
dtype: int64

#### 10. You will need to make something with missing values.  Will you clean or filling them? Explain. 
**Remember** to check the results of your code before passing to the next step

In [11]:
#  We don't have enough information on how to fill missing values for ViewCount

In [12]:
df = df.drop('ViewCount', axis=1)

In [13]:
df.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,CommentCount
0,-1,1,0,5007,1920,2175,0,0
1,-1,1,0,5007,1920,8576,0,0
2,-1,1,0,5007,1920,8578,0,0
3,-1,1,0,5007,1920,8981,0,0
4,-1,1,0,5007,1920,8982,0,0


#### 11. Adjust the data types in order to avoid future issues. Which ones should be changed? 

In [14]:
df.dtypes

userId          int64
Reputation      int64
Views           int64
UpVotes         int64
DownVotes       int64
postId          int64
Score           int64
CommentCount    int64
dtype: object

In [15]:
# We won't perform number operations on these columns
# https://stats.stackexchange.com/help/whats-reputation
df[['userId', 'postId', 'Reputation']] = df[['userId', 'postId', 'Reputation']].astype(str)

In [16]:
df.dtypes

userId          object
Reputation      object
Views            int64
UpVotes          int64
DownVotes        int64
postId          object
Score            int64
CommentCount     int64
dtype: object

#### Bonus: Identify extreme values in your merged dataframe as you have learned in class, create a dataframe called outliers with the same columns as our data set and calculate the bounds. The values of the outliers dataframe will be the values of the merged_df that fall outside that bounds. You will need to save your outliers dataframe to a csv file on your-code folder.

In [17]:
stats = df.describe().transpose()
stats['IQR'] = stats['75%'] - stats['25%']
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max,IQR
Views,90584.0,1034.245176,2880.074012,0.0,5.0,45.0,514.25,20932.0,509.25
UpVotes,90584.0,734.315718,2050.869327,0.0,1.0,22.0,283.0,11442.0,282.0
DownVotes,90584.0,33.273249,134.936435,0.0,0.0,0.0,8.0,1920.0,8.0
Score,90584.0,2.780767,4.948922,-19.0,1.0,2.0,3.0,192.0,2.0
CommentCount,90584.0,1.89465,2.638704,0.0,0.0,1.0,3.0,45.0,3.0


In [18]:
outliers = pd.DataFrame(columns=df.columns)

for col in stats.index:
    iqr = stats.at[col, 'IQR']
    cutoff = iqr * 1.5
    lower = stats.at[col, '25%'] - cutoff
    upper = stats.at[col, '75%'] + cutoff
    results = df[
        (df[col] < lower) | (df[col] > upper)
    ].copy()
    results['Outlier'] = col
    outliers = outliers.append(results)

In [19]:
outliers.head()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,CommentCount,Outlier
1155,88,14082,3320,4235,126,74,25,0,Views
1156,88,14082,3320,4235,126,94,5,0,Views
1157,88,14082,3320,4235,126,99,7,1,Views
1158,88,14082,3320,4235,126,119,6,3,Views
1159,88,14082,3320,4235,126,140,7,0,Views


In [20]:
outliers.tail()

Unnamed: 0,userId,Reputation,Views,UpVotes,DownVotes,postId,Score,CommentCount,Outlier
90274,55135,13,0,0,0,114129,2,10,CommentCount
90326,55241,6,3,0,0,114339,1,8,CommentCount
90346,55302,13,0,0,0,114859,2,8,CommentCount
90408,55420,11,1,0,0,114719,2,9,CommentCount
90491,55557,6,3,0,0,115020,1,9,CommentCount


In [21]:
outliers.to_csv('./outliers.csv', index=False)