<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Functions" data-toc-modified-id="Functions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Functions</a></span><ul class="toc-item"><li><span><a href="#Dataset-description:" data-toc-modified-id="Dataset-description:-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset description:</a></span></li></ul></li><li><span><a href="#Data-cleaning" data-toc-modified-id="Data-cleaning-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data cleaning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Kristina:" data-toc-modified-id="Kristina:-2.0.1"><span class="toc-item-num">2.0.1&nbsp;&nbsp;</span>Kristina:</a></span><ul class="toc-item"><li><span><a href="#Informational" data-toc-modified-id="Informational-2.0.1.1"><span class="toc-item-num">2.0.1.1&nbsp;&nbsp;</span>Informational</a></span></li><li><span><a href="#Exit-Rates" data-toc-modified-id="Exit-Rates-2.0.1.2"><span class="toc-item-num">2.0.1.2&nbsp;&nbsp;</span>Exit Rates</a></span></li><li><span><a href="#Browser" data-toc-modified-id="Browser-2.0.1.3"><span class="toc-item-num">2.0.1.3&nbsp;&nbsp;</span>Browser</a></span></li><li><span><a href="#Revenue" data-toc-modified-id="Revenue-2.0.1.4"><span class="toc-item-num">2.0.1.4&nbsp;&nbsp;</span>Revenue</a></span></li></ul></li></ul></li></ul></li></ul></div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression


# Functions

In [None]:
## Function to explain NA values in a column:

def NA_values(series):
    '''
    Function that takes a Pandas Series and returns a print statement explaining NAs and % of the column
    '''
    
    return print(f'Column name: {series.name}\nTotal values: {int(series.count())}\nNA values: {series.isna().sum()}\n% of NA values: {round(series.isna().mean() * 100,2)}%')


In [None]:
## Function to get the information about the outliers of a column:

def iqr(dataset, series):
    """
    Function takes dataset and column and returns the information about the outliers.
    """
    Q1 = np.percentile(series, 25)
    Q3 = np.percentile(series, 75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    total_outliers = len(dataset.loc[(series > upper) | (series < lower)])
    percent_outliers = total_outliers / len(dataset) * 100
    
    return print(f'Column: {series.name}\nLower outliers: all values lower than {round(lower, 3)}\nUpper outliers: all values higher than {round(upper, 3)}\nTotal number of rows with outliers: {total_outliers}\n% of outliers: {round(percent_outliers, 2)}%')

In [None]:
data = pd.read_csv('data/online_shoppers_intention_DATAPTDIC19.csv', sep=',')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
data.info()

## Dataset description:
 - Administrative:`float`. Administrative Value. `yet to identify`.
 - Administrative_Duration: `object`. Duration in Administrative Page. `Identify values and change dtype accordingly.`. 
 - Informational: `float`. Informational Value. `yet to identify` 
 - Informational_Duration: `object`. Duration in Informational Page. `Identify values and change dtype accordingly.`
 - ProductRelated: `float.` Product Related Value. `yet to identify` 
 - ProductRelated_Duration: `object`. Duration in Product Related Page. `Identify values and change dtype accordingly.` 
 - BounceRates: `float`. Bounce Rates of a web page. Percentages. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. 
 - ExitRates: `float`. Exit rate of a web page. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. 
 - PageValues: `object`. Page values of each web page. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. `Identify values and change dtype accordingly.` 
 - SpecialDay: Special days like valentine etc. `float`. Closeness to a special date. `dtype correct`. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 
 - Month: Month of the year. `object`. String to identify the month of the year. `clean`. 
 - OperatingSystems: Operating system used. `float`. `Try to explain the values`. 
 - Browser: Browser used. `integer` 
 - Region: Region of the user. `integer`. 
 - TrafficType: Traffic Type. `integer`.  
 - VisitorType: Types of Visitor `object` 
 - Weekend: Weekend or not `bool`  
 - Revenue: Revenue will be generated or not `object`. `Should be bool` 
     

# Data cleaning

### Kristina:

 - **Informational**
  - Data type: Categorical, float. No changes
  - Missing Values: There are 14 missing values, which is 0.11% out of all data. It will be filled with the median
  - 78.7% of sessions are coming from the category 0.0 of the Informational pages
  - No other changes are needed
 - **ExitRates**
  - Data type: Numerical, float. No changes
  - Missing Values: There are 14 missing values, which is 0.11% out of all data. It will be filled with the median
  - Outliers: There are 1094 outliers, which is 8.87% of all data. The majority of the outliers are falling under the FALSE revenue category and under the 0.2 value of the Exit rates. Additional column will be created to be able to filter out the outliers if needed: exitrates_outliers (boolean values).
 - **Browser**
  - Data type: Categorical, integer. No changes
  - Missing Values: There are no missing values
  - The most popular browser is 2. The usage share is very similar to the standard Usage share of all browsers. Later on, in the data visualization, we will show more insights on this
  - No other changes are needed
 - **Revenue**
  - Data type: categorical, object. Will change to boolean
  - There are 4 categories, will unify the data to have only TRUE and FALSE
  - 84.53% of data falls under FALSE category of Revenue. Since it's a target column, the data will need to be equilibrated.

#### Informational

In [None]:
# Missing values

NA_values(data.Informational)

# Filling missing values

data.Informational = data.Informational.fillna(data.Informational.median())

# Distribution

print(data.Informational.value_counts())

plt.style.use('seaborn')
fig, ax = plt.subplots(1, figsize=(8,6))
ax1 = data.Informational.hist()
plt.title("Informational - distribution")
plt.show()

# % of sessions in category 0.0 

percent_info_0 = data.Informational.value_counts()[0]/len(data)*100  
print(f'% of sessions in the category 0.0 of the Informational pages: {round(percent_info_0,1)}')

#### Exit Rates

In [None]:
# Missing values

NA_values(data.ExitRates)

# Filling missing values

data.ExitRates = data.ExitRates.fillna(data.ExitRates.median())

# Distribution

fig, ax = plt.subplots(1, figsize=(8,6))
plt.style.use('seaborn')
ax1 = data.ExitRates.hist()
plt.title("ExitRates - distribution")
plt.show()

# Outliers

fig, ax = plt.subplots(1, figsize=(8,6))
plt.style.use('seaborn')
ax1 = data.boxplot('ExitRates')
plt.title("ExitRates - outliers")
plt.show()

iqr(data, data.ExitRates)

# Checking the distribution of outliers regarding the target column - Revenue

exitrates_outliers = data.loc[(data.ExitRates > 0.104) | (data.ExitRates < -0.039)]
print(exitrates_outliers.Revenue.value_counts())

# Checking the top exit rate values of the outliers:

print(exitrates_outliers.ExitRates.value_counts())

# Creating a new column in the dataset to indicate exit rate outliers:
# Only using the upper IQR because the lower is a negative value and we don't have values lower tahn 0.

data['exitrates_outliers']  = data['ExitRates'].apply(lambda x: 'TRUE' if x > 0.104 else 'FALSE')

# Converting column to boolean

mapa = {'TRUE': True, 'FALSE': False}
data['exitrates_outliers'] = data['exitrates_outliers'].map(mapa)

#### Browser

In [None]:
# Missing values

NA_values(data.Browser)

# Distribution

print(data.Browser.value_counts(normalize = True)*100) 

fig, ax = plt.subplots(1, figsize=(8,6))
plt.style.use('seaborn')
ax1 = data.Browser.hist()
plt.title("Browser - distribution")
plt.show()

# Creating a dataset of the standard usage share of all browsers and of mobile browsers

usage_share_browsers = pd.DataFrame({'Chrome': 64.92, 'Safari': 15.97, 'Firefox': 4.33, 'Samsung_Internet': 3.29, 
                                     'UC': 2.94, 'Opera': 2.34, 'Edge': 2.05, 'IE': 1.98, 'AOSP': 0.59, 'Others': 1.59}, 
                                    index = [0]).T
colnames = ['standard_usage_all']
usage_share_browsers.columns = colnames
print(usage_share_browsers)

usage_share_browsers_mob = pd.DataFrame({'Chrome': 63.80, 'Safari': 19.70, 'Firefox': 0.35, 'Samsung_Internet': 6.27, 
                                     'UC': 5.33, 'Opera': 2.48, 'Others': 2.07}, 
                                    index = [0]).T
colnames = ['standard_usage_mobile']
usage_share_browsers_mob.columns = colnames
print(usage_share_browsers_mob)

#### Revenue

In [None]:
# Missing values

NA_values(data.Revenue)

# Distribution

print(data.Revenue.value_counts())

fig, ax = plt.subplots(1, figsize=(8,6))
plt.style.use('seaborn')
ax1 = data.Revenue.hist()
plt.title("Revenue - distribution")
plt.show()

# Changing 0 and 1 to TRUE and FALSE

data.Revenue.replace('0', 'FALSE', inplace=True)
data.Revenue.replace('1', 'TRUE', inplace=True)

# Converting column to boolean

mapa = {'TRUE': True, 'FALSE': False}
data['Revenue'] = data['Revenue'].map(mapa)

print(f' {round(data.Revenue.value_counts(normalize = True)[0]*100, 2)}% of data falls under FALSE category of Revenue')



### Pau:

    - ProductRelated:

    - SpecialDay:

    - TrafficType:


In [None]:
# ProductRelated column

# 1.09% NA --> filling method: median?:

NA_values(data.ProductRelated) 

data.ProductRelated = data.ProductRelated.fillna(data.ProductRelated.median()) 

# Outliers:

# dtype transformation: float to int


In [None]:
# SpecialDay column

# 0.0% NA --> no filling needed:

NA_values(data.SpecialDay) 

# Outliers:

# dtype correct (float)


In [None]:
# TrafficType column

# 0.97% NA --> filling method: median?:

NA_values(data.TrafficType) 

data.TrafficType = data.TrafficType.fillna(data.TrafficType.median())

# Outliers:

# dtype transformation: float to int