<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Functions" data-toc-modified-id="Functions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Import-and-Analysis" data-toc-modified-id="Import-and-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import and Analysis</a></span><ul class="toc-item"><li><span><a href="#Dataset-description:" data-toc-modified-id="Dataset-description:-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Dataset description:</a></span></li></ul></li><li><span><a href="#Data-cleaning" data-toc-modified-id="Data-cleaning-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data cleaning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Sosa:" data-toc-modified-id="Sosa:-3.0.1"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>Sosa:</a></span></li></ul></li></ul></li></ul></div>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression


# Functions

In [None]:
## Function to explain NA values in a column:

def NA_values(series):
    '''
    Function that takes a Pandas Series and returns a print statement explaining NAs and % of the column
    '''
    
    return print(f'Column name: {series.name}\nTotal values: {int(series.count())}\nNA values: {series.isna().sum()}\n% of NA values: {round(series.isna().mean() * 100,2)}%')


In [None]:
## Function to find the outliers of a column:

def iqr(dataset, series):
    """
    Function takes dataset and column and returns the information about the outliers.
    """
    Q1 = np.percentile(series, 25)
    Q3 = np.percentile(series, 75)
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    total_outliers = len(dataset.loc[(series > upper) | (series < lower)])
    percent_outliers = total_outliers / len(dataset) * 100
    
    return print(f'Column: {series.name}\nLower outliers: all values lower than {round(lower, 3)}\nUpper outliers: all values higher than {round(upper, 3)}\nTotal number of rows with outliers: {total_outliers}\n% of outliers: {round(percent_outliers, 2)}%')

# Import and Analysis

In [None]:
data = pd.read_csv('data/online_shoppers_intention_DATAPTDIC19.csv', sep=',', index_col=0)

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
data.info()

## Dataset description:
 - Administrative:`float`. Administrative Value. `yet to identify`.
 - Administrative_Duration: `object`. Duration in Administrative Page. `Identify values and change dtype accordingly.`. 
 - Informational: `float`. Informational Value. `yet to identify` 
 - Informational_Duration: `object`. Duration in Informational Page. `Identify values and change dtype accordingly.`
 - ProductRelated: `float.` Product Related Value. `yet to identify` 
 - ProductRelated_Duration: `object`. Duration in Product Related Page. `Identify values and change dtype accordingly.` 
 - BounceRates: `float`. Bounce Rates of a web page. Percentages. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. 
 - ExitRates: `float`. Exit rate of a web page. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. 
 - PageValues: `object`. Page values of each web page. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. `Identify values and change dtype accordingly.` 
 - SpecialDay: Special days like valentine etc. `float`. Closeness to a special date. `dtype correct`. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 
 - Month: Month of the year. `object`. String to identify the month of the year. `clean`. 
 - OperatingSystems: Operating system used. `float`. `Try to explain the values`. 
 - Browser: Browser used. `integer` 
 - Region: Region of the user. `integer`. 
 - TrafficType: Traffic Type. `integer`.  
 - VisitorType: Types of Visitor `object` 
 - Weekend: Weekend or not `bool`  
 - Revenue: Revenue will be generated or not `object`. `Should be bool` 
     

# Data cleaning

### Sosa:

    - Administrative:
        Due to the low number of missing values (0.28%), we'll fill them with the median of the values since its 
        impact won't be noticeable. We can't use mean because each value represents a category and should be
        assigned to an existing value. 
        Also, we can also see that the distribution of the values is a logarithmic one, left-skewed.
        There's 100 values 999. These are obvious errors. They're distributed evenly through the column. To
        fix this, we'll transform this 999 values into NaNs to interpolate them via a ffill, in order to preserve
        the actual distribution of the values.
        
    - ProductRelated_Duration
        In this column, we have negative values. The time of a person staying in a webpage cannot be
        negative, so we assume there's a error on the lecture. We'll be replacing them with '0' value.
        We have just two pronounced outliers. 
        Additionaly, we'll assing our NAs, 0,14% of the values, to the mean values since our continuous numeric
        values are equally distributed.
        
        
    - Month
        We have no NA values, but some wrong strings to classify the months. We'll fix this with Regex.
        No more transformation needed.
        
    - VisitorType
        We have four types of visitors: Returning, New, Other and More. In this case, we'll reduce the group
        to three types: Returning, New and Other, by merging the More column to Other.
        For the NA values, we have no way to know if the visitor is Returning or New, so we'll also group them
        with the 'Other' values.
    
Missing values
Outliers
Errors
Transformation

In [None]:
## Administrative column

print(NA_values(data.Administrative))

data.Administrative = data.Administrative.fillna(data.Administrative.median())


data[data['Administrative'] == 999] = None

data.Administrative.fillna(method='ffill', inplace=True)



In [None]:
## ProductRelated_Duration column

print(NA_values(data.ProductRelated_Duration))

# Fixing outliers:

outliers = data.ProductRelated_Duration.sort_values(ascending=False)[:2]

data.query('ProductRelated_Duration == @outliers')['ProductRelated_Duration'] = data.ProductRelated_Duration.mean()

# Fixing the NA values:

data.ProductRelated_Duration = data.ProductRelated_Duration.fillna(data.ProductRelated_Duration.mean())




In [None]:
## Month column

NA_values(data.Month)

# Fixing strings:

data.Month = data.Month.str.replace('MAY', 'May').str.replace('March', 'Mar')

data.Month.value_counts()

In [None]:
## VisitorType column

NA_values(data.VisitorType)

data.VisitorType = data.VisitorType.str.replace('More', 'Other').fillna('Other')

data.VisitorType.value_counts()