<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Functions" data-toc-modified-id="Functions-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Functions</a></span></li><li><span><a href="#Import-and-Analysis" data-toc-modified-id="Import-and-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Import and Analysis</a></span><ul class="toc-item"><li><span><a href="#Dataset-description:" data-toc-modified-id="Dataset-description:-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Dataset description:</a></span></li></ul></li><li><span><a href="#Data-cleaning" data-toc-modified-id="Data-cleaning-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data cleaning</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Sosa:" data-toc-modified-id="Sosa:-3.0.1"><span class="toc-item-num">3.0.1&nbsp;&nbsp;</span>Sosa:</a></span></li></ul></li></ul></li></ul></div>

In [129]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression


# Functions

In [117]:
## Function to explain NA values in a column:

def NA_values(series):
    '''
    Function that takes a Pandas Series and returns a print statement explaining NAs and % of the column
    '''
    
    return print(f'Column name: {series.name}\nTotal values: {int(series.count())}\nNA values: {series.isna().sum()}\n% of NA values: {round(series.isna().mean() * 100,2)}%')


# Import and Analysis

In [118]:
data = pd.read_csv('data/online_shoppers_intention_DATAPTDIC19.csv', sep=';')

In [119]:
data.head()

Unnamed: 0,Administrative,Administrative_Duration,Informational,Informational_Duration,ProductRelated,ProductRelated_Duration,BounceRates,ExitRates,PageValues,SpecialDay,Month,OperatingSystems,Browser,Region,TrafficType,VisitorType,Weekend,Revenue
0,0.0,0,0.0,0,1.0,0,0.2,0.2,0,0.0,Feb,1.0,1,1,1,Returning_Visitor,False,False
1,0.0,0,0.0,0,2.0,64,0.0,0.1,0,0.0,Feb,2.0,2,1,2,Returning_Visitor,False,False
2,0.0,-1,0.0,-1,1.0,-1,0.2,0.2,0,0.0,Feb,4.0,1,9,3,Returning_Visitor,False,False
3,0.0,0,0.0,0,2.0,2.666.666.667,0.05,0.14,0,0.0,Feb,3.0,2,2,4,Returning_Visitor,False,False
4,0.0,0,0.0,0,10.0,627.5,0.02,0.05,0,0.0,Feb,3.0,3,1,4,Returning_Visitor,True,False


In [120]:
data.shape

(12330, 18)

In [121]:
data.describe()

Unnamed: 0,Administrative,Informational,ProductRelated,BounceRates,ExitRates,SpecialDay,OperatingSystems,Browser,Region,TrafficType
count,12315.0,12316.0,12315.0,12316.0,12316.0,12329.0,12329.0,12330.0,12330.0,12330.0
mean,2.317824,0.503979,31.765246,0.022152,0.043003,0.061432,2.273096,2.357097,3.147364,4.069586
std,3.322888,1.270701,44.491889,0.048427,0.048527,0.198925,3.907923,1.717277,2.401591,4.025169
min,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0
25%,0.0,0.0,7.0,0.0,0.014286,0.0,2.0,2.0,1.0,2.0
50%,1.0,0.0,18.0,0.003119,0.025124,0.0,2.0,2.0,3.0,2.0
75%,4.0,0.0,38.0,0.016684,0.05,0.0,3.0,2.0,4.0,4.0
max,27.0,24.0,705.0,0.2,0.2,1.0,99.0,13.0,9.0,20.0


In [122]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
Administrative             12315 non-null float64
Administrative_Duration    12316 non-null object
Informational              12316 non-null float64
Informational_Duration     12316 non-null object
ProductRelated             12315 non-null float64
ProductRelated_Duration    12316 non-null object
BounceRates                12316 non-null float64
ExitRates                  12316 non-null float64
PageValues                 12330 non-null object
SpecialDay                 12329 non-null float64
Month                      12325 non-null object
OperatingSystems           12329 non-null float64
Browser                    12330 non-null int64
Region                     12330 non-null int64
TrafficType                12330 non-null int64
VisitorType                12327 non-null object
Weekend                    12330 non-null bool
Revenue                    12330 non-null object
dtypes:

## Dataset description:
 - Administrative:`float`. Administrative Value. `yet to identify`.
 - Administrative_Duration: `object`. Duration in Administrative Page. `Identify values and change dtype accordingly.`. 
 - Informational: `float`. Informational Value. `yet to identify` 
 - Informational_Duration: `object`. Duration in Informational Page. `Identify values and change dtype accordingly.`
 - ProductRelated: `float.` Product Related Value. `yet to identify` 
 - ProductRelated_Duration: `object`. Duration in Product Related Page. `Identify values and change dtype accordingly.` 
 - BounceRates: `float`. Bounce Rates of a web page. Percentages. The value of "Bounce Rate" feature for a web page refers to the percentage of visitors who enter the site from that page and then leave ("bounce") without triggering any other requests to the analytics server during that session. 
 - ExitRates: `float`. Exit rate of a web page. The value of "Exit Rate" feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. 
 - PageValues: `object`. Page values of each web page. The "Page Value" feature represents the average value for a web page that a user visited before completing an e-commerce transaction. `Identify values and change dtype accordingly.` 
 - SpecialDay: Special days like valentine etc. `float`. Closeness to a special date. `dtype correct`. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8. 
 - Month: Month of the year. `object`. String to identify the month of the year. `clean`. 
 - OperatingSystems: Operating system used. `float`. `Try to explain the values`. 
 - Browser: Browser used. `integer` 
 - Region: Region of the user. `integer`. 
 - TrafficType: Traffic Type. `integer`.  
 - VisitorType: Types of Visitor `object` 
 - Weekend: Weekend or not `bool`  
 - Revenue: Revenue will be generated or not `object`. `Should be bool` 
     

# Data cleaning

### Sosa:

    - Administrative:
        Due to the low number of missing values (0.12%), we'll fill them with the median of the values since its 
        impact won't be noticeable. We can't use mean since it's value represents a category and should assign to 
        an existing value. We have no other issues on this column.
        
    - ProductRelated_Duration
        First off, since we have floats we convert the column dtype to float. This is due a modification on the original
        In this column, we have negative values. The time of a person staying in a webpage cannot be negative, so
        we assume there's a error on the lecture. We'll be replacing them with '0' value.
        The lectures we're given 
    - Month
    - VisitorType
    
Missing values
Outliers
Errors
Transformation

In [75]:
## Administrative column

# before processing: NA_values(data.Administrative)

data.Administrative = data.Administrative.fillna(data.Administrative.median())

In [115]:
## ProductRelated_Duration column

#NA values:

NA_values(data.ProductRelated_Duration)

# Fixing the negative values:
data[data['ProductRelated_Duration'] == '-1'] = '0'

data['ProductRelated_Duration'].value_counts()

Column name: ProductRelated_Duration
Total values: 12316
NA values: 14
% of NA values: 0.11%


0                752
17                21
8                 17
11                17
15                16
                ... 
1.195.583.333      1
2.487.061.777      1
3289.5             1
812.875            1
342.5              1
Name: ProductRelated_Duration, Length: 9524, dtype: int64

In [None]:
## Month column

NA_values(data.Month)

In [127]:
## VisitorType column

NA_values(data.VisitorType)

Column name: VisitorType
Total values: 12327
NA values: 3
% of NA values: 0.02%
