# Logistic Regression

This is a binary classification problem, in this notebook we'll use a Logistic Regression. 

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [2]:
df_model = pd.read_csv("./clicks_model.csv", index_col=False)

In [3]:
df_model.head()

Unnamed: 0.1,Unnamed: 0,adults,children,sale,weekend,lag_days,ndays_reserve,Click__Friday,Click__Monday,Click__Saturday,Click__Sunday,Click__Thursday,Click__Tuesday,Click__Wednesday,Geo__Market1,Geo__Market2,Geo__Market3,Geo__Market4,Geo__Market5
0,0,2,0,0,0,17.0,8.0,1,0,0,0,0,0,0,0,0,1,0,0
1,1,2,2,0,0,56.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0
2,2,2,0,0,0,78.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0
3,3,2,0,0,0,78.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0
4,4,2,0,0,0,103.0,38.0,1,0,0,0,0,0,0,0,0,1,0,0


In [4]:
#Deleting the columns not necessary.
df_model.drop(['Unnamed: 0'], axis=1, inplace=True)

In [5]:
df_model.dtypes

adults                int64
children              int64
sale                  int64
weekend               int64
lag_days            float64
ndays_reserve       float64
Click__Friday         int64
Click__Monday         int64
Click__Saturday       int64
Click__Sunday         int64
Click__Thursday       int64
Click__Tuesday        int64
Click__Wednesday      int64
Geo__Market1          int64
Geo__Market2          int64
Geo__Market3          int64
Geo__Market4          int64
Geo__Market5          int64
dtype: object

## Sklearn implementation: Logistic Regression

Importing the libraries necessary

In [6]:
from sklearn import linear_model
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [7]:
df_model.head()

Unnamed: 0,adults,children,sale,weekend,lag_days,ndays_reserve,Click__Friday,Click__Monday,Click__Saturday,Click__Sunday,Click__Thursday,Click__Tuesday,Click__Wednesday,Geo__Market1,Geo__Market2,Geo__Market3,Geo__Market4,Geo__Market5
0,2,0,0,0,17.0,8.0,1,0,0,0,0,0,0,0,0,1,0,0
1,2,2,0,0,56.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0
2,2,0,0,0,78.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0
3,2,0,0,0,78.0,1.0,1,0,0,0,0,0,0,0,0,1,0,0
4,2,0,0,0,103.0,38.0,1,0,0,0,0,0,0,0,0,1,0,0


### Creation of Model Logistic Regression

In [8]:
X = np.array(df_model.drop(['sale'],1))
y = np.array(df_model['sale'])
X.shape

(158161, 17)

We create our model and we ajust this at the data 

In [9]:
model = linear_model.LogisticRegression()
model.fit(X,y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [10]:
predictions = model.predict(X)
predictions[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

The precision of ours predictions is aprox of 98% 

In [11]:
model.score(X,y)

0.98045662331421779

### Validation of the Model

We divide our dataset: 80% records to train the model and the 20% for test it.

In [12]:
validation_size = 0.20
seed = 12345
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, y, test_size=validation_size, random_state=seed)

In [13]:
name='Logistic Regression'
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy')
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

Logistic Regression: 0.980408 (0.001210)


We make now ours predictions with the "cross validation set"

In [14]:
predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))

0.980653115417


The accuracy achieved for by our logistic regression is 98%.

#### Printing the Matrix of Confusion

In [15]:
print(confusion_matrix(Y_validation, predictions))

[[31021     0]
 [  612     0]]


#### Report of classification

In [16]:
print(classification_report(Y_validation, predictions))

             precision    recall  f1-score   support

          0       0.98      1.00      0.99     31021
          1       0.00      0.00      0.00       612

avg / total       0.96      0.98      0.97     31633



  'precision', 'predicted', average, warn_for)


The support indicates that there are 31021 records with sale as "0" and 612 as "1". F1-Score is 97%, very good accuracy

#### Example of Prediction

In [17]:
#Making a simple prediction with values examples: 2 adults, 0 children, dayofweek 6 (Sunday), weekend, month of checkin in June, lag days aprox 7 and 2 days of reserve.
X_new = pd.DataFrame({'adults':[2],'children':[0],'weekend':[0],'lag_days':[7],'ndays_reserve':[2], 
                     'Click_Friday':[1],'Click_Saturday':[0],'Click_Sunday':[0],'Click_Monday':[0],'Click_Tuesday':[0],
                     'Click__Wednesday':[0],'Click__Thursday':[0],'Geo__Market1':[1],'Geo__Market2':[0], 
                     'Geo__Market3':[0],'Geo__Market4':[0],'Geo__Market5':[0],})
model.predict(X_new)

array([0])