# Logistic Regression using sklearn

Until now we have gone though the regression problems where the predicted value was a continous variable, now we will dive into the classification problems. Let us first take a look at a binary classification problem.

## Problem Statement
Predict whether a visitor on website will click on an online advertisement.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score,confusion_matrix


In [2]:
# import data into a pandas dataframe
data = pd.read_csv("/Users/sylvia/Desktop/datasets/advertising.csv")


In [3]:
# check the shape of data
data.shape

(1000, 10)

In [4]:
data.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


In [5]:
# check basic information of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Daily Time Spent on Site  1000 non-null   float64
 1   Age                       1000 non-null   int64  
 2   Area Income               1000 non-null   float64
 3   Daily Internet Usage      1000 non-null   float64
 4   Ad Topic Line             1000 non-null   object 
 5   City                      1000 non-null   object 
 6   Male                      1000 non-null   int64  
 7   Country                   1000 non-null   object 
 8   Timestamp                 1000 non-null   object 
 9   Clicked on Ad             1000 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB


As seen above there are no missing values.

In [6]:
# delete the columns with string values for simplicity
del_columns = ["Ad Topic Line","City","Country"]
data.drop(columns=del_columns,inplace=True)
data.head()

# Now we are left with just all numerical columns

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,0,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,1,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,0,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,1,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,0,2016-06-03 03:36:18,0


In [7]:
# rename columns (some column names are very lengthy although informative)

cols = ['daily_time','age','area_income','daily_usage','male','timestamp','clicked']
data.columns= cols
data.head()

Unnamed: 0,daily_time,age,area_income,daily_usage,male,timestamp,clicked
0,68.95,35,61833.9,256.09,0,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,1,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,0,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,1,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,0,2016-06-03 03:36:18,0


### <font color=blue>Feature Extraction from timestamp column

In [8]:
# convert the time stamp to features

# convert the timestamp column to datetime dtype
data['timestamp'] = pd.to_datetime(data['timestamp'])

# create feature of month
data['month'] = data['timestamp'].dt.month

# create feature of the day of the month
data['day_month'] = data['timestamp'].dt.day

# create feature of the day of the week
data['day_week'] = data['timestamp'].dt.dayofweek

# create feature of the hour of the day
data['hour'] = data['timestamp'].dt.hour

# drop the timestamp column
data.drop(columns = 'timestamp', inplace = True)

# preview the data
data.head()

Unnamed: 0,daily_time,age,area_income,daily_usage,male,clicked,month,day_month,day_week,hour
0,68.95,35,61833.9,256.09,0,0,3,27,6,0
1,80.23,31,68441.85,193.77,1,0,4,4,0,1
2,69.47,26,59785.94,236.5,0,0,3,13,6,20
3,74.15,29,54806.18,245.89,1,0,1,10,6,2
4,68.37,35,73889.99,225.58,0,0,6,3,4,3


After pre-processing data, we create dependent and independent features.

In [9]:

# create dependent and independent feature sets
X = data.drop(columns='clicked').copy()
y = data['clicked'].copy()


# spilt the data set in training and testing set
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state = 33)


In [10]:
# fit the logistic regression model using sklearn
logr = LogisticRegression()
logr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [11]:
# predict the values for test set using the trained model

y_pred = logr.predict(X_test)

In [12]:
# calculate the accuracy score
accuracy_score(y_test, y_pred)

0.892

In [13]:
# confusion matrix for the model
confusion_matrix(y_test, y_pred)

array([[118,   6],
       [ 21, 105]])