# Logistic Regression Ex 1: Ad Click Prediction Student Notebook

In this exercise we will build a simple logistic regression model using the [Ad Click prediction dataset](https://www.kaggle.com/jahnveenarang/cvdcvd-vd).

We will work with some demographic data to predict whether a user purchased something after clicking on an ad or not.

### Importing the libraries and datasets

First we should import the required libraries and the dataset

In [19]:
#Basics
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plot
import seaborn as sns
import matplotlib.pyplot as plt

#SKLearn ML


In [20]:
#Loading dataset


In [38]:
#display the loaded dataframe


### Exploring the dataset

It's important to understand the amount, size and types of data that we're working with

In [44]:
#Dataset Shape


In [None]:
#Dataset First Few Rows


In [45]:
#Data Types


### Visualisation Exploration of the Data

We can take a look at our data by plotting a histogram using the [Seaborn](https://seaborn.pydata.org/) and [matplotlib](https://matplotlib.org/) packages.

In [None]:
#plot a histogram of count by age_range, with one series per gender
social_network_ads_bins = social_network_ads[social_network_ads.Age.notna()]

bins = list(range(0, 120, 10))
social_network_ads_bins['age_range'] = pd.cut(social_network_ads.Age, bins=bins)
chart = sns.catplot(x="age_range", kind="count", hue="Female", data=social_network_ads_bins);
for axes in chart.axes.flat:
    axes.set_xticklabels(axes.get_xticklabels(), rotation=90)

We can see that the distribution of data in the dataset is roughly equal. This is a balanced dataset.

## Fitting the Logistic Regression Model Using SkLearn
We will be using [sklearn](https://scikit-learn.org/stable/) for fitting the first logistic regression model. We will split our dataframe into the following:

- A single column for the target variable (technically a Series)
- Remaining columns for the inputs (Since there are multiple columns this is a dataframe) 


In [26]:
#Separate x_inputs from y_target variables


In [37]:
#Split the data into training and test sets using the function


In [28]:
#Declare a logistic regression classifier


In [29]:
#Make predictions on test data


In [36]:
#Print out accuracy score on the test data


In [None]:
#Plot the confusion matrix


In later chapter's we'll look more closely at how to interpret the confusion matrix.

## Fitting the Logistic Regression Model Using SkLearn

To fit the model using [statsmodels](https://www.statsmodels.org/stable/index.html) we first separate our target variable (Y) and independent variables (X). As we want to predict the purchase, this is our target variable Y. The rest are our independet variables X. 

## Using Statsmodel
Now, we will have a look at using the Statsmodel library for fitting logistic regression model. We will specify the model and in the next step we will fit the model.

In [32]:
#StatsModels ML
import statsmodels.api as stats
import statsmodels.api as sm

In [33]:
# building the model 
log_reg = sm.Logit(endog = y_target, exog=x_inputs)

# #fitting the data
log_reg = log_reg.fit()

Optimization terminated successfully.
         Current function value: 0.677546
         Iterations 4


Now we will print the model results summary. The summary includes information on the fit process as well as the estimated coefficients.

In [34]:
log_reg.summary(xname=['Gender', 'Age', 'EstimatedSalary'])

0,1,2,3
Dep. Variable:,Purchased,No. Observations:,400.0
Model:,Logit,Df Residuals:,397.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 02 May 2022",Pseudo R-squ.:,-0.03923
Time:,13:00:15,Log-Likelihood:,-271.02
converged:,True,LL-Null:,-260.79
Covariance Type:,nonrobust,LLR p-value:,1.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Gender,-0.5371,0.201,-2.666,0.008,-0.932,-0.142
Age,-0.0002,0.006,-0.035,0.972,-0.012,0.011
EstimatedSalary,6.209e-07,2.82e-06,0.220,0.826,-4.91e-06,6.15e-06


The top section shows the statistics of the model. The coef column shows the value for the coefficients $\gamma$=$\beta$<sub>o</sub>+$\beta$<sub>1</sub>x from our logistic regression formula. 

The column P>|z| shows the p-values. A p-value is a probability measure. If this p-value meets an established threshold for statistical significance, then we can conclude model fits the data better than the null model. The z-statistic equals the coefficient divided by its standard error.

 Age and Estimated Salary have a very p-value greater than 0.05 meaning they are not statistically significant in helping predict the output variable. 

#### Log Odds and Interpretation

The Gender coefficient (x1) has a coefficient of β=-0.8028 and is statistically significant (p-value of 0.000). We now want to look at the interpretation transforming the coefficients into log odds. We do so by taking the exponential of the parameters. We can remember that Gender is our x1 variable.

In [35]:
odds_ratios = pd.DataFrame({"Odds Ratio": log_reg.params})

odds_ratios = np.exp(odds_ratios)
print(odds_ratios)

                 Odds Ratio
Female             0.584419
Age                0.999796
EstimatedSalary    1.000001


Taking the exponent of the gender coefficient (-0.8028), we can see that odd ratio is 0.44. We can say that the odds of one gender (male) purchasing the item are almost half compared to the other gender (female). We can remember that 0 is the encoded label for female and 1 is the encoded label for male.

#### END OF NOTEBOOK