# INTRODUCTION

A click is a marketing metric that counts the number of customers who have interacted with your adds. The add-click expresses the percentage of clicks over the total view of the add. It defines your marketing campaign's success. The higher amount of clicks is a signal that your audiences are getting the appropriate commercial/adds and for your company higher Returns on Investment (ROI). So, the purpose of this analysis is to predict who and why a customer will click on your add.

The dataset is composed of:

    -Daily Time Spent on Site: Amount of time in the website
    -Age: Customer age
    -Area Income: Average revenue of customer 
    -Daily Internet Usage: daily average time on internet
    -Ad Topic Line: Text of the advertissement
    -City: City of the customer
    -Male: Wheter or not user is a male
    -Country: country of the user
    -Timestamp: Time at which consumer clicked on Ad or closed window
    -Clicked on Ad: 0 or 1 indicated clicking on Ad

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv("/kaggle/input/advertising/advertising.csv")
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.duplicated().sum()


In [None]:
df.isnull().sum()

# Exploratory Data Analysis


In [None]:
pd.crosstab(df['Male'],df['Clicked on Ad']).sort_values(1,ascending=False)
##data is balanced and equally distributed between gender

In [None]:
no_click = df[df['Clicked on Ad'] == 0]
click = df[df['Clicked on Ad'] == 1]

In [None]:

click['Age'].hist(bins=10,label = 'click', alpha=0.5)
no_click['Age'].hist(bins=10,label = 'no click', alpha=0.5)
plt.legend(loc = 'age_click')
plt.show()

In [None]:

click['Area Income'].hist(bins=10,label = 'click', alpha=0.5)
no_click['Area Income'].hist(bins=10,label = 'no click', alpha=0.5)
plt.legend(loc = 'income_click')
plt.show()

In [None]:

click['Daily Time Spent on Site'].hist(bins=10,label = 'click', alpha=0.5)
no_click['Daily Time Spent on Site'].hist(bins=10,label = 'Clicked on Ad', alpha=0.5)
plt.legend(loc = 'time_click')
plt.show()

In [None]:

click['Daily Internet Usage'].hist(bins=10,label = 'click', alpha=0.5)
no_click['Daily Internet Usage'].hist(bins=10,label = 'no click', alpha=0.5)
plt.legend(loc = 'fulltime_click')
plt.show()

In [None]:
import datetime

df['Date'] = pd.to_datetime(df['Timestamp'], errors='coerce')

df['Hour']=df['Date'].dt.hour
df['Month']=df['Date'].dt.month
df['Weekdays']= df['Date'].dt.weekday

In [None]:
pd.crosstab(df['Month'],df['Clicked on Ad'])
#no season

In [None]:
sns.countplot('Month',hue='Clicked on Ad',data= df)

In [None]:
sns.countplot('Weekdays',hue='Clicked on Ad',data= df)

In [None]:

sns.countplot('Hour',hue='Clicked on Ad',data= df)

In [None]:
df.corr()

In [None]:
df['Age_bins'] = pd.cut(df['Age'], bins=[0, 29, 35, 42, 70], labels=['Young','Adult','Mid', 'Elder'])
df['Salary_bins'] = pd.cut(df['Area Income'], bins=[0, 30000.00, 55000.00, 65000.00, 85000.00], labels=['Low Income','Below Average','Above Average', 'Wealth'])
df['Daily_bins'] = pd.cut(df['Daily Internet Usage'], bins=[0, 139, 183, 218, 300], labels=['Short Time','Below Average','Above Average', 'Addict'])
df['Website_bins'] = pd.cut(df['Daily Time Spent on Site'], bins=[0, 51, 68, 78, 100], labels=['Short time','Below Average','Above Average', 'Addict'])

In [None]:
a = df.groupby(['Age_bins', 'Salary_bins', 'Male'])['Clicked on Ad'].sum().unstack('Salary_bins')
a.fillna(0)

In [None]:
df.groupby(['Age_bins', 'Website_bins', 'Male'])['Clicked on Ad'].sum().unstack('Website_bins')

In [None]:
print('The number of towns is equal to {}.'.format(df['City'].nunique()))
print('The number of coutnries is equal to {}.'.format(df['Country'].nunique()))

# Prepare the Data

In [None]:
X = df.drop(['Date','Timestamp','Clicked on Ad', 'Ad Topic Line', 'Age_bins','City', 'Country', 'Salary_bins', 'Daily_bins', 'Website_bins'], axis=1)
y = df['Clicked on Ad']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
from  sklearn.preprocessing  import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)

In [None]:
import  statsmodels.api  as sm
from scipy import stats

X2   = sm.add_constant(X_train)
model  = sm.Logit(y_train, X2)
model2 = model.fit()
print(model2.summary(xname=['Const','Daily Time Spent on Site', 'Age', 'Area Income','Daily Internet Usage', 'Male', 'Hour', 'Month', 'Weekdays']))

In [None]:
X.drop(['Male','Hour', 'Month', 'Weekdays'], axis= 1, inplace = True)

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression                                                                  
lr = LogisticRegression()                
lr.fit(X_train, y_train)                                                                        
y_pred = lr.predict(X_test)   

In [None]:
from sklearn import metrics
print (metrics.accuracy_score(y_test, y_pred))
print (metrics.confusion_matrix(y_test, y_pred))
print (metrics.classification_report(y_test, y_pred))

****How to evaluate the model?

From a classification prediction, the model generated four possible outcomes: 

* True Positive (TP): the ones that model predict to click and they clicked 
* False Positive (FP): the ones that model predict to click and they did not click
* True Negative (TN): the ones that model predict to not click and they did not click
* False Negative (FN): the ones that model predict to not click and they clicked


We plot the result in a matrix NxN, where N represents the number target. In our case, it will be a matrix 2x2. We call it Confusion Matric. From it, we compute the outcome to evaluate our model. The standard metrics are accuracy, precision, recall, and F1-score.

Definition from Sklearn:

* The precision is the ratio TP / (TP + FP) where TP is the number of true positives and FP the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.
 
* The recall is the ratio TP / (TP + fFN) where TP is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

* The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. 
 
* In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

**Interpretation of the confusion matrix:**

From the test set, 96% of the consumer's action has been predicted.
According to the precision, the model predicted that a customer would click on the add, that customer click is correct 93% of the time. 


**Conclusion**

Our model is able to determine if a target client will click not in the add by 96%.
According to our previous analysis, people who are related to click on the add:

- tend to have lower-medium income, between 40.000-50.000
- generally older, 40 years-old and more
- spend too much time neither on the website nor on the internet