# Advertising Click Prediction

Your goal is to analyze an advertising dataset, indicating whether or not a particular internet user has clicked on an Advertisement.

The goal is to predict if a user would click on an advertisement based on the features of the user.

In the following you can find the description of the features included in the dataset:

* _Daily Time Spent on Site_: consumer time on site in minutes
* _Age_: cutomer age in years
* _Area Income_: Avg. Income of geographical area of consumer
* _Daily Internet Usage_: Avg. minutes a day consumer is on the internet
* _Ad Topic Line_: Headline of the advertisement
* _City_: City of consumer
* _Male_: Whether or not consumer was male
* _Country_: Country of consumer
* _Timestamp_: Time at which consumer clicked on Ad or closed window
* _Clicked on Ad_: 0 or 1 indicated clicking on Ad

In [None]:
import numpy as np
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Importing the Data
Here we are going to import the data set and take a look inside:

In [None]:
data = pd.read_csv('advertising.csv') 

In [None]:
data.head(3)

In [None]:
data.info()

In [None]:
data.describe(include="all")

In [None]:
data["Clicked on Ad"].value_counts()

## Exploratory Data Analysis

We use some visualizations on the data to extract some insights. Let's first check the distribution on user ages:

In [None]:
plt.figure(figsize=(7,5))
sns.set_style('whitegrid')
sns.distplot(data['Age'], bins = 20, kde=True, hist_kws=dict(edgecolor="k", linewidth=1))

We can see that the ages of the internet users in this data set are between 20-60 and they are mostly in their thirties. Let's look at the _Area Income_ versus _Age_:

In [None]:
sns.jointplot(x='Age', y='Area Income', data= data)

We also explore the daily time spent on the website versus the age of the users:

In [None]:
sns.jointplot(x='Age', y='Daily Time Spent on Site', data= data,)

This shows that the younger adults (in the age of 20-40) have spent the most time on the website. However, we should also consider that they have the most population compared to other age groups in the dataset.

Now we want to see the daily time spent on the site versus the total time that the user has spent on the internet: 

In [None]:
sns.jointplot(x='Daily Time Spent on Site', y='Daily Internet Usage', data= data)

We can see that the users that spend more time on the internet tend to spend more time on the website too.

Now we take a quick look at the relationship of all the features, considering if they have clicked on the ad or not:

In [None]:
sns.pairplot(data, hue='Clicked on Ad')

## Cleaning the data

We check to see if we have any missing data:

In [None]:
sns.heatmap(data.isnull(), yticklabels=False)

We don't have any missing data as it can be seen in the heatmap.

We have some non-numerical values that we have to manage before using them as inputs of our machine learning algorithm, such as 'Ad Topic Line', 'City', 'Country', 'Timestamp'.

Considering the 'Ad Topic Line' we decide to drop it for now. However, it should be considered that using Natural Language Processing we might get interesting information out of it.

Regarding the 'City' and the 'Country', we can replace them by dummy variables with numerical values, however, in this way we get too many new features.

Another approach would be considering them as categorical values and coding them in one numeric feature. 

The 'Timestamp' conversion into numerical values is a bit more complicated. We can consider converting timestampts directly to numbers, or converting them to slots of time/day and see it as a categorical value and then convert it to numerical.

Here we have chosen to take the month and the hour from the timestamp as numerical features.

In [None]:
data['City Codes']= data['City'].astype('category').cat.codes

In [None]:
data['Country Codes'] = data['Country'].astype('category').cat.codes

In [None]:
data[['City Codes','Country Codes']].head(3)

In [None]:
data['Month'] = data['Timestamp'].apply(lambda x: x.split('-')[1])
data['Hour'] = data['Timestamp'].apply(lambda x: x.split(':')[0].split(' ')[1])

In [None]:
data[['Month','Hour']].head(3)

In [None]:
data['dateobj'] = data['Timestamp'].astype('datetime64[ns, US/Eastern]')

In [None]:
data['dow']= data['dateobj'].dt.dayofweek
data["Month"]=data["Month"].astype(int)
data["Hour"]=data["Hour"].astype(int)
data

## Training a Logistic Regression Model

Now we can select our features and the target, and then split our data into test and train sets.

For our features, we have already decided to drop the 'Ad Topic Line' as it is texual data and at the moment we don't want to invest the time to extract information out of it. We can also drop the features for which we have created the numerical replacement features:

In [None]:
X = data.drop(labels=['Ad Topic Line','City','Country','Timestamp','dateobj','Clicked on Ad'], axis=1)

In [None]:
X.head(3)

In [None]:
from sklearn.preprocessing import StandardScaler

sc=StandardScaler()
Xsc=sc.fit_transform(X)
Xsc

We can see that now all our features are numerical, and we can use them to train our model.

In [None]:
y = data['Clicked on Ad']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
#X_train, X_test, y_train, y_test = train_test_split(Xsc, y, test_size=0.3, random_state=101)

Now that we have our train data, we want to train a logistic model on it:

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
from sklearn.svm import SVC
model = LogisticRegression()
#model = SVC()

In [None]:
model.fit(X_train, y_train)

### Predictions and Evaluations
Now we predict values for the testing data:

In [None]:
predictions = model.predict(X_test)

Let's evaluate the model based on precision, recall and F1-Score

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test, predictions, target_names=['Not Clicked','Clicked']))

In [None]:
model.coef_