In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

In [None]:
df = pd.read_csv('../input/advertising.csv')

In [None]:
df.head()

In [None]:
df.describe()

Looks like the mean values are on different orders. We'll probably need to scale the features later!

In [None]:
np.sum(df.isnull(), axis = 0)

Thankfully, no missing values!

### EDA

In [None]:
sns.heatmap(df.corr());

From the heatmap, the daily time spend on site and the daily internet usage appear to be the features that are most correlated. Let's plot a scatterplot to see how it looks like!

In [None]:
sns.scatterplot(x = 'Daily Time Spent on Site', y = 'Daily Internet Usage', data = df, hue = 'Clicked on Ad');

Interesting! Just by considering these two features it looks like we have quite a clear separation of categories already! Not only that, the boundary appears to be quite linear as well. However, there are some outliers (the orange points) that are in the blue region, but not vice-versa (trying to classify these outliers correctly will be difficult task indeed)

In [None]:
sns.distplot(df[df['Clicked on Ad'] == 0]['Daily Internet Usage'], label = 'Did not click');
sns.distplot(df[df['Clicked on Ad'] == 1]['Daily Internet Usage'], label = 'Clicked');
plt.legend();

In [None]:
sns.distplot(df[df['Clicked on Ad'] == 0]['Daily Time Spent on Site'], label = 'Did not click');
sns.distplot(df[df['Clicked on Ad'] == 1]['Daily Time Spent on Site'], label = 'Clicked');
plt.legend();

The next most correlated feature is age, let's see if there is a big difference between the two categories, through a box plot

In [None]:
sns.boxplot(x = 'Clicked on Ad', y = 'Age', data = df);

There seems to be a correlation here as well, older people seem to click on the ad more often!


### Classification (Logistic Regression)

Let's now begin our classification task. We'll begin with a simple logistic regression to give us a linear decision boundary

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

We'll work with just the 2 daily time features explored previously as they already provide us with quite a clear distinction.

In [None]:
x = df[['Daily Time Spent on Site', 'Daily Internet Usage']]
y = df['Clicked on Ad']

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3)

In [None]:
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

In [None]:
pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)

Let's see how our model does!

In [None]:
from sklearn.metrics import classification_report as cr, confusion_matrix as cm

In [None]:
print(cm(y_test,y_pred))
print(cr(y_test,y_pred))

Wow! This is actually pretty amazing already! This makes me wonder if the data is from real-life or not =O

Nevertheless, as an exercise, let's try to plot out the decision boundary using just the 2 daily times features!

In [None]:
def plotBoundary(x,classifier):
    x1_max = np.max(x['Daily Time Spent on Site']) + 1
    x1_min = np.min(x['Daily Time Spent on Site']) - 1
    x2_max = np.max(x['Daily Internet Usage']) + 1
    x2_min = np.min(x['Daily Internet Usage']) - 1

    xx1,xx2 = np.meshgrid( np.arange(x1_min,x1_max,0.1), np.arange(x2_min, x2_max,0.1))
    features = np.array([xx1.ravel(), xx2.ravel()]).T
    predictions = classifier.predict(features).reshape(xx1.shape)
    plt.contour(xx1,xx2,predictions);

In [None]:
plotBoundary(x,pipe)
sns.scatterplot(x = 'Daily Time Spent on Site', y = 'Daily Internet Usage', data = x_test, hue = y_test);

As expected, we managed to capture most of the points, except for the outliers mentioned previously. The linear decision boundary appears to be a pretty good fit!

### Classification (KNN)

Let's try to use the K Nearest Neighbors Classifier this time!

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
pipe2 = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', KNeighborsClassifier())
])

param_grid = {
    'clf__n_neighbors': list(range(5,55,5))
}

grid = GridSearchCV(pipe2, param_grid = param_grid, cv = 5)


In [None]:
grid.fit(x_train,y_train)

In [None]:
print(grid.best_params_)
print(grid.best_score_)

Looks like 25 neighbors is the best choice out of the given range! 

In [None]:
y_pred2 = grid.predict(x_test)
print(cm(y_test,y_pred2))
print(cr(y_test,y_pred2))

In [None]:
plotBoundary(x,grid)
sns.scatterplot(x = 'Daily Time Spent on Site', y = 'Daily Internet Usage', data = x_test, hue = y_test);

That's a cool-looking boundary =) But unfortunately, this has caused us to miss out some points near the boundary (compare this against the logistic regression case and we see that there are a few orange points very close to the boundary; they were classified correctly previously, but now they are incorrectly classified)

Overall, I think this is a pretty kind dataset because we didn't need to do any data cleaning, and some features were already strongly correlated so it was pretty easy to pick out the important ones! Hope you enjoyed the read!