# <a id=1>Clustering for Click-Through-Rate using KMeans, Gaussian Mixture and Text Processing</a>

The goal of this kernel is to cluster the given sample of Ad-Topics with KMeans and GMM with and without text processing and verify the accuracy of models with various metrics and/or visual methods.<br>
The data consists of 10 features, *viz.*, 'Daily Time Spent on Site', 'Age', 'Area Income', 'Daily Internet Usage', 'Ad Topic Line', 'City', 'Male', 'Country', Timestamp' and 'Clicked on Ad'. The meta-data of the features is as below:-<br>
 - **Daily Time Spent on a Site**: 	Total time spent by the user on a site in minutes.
 - **Age**: User's age
 - **Area Income**: Average income of geographical area of user.
 - **Daily Internet Usage**: 	Avgerage minutes in a day user is on the internet.
 - **Ad Topic Line**: Banner topic line of the advertisement.
 - **City**: 	City of the user.
 - **Male**: Gender of user (male=1,female=0)
 - **Country**: Country of the user.
 - **Timestamp**: 	Time at which user clicked on an Ad or closed window otherwise
 - **Clicked on Ad**: ***TARGET COLUMN***  This is a binary feature: 0 refers to the case where a user didn't click the advertisement, while 1 refers to the case when the advertisement is clicked.
 This kernel is devided into following sections:-
 
* [Introduction](#1)
* [Importing Required Packages](#2)
* [Import dataset and get knowabouts](#3)
* [Data Visualization](#4)
* [Pre-processing](#5)
* [Clustering](#6)
  *  [Using KMeans](#6.1)
  *  [Using KModes](#6.2)
  *  [Using Gaussian Mixture Model](#6.3)
* [Evaluation of clusters](#7)
* [Re-cluster after text-processing](#8)

# <a id= 2> Importing Required Packages</a>
Here the imported packages are divided as per their usage. 
 * Numpy, Pandas and Seaborn: For data handling and visualisation
 * re: RegEx package for text manipulation
 * sklearn packages: For pre-processing, Clustering and validation
 * ntlk.stem: For stemming the text in ad topics to avoid duplicated features in vectorizer.

In [None]:
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib.pyplot as plt

import warnings

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

from sklearn.mixture import GaussianMixture as GMM
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans

from sklearn.metrics import adjusted_rand_score,calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score,completeness_score,homogeneity_score,v_measure_score
from nltk.stem import WordNetLemmatizer

from IPython.core.interactiveshell import InteractiveShell
from IPython.core.display import display, HTML
pd.options.display.max_rows = 500
InteractiveShell.ast_node_interactivity = "all"
warnings.filterwarnings("ignore")

# <a id = "3">Import dataset and get knowabouts</a>
This section is all about getting data and understanding its structure. Here we try to find skew of data, null values and their occurance with respect to other features and statistical standing of data.

In [None]:
df= pd.read_csv('/kaggle/input/advertising/advertising.csv',parse_dates = ['Timestamp'])
display(HTML('<h2 id = "inf">Basic Information about data like data-type and count</h2>'))
df.info()
display(HTML('<h2>Statistical summary of data</h2>'))

df.describe()
display(HTML('<h2>Random Sample size of 5 observations</h2>'))

df.sample(5).T

As in [basic information of data](#inf) we can see all features have 1000 values-it means there are no null values. But still we are to some superficial probe how many countries and cities are there or what time span is covered in dataset.

In [None]:
display(HTML('<h2 id="uni">Unique Values in dataset</h2>'))
df.nunique()
display(HTML('<b> Besides these it has <i>'+str(df['Timestamp'].dt.year.nunique())+'</i> year values and <i>'
             +str(df['Timestamp'].dt.month.nunique())+'</i> month values in Timestamp feature</b>'))

For ease of operation let us remove spaces in feature names.

In [None]:
f_dct = {n : re.sub('[^A-Za-z0-9]+','_',n) for n in df.columns.values}
df.rename(columns = f_dct,inplace=True)

# <a id=4>Data Visualization</a>
Now let us go a little deeper in data. To understand the data better we try to plot some graphs to identify their inter-relation and distribution. First, we try to identify the presence of outliers by box and whiskers plot on daily time spent, Age and internet usage. Other columns are not included being categorical. The area income column is also left as its scale will hugely affect the graph

In [None]:
p=df[['Daily_Time_Spent_on_Site', 'Age', 'Daily_Internet_Usage']].boxplot(figsize = (10,8),grid=True,fontsize=10)
plt.suptitle('Box plot for features',fontsize=15)

While checking for [unique values](#uni) we found out that 237 countries and 966 cities are out there in dataset. Let's check how these are affecting click-through.

In [None]:
pd.crosstab(df.Country,df.Clicked_on_Ad).sort_values(1,ascending=False)

In [None]:
pd.crosstab(df.City,'count').sort_values('count',ascending=False)

Now we check relationship amongst features with pairplot.

In [None]:
plt.figure(figsize=(10, 10))
p = sns.pairplot(df, hue ='Clicked_on_Ad',
    vars=['Daily_Time_Spent_on_Site', 'Age', 'Area_Income', 'Daily_Internet_Usage']
                 ,diag_kind='kde',   palette='bright')
plt.show()

# <a id=5>Pre-processing</a>
Under pre-processing, we try to get the data ready for modelling. During this data is cleansed, redundant/useless features are removed, new features are created as per requirement and such more jobs to do.
<p> Here we do not have any NaN thus no need for imputation. Further, we have 2 text features which will not be useful and are required to be dropped. Also, Timestamp feature is itself meaningless but its part like dayofweek,dayofmonth,month, year etc are useful and hence are required to be generated. So, let's start with feature creation.

In [None]:
df['hour']=df['Timestamp'].dt.hour
df['day'] = df['Timestamp'].dt.day
df['month'] = df['Timestamp'].dt.month
df['weekday'] = df['Timestamp'].dt.weekday

In [None]:
display(HTML("<h3>Dropping unusable features</h3> We are dropping here ['Timestamp','City','Country']"))
df.drop(columns=['Timestamp','City','Country'],inplace=True)
display(HTML('<b> Now shape of dataset is '+str(df.shape)+'</b>'))
display(HTML('<h3> New sample of data</h3>'))
df.sample(n=5)

## Scaling
Now to avoid effect of different scales in different feature we will now scale the data(leaving categorical and text columns). Since we don't have outliers here, we can use StandardScaler. The standard scaler will first calculate mean $\mu$ and standard deviation $\sigma$ of each feature and then replace each value with it's z-score which is defined as<p><font size = 12>
    $z = \frac{x-\mu}{\sigma}$</font>

In [None]:
X = df[ ['Daily_Time_Spent_on_Site', 'Age', 'Area_Income', 'Daily_Internet_Usage', 'Male','weekday','day','hour','month']]
Y = df[['Clicked_on_Ad']].to_numpy().ravel()
X_scaled = StandardScaler().fit_transform(X.copy())

Now our dataset is ready for clustering. we can pass on the scaled ndarray **X_scaled** for further process.

# <a id =6>Clustering</a>
Here we will first try to use given five features ['Daily_Time_Spent_on_Site', 'Age', 'Area_Income', 'Daily_Internet_Usage', 'Male'] for clustering and review it with predetermined result **Y**.
<p>There are various clustering models but we are taking a few as KMeans,Gaussian Mixture Model and KModes.

## <a id="6.1">K-Means Clustering</a>
In K-Means, we try to divide the data in pre-determined number of clusters and check the 
behavious of clusters later-on. It by default will use euclidian distance and tries to minimise the same by moving **means** of clusters.

In [None]:
km = KMeans(n_clusters=2) #K-Means model
cluster_km = km.fit_predict(X_scaled) #fitting means it tries to understand the data and predict will give cluster lables

## <a id="6.2">K-Modes Clustering</a>
It is similar to K-Means but here defining metric is mode instead of mean. Here, we try to divide the data in pre-determined number of clusters and check the 
behavious of clusters later-on. It by default will use euclidian distance and tries to minimise the same by moving **modes** of clusters.

In [None]:
from kmodes.kmodes import KModes
km1 = KModes(n_clusters=2,init='Cao')
cluster_km1 = km1.fit_predict(X_scaled)

## <a id="6.3">Gaussian Mixture Model</a>

Gaussian Mixture Models (GMMs) are based on Gaussian Distributions and are flexible building blocks for other machine learning algorithms. They are great approximations for general probability distributions but also because they remain somewhat interpretable even when the dataset gets very complex. Mixture Models do not require to know about data and the subpopulation to which it belongs but learn about the same later on by finding the distribution(s) for its each feature.

In [None]:
gmm = GMM(n_components=2, covariance_type='full', max_iter=100, n_init=10)
cluster_gmm = gmm.fit_predict(X)

## <a id="6.4">OPTICS Model</a>
OPTICS (Ordering Points To Identify the Clustering Structure) finds core sample of high density and expands clusters from them. It keeps cluster hierarchy for a variable neighborhood radius.

In [None]:
from sklearn.cluster import OPTICS
optics = OPTICS(min_samples=2)
cluster_optics = optics.fit_predict(X_scaled)

# <a id="7">Evaluation of clustering</a>
The evaluation of clusters can be in two ways:-
 - Using Indices/metrics
    - **Internal Metrics** - Unsupervised clustering; where ground truth is unavailable.
    - **External Metrics** - Supervised clustering; where ground truth is available.
 - Using visual methods

In this kernel I have used **Davies Bouldin Score** and **Calinski Harabasz Score** for internal metrics and **ARI, Completeness score, homogeneity score and V-measure** for external metric.

For visual evaluation, I have used TSNE to check the performance of clustering.

In [None]:
models = [km,km1,gmm,optics]
clst = [cluster_km,cluster_km1,cluster_gmm,cluster_optics]
t ='<table border=1 color = "#000000"><tr><th>model</th><th>ARI</th><th>calinski_harabasz_score</th><th>davies_bouldin_score</th>'
t+='<th>completeness_score</th><th> homogeneity_score </th><th>v_measure_score</th>'
for i in range(4):
    t = t+('<tr><td>'+str(models[i])+'</td>'+'<td>'+str(adjusted_rand_score(Y,clst[i]))+'</td>')
    t = t+('<td>'+str(calinski_harabasz_score(X_scaled,clst[i]))+'</td>')
    t = t+('<td>'+str(davies_bouldin_score(X_scaled,clst[i]))+'</td>')
    t = t+('<td>'+str(completeness_score(Y,clst[i]))+'</td>')
    t = t+('<td>'+str(homogeneity_score(Y,clst[i]))+'</td>')
    t = t+('<td>'+str(v_measure_score(Y,clst[i]))+'</td></tr>')
t+='</table>'    
display(HTML(t))

In [None]:
from sklearn.metrics import classification_report
for i in range(4):
    display(HTML('<h4>'+str(models[i])+'</h4>'))
    print(classification_report(Y,clst[i]))

In [None]:
tsne = TSNE(n_components = 2)
tsne_out = tsne.fit_transform(X_scaled)
fig, axs = plt.subplots(2,2, figsize=(15, 15))
plt.suptitle('TSNE Visualisation for different cluster models',fontsize=15)
for i in range(4):
    p = axs[i//2][i%2].scatter(tsne_out[:, 0], tsne_out[:, 1],marker=10,s=10,linewidths=5,c=clst[i])
    axs[i//2][i%2].set_title(models[i])

# <a id=8>Re-cluster after Text Processing</a>

<font size =5>$tf_i,_j = \frac{f_j(i)}{max f_i}$
<br>
$w_i,_j  =  tf_i,_j \times log_2(\frac{N}{df_i})$
    </font>

In [None]:
#from nltk.stem import WordNetLemmatizer
#import nltk
#nltk.download('wordnet')
topics = []
stemmer = WordNetLemmatizer()
for i in range(X.shape[0]):
    topic = re.sub(r'\W',' ',df.Ad_Topic_Line[i])
    topic = re.sub(r'\s+[a-zA-Z]\s+', ' ',topic)
    
    # remove all single characters
    topic = re.sub(r'\s+[a-zA-Z]\s+', ' ', topic)
    
    # Remove single characters from the start
    topic = re.sub(r'\^[a-zA-Z]\s+', ' ', topic) 
    
    # Substituting multiple spaces with single space
    topic = re.sub(r'\s+', ' ', topic, flags=re.I)
    
    # Removing prefixed 'b'
    topic = re.sub(r'^b\s+', '', topic)
    
    # Converting to Lowercase
    topic = topic.lower()
    
    # Lemmatization
    topic = topic.split()

    topic = [stemmer.lemmatize(word) for word in topic]
    topic = ' '.join(topic)
    
    topics.append(topic)



In [None]:
tfidfconverter = TfidfVectorizer( max_features=500,min_df=3 ,max_df=0.8,stop_words='english') #
X = tfidfconverter.fit_transform(topics)

X.toarray()
len(tfidfconverter.get_feature_names())
tfidfconverter.get_feature_names()

In [None]:
df1 = pd.DataFrame(X.toarray(),columns=tfidfconverter.get_feature_names())
df1 = pd.concat([df,df1],axis='columns')
df1.drop(columns=['Ad_Topic_Line','Clicked_on_Ad'],inplace=True)
X1 = StandardScaler().fit_transform(df1)

In [None]:
clst_txt = [m.fit_predict(X1) for m in models]

In [None]:
t ='<table border=1 color = "#000000"><tr><th>model</th><th>ARI</th><th>calinski_harabasz_score</th><th>davies_bouldin_score</th>'
t+='<th>completeness_score</th><th> homogeneity_score </th><th>v_measure_score</th>'
for i in range(4):
    t = t+('<tr><td>'+str(models[i])+'</td>'+'<td>'+str(adjusted_rand_score(Y,clst[i]))+'</td>')
    t = t+('<td>'+str(calinski_harabasz_score(X_scaled,clst[i]))+'</td>')
    t = t+('<td>'+str(davies_bouldin_score(X_scaled,clst[i]))+'</td>')
    t = t+('<td>'+str(completeness_score(Y,clst[i]))+'</td>')
    t = t+('<td>'+str(homogeneity_score(Y,clst[i]))+'</td>')
    t = t+('<td>'+str(v_measure_score(Y,clst[i]))+'</td></tr>')
t+='</table>'    
display(HTML(t))

In [None]:
for i in range(4):
    print(models[i])
    print(classification_report(Y,clst_txt[i]))

In [None]:
tsne_out = tsne.fit_transform(X1)
fig, axs = plt.subplots(2,2, figsize=(15, 15))
plt.suptitle('TSNE Visualisation for different cluster models',fontsize=15)
for i in range(4):
    p = axs[i//2][i%2].scatter(tsne_out[:, 0], tsne_out[:, 1],marker=10,s=10,linewidths=5,c=clst_txt[i])
    axs[i//2][i%2].set_title(models[i])