# About this data
### Source: https://www.kaggle.com/loveall/clicks-conversion-tracking

1.) ad_id: an unique ID for each ad.

2.) xyzcampaignid: an ID associated with each ad campaign of XYZ company.

3.) fbcampaignid: an ID associated with how Facebook tracks each campaign.

4.) age: age of the person to whom the ad is shown.

5.) gender: gender of the person to whim the add is shown

6.) interest: a code specifying the category to which the person’s interest belongs (interests are as mentioned in the person’s Facebook public profile).

7.) Impressions: the number of times the ad was shown.

8.) Clicks: number of clicks on for that ad.

9.) Spent: Amount paid by company xyz to Facebook, to show that ad.

10.) Total conversion: Total number of people who enquired about the product after seeing the ad.

11.) Approved conversion: Total number of people who bought the product after seeing the ad.

In [None]:
#importing libraries

import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# reading the file

dataset= pd.read_csv('../input/clicks-conversion-tracking/KAG_conversion_data.csv')

# Data Exploration

In [None]:
dataset.info()

In [None]:
dataset.head()

In [None]:
dataset.describe().T

### Separating 'Categorical' and 'Numerical' columns 

#### a) Categorical 

In [None]:
categorical_features=[ x for x in dataset.columns if dataset[x].dtype == 'O']
categorical_features

There are only 2 categorical columns namely, age and gender.

In [None]:
#finding count of different elements in the column

for x in categorical_features:
    abc = dataset[x].value_counts()
    print(abc)

Plotting the categorical variable to check the count of each element in it.

In [None]:
for i in categorical_features:
    sns.countplot(dataset[i], hue= dataset[i])
    plt.ylim(0,650)
    plt.legend(loc='upper right', facecolor='yellow', framealpha=0.5, bbox_to_anchor=(1.2,1))
    plt.show()

In [None]:
categorical_features_with_nan = [x for x in categorical_features if dataset[x].isnull().sum()>0]
categorical_features_with_nan

The 'categorical_features_with_nan' list shows that there is null value present in the categorical columns.

#### b) Numerical

In [None]:
numerical_features = [x for x in dataset.columns if x not in categorical_features]
numerical_features

In [None]:
numerical_features_with_nan = [x for x in numerical_features if dataset[x].isnull().sum()>0]
numerical_features_with_nan

The 'numerical_features_with_nan' list shows that there is null value present in the categorical columns.

From the above codes, we can conclude that this data has 'NO NULL' value.

## DATA VISUALISATION

In [None]:
fig= plt.figure(figsize=(11,10))

dataset.groupby(categorical_features)['Clicks','Spent'].mean().plot.bar()
plt.title('Average clicks/money spent vs Male/Female of different age groups', fontweight="bold")
plt.ylabel('count')

plt.show()

From the above graph, we can depict that females of every age clicked on the Ads the most compared to men of same age group. 

And money was also spent more on females of every age group.

In [None]:
fig= plt.figure(figsize=(15,10))

dataset.groupby('Total_Conversion')['Spent'].mean().plot.barh()
plt.title('Average Money spent vs Ads Enquiry', fontweight="bold")
plt.ylabel('Ads enquiry')

abc= round(dataset.groupby('Total_Conversion')['Spent'].mean(),2)

for index, value in enumerate(abc):
    plt.text(value, index, str(value))       

#### Insights from above graph:

When the company spent an average of 

1) $10-200 on Ads, then, they got 120 Ads enquiry in return.

2) $200-400 on Ads, then, they got 319 Ads enquiry in return.

3) > $400 on Ads, then, they got 114 Ads enquiry in return.

We can conclude that spending between $200-400 on Ads seems reasonable for any company because it yields the max Ads enquiry.

In [None]:
fig= plt.figure(figsize=(9,10))

value=[dataset['Total_Conversion'].mean(), dataset['Approved_Conversion'].mean()]
labels= ['Enquiry of product', 'Buying the product']
plt.pie(value, labels = labels, autopct='%.2f%%')

plt.show()

Around 3/4th of the people only enquires about the products seen via ads but only 1/4th people buys it.

In [None]:
fig= plt.figure(figsize=(12,10))

dataset.groupby(['Approved_Conversion'])['Impressions'].mean().plot.barh()
xyz=round(dataset.groupby(['Approved_Conversion'])['Impressions'].mean(),2)
plt.title('Average Impressions vs Product Bought', fontweight="bold")

#to put the value above the bar
for index, value in enumerate(xyz):
    plt.text(value, index, str(value))

#### Insights:
When the ads were shown more than 1300000 times, a total of 68 products were bought by the people, and,

64 products were sold when the Impressions were less than 1300000.

i.e. More impression implies more publicity and thus more people will buy it.

#### Separating data in 2 parts(men and women)
 

In [None]:
women= dataset[dataset['gender']=='F']
men= dataset[dataset['gender']=='M']

men_abv_avg_Impression = men[men['Impressions']>dataset['Impressions'].mean()]
women_abv_avg_Impression = women[women['Impressions']>dataset['Impressions'].mean()]

men_below_avg_Impression = men[men['Impressions']<dataset['Impressions'].mean()]
women_below_avg_Impression = women[women['Impressions']<dataset['Impressions'].mean()]

In [None]:
fig, (ax1,ax2) = plt.subplots(2,1, figsize=(8,9))

sns.distplot(men_abv_avg_Impression.Clicks, bins=10, kde= False, label= 'men_Clicks', ax= ax1)
sns.distplot(women_abv_avg_Impression.Clicks, bins=10, kde= False, label= 'women_Clicks', ax= ax1)
ax1.legend()
ax1.set_ylabel('count')
ax1.set_title('Above average impression vs Clicks', fontweight="bold", fontname="Times New Roman", size=28)

sns.distplot(men_below_avg_Impression.Clicks, bins=10, kde= False, label= 'men_Clicks', ax= ax2)
sns.distplot(women_below_avg_Impression.Clicks, bins=10, kde= False, label= 'women_Clicks', ax= ax2)
ax2.set_ylabel('count')
ax2.set_title('Below average impression vs Clicks', fontweight="bold", fontname="Times New Roman", size=28)
ax2.legend()

plt.tight_layout()
plt.show()

When women were shown ads more than the above average amount, they tend to click on it more often than men, 
who were shown the same amount of ads.

When women were shown ads less than the above average amount, then number of clicks fell drastically for both men and women, but women still clicked more than men did.

In [None]:
#removig the 'id' columns from the dataset
numerical_features_new = [x for x in numerical_features if '_id' not in x]
numerical_features_new

#### Giving numerical value to the categorical values by using LabelEncoder


In [None]:
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()

for x in categorical_features:
    dataset[x]= le.fit_transform(dataset[x])
    print(x, le.classes_)

In [None]:
for i in numerical_features_new:
    sns.distplot(dataset[i])
    plt.title(i)
    plt.ylabel('Count')
    plt.show()

#### Bringing down all the values to same scale(between 0 to 1)

In [None]:
from sklearn.preprocessing import StandardScaler
scale= StandardScaler()
dataset_scaled = pd.DataFrame(scale.fit_transform(dataset) ,columns = dataset.columns)

#### Finding the Correlation

In [None]:
fig= plt.figure(figsize=(12,10))

sns.heatmap(dataset_scaled.corr().abs() , annot= True)

### Keeping only the related columns

In [None]:
data= dataset_scaled.copy()
data= data[['Impressions', 'Clicks', 'Spent']]

## Applying algorithm (kmeans)

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

inertia= []
k= np.arange(1,9)
for i in k:
    model = KMeans(n_clusters= i, max_iter=5)
    ab = model.fit(data.values)
    inertia.append(pd.Series({'k': i,
                              'inertia': model.inertia_
                             }))

In [None]:
inertias= pd.concat(inertia, axis=1).T.set_index('k')
inertias

Lesser the inertia better the result.

But, we can see the drop in inertia from k=1 to k=2 and k=2 to k=3 is much greater.

Though, inertia decreases with increase in k, but the rate is very low.

In [None]:
# Plot k vs inertias

plt.figure(figsize=(12,6))
plt.plot(inertias.index, inertias['inertia'], '-*')
plt.title('Best k using Elbow method')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(k)
plt.show()

From the above elbow graph, we can see that for k=2, inertia drops with a high rate, so, we can choose k=2 or k=3 (to be on the safe side).

In [None]:
score=[]
k= np.arange(2,9)
for i in k:
    model = KMeans(n_clusters= i, max_iter=5)
    pred = model.fit_predict(data.values)
    sil_score= silhouette_score(data.values, pred)
    score.append(pd.Series({'k': i,
                            'Score': sil_score
                             }))

In [None]:
result= pd.concat(score, axis=1).T.set_index('k')
result

From the above table, it is clear that k=2 i.e. , two yields the best result so we'll choose it.

In [None]:
model = KMeans(n_clusters= 2, max_iter=5)
model.fit(data.values)

In [None]:
data['Score']= model.predict(data.values)
data.sample(5)

A Silhouette score of :

1) -1 shows BAD CLUSTERING,

2) 0 shows CLUSTERS ARE OVERLAPPED,

3) 1 or >1 shows GOOD CLUSTERING.

### Visualisation of the result

In [None]:
sns.lmplot(x='Impressions', y='Clicks', data= data, hue='Score',fit_reg=False, markers=["o", "^"] ,palette='Set1')
sns.lmplot(x='Spent', y='Clicks', data= data, hue='Score',fit_reg=False, markers=["o", "^"], palette="ocean" )
sns.lmplot(x='Impressions', y='Spent', data= data, hue='Score',fit_reg=False, markers=["o", "^"] ,palette="cool" )
plt.show()