# Stirring Minds Task:-

In [None]:
#importing important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#loading the dataset
data=pd.read_csv("../input/clicks-conversion-tracking/KAG_conversion_data.csv")

# Data Preparation:- 

In [None]:
data.head()

In [None]:
data.shape

**So 11 features with 1143 entries are there in the dataset.**

In [None]:
#checking if any missing value is there
data.isnull().sum()

**Hence no missing values are there. That's cool!**

In [None]:
#Checking the info of the data
data.info()

In [None]:
data.describe()

In [None]:
#Taking a copy of the dataset to do further operations
df=data.copy()

In [None]:
df.head()

**As we can see, 'age' is in range and 'gender' is a categorical column. So we have to convert them to continuous values for further calculations.**

In [None]:
#Checking the unique elements of 'age' column
df['age'].unique()


In [None]:
#replace the range with average values
df['age']=df['age'].replace(['30-34','35-39','40-44','45-49'],[32,37,42,47])
df[['age']] = df[['age']].apply(pd.to_numeric) 

In [None]:
df['age']

In [None]:
#replace 'Male' with '0' and 'Female'with '1'
df['gender']=df['gender'].replace(['M','F'],[0,1])

In [None]:
df['gender']

In [None]:
#checking types of columns
df.dtypes

Looking at the dataset, we can choose the relevant columns to work on further.

In [None]:
#taking important columns for processing
ds=df[['age','gender','interest','Impressions','Clicks','Spent','Total_Conversion','Approved_Conversion']]

In [None]:
ds.head()

In [None]:
#creating some new calculated columns of our use
#How much amount is spent per click, i.e. SPC=Spent/Clicks
ds["SPC"]=df["Spent"]/df["Clicks"]
#How many impression turned into clicks, i.e. CPI%=(Clicks/Impressions)*100
ds["CPI"]=(df["Clicks"]/df["Impressions"])*100

In [None]:
#Checking the complete preprocessed dataset
ds.head()

# Visualisations to get useful Insights:-

In [None]:
#Let's check the correlation between features using heatmap
f,ax = plt.subplots(figsize=(15, 10))
sns.heatmap(ds.corr(method='pearson'), annot=True, fmt= '.1f',ax=ax, cmap="BrBG")

**Observation:- As an expected result, we get great correlation between amount spent and the number of clicks or impressions. Which is followed by a good amount of correlation between impressions and total conversion or approved conversion. And we can see that maximum number of conversions has turned into Approved conversions. That's something very positive.**

# Let's check if we get any insight from relation of age with other columns:-

In [None]:
cba=ds.groupby("age")["Clicks"].count() #Clicks per age group
Iba=ds.groupby("age")["Impressions"].count() #Impressions per age group
conv_age=ds.groupby("age")["Total_Conversion"].count() #Conversions per age group
CPI_age=ds.groupby("age")["CPI"].count() #CPI per age group
plt.subplot(221)
ax = cba.plot(kind='bar', figsize=(10,6), color="blue", fontsize=10)
ax.set_title("Clicks by age", fontsize=16)
ax.set_xlabel("Age", fontsize=12);
ax.set_ylabel("Clicks", fontsize=12);
plt.subplot(222)
ix = Iba.plot(kind='bar', figsize=(10,6), color="gray", fontsize=10)
ix.set_title("Impressions by age", fontsize=16)
ix.set_xlabel("Age", fontsize=12);
ix.set_ylabel("Impressions", fontsize=12);
plt.subplot(223)
bx = conv_age.plot(kind='bar', figsize=(10,6), color="green", fontsize=10)
bx.set_title("Conversion by age", fontsize=16)
bx.set_xlabel("Age", fontsize=12);
bx.set_ylabel("Conversion", fontsize=12);
plt.subplot(224)
cx = CPI_age.plot(kind='bar', figsize=(10,6), color="maroon", fontsize=10)
cx.set_title("CPI by age", fontsize=16)
cx.set_xlabel("Age", fontsize=12);
cx.set_ylabel("CPI", fontsize=12);
plt.tight_layout()
plt.show()

**Observation:- So, the number of clicks, impressions as well as conversions are maximum in case of the age-group 30-34 and the least in case of 40-44 age-group. So as in case of CPI(as it depends on other features). Hence the campaign may be the most focused on the age group of 30-34.**

# Let's check if we get any insight from relation of gender & age together with other columns:-

**Observation:- So, in each case the numbers are more in case of Males(0) than Females(1).**

In [None]:
#Now let's try to find some relation between age, gender and other important parameters
for column in ds[['Clicks','Impressions','Spent','Total_Conversion','CPI']]:
    with sns.axes_style(style='ticks'):
             g = sns.catplot("age", column, "gender", data=ds, height=8, kind="box")
             g.set_axis_labels("Age", column);

**Observations- Now here we get some interesting stuff. From the first graph it's pretty clear that we have more clicks from Female customers than that of the Males in each of the age groups. But in case of Impressions, we are not focusing that much on the Female customers. Which has to be taken care of. It may increase the Conversion rates also, espcially, in the age group of 30-34, from where we are getting the maximum clicks. From the 3rd graph, it is seen that in some cases we are spending more in Males that Females, which is to be completely reversed. In each and every age group, the investment should be more upon Female customers.
Which is most interesting here, that is the total conversion. Here we see similar trends in 30-34, 35-39 and 45-49 years age group. But in 40-44, Females are ahead. I think, if we can focus on the improvement of previously mentioned features, we will also get more conversions from more Female candidates.Clicks per Impression plot also says the same story of Female dominance. These are some very very useful insights for sure.**

In [None]:
#Checking the unique Campaign IDs
df['xyz_campaign_id'].unique()

In [None]:
#Let's Check the amount spent on each Campaign
sns.boxplot(x='xyz_campaign_id', y='Spent',data=df, color='gray', width=1)

**Observation:- Highest amount spent on the campaign with ID 1178. So let us focus on that campaign separately.**

# Analysing the Campaign with Campaign_ID= 1178

In [None]:
#Storing the 1178 campaign stats in a different dataframe
df_1178=df.loc[df['xyz_campaign_id'] == 1178]

In [None]:
df_1178.shape

In [None]:
df_1178.head()

In [None]:
df_1178=df_1178.drop(['xyz_campaign_id'],axis=1)

In [None]:
#Check the correlation heatmap
f,ax = plt.subplots(figsize=(15, 10))
sns.heatmap(df_1178.corr(method='pearson'), annot=True, fmt= '.1f',ax=ax)

**Nearly a similar graph we are getting here like the heatmap of the main dataset.**

In [None]:
#Let us get the insights based on age
sns.boxplot(x='age', y='Clicks',data=df_1178, color='gray', width=1)

In [None]:
cbag=df_1178.groupby("age")["Clicks"].count()
ax = cbag.plot(kind='bar', figsize=(10,6), color="blue", fontsize=10)
ax.set_title("Clicks by age", fontsize=16)
ax.set_xlabel("Age", fontsize=12);
ax.set_ylabel("Clicks", fontsize=12);
plt.show()

**Observation:- Here is also a similar story, like the age group of 30-34 are ahead of all. So it is pretty confirmed that our target audience should be the same.**

In [None]:
#Let's check the conversion for different age groups
conv_ages=df_1178.groupby("age")["Total_Conversion"].count() #Conversions per age group
bx = conv_ages.plot(kind='bar', figsize=(10,6), color="lime", fontsize=10)
bx.set_title("Conversion by age", fontsize=16)
bx.set_xlabel("Age", fontsize=12);
bx.set_ylabel("Conversion", fontsize=12);

**So the same trend reflects here also. Confirms the validity of our previous observation.**

# Now let's get the insights based on both age and gender in df_1178...

In [None]:
for column in df_1178[['Clicks','Impressions','Spent','Total_Conversion']]:
    with sns.axes_style(style='ticks'):
             g = sns.catplot("age", column, "gender", data=df_1178, height=8, kind="box")
             g.set_axis_labels("Age", column);
        

**Observation:- It is very much clear from the above graphs that in each and every aspect the female customers are pretty ahead than the males. Hence it is now an established fact that, in each age group, we have to focus more on the female customers and we should try to increase our impact on male customers as much as we can. Specially in the age groups of 30-34 the conversion rate is really good in male as well as female customers, so we should stress upon that. By increasing the Impressions, we can increase the Clicks and the conversion rate would increase similarly.**

# Now at last we will try to find the optimum number of clusters from the data set!

In [None]:
ds1=df[['xyz_campaign_id','age','gender','interest','Impressions','Clicks','Spent','Total_Conversion','Approved_Conversion']]

In [None]:
#Using Elbow method!
from sklearn.cluster import KMeans
wcss=[]
K_rng=10

for i in range(1,K_rng):
    K=KMeans(i)
    K.fit(ds1)
    w=K.inertia_
    wcss.append(w)
    
Clusters=range(1,K_rng)
plt.figure(figsize=(12,8))
plt.plot(Clusters,wcss)
plt.xlabel('Clusters')
plt.ylabel('WCSS Values') #Within Cluster Sum of Squares
plt.title('Elbow Method Visualisation')

Oservation:- The Elbow method graph suggests the optimum cluster region to be from 2 to 4. So we can proceed with these only.

# For K=2

In [None]:
#Fitting the model
K2= KMeans(2)
K2.fit(ds1)

In [None]:
#Prediction using the model
ds1_pred=ds1.copy()
ds1_pred['Predicted']=K2.fit_predict(ds1)

In [None]:
#Visualise the clusters (Clicks and Conversion) after prediction
plt.figure(figsize=(8,5))
plt.scatter(ds1_pred['Clicks'], ds1_pred['Total_Conversion'], c=ds1_pred['Predicted'], cmap = 'rainbow')
plt.xlabel('Clicks')
plt.ylabel('Conversion')
plt.title('Clicks VS Conversion(K=2)')

In [None]:
#Visualise the clusters (Impressions and Clicks) after prediction
plt.figure(figsize=(8,5))
plt.scatter(ds1_pred['Impressions'], ds1_pred['Clicks'], c=ds1_pred['Predicted'], cmap = 'jet')
plt.xlabel('Impressions')
plt.ylabel('Clicks')
plt.title('Impressions VS Clicks(K=2)')

# For K=3

In [None]:
#Fitting the model
K3= KMeans(3)
K3.fit(ds1)

In [None]:
#Prediction using the model
ds1_pred2=ds1.copy()
ds1_pred2['Predicted']=K3.fit_predict(ds1)

In [None]:
#Visualise the clusters (Clicks and Conversion) after prediction
plt.figure(figsize=(8,5))
plt.scatter(ds1_pred2['Clicks'], ds1_pred2['Total_Conversion'], c=ds1_pred2['Predicted'], cmap = 'rainbow')
plt.xlabel('Clicks')
plt.ylabel('Conversion')
plt.title('Clicks VS Conversion(K=3)')

In [None]:
#Visualise the clusters (Impressions and Clicks) after prediction
plt.figure(figsize=(8,5))
plt.scatter(ds1_pred2['Impressions'], ds1_pred2['Clicks'], c=ds1_pred2['Predicted'], cmap = 'jet')
plt.xlabel('Impressions')
plt.ylabel('Clicks')
plt.title('Impressions VS Clicks(K=3)')

# For K=4

In [None]:
#Fitting the model
K4= KMeans(4)
K4.fit(ds1)

In [None]:
#Prediction using the model
ds1_pred3=ds1.copy()
ds1_pred3['Predicted']=K4.fit_predict(ds1)

In [None]:
#Visualise the clusters (Clicks and Conversion) after prediction
plt.figure(figsize=(8,5))
plt.scatter(ds1_pred3['Clicks'], ds1_pred3['Total_Conversion'], c=ds1_pred3['Predicted'], cmap = 'rainbow')
plt.xlabel('Clicks')
plt.ylabel('Conversion')
plt.title('Clicks VS Conversion(K=4)')

In [None]:
#Visualise the clusters (Impressions and Clicks) after prediction
plt.figure(figsize=(8,5))
plt.scatter(ds1_pred3['Impressions'], ds1_pred3['Clicks'], c=ds1_pred3['Predicted'], cmap = 'jet')
plt.xlabel('Impressions')
plt.ylabel('Clicks')
plt.title('Impressions VS Clicks(K=4)')

**Observation:- From the analyses above it can be stated that optimum number of clusters are 3. This will help us to be focusted in the manner, that will help the company to progress further. Impressions VS Clicks are pretty good to observe in 3 clusters where the Clicks VS Conversion is a bit confusing.** 

# Final Observation & Colclusion:-

**It is clearly observed that the company should focus mainly upon the age group of 30-34 as most of their revenues are generating from that age group only. Beside it, steps are to be taken to increase the impact of the adds upon other age groups also. Another interesting thing to be considered is the gender. In each and every case we have seen that females are more productive for the organisation than that of the male customers. Hence here also the focus should be kept accordingly. Last but not the list we have found the optimum number of cluters to be 3 by Elbow method. Keeping all these important aspects in mind the company can build upon these insights to progress more.
Thank you.**

# END