# Importing Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os, datetime
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler,LabelEncoder,StandardScaler
from sklearn.decomposition import PCA, non_negative_factorization
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.cluster import KMeans, AgglomerativeClustering,MeanShift,DBSCAN,Birch
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, f1_score, roc_auc_score,recall_score


## About Dataset
This dataset is taken from **KAGGLE**. 
## Context
## Problem Statement

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

#### Link of the dataset 
https://www.kaggle.com/datasets/imakash3011/customer-personality-analysis?select=marketing_campaign.csv

### Content
### Attributes

#### People

* ID: Customer's unique identifier
* Year_Birth: Customer's birth year
* Education: Customer's education level
* Marital_Status: Customer's marital status
* Income: Customer's yearly household income
* Kidhome: Number of children in customer's household
* Teenhome: Number of teenagers in customer's household
* Dt_Customer: Date of customer's enrollment with the company
* Recency: Number of days since customer's last purchase
* Complain: 1 if the customer complained in the last 2 years, 0 otherwise

#### Products

* MntWines: Amount spent on wine in last 2 years
* MntFruits: Amount spent on fruits in last 2 years
* MntMeatProducts: Amount spent on meat in last 2 years
* MntFishProducts: Amount spent on fish in last 2 years
* MntSweetProducts: Amount spent on sweets in last 2 years
* MntGoldProds: Amount spent on gold in last 2 years

#### Promotion

* NumDealsPurchases: Number of purchases made with a discount
* AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
* AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
* AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
* AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
* AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
* Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

#### Place

* NumWebPurchases: Number of purchases made through the company’s website
* NumCatalogPurchases: Number of purchases made using a catalogue
* NumStorePurchases: Number of purchases made directly in stores
* NumWebVisitsMonth: Number of visits to company’s website in the last month

#### Purpose of this exercise
* Need to perform clustering to summarize customer segments.



# Main objective
In this project, We will be performing unsupervised clustering on the customer's data from a groceries firm's database. Customer segmentation is the practice of separating customers into groups that reflect similarities among customers in each cluster. We will divide the customers into segments to understand the behaviour of customers as a group to the business. To modify products according to distinct needs of the customers. 

# Stakeholders
By this analysis, our stakeholders will get to know customer behaviour, who are our most loyal customers and customer who are visting the store less.
It also helps the business to cater to the concerns of different types of customers.

In [None]:
data=pd.read_csv('/kaggle/input/customer-personality-analysis/marketing_campaign.csv', sep="\t")

In [None]:
pd.set_option('display.max_columns', None)

# Data Cleaning and Exploration

In [None]:
data.shape

In [None]:
data.info()

In [None]:
# Categorical Variables encoded
labels = LabelEncoder()
data.Marital_Status=labels.fit_transform(data.Marital_Status)
data.Education=labels.fit_transform(data.Education)

In [None]:
# year_birth converted to age
data["age"]=datetime.datetime.today().year-data.Year_Birth
data=data.drop("Year_Birth",axis=1)

In [None]:
# renaming column names for more meaning
data.columns=['ID', 'education', 'marital_status', 'income', 'kidhome',
       'teenhome', 'cust_enrol_dt', 'recency', 'amt_spent_wine_last_2_yr', 'amt_spent_fruits_last_2_yr',
       'amt_spent_meat_last_2_yr', 'amt_spent_fish_last_2_yr', 'amt_spent_sweet_last_2_yr',
       'amt_spent_gold_last_2_yr', 'num_deals_purchases', 'num_web_purchases',
       'num_catalog_purchases','num_store_purchases','num_web_visits_month',
       'accepted_cmp3', 'accepted_cmp4', 'accepted_cmp5', 'accepted_cmp1',
       'accepted_cmp2', 'complain', 'Z_cost_contact', 'Z_revenue', 'response',
       'age']
# dt_customer column data type corrected
data.cust_enrol_dt=pd.to_datetime(data.cust_enrol_dt)
# data.head()

In [None]:
data['amount_spent']=data[['amt_spent_wine_last_2_yr', 'amt_spent_fruits_last_2_yr','amt_spent_meat_last_2_yr', 'amt_spent_fish_last_2_yr',
                           'amt_spent_sweet_last_2_yr','amt_spent_gold_last_2_yr']].sum(axis=1)
data=data.drop(['amt_spent_wine_last_2_yr', 'amt_spent_fruits_last_2_yr','amt_spent_meat_last_2_yr', 'amt_spent_fish_last_2_yr',
                           'amt_spent_sweet_last_2_yr','amt_spent_gold_last_2_yr'],axis=1)
cols=data.columns

In [None]:
# Now we can use this dataset for clustering as none of the columns are categorical
data.info()

In [None]:
# Finding the unique values in every column
pd.DataFrame([[i, len(data[i].unique())] for i in data.columns],
             columns=['Variable', 'Unique Values']).set_index('Variable')

In [None]:
# Check for missing values in dataset  - Missing values are handled using imputation
# All the variables are numeric in nature
data.income=data.income.fillna(np.mean(data.income))
data.isna().sum()
data.info()

In [None]:
# Check for duplicate customer ids # no duplicate ids found
assert data.ID.duplicated(keep=False).count()==data.shape[0]

In [None]:
# From the Histogram it is clear that there are some people whose age is greated than 120 on an average people live around 100 years
# So removing age values which are greater than 100
data=data[data.age<=100]
f, (ax1, ax2) = plt.subplots(1, 2)
data.age.hist(ax=ax1)
ax1.set_title("Histogram")
sns.boxplot(data.age,ax=ax2)
ax2.set_title("Boxplot")

In [None]:
# Checking the correlation between varibles
plt.figure(figsize = (12, 10))
sns.heatmap(data.corr(), annot = True, linewidths=0,fmt='.2f',annot_kws={"size": 8})

# Standardize the data

In [None]:
# Features to be considered
X=data.drop(['cust_enrol_dt','ID'],axis=1)
mms=MinMaxScaler()
transformed_data=mms.fit_transform(X,)

In [None]:
scaled_data=pd.DataFrame(transformed_data)
scaled_data.columns=set(cols)-{'cust_enrol_dt','ID'}
scaled_data.head()

# Model Training and Predictions

# Kmeans

In [None]:
# From the plot it is okay to create 4-5 clusters
kmeans=KMeans(n_clusters=5)
kmeans.fit(transformed_data)


In [None]:
# Plot to find optimal K

inertia = []
list_num_clusters = list(range(1,15))
for num_clusters in list_num_clusters:
    km = KMeans(n_clusters=num_clusters)
    km.fit(transformed_data)
    inertia.append(km.inertia_)
    
plt.plot(list_num_clusters,inertia)
plt.scatter(list_num_clusters,inertia)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia');
### END SOLUTION

In [None]:
kmeans_labels=kmeans.labels_
set(kmeans.labels_)

# Agglomerative clustering

In [None]:
agc=AgglomerativeClustering(n_clusters=5)
agc.fit(transformed_data)

In [None]:
agc_labels=agc.labels_
set(agc.labels_)

# BIRCH

In [None]:
birch=Birch(n_clusters=5)
birch.fit(transformed_data)

In [None]:
# 4 clusters generated
birch_labels=birch.labels_
set(birch.labels_)

# Aggregate all the clusters labels into original dataframe with the help of mode and analyze clusters

In [None]:
data['kmeans']=kmeans_labels
data['agc']=agc_labels
data['birch']=birch_labels

In [None]:
df_clusters=data[["kmeans","agc","birch"]]
data['voting_labels']=df_clusters.mode(axis=1)[0]

# Analyze Clusters

In [None]:
df_c1=data[data.voting_labels==1]
df_c1.describe()

In [None]:
df_c2=data[data.voting_labels==2]
df_c2.describe()

In [None]:
df_c3=data[data.voting_labels==3]
df_c3

In [None]:
df_c4=data[data.voting_labels==4]
df_c4

In [None]:
df_c5=data[data.voting_labels==0]
df_c5.describe()

# Summary and key insights

I have created 5 models logistic, KNN, decision tree, random forests and gradient boosting and used five different validation metrices. below is the summary of all provided how they have performed.
All the models are trained on same training sets and tested on same test sets. Also, almost all of the models used same parameters.
From the **confusion matrix** it is evident that **Logistic Regression** performed very badly with **0 precision and recall**.
KNN and Decision Tree model gave some edge as precision, recall and f1 scores starts to improve in these two models by decreasing some accuracy
I think gradient boosting method have performed very well as compared to other models with highest accuracy,Precision, recall and highest f1-score.



# Feature Importance

# Suggestions and next steps for revisiting the model

We could further optimize these models
1. Using **GridSearchCV** that will find the best parameters for every model.
2. Using Sampling because data is **unbalanced**, so we can also look from that angle also to increase the accuracy of the model.
3. We could also change our model based on the **inputs received from our stakeholders** about the business.
4. We could also use XGboost model