# Introduction

This notebook serves as an exploritory data analysis (EDA) on the *customer-personality-analysis* dataset. The project task for this dataset is to preform clustering to summarize and understand customer segments and behavior. 

The clustering technique used in this EDA can be found here https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/

All other analysis on this dataset are my own.

Please feel free to provide sugesstions and comments on this notebook below!

## Guiding Questions

Some questions we might ask to guide our analysis include.
* What are the main demographics of our customers?
* What are their spending habits?
* How can this analysis help our company's marketing campaign?

#  Data Preperation and Cleaning

lets first import our packages and our dataset

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.style as style
%matplotlib inline
import seaborn as sns

style.use('seaborn-poster')
style.use("fivethirtyeight")
plt.rcParams['font.family'] = 'serif'

import warnings
warnings.filterwarnings('ignore')

In [None]:
customer_data = pd.read_csv('../input/customer-personality-analysis/marketing_campaign.csv', sep = '\t', index_col = 'ID')

Next we can take a look at our data before begining the cleaning process.

In [None]:
customer_data.head()

In [None]:
customer_data.shape

looks like this dataframe is comprised of 2240 customers with 28 features. We won't be using all 28 columns in this EDA however, so we will want to parse our dataframe. Lets first determine if we have any missing values.

In [None]:
customer_data.isnull().sum()

seems as if our missing values are only in the *Income* column. This could correspond to customers with no income so imputation would not make sense here. This dataset is large enough where we can omit these rows.

In [None]:
customer_data_cleaned = customer_data.dropna()

Next we will want to get a better idea of the ages of our customers. We do not currently have an *Age* column in our dataframe so lets create one using the *Year_Birth* feature.

In [None]:
from datetime import date

def get_age(birthyear):
    return date.today().year - birthyear

ages = customer_data_cleaned.Year_Birth.map(get_age)

customer_data_cleaned["Age"] = customer_data_cleaned.Year_Birth.map(get_age)

customer_data_cleaned.Age.describe()

We seem to have some users that are extremely old, the oldest customer is 128 years old! This data might be incorrect. Lets take a closer look.

In [None]:
customer_data_cleaned.sort_values(by = 'Year_Birth').head()

We observe that we have three customers who were born in the 19th century. Surely this cannot be correct. Lets drop these rows.

In [None]:
customer_data_cleaned.drop([11004, 1150, 7829], inplace = True)

Next lets create some new features in our dataframe. The first being a total sum of all the spending for each customer titled *Spending*. We will also create another feature that includes the number of months each customer has been with the company. This will allow us to cluster the customers in groups of new and old as well as big and small spenders.

In [None]:
customer_data_cleaned['Spending'] = customer_data_cleaned.MntWines + customer_data_cleaned.MntFruits + customer_data_cleaned.MntMeatProducts + customer_data_cleaned.MntFishProducts + customer_data_cleaned.MntSweetProducts + customer_data_cleaned.MntGoldProds
customer_data_cleaned['Time_With_Company'] = pd.to_datetime(customer_data_cleaned.Dt_Customer, dayfirst = True, format = '%d-%m-%Y')
customer_data_cleaned['Time_With_Company'] = pd.to_numeric(customer_data_cleaned.Time_With_Company.dt.date.apply(lambda z: (date.today() - z)).dt.days, downcast = 'integer') / 30

Lets now take a look at the *Education* feature. This will allow us to better understand the demographic of customer base.

In [None]:
customer_data_cleaned.Education.unique()

There are only a few unique values for this feature so we can leave it be. Lets do the same with the *Marital_Status* feature now.

In [None]:
customer_data_cleaned.Marital_Status.unique()

Seems like we have a larger number of unique values for this column. Some of these values are similar in definition so lets group them togheter to make our analysis easier.

In [None]:
customer_data_cleaned.Marital_Status = customer_data_cleaned.Marital_Status.replace({"Divorced": "Single", "Together": "Partner","Married": "Partner", "Widow": "Single", "Alone": "Single", "Absurd": "Single","YOLO": "Single"})

Lets create some new features regarding the children of the customers

In [None]:
customer_data_cleaned["Children"] = customer_data_cleaned.Kidhome + customer_data_cleaned.Teenhome
customer_data_cleaned["Has_Child"] = np.where(customer_data_cleaned.Children > 0, "Has Child", "No Child")

Finally, lets rename some of our column names.

In [None]:
customer_data_cleaned = customer_data_cleaned.rename(columns = {"MntWines": "Wine",
                                                     "MntFruits": "Fruit",
                                                     "MntMeatProducts": "Meat",
                                                     "MntFishProducts": "Fish",
                                                     "MntSweetProducts" : "Sweets",
                                                     "MntGoldProds": "Gold"})

In [None]:
customer_data_cleaned = customer_data_cleaned.rename(columns = {"NumWebPurchases" : "Web",
                                                               "NumCatalogPurchases" : "Catalog",
                                                               "NumStorePurchases" : "Store",
                                                               "NumWebVisitsMonth" : "WebVisits"})

Lets take a look at the values for web purchases

In [None]:
customer_data_cleaned.Web.describe()

In [None]:
customer_data_cleaned.Web.value_counts()

Seems like we have some outlier values, lets remove them.

In [None]:
outlier_IDs = customer_data_cleaned.loc[customer_data_cleaned.Web > 20].index
customer_data_cleaned.drop(outlier_IDs, inplace = True)

We will do the same for the *Catalog* and *Store* columns

In [None]:
customer_data_cleaned.Catalog.describe()

In [None]:
customer_data_cleaned.Catalog.value_counts()

In [None]:
outlier_IDs = customer_data_cleaned.loc[customer_data_cleaned.Catalog > 20].index
customer_data_cleaned.drop(outlier_IDs, inplace = True)

In [None]:
customer_data_cleaned.Store.describe()

Lets group together our clean dataframe and select the columns we wish to analyze.

In [None]:
demographic_data = customer_data_cleaned[[ "Education", "Marital_Status", "Has_Child", "Children", "Age", "Income", "Spending", "Time_With_Company", "Wine", "Fruit", "Meat", "Sweets", "Gold", "Web", "Catalog", "Store", "WebVisits"]]

In [None]:
demographic_data.head()

In [None]:
demographic_data.Income.describe()

In [None]:
demographic_data = demographic_data[demographic_data.Income < 600000]

# Analysis

Lets begin to vizualize our data. We will start with the *Education* feature.

In [None]:
ax = sns.countplot(data = demographic_data, x = 'Education', palette = ("Set2"))
ax.set(xlabel = None,
      title = "Education Level of Customers")

Seems like a majority of customers have a graduate degree (Bachelor's). It also seems like the second highest group includes customers with PhD's. Lets look at marital status now.

In [None]:
ax = sns.countplot(data = demographic_data, x = "Marital_Status", palette = "husl")
ax.set(xlabel = None,
      title = "Marital Status of Customers")

Seems like the number of customers in a relationship is almost double the number of single customers. This could be useful for marketing purposes. Let's move on to the child status of our customers.

In [None]:
ax = sns.countplot(data = demographic_data, x = "Has_Child", palette = "flare")
ax.set(xlabel=None,
      title = "Child Status of Customers")

An overwhelming majority of customers have at least one child. Now what about the age of our customers?

In [None]:
ax = sns.histplot(data = demographic_data.Age, color = "midnightblue")
ax.set(title = "Ages of Customers",
      ylabel = None,
      xlabel = "Age")
ax.grid(axis = "x")

The age of our customers varies over a wide range. However it seems like most of our customers are middle aged (40-60 years). 

In [None]:
demographic_data.Age.describe()

In [None]:
print("Average age of customers", np.round(demographic_data.Age.mean()), "years old")

Lets look at the income of our customers and how much they spend.

In [None]:
fig, ax = plt.subplots(1,2, figsize = (20,12))
sns.histplot(ax = ax[0], data = demographic_data.Income, color = "steelblue")
sns.histplot(ax = ax[1], data = demographic_data.Spending, color = "steelblue")

ax[0].set_title("Income of Customers", fontsize = 22, pad = 50)
ax[0].set_xlabel("Income [USD $]", fontsize = 20, labelpad = 35)

ax[1].set_title("Spending of Customers", fontsize = 22, pad = 50)
ax[1].set_xlabel("Amount Spent [USD $]", fontsize = 20, labelpad = 35)

for num in [0,1]:
    ax[num].grid(axis = "x")
    ax[num].set(ylabel = None)

In [None]:
demographic_data.Income.describe()

In [None]:
demographic_data.Spending.describe()

The income of our customers varies by quite a bit with a normal distribution. The average salary seems to be around 52,000 USD and average spending around 605 USD. 

Lets finally look at how long our customers have been with the company.

In [None]:
ax = sns.histplot(data = demographic_data.Time_With_Company, color = "darkslateblue")

ax.set_title("Time With Company", fontsize = 22, pad = 50)
ax.set_xlabel("Number of Months", fontsize = 20, labelpad = 35)
ax.set_ylabel(None)
ax.grid(axis = 'x')

Seems to be a pretty even distribution of old and new customers. This suggests that the company seems to be growing at a steady rate. 

Now lets observe how much our customers spend on specific products.

In [None]:
fig, ax = plt.subplots(2,2, figsize = (18,18))

sns.histplot(ax = ax[0,0], data = demographic_data.Wine, color = "darkorange")
sns.histplot(ax = ax[0,1], data = demographic_data.Fruit, color = "seagreen")
sns.histplot(ax = ax[1,0], data = demographic_data.Meat, color = "lightcoral")
sns.histplot(ax = ax[1,1], data = demographic_data.Sweets, color = "mediumturquoise")

fig.suptitle("Items bought by Customers", fontsize = 24)
ax[0,0].annotate('All amounts in USD [$]', xy = (0,1), xycoords = 'axes fraction', fontsize = 20)

ax[0,0].set_xlabel("Wine", fontsize = 22, labelpad = 25)
ax[0,1].set_xlabel("Fruit", fontsize = 22, labelpad = 25)
ax[1,0].set_xlabel("Meat", fontsize = 22, labelpad = 25)
ax[1,1].set_xlabel("Sweets", fontsize = 22, labelpad = 25)

for num in [0,1]:
    for num2 in [0,1]:
        ax[num, num2].set_ylabel(None)
        ax[num, num2].grid(axis = "x")

Seems like the customers are spending the most money on wine and meat products. This is consistent with expectations because wine and meat products typically cost more than fruit or sweets. Data regarding the number of units sold in each category could prove to be more useful.

# Creating Clusters

In this section we will create four clusters of customers based off their 
1. Income
2. Time with company
3. Spending

The four clusters will include

* **Stars** : Customers with high income, high spending, and a long time with the company
* **High Potential**: Customers with high income, high spending, and short time with company
* **Needs Attention**: Customers with low income, low spending, and short time with company
* **Leaky Buckets**: Customers with low income, low spending, and a long time with company

note: a futher breakdown of this method of clustering is provided here https://thecleverprogrammer.com/2021/02/08/customer-personality-analysis-with-python/

In [None]:
from sklearn.preprocessing import StandardScaler, normalize
from sklearn import metrics
from sklearn.mixture import GaussianMixture

scalar = StandardScaler()
temp = demographic_data[["Income", "Time_With_Company", "Spending"]]

X_std = scalar.fit_transform(temp)
X = normalize(X_std, norm = 'l2')

In [None]:
gmm = GaussianMixture(n_components = 4, covariance_type = "spherical", random_state = 0, max_iter = 1000).fit(X)
labels = gmm.predict(X)
temp["Cluster"] = labels
temp.head()

In [None]:
temp = temp.replace({0 : "Leaky Buckets",
                    1 : "High Potential",
                    2 : "Needs Attention",
                    3: "Stars"})

In [None]:
temp.head()

In [None]:
demographic_data = demographic_data.merge(temp.Cluster, left_index = True, right_index = True)

Looking at a breakdown of our clusters and their features.

In [None]:
pd.options.display.float_format = "{:.0f}".format
summary = demographic_data[['Income','Spending','Time_With_Company','Cluster']]
summary.set_index("Cluster", inplace = True)
summary=summary.groupby('Cluster').describe().transpose()
summary

In [None]:
ax = sns.countplot(data = demographic_data, y = "Cluster", palette = "muted")
ax.bar_label(container = ax.containers[0], padding = -55, fontsize = 22)
ax.set_ylabel(None)
ax.set_xlabel(None)
ax.set_title("Cluster Distribution", pad = 20)

In [None]:
ax = sns.scatterplot(x = demographic_data.Income,
               y = demographic_data.Spending,
               hue = demographic_data.Cluster,
               palette = "muted")

ax.set_xlabel("Income [USD $]", fontsize = 20, labelpad = 20)
ax.set_ylabel("Amount Spent [USD $]", fontsize = 20, labelpad = 20)

In [None]:
ax = sns.scatterplot(x = demographic_data.Time_With_Company,
               y = demographic_data.Spending,
               hue = demographic_data.Cluster,
               palette = "muted")

ax.set_xlabel("Months With Company", fontsize = 20, labelpad = 20)
ax.set_ylabel("Amount Spent [USD $]", fontsize = 20, labelpad = 20)

In [None]:
ax = sns.scatterplot(x = demographic_data.Time_With_Company,
               y = demographic_data.Income,
               hue = demographic_data.Cluster,
               palette = "muted")

ax.set_xlabel("Months With Company", fontsize = 20, labelpad = 20)
ax.set_ylabel("Income [USD $]", fontsize = 20, labelpad = 20)

Lets look at some of the main features of our dataset again, now seperated by cluster.

In [None]:
fig, ax = plt.subplots(2,2, figsize = (18, 18))

sns.countplot(ax = ax[0,0], data = demographic_data.loc[demographic_data.Cluster == "Stars"], y = "Education", order = ['PhD', "Master", "Graduation", "2n Cycle"], palette = "Set2")
sns.countplot(ax = ax[0,1], data = demographic_data.loc[demographic_data.Cluster == "High Potential"], y = "Education", order = ['PhD', "Master", "Graduation", "2n Cycle"], palette = "Set2")
sns.countplot(ax = ax[1,0], data = demographic_data.loc[demographic_data.Cluster == "Needs Attention"], y = "Education", order = ['PhD', "Master", "Graduation", "2n Cycle"], palette = "Set2")
sns.countplot(ax = ax[1,1], data = demographic_data.loc[demographic_data.Cluster == "Leaky Buckets"], y = "Education", order = ['PhD', "Master", "Graduation", "2n Cycle"], palette = "Set2")

ax[0,0].set_title("Stars", fontsize = 20, pad = 20)
ax[0,1].set_title("High Potential", fontsize = 20, pad = 20)
ax[1,0].set_title("Needs Attention", fontsize = 20, pad = 20)
ax[1,1].set_title("Leaky Buckets", fontsize = 20, pad = 20)

for i in [0,1]:
    for j in [0,1]:
        ax[i,j].set(xlabel = None, ylabel = None)

fig.suptitle("Education of Customers by Cluster", fontsize = 24)


Seems as if the demographic breakdown of each cluster is even among education.

Next lets look at the puchasing habits of each cluster.

In [None]:
fig, ax = plt.subplots(2,2, figsize = (18,18))

Stars = demographic_data.loc[demographic_data.Cluster == "Stars"]
HP = demographic_data.loc[demographic_data.Cluster == "High Potential"]
NA = demographic_data.loc[demographic_data.Cluster == "Needs Attention"]
LB = demographic_data.loc[demographic_data.Cluster == "Leaky Buckets"]

ax[0,0].set_title("Stars", fontsize = 20, pad = 20)
ax[0,1].set_title("High Potential", fontsize = 20, pad = 20)
ax[1,0].set_title("Needs Attention", fontsize = 20, pad = 20)
ax[1,1].set_title("Leaky Buckets", fontsize = 20, pad = 20)

sns.histplot(ax = ax[0,0], data = Stars.Web, color = "skyblue", label = "Web Purchases")
sns.histplot(ax = ax[0,0], data = Stars.Catalog, color = "red", label = "Catalog Purchases")
sns.histplot(ax = ax[0,0], data = Stars.Store, color = "gold", label = "Store Purchases")

sns.histplot(ax = ax[0,1], data = HP.Web, color = "skyblue")
sns.histplot(ax = ax[0,1], data = HP.Catalog, color = "red")
sns.histplot(ax = ax[0,1], data = HP.Store, color = "gold")

sns.histplot(ax = ax[1,0], data = NA.Web, color = "skyblue")
sns.histplot(ax = ax[1,0], data = NA.Catalog, color = "red")
sns.histplot(ax = ax[1,0], data = NA.Store, color = "gold")

sns.histplot(ax = ax[1,1], data = LB.Web, color = "skyblue")
sns.histplot(ax = ax[1,1], data = LB.Catalog, color = "red")
sns.histplot(ax = ax[1,1], data = LB.Store, color = "gold")

for i in [0,1]:
    for j in [0,1]:
        ax[i,j].set(xlabel = None, ylabel = None)
        ax[i,j].grid(axis = "x")
        
ax[0,0].legend()

fig.suptitle("Purchasing Habits of Customers by Cluster", fontsize = 24)

Focusing on the Stars and High Potential clusters, it seems as if the star customers tend to make more web purchases than the high potential customers. The high potential customers on the other hand tend to purchase heavily from stores catagory. As expected, the needs attention and leaky buckets clusters have very low purchasing volume with large amounts making zero catalog purchases.

Lets visualize the number of website visits in the past month from each cluster.

In [None]:
ax = sns.barplot(x = demographic_data.Cluster, y = demographic_data.WebVisits, palette = "muted")
ax.set_ylabel("Number of Website Visits", labelpad = 20)
ax.set_xlabel(None)
ax.set_title("Average Website Visits in the Last Month by Cluster")

Interesting enough, the groups that make up the most website visits are the groups that spend the least. 

Lets look at more spending habits of our customers.

In [None]:
fig, ax = plt.subplots(1,2, figsize = (15,10))
sns.swarmplot(ax = ax[0], x = demographic_data.Has_Child, y = demographic_data.Spending)
sns.swarmplot(ax = ax[1], x = demographic_data.Marital_Status, y = demographic_data.Spending)

ax[0].set_ylabel("Spending [USD $]", labelpad = 20)
ax[1].set_ylabel(None)

for i in [0,1]:
    ax[i].set_xlabel(None)
    
plt.suptitle("Spending Habits by Customer Demographic", fontsize = 24)

It seems like on average, those without children spend more than those with children. What about the types of products the customers are buying?

In [None]:
temp = demographic_data.loc[:, ["Wine", "Fruit", "Meat", "Sweets", "Cluster"]]
temp = temp.groupby("Cluster").sum()
temp.head()

In [None]:
ax = temp.transpose().plot(kind = "barh", stacked = True, colormap = "Set2")
ax.set_xlabel("Spending [USD $]", fontsize = 20, labelpad = 25)
ax.set_title("Spending Habits by Cluster", fontsize = 22, pad = 25)

Seems like customers spend the most money on Wine. Lets investigate this product category further.

In [None]:
ax = sns.scatterplot(data = demographic_data, x = "Income", y = "Wine", hue = "Has_Child")
ax.set_title("Income vs. Wine Spending", pad = 20)
ax.set_xlabel("Income [USD $]", fontsize = 20, labelpad = 20)
ax.set_ylabel("Wine Spending [USD $]", fontsize = 20, labelpad = 20)

In [None]:
demographic_data.loc[demographic_data.Has_Child == "No Child"].Wine.describe()

In [None]:
demographic_data.loc[demographic_data.Has_Child == "Has Child"].Wine.describe()

In [None]:
demographic_data.loc[demographic_data.Has_Child == "No Child"].Wine.sum()

In [None]:
demographic_data.loc[demographic_data.Has_Child == "Has Child"].Wine.sum()

As expected, we see a postive relationship between income and wine product spending. However it seems as if the customers without children spend more, on average (per customer), than the customers with children. Although those with children spend more on wine total due to the larger volume of customers. 

# Conclusions

Lets go back to our original guiding questions.

* **What are the main demographics of our customers?**

Our customers are primarily middle aged with at least a bachelors degree and make 52k on average. A majority of them are married with at least one child. 
* **What are their spending habits?**

Our customers spend the most on meat and wine products. Those without children seem to spend more on average than those without children. It seems as if the highest spending customers prefer to shop in stores rather than online or from the catalog.
* **How can this analysis help our company's marketing campaign?**

Through this analysis we can better understand the spending habits and demographics of our customers. An area to focus on for marketing could include advertizing the wine products with a focus on single / unmarried customers. Another focus could be on discouts and deals though the online store, since it seems like those who spend the least on products vist the website the most frequently. 