# <center> $\color{darkblue}{\text{Customer Personality Analysis}}$ </center>

#### <center> $\color{darkblue}{\text{W. Touhami - Dec 2021}}$ </center>

______

__Problem Statement__

Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.

__Target__

Need to perform clustering to summarize customer segments.

__Data__
We use a dataset from Kaggle.com


-----------

__Why is customer segmentation important__

All types od segmentation help the compagny to spend money affeciently. The segmentation helps to decide which marketing strategies deserve different allocations of ressources.

__Improving products__

Customer segmentation helps the compagny to improve their products through a better understanding of who uses them.

Customer segmentation ca enable te craft more engaging message. Allowing customised messages to each segment brings about higher quality enthusiasm for your brand.



---------------------

### $\color{darkblue}{\text{Table of content}}$

1. Data loading
1. Data Analysis
1. Data cleaning
1. Data Preprocessing
1. Clustering
1. Discussion
1. General conclusion and recommendations

_____________

## $\color{darkblue}{\text{Import Libraries}}$

In [4]:
pip install plotly

Collecting plotly
  Downloading plotly-5.5.0-py2.py3-none-any.whl (26.5 MB)
     |████████████████████████████████| 26.5 MB 1.1 MB/s            
Collecting tenacity>=6.2.0
  Using cached tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.5.0 tenacity-8.0.1
Note: you may need to restart the kernel to use updated packages.


In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns 
import plotly.graph_objects as go
import plotly.express as px
import time

  import pandas.util.testing as tm


ModuleNotFoundError: No module named 'plotly'

#  $\color{darkblue}{\text{1- Data loading}}$ 

Dataset of 2240 customers described with 29 features

__Features__

People
<ul>
<li> ID: Customer's unique identifier</li>
<li> Year_Birth: Customer's birth year</li>
<li> Education: Customer's education level</li>
<li> Marital_Status: Customer's marital status</li>
<li> Income: Customer's yearly household income</li>
<li> Kidhome: Number of children in customer's household</li>
<li> Teenhome: Number of teenagers in customer's household</li>
<li> Dt_Customer: Date of customer's enrollment with the company</li>
<li> Recency: Number of days since customer's last purchase</li>
<li>Complain: 1 if customer complained in the last 2 years, 0 otherwise</li>
 </ul>
Products
<ul>
<li>MntWines: Amount spent on wine in last 2 years</li>
<li>MntFruits: Amount spent on fruits in last 2 years</li>
<li>MntMeatProducts: Amount spent on meat in last 2 years</li>
<li>MntFishProducts: Amount spent on fish in last 2 years</li>
<li>MntSweetProducts: Amount spent on sweets in last 2 years</li>
<li>MntGoldProds: Amount spent on gold in last 2 years</li>
 </ul>
Promotion
<ul>
<li>NumDealsPurchases: Number of purchases made with a discount</li>
<li>AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise</li>
<li>AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise</li>
<li>AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise</li>
<li>AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise</li>
<li>AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise</li>
<li>Response: 1 if customer accepted the offer in the last campaign, 0 otherwise</li>
 </ul>
Place
<ul>
<li>NumWebPurchases: Number of purchases made through the company’s web site</li>
<li>NumCatalogPurchases: Number of purchases made using a catalogue</li>
<li>NumStorePurchases: Number of purchases made directly in stores</li>
<li>NumWebVisitsMonth: Number of visits to company’s web site in the last month</li>
 </ul>

In [None]:
df = pd.read_csv('marketing_campaign.csv',index_col=False,sep="\t")
print(f'Dataset with {df.shape[0]} entries and {df.shape[1]} columns')

# $\color{darkblue}{\text{2- Data analysis}}$

In [None]:
# Let examine data
#pd.set_option('display.max_columns', None)
dataname = 'Marketing Compaign'

print(f'Let have a look to dataset {dataname}:\n')
display(df.head())

  
print('Some stats:')
display(df.describe())
 

# Duplicates rows
duplicates = df.duplicated().sum()
if duplicates == 0:
    print("\n")
    print('There are no duplicated entries.')
else:
    print("\n")
    print(f'There are {duplicates} duplicates.')

   
# Missing values
data_missing = pd.DataFrame(df.isnull().sum())
if data_missing[0].sum() > 0:
    print('\n Missing values')
    data_missing.plot(kind='bar')
    plt.grid()
    plt.title('Missing values');
else:
    print(f'There are no missing values in "{data_name}".')


__Observations__ 

1. The dataset contains 2240 customers described with 29 features (including customer's ID)

1. Columns 'Z_CostContact','Z_Revenue' with equal values, so no relevant information, we will drop these 2 columns

1. No duplicated values

1. There some missing values (24) in column 'Income'

1. There some outliers in column 'Income' (for example 666666 is outlier) and in column 'year_birth', the value 1893 seems outiers too
1. Dt_Customer that indicates the date a customer joined the database is not parsed as DateTime


# $\color{darkblue}{\text{3- Data cleaning}}$

In [None]:
# Copy dataframe
data = df.copy()
data.shape

In [None]:
# Drop missing data
data = data.dropna()
print("The total number of data-points after removing the rows with missing values are:", len(data))

In [None]:
# Convert Dt_Customer to DateTime 
data["Dt_Customer"] = pd.to_datetime(data["Dt_Customer"])
type(data["Dt_Customer"])

In [None]:
#Categorical features
print("Total categories in the feature Marital_Status:\n", data["Marital_Status"].value_counts(), "\n")
print("Total categories in the feature Education:\n", data["Education"].value_counts())

__Some modifications__

1. Extract the "Age" of a customer by the "Year_Birth" indicating the birth year of the respective person. 
1. Create a feature "Spent" indicating the total amount spent by the customer in various categories. 
1. Modify feature "Marital_Status": <ol>
                                    <li> Single/Absurd/Divorced/widow/Alone/Yolo --> Single:1 
                                    <li> Married/together --> Married:2 
                                    </ol>
1. Create a feature "Children" to indicate total children = kids and teenagers.
1. Create a feature "FamilySize" to indicate total nb of persons 

1. Simplify feature "Education" :<ol>
                                 <li> Basic/2nCycle --> Undergraduate 
                                 <li> Graduation --> Graduate 
                                 <li> Master/PhD --> Postgraduate 
                                 </ol>
1. Create feautre "Enrollment": Modify date of enrollment 'Dt_customer' to duration of registration 
1. Drop the outliers examples where Age > 100 and Income> 600000 
1. Drop not necessary columns: 'Z_CostContact','Z_Revenue', 'Year_Birth', 'DT_Customer'


In [None]:
# 1. Extract Age
data['Age'] = 2021 - data['Year_Birth']

In [None]:
# 2. Create 'Spent' feature
data['Spent'] = data['MntFishProducts'] + data['MntFruits'] + data['MntGoldProds'] + data['MntMeatProducts'] + data['MntSweetProducts'] + data['MntWines']

In [None]:
# 3. Modify 'Marital_Status' feature
data['Marital_Status'].replace({'Married':2,'Together':2,'Alone':1,'YOLO':1,'Single':1,'Absurd':1,'Widow':1,'Divorced':1},inplace=True)

In [None]:
# 4. Add feature children
data['Children'] = data['Kidhome'] + data['Teenhome']

In [None]:
# 5. Add Feature FamilySize
data['FamilySize'] = data['Children'] + data['Marital_Status']

In [None]:
# 6. Modify 'Education' feature and replace the column by "EducationGrad"
df['Education'].replace({'Basic':'Undergraduate','2n Cycle':'Undergraduate','Graduation':'Graduate','Master':'Postgraduate','PhD':'Postgraduate'},inplace=True)
data['EducationGrad'] = df['Education']
data['EducationGrad'].replace({'Undergraduate':0,'Graduate':1,'Postgraduate':2},inplace=True)
data['EducationGrad'] = pd.to_numeric(data['EducationGrad'])

data.drop(columns=['Education'],axis=1,inplace=True)

In [None]:
# 7. Modify date of enrollment 'Dt_customer' 
#date_now = datetime.datetime.now().date()
years = []
for i in data["Dt_Customer"]:
    i = i.year
    years.append(2021 - i)
 
data['Enrollment'] = years

In [None]:
# 8. drop outliers examples
indexNames = data[ data['Age'] > 100 ].index
data.drop(indexNames , inplace=True)
data.drop(data[ data['Income'] > 600000 ].index , inplace=True)

In [None]:
# 9. drop non necessary columns
data.drop(columns=['Z_CostContact','Z_Revenue','Year_Birth', 'Dt_Customer'],axis=1,inplace=True)

In [None]:
print('The cleaned dataset:\n')
display(data.head())

# Some graphs to illustrate data

In [None]:
plt.figure(figsize=(18,18))
plt.subplot(4,3,1)
plt.scatter(data['Income'],data['Spent'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Income')
plt.ylabel('Spent')

plt.subplot(4,3,2)
plt.scatter(data['Age'],data['Spent'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Age')
plt.ylabel('Spent')
plt.subplot(4,3,3)
plt.scatter(data['Marital_Status'],data['Spent'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Marital_Status')
plt.ylabel('Spent')
plt.subplot(4,3,4)
plt.scatter(data['Income'],data['MntWines'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Income')
plt.ylabel('MntWines')
plt.subplot(4,3,5)
plt.scatter(data['Income'],data['MntFruits'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Income')
plt.ylabel('MntFruits')
plt.subplot(4,3,6)
plt.scatter(data['Income'],data['MntMeatProducts'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Income')
plt.ylabel('MntMeatProducts')
plt.subplot(4,3,7)
plt.scatter(data['Income'],data['MntFishProducts'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Income')
plt.ylabel('MntFishProducts')
plt.subplot(4,3,8)
plt.scatter(data['Income'],data['MntSweetProducts'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('Income')
plt.ylabel('MntSweetProducts')
plt.subplot(4,3,9)
plt.scatter(data['FamilySize'],data['Spent'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('FamilySize')
plt.ylabel('Spent')
plt.subplot(4,3,10)
plt.scatter(data['NumDealsPurchases'],data['Spent'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('NumDealsPurchases')
plt.ylabel('Spent')
plt.subplot(4,3,11)
plt.scatter(data['NumDealsPurchases'],data['Income'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('NumDealsPurchases')
plt.ylabel('Income')
plt.subplot(4,3,12)
plt.scatter(data['NumDealsPurchases'],data['FamilySize'],c='blue',alpha=0.5,lw=.2)
plt.xlabel('NumDealsPurchases')
plt.ylabel('FamilySize')

In [None]:
# Correlation between features
plt.figure(figsize=(16,9))
ax = sns.heatmap(data.corr(),annot = True,cmap = 'YlGnBu')
plt.show()

In [None]:
def dist_feat(dataset, columns_list, rows, cols, suptitle, size=(16,8), y=0.5):
    fig, axs = plt.subplots(rows, cols,figsize=size)
    fig.suptitle(suptitle,y=y, size=16)
    axs = axs.flatten() 
    for i, data in enumerate(columns_list):
        mean, median = dataset[data].mean(), dataset[data].median()
        graph = sns.histplot(dataset[data], ax=axs[i])
        graph.axvline(mean, c='red',label='mean')
        graph.axvline(median, c='green',label='median')
        graph.legend()
        #axs[i].set_title(data)

col_list = ['Income', 'Kidhome','MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
            'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases','NumCatalogPurchases', 'NumStorePurchases',
             'NumWebVisitsMonth']     
dist_feat(dataset=data, columns_list=col_list, 
            rows=5, cols=3, suptitle='Distibution for each variable', size=(18,28), y=1.0)

#### $\color{blue}{\text{Observations}}$ 

1. Feature 'Income' has mean and median almost similar unlike the others features
2. Most features has Right skewed distribution except features 'NumStorePurchases' and 'NumWebVisitsMonth': customers prefer buying on web/store than through catalog

In [None]:
col_cat_list = ['Education', 'Marital_Status']

fig, axs = plt.subplots(1, 2,figsize=(14,8))
fig.suptitle('Barchar for education types and marital status',size=16)
#fig.tight_layout(pad=6.0)
axs = axs.flatten() 
    
for i, col in enumerate(col_cat_list):
    df_c = df.groupby(col).count().reset_index()
    pal = sns.color_palette("Greens_d", len(df_c))
    rank = df_c['ID'].argsort().argsort() 
        
    g=sns.barplot(df_c[col], df_c['ID'], ax=axs[i], palette=np.array(pal[::-1])[rank])
    plt.legend()
    for index, row in df_c.iterrows():
        g.text(row.name,row.ID, round(row.ID,2), color='black', ha="center")
            
   
    axs[i].tick_params(axis='y')
    df_c = df.groupby(col)['Income'].mean().reset_index()
    axs[i] = axs[i].twinx()
    sns.lineplot(df_c[col], df_c['Income'], ax=axs[i], color='red', label='mean income')
    axs[i].tick_params(axis='y')
        


### $\color{blue}{\text{Observations}}$

1. From the plot above we can see that customers with PostGraduate degree (Master and PhD) have the highest income in comparison to others education types. Clients with 'UnderGraduate' education don't earn a lot in average.

2. Marital status: customers with marital status "Absurd" has the highest income, should be investigated carefully. The lowest income reffers to alone and YOLO people: such clients can be too young or too old and don't earn lots money.


# $\color{darkblue}{\text{4- Data preprocessing}}$

1. Data normalization
1. Reducing features dimensions using PCA method

In [None]:
# 1. Data Normalization
from sklearn.preprocessing import StandardScaler

obje_ss=StandardScaler()

X_scaled=obje_ss.fit(data)
scaled_data = pd.DataFrame(obje_ss.transform(data),columns= data.columns )
print("All features are now scaled")

In [None]:
from sklearn.decomposition import PCA

In [None]:
# 2. Dimensions Features reducing to 3/4: PCA

pca = PCA(n_components=4)
pca.fit(scaled_data)
df_pca = pd.DataFrame(pca.transform(scaled_data), columns=(["f1","f2", "f3","f4"]))
df_pca.describe().T
print('Results after PCA method')
print(df_pca.head())


# Ploting result data with the use of scatterplot. plotly
x =df_pca["f1"]
y =df_pca["f2"]
z =df_pca["f3"]

fig = go.Figure(data=[go.Scatter3d(
    x=x,y=y,z=z,mode='markers',
    marker=dict(size=6,color=x,opacity=0.8))])

# tight layout
fig.update_layout( title={'text': "3D scatterplot of size-reduced data",'y':0.9,
        'x':0.5,'xanchor': 'center','yanchor': 'top'},
                  margin=dict(l=200, r=220, b=0, t=0))
fig.show()
fig.update(layout_coloraxis_showscale=False)

# $\color{darkblue}{\text{5- Clustering}}$

Clustering is an __unsupervised learning__ problem. It is used as a technique for groups segmentation.

There are many clustering algorithms:  Kmeans, Agglomerative Clustering,DBscan

__Elbow method__:in clustering, it means one should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data.

The __KElbowVisualizer__ implements the “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for 𝐾

In [None]:
 
import sklearn.cluster as cluster
from yellowbrick.cluster import KElbowVisualizer
from sklearn.metrics import silhouette_score 
from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score

In [None]:
# Instantiate the clustering model and visualizer
model = cluster.KMeans()
visualizer = KElbowVisualizer(model, k=10)

# Fit the data to the visualizer
visualizer.fit(df_pca) 

# Finalize and render the figure
visualizer.show()       


In [None]:
plot_kwds = {'alpha' : 0.25, 's' : 50, 'linewidths':0}
clusters_series = []
score_silh = []
score_hara = []
score_bould =[]

algorithms = [cluster.KMeans,cluster.AgglomerativeClustering,cluster.DBSCAN]
args = [ (), (), ()]
kwds = [{'n_clusters':4}, {'n_clusters':4, 'linkage':'ward'}, {'eps':0.605}]

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(16, 10), sharex=True, sharey=True)
#axs = axs.flatten()

for j, i in zip(range(len(algorithms)),axs.ravel()):
        algorithm = algorithms[j]
        start_time = time.time()
        labels = algorithm(*args[j], **kwds[j]).fit_predict(df_pca)
        end_time = time.time()
        clusters_series.append(labels)
        #plotting
        palette = sns.color_palette('deep', np.unique(labels).max() + 1)
        colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
        if j < len(algorithms):       
                i.scatter(df_pca.iloc[:,0], df_pca.iloc[:,1],c=colors,  **plot_kwds)
                i.set_title('Clusters found by {}'.format(str(algorithm.__name__)), fontsize=14)
                i.text(4, 6, 'Time: {:.2f} s'.format(end_time - start_time), fontsize=10)
        # Calculate cluster validation metrics  
        score_s = silhouette_score(df_pca, labels, metric='euclidean')
        score_silh.append(score_s)
        score_c = calinski_harabasz_score(df_pca, labels)
        score_hara.append(score_c)
        score_d = davies_bouldin_score(df_pca, labels)
        score_bould.append(score_d)

print("-----------------------------------------")
print ("KMeans Algo:")
print('Silhouette Score: %.4f' % score_silh[0])
print('Calinski Harabasz Score: %.4f' % score_hara[0])
print('Davies Bouldin Score: %.4f' % score_bould[0])  

print("-----------------------------------------")
print ("Agglomerative Algo:")
print('Silhouette Score: %.4f' % score_silh[1])
print('Calinski Harabasz Score: %.4f' % score_hara[1])
print('Davies Bouldin Score: %.4f' % score_bould[1])  

print("-----------------------------------------")
print ("DBSCAN Algo:")
print('Silhouette Score: %.4f' % score_silh[2])
print('Calinski Harabasz Score: %.4f' % score_hara[2])
print('Davies Bouldin Score: %.4f' % score_bould[2])  




#df_pca["clusters_affinity"], df_pca["clusters_spectral"] =  clusters_series[1],clusters_series[3]
df_pca["clusters_kmeans"], df_pca["clusters_agglom"],df_pca["clusters_dbscan"] =   clusters_series[0],clusters_series[1],clusters_series[2]
data["clusters_kmeans"], data["clusters_agglom"],data["clusters_dbscan"] =   clusters_series[0],clusters_series[1],clusters_series[2]
#df_filtered["clusters_affinity"], df_filtered["clusters_spectral"] =  clusters_series[1],clusters_series[3]
#df_filtered["clusters_kmeans"], df_filtered["clusters_agglom"] =   clusters_series[2],clusters_series[4]

we can see that __K-means__ outperforms others clustering algorithms based on all cluster validation metrics.

# $\color{darkblue}{\text{6- Discussion}}$

Let focus on KMeans and analyse the results

In [None]:
ax=plt.figure(figsize=(16,8))
cmap=['cyan','pink','green','orange']


sns.scatterplot(data = data,x=data["Spent"], y=data["Income"],hue=data['clusters_kmeans'], palette='viridis_r')
plt.title("Kmeans Clustering")
plt.legend()
plt.suptitle("Cluster's Profile Based On Income And Spending", y=0.95);


## $\color{darkblue}{\text{Clusters analysis}}$ 

In [None]:
ct = data['clusters_kmeans'].value_counts()
ct.shape

In [None]:
plt.figure()
pl = sns.countplot(x=data["clusters_kmeans"],palette='viridis_r')
pl.set_title("Distribution Of The Clusters")

plt.show()

In [None]:
plt.figure(figsize=(18,10))
ax = plt.subplot(2,2,1)
sns.countplot(x=data['Marital_Status'],hue=data['clusters_kmeans'],palette="viridis_r")
pl.set_title("Distribution Of Marital Status")
ax = plt.subplot(2,2,2)
sns.countplot(x=data['FamilySize'],hue=data['clusters_kmeans'],palette="viridis_r")
pl.set_title("Distribution Of Family Size")
ax = plt.subplot(2,2,3)
sns.countplot(x=data['Kidhome'],hue=data['clusters_kmeans'],palette="viridis_r")
pl.set_title("Distribution Of Kids")
ax = plt.subplot(2,2,4)
sns.countplot(x=data['Teenhome'],hue=data['clusters_kmeans'],palette="viridis_r")
pl.set_title("Distribution Of Teengers")

### __Observations:__
1. Middle income, Middle/high spent
1. Few offers are accepted
1. Web and store purchases are used less than catalog purchases

# $\color{darkblue}{\text{Clusters interpretation}}$

In [None]:
data.drop(columns=['ID','clusters_agglom','clusters_dbscan'],inplace=True)

In [None]:
data = data.rename(columns={'clusters_kmeans':'Cluster'})

In [None]:
# Clusters interpretation
sns.set(style="ticks",rc={"axes.spines.left": True,'axes.facecolor':'black', 'figure.facecolor':'black', 'axes.grid' : False, 'font.family': 'Dejavu Sans'})


for i in data:
    g = sns.FacetGrid(data, col = "Cluster", hue = "Cluster", palette = "Set2")
    g.map(plt.hist, i, bins=10, ec="k") 
    mean, median = data[i].mean(),data[i].median()
    g.refline(x=mean, color='red')
    g.set_xticklabels(rotation=30, color = 'white')
    g.set_yticklabels(color = 'white')
    g.set_xlabels(size=15, color = 'white')
    g.set_titles(size=15, color = '#FFC300', fontweight="bold")
    g.fig.set_figheight(5)


<table>
  <thead>
    <tr>
      <th>Cluster 0</th>
      <th>Cluster 1</th>
      <th>Cluster 2</th>
      <th>Cluster 3</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Low Income, less than overall mean</td>
      <td>Middle Income, the cluster's mean is > than the dataset's mean</td>
      <td>Middle Income</td>
      <td>High Income, largement supérieur than the dataset's mean</td>
    </tr>
    <tr>
      <td>Low spent, less than mean of dataset's spent </td>
      <td>Middle spent the cluster's mean is > than the dataset's mean </td>
      <td>Middle spent</td>
      <td>High spent, largement supérieur than the dataset's mean</td>
    </tr>
    <tr>
      <td>Plutot jeune </td>
      <td>Uniformément distribué </td>
      <td>plutot agé, moy > dataset's mean</td>
      <td>Uniformément distribué </td>
    </tr>
    <tr>
      <td>Majoity Graduated and Post graduated </td>
      <td>Majoity Graduated </td>
      <td>Majoity Post graduated</td>
      <td>Majoity Graduated and Post graduated </td>
    </tr>
    <tr>
      <td>Most web/store purchases</td>
      <td>Most store purchases</td>
      <td>Most web/store purchases</td>
      <td>Most web/store purchases</td>
    </tr>
    <tr>
      <td>Few Catalog purchases</td>
      <td>Few Catalog purchases</td>
      <td>Few Catalog purchases</td>
      <td>Few Catalog purchases</td>
    </tr>
    <tr>
      <td>Few offers accepted</td>
      <td>Few offers accepted</td>
      <td>Few offers accepted</td>
      <td>Most affers accepted</td>
    </tr>
  </tbody>
</table>

## $\color{darkblue}{\text{Products distribution}}$

In [None]:
# Creating a new dataset 
clusters_products = data[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'Cluster']]    # Select variables
clusters_products = clusters_products.rename(columns={'MntWines':'Wines','MntFruits':'Fruits','MntMeatProducts':'Meat',
'MntFishProducts':'Fish','MntSweetProducts':'Sweet','MntGoldProds':'Gold'})

clusters_products1 = clusters_products.groupby(['Cluster'])
clusters_products2 = clusters_products1.agg({'Wines':'sum', 'Fruits':'sum', 'Meat':'sum', 'Fish':'sum', 'Sweet':'sum', 'Gold':'sum'})

clusters_products3 = clusters_products2.stack().reset_index(name='Count').rename(columns={'level_1':'Products'})   # Oposite as pivoting

clusters_products3['group'] = clusters_products3['Cluster']
clusters_products3['group'] = clusters_products3['group'].astype(str)

# Rename values
#clusters_products3['group'] = clusters_products3['group'].str.replace('0', 'Good Customers')
#clusters_products3['group'] = clusters_products3['group'].str.replace('1', 'Elite Customers')
#clusters_products3['group'] = clusters_products3['group'].str.replace('2', 'Economical Customers')
#clusters_products3['group'] = clusters_products3['group'].str.replace('3', 'Cheap Customers')

products = clusters_products3.copy()
products = products.assign(ratio=products.groupby('group').Count.transform(lambda x: x / x.sum()))

# Visualization
fig = px.bar(products, x='group', y='ratio', color='Products',
             labels={
                     "ratio": "Ratio",
                     "group": "Consumer's type"
                     },
             color_discrete_map={
                     'Gold': '#FFD700',
                     'Fish': '#87CEEB',
                     'Wines': '#b11226',
                     'Meat': '#f08080',
                     'Sweet': '#FF69B4',
                     'Fruits': 'lightgreen'},
                title="Products Distribution by Clusters")

fig.layout.yaxis.tickformat = ',.0%'

fig.update_traces(marker_line_color='white', marker_line_width=1, opacity=0.8)

fig.update_xaxes(showgrid=False)
fig.update_yaxes(showgrid=False)

fig.update_layout(
    {'plot_bgcolor': 'black',
    'paper_bgcolor': 'black'
    },
    font=dict(
        family="verdana",
        size=21,
        color="white"
    ),
    width=680,
    height=800,
    title_font_color="#FFC300",
    yaxis_title=None,
    xaxis_title=None
)
fig.update(layout_coloraxis_showscale=False)
fig.show()

__Observations__

1. Wine and meat are the most popular products amongst all the groups


# $\color{darkblue}{\text{7- General conclusion and recommendations}}$

We used a dataset of 2240 customers described by 28 features.

After some data analysis and cleaning, we used the PCA method to redure the features dimensions and some clustering methods to segment the customers into 4 clusters (recommended by Elbow method).

We notice that in general, the more we earn the more we spend. The deals offered by the company are not widely used apart from the wealthy class. The majority of shopping is done either in store or online.
We also note that wine and meat remain the most consumed products by all classes.

To conclude this project, we'll offer some recommendations for the company based on our analysis:

It would be beneficial to offer a greater variety of the most used products (Meat and wines) or special deals in these categories.

The company can also offer more promotions for products (fish, fruits, ...) that are not sold well to encourage consumers to buy more.

Most of the consumers have enrolled for a company membership for a long period of time. To promote and protect loyal clients, the company can offer some coupons to active, long-time customers.

In [None]:
pip