**Main Objectives:**

The scope of this project is to build several machine learning algorithms which can segment our customers by similar characteristics and choose the best one based on their interpretability and best representation of our data. This can be broken down into the following milestones:  
1. Data Cleaning and Exploration.  
2. Modeling.  
3. Interpretation of clusters.  
4. Selection of best model.  

The best model built could benefit the company in the task of focusing marketing strategies to specific groups of customers improving their experience and satisfaction.

In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt 
from sklearn.cluster import KMeans
import seaborn as sns
%matplotlib inline

In [None]:
df=pd.read_csv('../input/customer-personality-analysis/marketing_campaign.csv',header=0,sep = '\t')
df.head()

Above we see every feature looks suitable to be used in the modeling process, but as we will see later some features were originally wrongly encoded or typed and this was corrected during the cleaning process.

In [None]:
df.shape

In [None]:
df.isnull().sum()

Let's drop the rows containing null values:

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum().sum()

In [None]:
df.dtypes

In [None]:
df.describe()

In order to know the distribution of each feature will be plotted a histogram for each of them:

In [None]:
df[['Year_Birth','Income','Kidhome','Teenhome', 'MntWines', 'MntFruits',
           'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
           'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
           'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
           'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
           'AcceptedCmp2', 'Complain', 'Z_CostContact', 'Z_Revenue']].hist(figsize=[13,12])

Despite the fact that some features are highly skewed all values were validated by the publisher, but there is a value which couldn’t go unnoticed in our dataset and corresponds to a customer with an income which looks like it was miswritten or is an extreme outlier, but also their spending was considerably low so because of this the record will be dropped in order to have a better representation of the data to be clustered.

In [None]:
df[df.Income>600000]

In [None]:
df.drop(index=2233,inplace=True)

Dt_Customer is a feature which must be set as date type so as to compute later how long has the person been customer:

In [None]:
df['Dt_Customer']=pd.to_datetime(df['Dt_Customer'])

Let's see the education level of the customers displayed as a pie chart:

In [None]:
pie1=pd.DataFrame(df['Education'].value_counts())
pie1.reset_index(inplace=True)
pie1.plot(kind='pie', title='Pie chart of Education level',y = 'Education', 
          autopct='%1.1f%%', shadow=False, labels=pie1['index'], legend = False, fontsize=14, figsize=(12,12))

Let's see the marital status of the customers displayed as a pie chart:

In [None]:
pie2=pd.DataFrame(df['Marital_Status'].value_counts())
pie2.reset_index(inplace=True)
pie2.plot(kind='pie', title='Pie chart of Marital_Status',y = 'Marital_Status',
          autopct='%1.1f%%', shadow=False, labels=pie2['index'], legend = False, fontsize=14, figsize=(12,12))

In [None]:
df.dtypes

The classes in categorical variables will be reduced as follows:

In [None]:
for x in ['Education','Marital_Status']:
    print(x,len(df[x].unique()))

In [None]:
for x in ['Education','Marital_Status']:
    print(x,df[x].unique())

The level of Education is being encoded to numerical ascendingly:

In [None]:
df.Education=df.Education.replace('Basic',0).replace('2n Cycle',1).replace('Graduation',2).replace('Master',3).replace('PhD',4)

In [None]:
df.Education.unique()

The classes in Marital Status is being summarized as Single or In couple:

In [None]:
df.Marital_Status=df.Marital_Status.replace(['Divorced','Widow','Alone','Absurd','YOLO'],'Single').replace(['Together','Married'],'In couple')

In [None]:
df.Marital_Status.unique()

In [None]:
df.head()

In [None]:
df.Dt_Customer.min()

In [None]:
df.Dt_Customer.max()

Age of the customer will be obtained by subtracting Year birth to 2014 (Year when was done the analysis):

In [None]:
df.Year_Birth=(2014-df.Year_Birth)

In [None]:
df.rename(columns={'Year_Birth':'Age'},inplace=True)

Time being customer will be computed in months by subtracting Dt_Customer to 2014-12-07 (Date when was done the analysis):

In [None]:
df['Months being customer']=round(((pd.to_datetime('2014-12-07')-df.Dt_Customer).dt.days)/30,1)

In [None]:
df.drop(['Dt_Customer'],axis=1,inplace=True)

Number of children will be obtained by adding Kidhome and Teenhome:

In [None]:
df['N° children']=df['Kidhome']+df['Teenhome']

In [None]:
df.drop(['Kidhome','Teenhome'],axis=1,inplace=True)

Finally the Spending is obtained by adding MntWines,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds which shows how much of each type of product has been purchased by instance.

In [None]:
df['Spending']=df['MntWines']+df['MntFruits']+df['MntMeatProducts']+df['MntFishProducts']+df['MntSweetProducts']+df['MntGoldProds']

In [None]:
df.head()

Let's analyze the type of purchase so the supermarket can know how their customers do it and if there is a pattern or relation with Age, for this process will be created a new feature called Age_class in order to analyze by groups:

In [None]:
print('Number of web purchases: ',df.NumWebPurchases.sum())
print('Number of catalog purchases: ',df.NumCatalogPurchases.sum())
print('Number of store purchases: ',df.NumStorePurchases.sum())

In [None]:
df.Age.min(), df.Age.max()

In [None]:
bins = [18, 29, 39, 49, 59, 69, 130]
labels = ['18-29', '30-39', '40-49', '50-59', '60-69', '70+']
df['Age_class'] = pd.cut(df.Age, bins, labels = labels,include_lowest = True)

df[['Age','Age_class']].head()

In [None]:
df[['Age_class','NumWebPurchases','NumCatalogPurchases','NumStorePurchases','NumDealsPurchases']].groupby('Age_class').sum()

In the table above we don't see a tendency in young aged people to purchase by web, every class shows the same behaviour to prefer purchasing in store, then in web and finally by catalog.

Another insight that companies loves is to know more about those customers with the highest spendings, so let’s firstly define an interval to research and this will be equal to those values above the third quartile so the following should help us to find the such value:

In [None]:
df['Spending'].describe()

In [None]:
sns.boxplot(x='Spending',data=df)

The threshold is $1048 and thus customers with spending higher than this will be described below:

In [None]:
df[df.Spending>1048].describe().T

The following plot will show the correlation of Spending with every other feature:

In [None]:
plt.figure(figsize=(5, 12))
heatmap = sns.heatmap(df[['Age', 'Education', 'Income', 'Recency',
                          'MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts',
                          'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
                          'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
                          'NumWebVisitsMonth', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5',
                          'AcceptedCmp1', 'AcceptedCmp2', 'Complain', 'Z_CostContact',
                          'Z_Revenue', 'Months being customer', 'N° children', 'Spending']].corr()[['Spending']].sort_values(by='Spending', ascending=True), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Features Correlating with Spending', fontdict={'fontsize':18}, pad=22)
heatmap.set_ylim([0,24])

We can see there is a strong correlation with Wines and Meat products, but as all of these were added before there is a nice representation of them as only one feature. Income is considered to have a high correlation too and it makes sense as it is expected that someone with high income will spend more on their purchases. About the type of purchase it can be in store, web or catalog as we saw before, this information is crucial as the supermarket can know how their customers prefer to purchase leading to improve their experience and facilitate the process.  
As this is a project which aims to know more about the customer, their needs, behaviours, etc. some features will be selected to better interpret the clusters to be generated:  
Now the dataframe is cleaned and reduced to the following columns:

In [None]:
df[['Age','Education','Marital_Status','Income','N° children','Months being customer','Spending']].head()

The following 5 features are the most important when trying to cluster the customer's characteristics, thus we will see their corresponding histogram before scaling:

In [None]:
df2=df[['Age','Education','Income','Months being customer','Spending']]

In [None]:
df2.dtypes

In [None]:
cols=['Age','Education','Income','Months being customer','Spending']

In [None]:
df2[cols].hist(bins=30, figsize=(15,13))

In [None]:
df3=df2.copy(deep=True)

In [None]:
from sklearn.preprocessing import StandardScaler
ss=StandardScaler()
df3[cols]=ss.fit_transform(df2[cols])

In [None]:
df3[cols]

## Modeling

In order to choose the best model the number of clusters is defined as 3, but this can be changed to a higher number if there is a model which can differentiate these clearly. The following models will be built and compared using their corresponding cluster interpretation.  
1. Hierarchical Agglomerative clustering with ward linkage.  
2. DBSCAN with suit values for epsilon and min_samples.   
3. K-means.  

For the first model the unique hyperparameter set was ward linkage, the following is a figure showing the centroids of each cluster generated and take into account that these correspond to the mean of each group, early we could say this model mainly doesn’t differentiate well the age of  customers as the 3 of them are between 43 and almost 47 years old.

### Agglomerative clustering:

In [None]:
from sklearn.cluster import AgglomerativeClustering
ag = AgglomerativeClustering(n_clusters=3, linkage='ward', compute_full_tree=True)
ag = ag.fit(df3[cols])

In [None]:
df2['agglom'] = ag.fit_predict(df3[cols])
df3['agglom'] = ag.fit_predict(df3[cols])

In [None]:
df3.groupby('agglom').mean()

In [None]:
df2.groupby('agglom').mean()

In [None]:
df2.agglom.value_counts()

In [None]:
import seaborn as sns
coloribm = {"Magenta 100":"2A0A16", "Magenta 90":"57002B", "Magenta 80":"760A3A", "Magenta 70":"A11950", "Magenta 60":"D12765", "Magenta 50":"EE538B", "Magenta 40":"FA75A6", "Magenta 30":"FFA0C2", "Magenta 20":"FFCFE1", "Magenta 10":"FFF0F6", "Purple 100":"1E1033", "Purple 90":"38146B", "Purple 80":"4F2196", "Purple 70":"6E32C9", "Purple 60":"8A3FFC", "Purple 50":"A66EFA", "Purple 40":"BB8EFF", "Purple 30":"D0B0FF", "Purple 20":"E6D6FF", "Purple 10":"F7F1FF", "Blue 100":"051243", "Blue 90":"061F80", "Blue 80":"0530AD", "Blue 70":"054ADA", "Blue 60":"0062FF", "Blue 50":"408BFC", "Blue 40":"6EA6FF", "Blue 30":"97C1FF", "Blue 20":"C9DEFF", "Blue 10":"EDF4FF", "Teal 100":"081A1C", "Teal 90":"003137", "Teal 80":"004548", "Teal 70":"006161", "Teal 60":"007D79", "Teal 50":"009C98", "Teal 40":"00BAB6", "Teal 30":"20D5D2", "Teal 20":"92EEEE", "Teal 10":"DBFBFB", "Gray 100":"171717", "Gray 90":"282828", "Gray 80":"3D3D3D", "Gray 70":"565656", "Gray 60":"6F6F6F", "Gray 50":"8C8C8C", "Gray 40":"A4A4A4", "Gray 30":"BEBEBE", "Gray 20":"DCDCDC", "Gray 10":"F3F3F3"} 
colors = []
colornum = 60
for i in [f'Blue {colornum}', f'Teal {colornum}', f'Magenta {colornum}', f'Purple {colornum}', f'Gray {colornum}']:
    colors.append(f'#{coloribm[i]}')
palette = sns.color_palette(colors)

In [None]:
from scipy.cluster import hierarchy

Z = hierarchy.linkage(ag.children_, method='ward')

fig, ax = plt.subplots(figsize=(15,5))

# Some color setup
red = colors[2]
blue = colors[0]

hierarchy.set_link_color_palette([red, 'gray'])

den = hierarchy.dendrogram(Z, orientation='top', 
                           p=30, truncate_mode='lastp',
                           show_leaf_counts=True, ax=ax,
                           above_threshold_color=blue)

### K-means:

Firstly, for this model was computed the inertia values by number of clusters between 1 and 20, giving the following curve shown in the figure below: 

In [None]:
inertia = []
list_num_clusters = list(range(1,20))
for num_clusters in list_num_clusters:
    km = KMeans(n_clusters=num_clusters)
    km.fit(df3[cols])
    inertia.append(km.inertia_)
    
plt.plot(list_num_clusters,inertia)
plt.scatter(list_num_clusters,inertia)
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia')

There is not enough to explicitly say that there is an elbow point because the inertia is decreasing gradually as clusters increase.

In [None]:
km = KMeans(n_clusters=3, random_state=42)
km = km.fit(df3[cols])
df3['kmeans'] = km.predict(df3[cols])

In [None]:
km.inertia_

For three clusters the inertia value was 6729 and the mean and median value of each one is summarized in the second table below:

In [None]:
df3[['Age','Education','Income','Months being customer','Spending','kmeans']].groupby('kmeans').mean()

In [None]:
df2['kmeans'] = km.predict(df3[cols])

In [None]:
df2[['Age','Education','Income','Months being customer','Spending','kmeans']].groupby('kmeans').agg([np.mean,np.median])

In [None]:
df3.kmeans.value_counts()

We can certainly say there is a notable difference between clusters standing out the age, spending, education and income. Also the amount of records in the clusters was quite balanced.

In the following figures the clusters can be clearly characterized because of their separation between each other:

In [None]:
from mpl_toolkits.mplot3d import Axes3D

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Age'], df2['Income'], s=(df2['Spending']/12), c=km.labels_.astype(np.float), alpha=0.5)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Age'], df2['Income'], s=(df2['Education']**4), c=km.labels_.astype(np.float), alpha=0.5)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Income'], df2['Spending'], s=(df2['Education']**4), c=km.labels_.astype(np.float), alpha=0.5)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Income', fontsize=18)
plt.ylabel('Spending', fontsize=16)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Months being customer'], df2['Spending'], s=(df2['Income']/2000), c=km.labels_.astype(np.float), alpha=0.8)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Months being customer', fontsize=18)
plt.ylabel('Spending', fontsize=16)
plt.show()

In [None]:
fig = plt.figure(figsize = (8,9))
#ax = fig.add_subplot(1,1,1) 
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('Months being customer', fontsize = 15)
ax.set_ylabel('Income', fontsize = 15)
ax.set_zlabel('Spending', fontsize = 15)
ax.set_title('3 component', fontsize = 20)
targets = [0, 1, 2]
colors = ['purple', 'c', 'y']
for target, color in zip(targets,colors):
    indicesToKeep = (df2['kmeans'] == target)
    ax.scatter(df2.loc[indicesToKeep, 'Months being customer']
               , df2.loc[indicesToKeep, 'Income']
               , df2.loc[indicesToKeep, 'Spending']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

Create a copy of the original dataset cleaned just to analyze more the clusters and their characteristics:

In [None]:
df4=df.copy(deep=True)

In [None]:
df4['kmeans']=df2['kmeans']
df4.head()

One interesting insight about the clusters built are the products purchased, in the table below is shown the average of each product by cluster:

In [None]:
df4[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'kmeans']].groupby('kmeans').mean()

Still wines and meat are the top products and the spending of only cluster 1 reachs 73% of total sales. About the type of purchase ‘In Store’ continues being the most frequent, but also having a high rate in ‘Web’ and ‘Catalog’ for cluster 1.

In [None]:
df4[['NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'Age', 'kmeans']].groupby('kmeans').mean()

The following tables show how many records of each cluster are single or in couple and how many children each one has. Focusing on cluster 1 something interesting inferred is that 94% of them do not have children or have only one, whereas 63% are in a couple.

In [None]:
pd.crosstab(df4['Marital_Status'], df4['kmeans'], rownames=['Marital_Status'], colnames=['kmeans'])

In [None]:
pd.crosstab(df4['N° children'], df4['kmeans'], rownames=['N° children'], colnames=['kmeans'])

### DBSCAN

For the last model it was a huge task the setting of its hyperparameters as always it created more than 5 clusters and for this reason the following were used: epsilon=0.99 and  min_samples=5. This created five clusters and 46 noise points which are labeled as cluster ‘-1’, representing actually six segments and their centroids are in the table below: 

In [None]:
from sklearn.cluster import DBSCAN
from sklearn import metrics

In [None]:
# Compute DBSCAN
db = DBSCAN(eps=0.99, min_samples=5).fit(df3[cols])
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

In [None]:
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)

In [None]:
df2['dbscan']=labels
df3['dbscan']=labels

In [None]:
df3.dbscan.value_counts()

In [None]:
df2[['Age','Education','Income','Months being customer','Spending','dbscan']].groupby('dbscan').mean()

For this model we could certainly say that it differentiated a bit better than HAC by age and spending, but it is still hard to explain each of them. 

In [None]:
df3[['Age','Education','Income','Months being customer','Spending','dbscan']].groupby('dbscan').mean()

As in K-means let's see in the table below the average spending by type of product for each cluster:

In [None]:
df4['dbscan']=df2['dbscan']
df4[['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'dbscan']].groupby('dbscan').mean()

The first figure below shows the age by income and relates the education of the customer with the size of each circle, clearly the clusters can’t be differentiated and in the last figure we see how tight they actually are:

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Age'], df2['Income'], s=(df2['Spending']/25), c=labels.astype(np.float), alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Age'], df2['Income'], s=(df2['Education']**3), c=labels.astype(np.float), alpha=0.7)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Age', fontsize=18)
plt.ylabel('Income', fontsize=16)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Income'], df2['Spending'], s=(df2['Education']**4), c=labels.astype(np.float), alpha=0.5)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Income', fontsize=18)
plt.ylabel('Spending', fontsize=16)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(8,5))

scatter = ax.scatter(df2['Months being customer'], df2['Spending'], s=(df2['Income']/2000), c=labels.astype(np.float), alpha=0.8)
legend1 = ax.legend(*scatter.legend_elements(), loc="upper right", title="Clusters")
#ax.add_artist(legend1)
plt.xlabel('Months being customer', fontsize=18)
plt.ylabel('Spending', fontsize=16)
plt.show()

In [None]:
fig = plt.figure(figsize = (8,9))
#ax = fig.add_subplot(1,1,1) 
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('Months being customer', fontsize = 15)
ax.set_ylabel('Income', fontsize = 15)
ax.set_zlabel('Spending', fontsize = 15)
ax.set_title('3 component', fontsize = 20)
targets = [0, 1, 2, 3, 4, -1]
colors = ['purple', 'c', 'y', 'c', 'm', 'k']
for target, color in zip(targets,colors):
    indicesToKeep = (df2['dbscan'] == target)
    ax.scatter(df2.loc[indicesToKeep, 'Months being customer']
               , df2.loc[indicesToKeep, 'Income']
               , df2.loc[indicesToKeep, 'Spending']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

Just to see the distribution of records by clusters created in DBSCAN let's reduce components to 3 using PCA:

In [None]:
from sklearn.decomposition import PCA

PCA2=PCA(n_components=3)
PCA2=PCA2.fit(df3[cols])

In [None]:
PCA2.explained_variance_ratio_

In [None]:
PCA2.explained_variance_ratio_.sum()

In [None]:
pd.DataFrame(abs(PCA2.components_)*100)

In [None]:
PCA3=PCA(n_components=3)
PCA3=PCA3.fit_transform(df3[cols])

In [None]:
PCA3=pd.DataFrame(PCA3, columns = ['principal component 1', 'principal component 2', 'principal component 3'])

In [None]:
PCA3.head()

In [None]:
PCA3_final = pd.concat([PCA3, df3['dbscan']], axis = 1)
PCA3_final.head()

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize = (8,9))
#ax = fig.add_subplot(1,1,1) 
ax = fig.add_subplot(111, projection='3d')
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_zlabel('Principal Component 3', fontsize = 15)
ax.set_title('3 component PCA', fontsize = 20)
targets = [0, 1, 2, 3, 4, -1]
colors = ['r', 'g', 'b', 'c', 'm', 'k']
for target, color in zip(targets,colors):
    indicesToKeep = (PCA3_final['dbscan'] == target)
    ax.scatter(PCA3_final.loc[indicesToKeep, 'principal component 1']
               , PCA3_final.loc[indicesToKeep, 'principal component 2']
               , PCA3_final.loc[indicesToKeep, 'principal component 3']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

Clearly it can't be well differentiated and also results complex to explain what each one represents: 

### Key Findings

The HAC and DBSCAN models did not represent uniformly our dataset, the clusters looked like in some features were almost the same but in other were totally different and the scope of this is to find the best differentiation between them so we can focus more on specific characteristics of the customer with highest spending and how to always offer what people with low income are looking for. The figures already shown related to K-means can help us to best interpret and infer key data from the clusters. For example from the first and second figures below we can say:  
- cluster 1 corresponds to those customers with highest income and spending who also are between 40 and 50 years old.  
- cluster 2 represents people with middle income and spending who are after 50 years old.   
- cluster 0 are young people with low income and spending who highly likely are in 2nd grade or undergraduate.  

Another key insight is the fact that in the table shown before about customers with spending over third quartile have characteristics considerably similar to cluster 1, because of this we could assign the name ‘Golden’ or ‘Top’ customers to this specific group.


### Suggestions

1. Firstly I personally found this was without a doubt the most difficult subject in the program, because of that I would love to receive your feed-back to improve it.    
2. Would be fascinating to see a better segmentation with easier representation but using different features as those selected in this project. I've tried several combinations and this one gave me the most notable difference.   
3. Spend more time in DBSCAN and choosing their right hyperparameters to obtain less clusters, because it always created more than 5 and these were complex to explain.