<h1>Cluster Analysis Techniques for Mall Customers data</h1>

<h2>Kernel Plan</h2>
<ul>
    <li>Load Dataset and Getting Some informations about the Dataset</li>
    <li>Clean the Data</li>
    <li>EDA</li>
    <li>Preprocessing the Data</li>
    <li>Use Clusering Techniques</li>
    <li>Use KMeans Clustering</li>
</ul>

<h1>Load Dataset and Getting Some informations about the Dataset</h1>

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
from sklearn.preprocessing import LabelEncoder
import os
sns.set(context='notebook', style='darkgrid')
plt.style.use('ggplot')
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
seg_df=pd.read_csv('/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
seg_df.head()

In [None]:
seg_df.info()

In [None]:
seg_df.describe()

<h1>Clean the Data</h1>

In [None]:
for i in seg_df.columns:
    x=i.strip().lower().replace(' ','_')
    seg_df=seg_df.rename(columns={i:x})
seg_df['gender']=seg_df['gender'].str.strip().str.lower()
customer_ids=seg_df['customerid']
seg_df=seg_df.drop('customerid', axis=1)

In [None]:
# cleaned dataframe
seg_df.head()

<h1>EDA</h1>

In [None]:
seg_df['gender'].value_counts().plot.bar()
plt.title('gender of customers')
plt.xlabel('gender')
plt.ylabel('frequency')
plt.show()

In [None]:
# a function of Cumulative distribution function
def cdf(lst):
    x=np.sort(lst)
    y=np.arange(1, len(x)+1)/len(x)
    return x, y
m_age=seg_df[seg_df['gender']=='male']['age']
f_age=seg_df[seg_df['gender']=='female']['age']
x_age, y_age=cdf(seg_df['age'])
x_m_age, y_m_age=cdf(m_age)
x_f_age, y_f_age=cdf(f_age)
fig, ax=plt.subplots(1,2)
ax[0].plot(x_age, y_age)
ax[0].set_title('cdf of ages')
ax[0].set_xlabel('age')

ax[1].plot(x_m_age, y_m_age, label='Male ages')
ax[1].plot(x_f_age, y_f_age, label='Female ages')
ax[1].set_title('cdf of male/female ages')
ax[1].set_xlabel('age')
ax[1].legend()

plt.show()

In [None]:
sns.set_palette('PRGn')
g=sns.boxplot(x='gender', y='age', data=seg_df)
g.set_title('Gender VS Age', y=1.03)

<p>Much clearer than cdf plot, we can see that (median, 75%, max) of male greater than female, (25%, min) of female greater than male and thare are no outliers</p> 

In [None]:
sns.set_palette('RdBu')
m_inc=seg_df[seg_df['gender']=='male']['annual_income_(k$)']
f_inc=seg_df[seg_df['gender']=='female']['annual_income_(k$)']
x_inc, y_inc=cdf(seg_df['annual_income_(k$)'])
x_m_inc, y_m_inc=cdf(m_inc)
x_f_inc, y_f_inc=cdf(f_inc)
fig, ax=plt.subplots(1,2)
ax[0].plot(x_inc, y_inc)
ax[0].set_title('cdf of annual income')
ax[0].set_xlabel('annual income')

ax[1].plot(x_m_inc, y_m_inc, label='Male annual incomes')
ax[1].plot(x_f_inc, y_f_inc, label='Female annual incomes')
ax[1].set_title('cdf of male/female annual incomes')
ax[1].set_xlabel('annual income')
ax[1].legend()

plt.show()

In [None]:
# let's get the average and median sending score for male and females
pd.pivot_table(values='spending_score_(1-100)', index='gender', data=seg_df, aggfunc=[np.mean, np.median])

<p>from the table above we can say that spending score distribution of females has more outliers(above upper limit) shifts the mean to higher values than the median, and the spending score distribution of males has more outliers(below lower limit) shifts the mean to lower values than the median.

In [None]:
fig, ax=plt.subplots()
sns.distplot(m_inc, ax=ax, hist=False, label='male')
sns.distplot(f_inc, ax=ax, hist=False, label='female')
ax.set_title('distribution of female/male annual incomes')
ax.set_xlabel('annual income')
ax.legend()
plt.show()

In [None]:
# pearson correlation heatmap

sns.heatmap(seg_df.corr(),
            annot=True,
            cmap='YlGnBu',
            cbar=True)

In [None]:
# let's create age label column

def age_label(x):
    if x<=35:
        return 'youth'
    elif (x>35) and (x<=60):
        return 'old'
    else:
        return 'senior'
seg_df['age_label']=seg_df['age'].apply(age_label)
seg_df['age_label'].value_counts(normalize=True)

In [None]:
pd.pivot_table(values=['spending_score_(1-100)','annual_income_(k$)'], index='age_label', data=seg_df, aggfunc=[np.mean, np.median])

In [None]:
sns.jointplot(data=seg_df, x='annual_income_(k$)', y='spending_score_(1-100)', kind='scatter')

<p>in the figure above the scatter plot consists of five patters.<br>
    annual income (40-60)--->spending score(40-60)<br>
    annual income (20-40)--->spending score(20-40)<br>
    annual income (20-40)--->spending score(80-100)<br>
    annual income (80-+140)--->spending score(20-40)<br>
    annual income (80-+140)--->spending score(80-100)<br></p>

<h1>Preprocessing</h1>

In [None]:
df2=seg_df.copy()

In [None]:
cat_cols=['gender', 'age_label']
num_cols=['age', 'annual_income_(k$)', 'spending_score_(1-100)']

In [None]:
def create_dummies(df,column_name):
    dummies = pd.get_dummies(df[column_name],prefix=column_name)
    df=df.drop(column_name, axis=1)
    df = pd.concat([df,dummies],axis=1)
    return df
def normalize(df, col):
    df[col]=(df[col]-df[col].mean())/df[col].std()
    return df
for i in cat_cols:
    seg_df=create_dummies(seg_df, i)
for i in num_cols:
    seg_df=normalize(seg_df,i)

In [None]:
# dataframe after preproccesing 
seg_df.head()

<h1>Use Clusering Techniques</h1>

<h2>Use Inertia measures clustering quality(KMeans Clustering)</h2>

In [None]:
inertias=[]
for i in range(1,7):
    model=KMeans(n_clusters=i)
    model.fit(seg_df)
    inertias.append(model.inertia_)
plt.plot(inertias, marker='o')
plt.title('inertia measure for number of clusters')
plt.xlabel('number of clusters')
plt.show()

<p>i think 3 is the best choice in the number of clusters(Choose the elbow in the inertia plot)</p>

In [None]:
model=KMeans(n_clusters=3)
model.fit(seg_df)
seg_df['seg_labels']=model.predict(seg_df)
model_1=TSNE(learning_rate=150)
trans=model.fit_transform(seg_df)
x=trans[:,0]
y=trans[:,1]
plt.scatter(x,y, c=seg_df['seg_labels'])
plt.legend()
plt.show()

In [None]:
seg_df['seg_labels'].value_counts()