In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

Customer segmentation is the process of dividing customers into groups based on common characteristics so companies can market to each group more effectively. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business.

I will be using a data set of Mall Customers, from Kaggle. I'll be using seaborn visualizations to segment the customers. 

In [None]:
customers = pd.read_csv('../input/Mall_Customers.csv')
customers.head()

As you can see, the dataset has features:
1. CustomerID
2. Genre
3. Age
4. Annual Income
5. Spending Score


Now obviously, we can see that Male or Female is not Genre but actually Gender! Let us make the change by renaming it to Gender.

In [None]:
customers.rename(columns={'Genre': 'Gender'},inplace=True)

In [None]:
customers.describe()
#customers.head()

As you can see, there are no missing values. Also, we can have a look at the descriptive statistics here to help us have an idea of what we're dealing with. Now that everything is okay, let us **explore** the data using visualizations.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.countplot(x='Gender', data=customers);
plt.title('Distribution of Gender');

We can infer that there are more women than men in this data set. Maybe this can have an affect on our final result. 
Let us see a distribution of** Age**.

In [None]:
customers.hist('Age', bins=35);
plt.title('Distribution of Age');
plt.xlabel('Age');

The ages are mostly between 30 and 40. The average age is 38.

In [None]:
plt.hist('Age', data=customers[customers['Gender'] == 'Male'], alpha=0.5, label='Male');
plt.hist('Age', data=customers[customers['Gender'] == 'Female'], alpha=0.5, label='Female');
plt.title('Distribution of Age by Gender');
plt.xlabel('Age');
plt.legend();

We can also see the distribution of Income. 

In [None]:
customers.hist('Annual Income (k$)');
plt.title('Annual Income Distribution in Thousands of Dollars');
plt.xlabel('Thousands of Dollars');

Much of the incomes lie between the 60 and 85,000 dollar buckets. Does gender impact this?

In [None]:
plt.hist('Annual Income (k$)', data=customers[customers['Gender'] == 'Male'], alpha=0.5, label='Male');
plt.hist('Annual Income (k$)', data=customers[customers['Gender'] == 'Female'], alpha=0.5, label='Female');
plt.title('Distribution of Income by Gender');
plt.xlabel('Income (Thousands of Dollars)');
plt.legend();

We can see that men earn the highest earners. Let's check their spending scores, gender wise.

In [None]:
male_customers = customers[customers['Gender'] == 'Male']
female_customers = customers[customers['Gender'] == 'Female']


print(male_customers['Spending Score (1-100)'].mean())
print(female_customers['Spending Score (1-100)'].mean())

We can infer that women spend more but earn less than men. Let's get us a scatterplot to understand more about the data.

In [None]:
sns.scatterplot('Age', 'Annual Income (k$)', hue='Gender', data=customers);
plt.title('Age to Income, Colored by Gender');

There's no direct correlation right there. Let's get us a heatmap to make us understand the correlation.

In [None]:
sns.heatmap(customers.corr(), annot=True)

You can see from the above plot that the only variables that are even somewhat correlated is spending score and age. It’s a negative correlation so the older a customer is in this data set, the lower their spending score. 

In [None]:
sns.scatterplot('Age', 'Spending Score (1-100)', hue='Gender', data=customers);
plt.title('Age to Spending Score, Colored by Gender');

This correlation isn't strong. Let's take up separate heat maps for both Male and Female. Maybe we can infer more from it.

In [None]:
sns.heatmap(female_customers.corr(), annot=True);
plt.title('Correlation Heatmap - Female');

In [None]:
sns.heatmap(male_customers.corr(), annot=True);
plt.title('Correlation Heatmap - Female');

We didn't find much correlation other than the fact that age has a greater impact on a woman's spending score than a man's. Let's learn more about a women’s spending score to age relationship. 

In [None]:
sns.lmplot('Age', 'Spending Score (1-100)', data=female_customers);
plt.title('Age to Spending Score, Female Only');

Laslty, we can see a visualization with Income and Spending Score

In [None]:
sns.scatterplot('Annual Income (k$)', 'Spending Score (1-100)', hue='Gender', data=customers);
plt.title('Annual Income to Spending Score, Colored by Gender');

There is some patterning here. You can think of these as customer segments:
1. Low income, low spending score
2. Low income, high spending score
3. Mid income, medium spending score
4. High income, low spending score
5. High income, high spending score