# Arvato Customer segmentation and Classification

In this notebook we will work on the following task:
- Dimensionality Reduction
- Customer Segmentation

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode, iplot
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

from sklearn.cluster import KMeans
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingRegressor, AdaBoostRegressor
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
import xgboost as xgb
import time
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/arvato-cleaned/Customers_cleaned.csv
/kaggle/input/arvato-cleaned/Azdias_cleaned.csv
/kaggle/input/arvato-cleaned/Customer_Additional_cleaned.csv


In [2]:
df_azdias = pd.read_csv('../input/arvato-cleaned/Azdias_cleaned.csv')
df_customers = pd.read_csv('../input/arvato-cleaned/Customers_cleaned.csv')

Feature Scaling
- First, we will scale the data

In [3]:
scaler = StandardScaler()
scaler.fit(df_azdias)
df_azdias = pd.DataFrame(scaler.transform(df_azdias), columns = df_azdias.columns)
df_customers = pd.DataFrame(scaler.transform(df_customers), columns = df_customers.columns)

Customer Segmentation Report
- We will use unsupervised machine learning to create clusters
- We will also investigate feature importance.
- We will also do dimensionality reduction

## Dimenionality Reduction

In [4]:
randome_seed = 22

We will implement PCA

In [5]:
def pca_model(data,n_components=None):
    pca = PCA(n_components, random_state=22)
    pca.fit(data)
    data_transformed = pca.transform(data)
    return pca, data_transformed

In [6]:
pca_azdias, _ = pca_model(df_azdias, None)

In [7]:
pca_customers,_ = pca_model(df_customers,None)

In [8]:
def plot_pca_variance(pca_data, cumulative=True, figsize=(8,10)):
    if cumulative:
        variance = np.cumsum(pca_data.explained_variance_ratio_)
        y_label = "Percentage Explained Variance"
    else:
        variance = pca_data.explained_variance_ratio_
        y_label = "Explained Variance Ratio"
    
    x_label="No. of components"
    x =  [x for x in range(0,350,50)]
    y = variance
    trace1 = go.Scatter(y=y,marker=dict(color='#ffdc51'),name='')
    layout = go.Layout(title=""
                       ,xaxis=dict(title=x_label),
                       yaxis=dict(title=y_label))
    fig = go.Figure(data=[trace1],layout=layout)
    iplot(fig)

In [9]:
plot_pca_variance(pca_azdias)

In [10]:
plot_pca_variance(pca_customers)

As we can see more than 90% of variance is explained by 200 features. We
So we will reduce the number of dimensions and do a k-means clustering to get customer Segmentation

In [11]:
pca_azdias_200, azdias_pca_200 = pca_model(df_azdias, n_components=200)

In [12]:
pca_cust_200, cust_pca_200 = pca_model(df_customers, n_components=200)

## K-means Clustering

In [13]:
def perform_kmeans(data, K_start, K_end, step=1):
    scores = []
    print("Performing K-Means clustering")
    print("Given range min:{}, max:{}, step:{}".format(K_start, K_end, step))

    for n in range(K_start, K_end+1, step):
        
        print("\nTraining for n_clusters: ", n)
        start = time.time()
        
        kmeans = KMeans(n, random_state=22)
        model = kmeans.fit(data)
        scores.append(abs(model.score(data)))
        
        print("Done! Score: ", scores[-1])
        print("Time elapsed: {:.2f} sec.".format(time.time()-start))
        
    return scores, range(K_start, K_end+1, step)

In [14]:
%%time
azdias_scores, azdias_range_ = perform_kmeans(azdias_pca_200, 1, 20, 1)

Performing K-Means clustering
Given range min:1, max:20, step:1

Training for n_clusters:  1
Done! Score:  240954669.1111473
Time elapsed: 12.91 sec.

Training for n_clusters:  2
Done! Score:  225294191.13016036
Time elapsed: 38.23 sec.

Training for n_clusters:  3
Done! Score:  217518090.1019049
Time elapsed: 61.16 sec.

Training for n_clusters:  4
Done! Score:  212977830.25994712
Time elapsed: 111.62 sec.

Training for n_clusters:  5
Done! Score:  209340726.34117094
Time elapsed: 164.85 sec.

Training for n_clusters:  6
Done! Score:  206572593.10150194
Time elapsed: 159.28 sec.

Training for n_clusters:  7
Done! Score:  204126703.52209368
Time elapsed: 188.25 sec.

Training for n_clusters:  8
Done! Score:  202061459.34842333
Time elapsed: 212.17 sec.

Training for n_clusters:  9
Done! Score:  200231737.16370913
Time elapsed: 240.07 sec.

Training for n_clusters:  10
Done! Score:  198888778.90791613
Time elapsed: 253.70 sec.

Training for n_clusters:  11
Done! Score:  197626530.793566

In [15]:
def plot_elbow(scores, range_):
    
    y_label = "Sum of squared distances"
    x_label="No. of Clusters"
    x = [x for x in range_]
    trace1 = go.Scatter(x=x,y=scores,mode='lines+markers',marker=dict(color='#ffdc51'),name='hj')
    layout = go.Layout(title=""
                       ,xaxis=dict(title=x_label),
                       yaxis=dict(title=y_label)
                      ,legend=dict(x=0.1, y=1.1, orientation="h"))
    fig = go.Figure(data=[trace1],layout=layout)
    iplot(fig)
    
    plt.show()

In [16]:
plot_elbow(azdias_scores, azdias_range_)

In [19]:
%%time
cust_scores, cust_range_ = perform_kmeans(cust_pca_200, 1, 20, 1)

Performing K-Means clustering
Given range min:1, max:20, step:1

Training for n_clusters:  1
Done! Score:  42916972.977617815
Time elapsed: 2.30 sec.

Training for n_clusters:  2
Done! Score:  40840909.760685965
Time elapsed: 11.10 sec.

Training for n_clusters:  3
Done! Score:  39534081.83530515
Time elapsed: 14.48 sec.

Training for n_clusters:  4
Done! Score:  38539986.92214797
Time elapsed: 16.37 sec.

Training for n_clusters:  5
Done! Score:  38135219.84259325
Time elapsed: 18.53 sec.

Training for n_clusters:  6
Done! Score:  37595454.00571827
Time elapsed: 29.91 sec.

Training for n_clusters:  7
Done! Score:  37233197.247667596
Time elapsed: 34.45 sec.

Training for n_clusters:  8
Done! Score:  36867177.566820726
Time elapsed: 40.25 sec.

Training for n_clusters:  9
Done! Score:  36571935.92080732
Time elapsed: 51.47 sec.

Training for n_clusters:  10
Done! Score:  36351424.42941498
Time elapsed: 54.69 sec.

Training for n_clusters:  11
Done! Score:  36031279.48217281
Time elaps

In [20]:
plot_elbow(cust_scores, cust_range_)

![](http://)The basic idea behind clustering algorithms is to select the number of clusters so as to minimize the intra-cluster variance. Although there is no definite way of selecting the number of clusters, there are some direct methods and statistical methods for selecting the number of clusters. In this process the elbow method is chosen to select the optimal number of clusters.
- We will use 8 clusters after looking at the elbow plot

We will use kmeans with 8 clusters on customers and azdias

In [22]:
kmeans_azdias = KMeans(8, random_state=22)
kmeans_azdias.fit(azdias_pca_200)

KMeans(random_state=22)

In [23]:
azdias_clusters = kmeans_azdias.predict(azdias_pca_200)

In [24]:
kmeans_cust = KMeans(8, random_state=22)
kmeans_cust.fit(cust_pca_200)

KMeans(random_state=22)

In [25]:
customer_clusters= kmeans_cust.predict(cust_pca_200)

In [26]:
print(azdias_clusters[:10], "\n",customer_clusters[:10])

[6 6 0 3 0 6 6 7 7 1] 
 [5 2 5 4 1 7 1 3 4 7]


In [27]:
customer_clusters = pd.Series(customer_clusters)
azdias_clusters = pd.Series(azdias_clusters)

In [28]:
customer_clusters.value_counts().sort_index()

0     3462
1    20802
2    20001
3    11130
4    24255
5    34114
6     2548
7    17934
dtype: int64

Let's see number of people in each cluster

In [29]:
cluster_info = pd.DataFrame([])
cluster_info["Population"] = azdias_clusters.value_counts().sort_index()
cluster_info["Customers"] = customer_clusters.value_counts().sort_index()
cluster_info.reset_index(inplace=True)
cluster_info.rename(columns={"index":"Cluster"}, inplace=True)

In [30]:
cluster_info

Unnamed: 0,Cluster,Population,Customers
0,0,114321,3462
1,1,95755,20802
2,2,116684,20001
3,3,64760,11130
4,4,90304,24255
5,5,74138,34114
6,6,112788,2548
7,7,68538,17934


In [31]:

trace1 = go.Bar(x=cluster_info["Cluster"],y=cluster_info["Population"],marker=dict(color='#ffdc51'),name='Azdias')
trace2 = go.Bar(x=cluster_info["Cluster"],y=cluster_info["Customers"],marker=dict(color='#9932CC'),name='Customer')
layout = go.Layout(title="Number of people in each cluster"
                   ,legend=dict(x=0.1, y=1.1)
                   ,xaxis=dict(title="Cluster"),
                   yaxis=dict(title="Population"))
fig = go.Figure(data=[trace1,trace2],layout=layout)
iplot(fig)

We will now see the proportion of population in each cluster

In [32]:
cluster_info["Pop_proportion"] = (cluster_info["Population"]/cluster_info["Population"].sum()*100).round(2)
cluster_info["Cust_proportion"] = (cluster_info["Customers"]/cluster_info["Customers"].sum()*100).round(2)

In [33]:
cluster_info["Cust_over_Pop"] = cluster_info["Cust_proportion"] / cluster_info["Pop_proportion"]

In [34]:
trace1 = go.Bar(x=cluster_info["Cluster"],y=cluster_info["Cust_over_Pop"],marker=dict(color='#ffdc51'),name='Average price of airbnb rental neighbourhood')
layout = go.Layout(title="Proportion of population in each cluster"
                   ,legend=dict(x=0.1, y=1.1)
                   ,xaxis=dict(title="Cluster"),
                   yaxis=dict(title="Proportion"))
fig = go.Figure(data=[trace1],layout=layout)
iplot(fig)

The ratio > 1 shows that the customer base is more in that cluster, and there is a scope of the population in these cluster to be future customers.