# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [2]:
# import libraries here; add more as necessary
import sys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
import pickle
import joblib
import time

from sklearn.preprocessing import LabelEncoder, MinMaxScaler, OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter.

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [222]:
mailout_test = pd.read_csv('data/mailout_test.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [223]:
print(mailout_test.shape)
mailout_test.head()

(42833, 366)


Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,1754,2,1.0,7.0,,,,,6.0,2.0,...,4.0,5.0,6.0,3.0,6.0,9.0,3.0,3,1,4
1,1770,-1,1.0,0.0,,,,,0.0,20.0,...,1.0,5.0,2.0,1.0,6.0,9.0,5.0,3,1,4
2,1465,2,9.0,16.0,,,,,11.0,2.0,...,3.0,9.0,6.0,3.0,2.0,9.0,4.0,3,2,4
3,1470,-1,7.0,0.0,,,,,0.0,1.0,...,2.0,6.0,6.0,3.0,,9.0,2.0,3,2,4
4,1478,1,1.0,21.0,,,,,13.0,1.0,...,1.0,2.0,4.0,3.0,3.0,9.0,7.0,4,2,4


In [224]:
test_identifier = mailout_test[['LNR']]

### Preprocess and Segment

#### Clean data

In [225]:
mailout_test_clean = clean_data(mailout_test)

CAMEO_DEUG_2015
[ 2.  5.  7.  9.  4.  6.  1. nan  3.  8.]
CAMEO_INTL_2015
[13. 31. 41. 51. 25. 43. 14. 15. 45. 24. nan 22. 35. 44. 34. 23. 12. 32.
 33. 55. 54. 52.]
CAMEO_DEU_2015
--DROP DUPLICATE--
Drop 0 duplicated rows, Number of rows after drop: 42833
--MISSING VALUES--
--CLEAN CATEGORICAL VARIABLES--
Dropped -45 columns, Number of columns after drop: 321


#### Segment

In [226]:
mailout_test_cluster = segment_clusters(mailout_test_clean, testset=True)

https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations


### Predict Customer Class

In [227]:
# Onehot encode Cluster column
X_clustered = mailout_test_cluster[['Cluster']]
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_clustered)
X_clustered_encoded = pd.DataFrame(enc.transform(X_clustered).toarray(), columns=list(enc.get_feature_names_out()))

In [103]:
mailout_test_cluster.drop(columns=['Cluster'], inplace=True)
mailout_test_cluster = mailout_test_cluster.join(X_clustered_encoded)

In [104]:
print(mailout_test_cluster.shape)
mailout_test_cluster.head()

(42833, 332)


Unnamed: 0,DSL_FLAG,GREEN_AVANTGARDE,HH_DELTA_FLAG,KBA05_SEG6,KONSUMZELLE,SOHO_KZ,UNGLEICHENN_FLAG,VERS_TYP,ANREDE_KZ,OST_WEST_KZ_O,...,Cluster_0,Cluster_1,Cluster_2,Cluster_3,Cluster_4,Cluster_5,Cluster_6,Cluster_7,Cluster_8,Cluster_9
0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [166]:
# make prediction
# load model
# gbc_model = joblib.load('results/gbc_grids_obj.pkl')
# ada_model = joblib.load('results/ada_grid_obj.pkl')
ada_model = ada_grid_obj.best_estimator_

In [228]:
# mailout_test_proba = ada_clf.predict_proba(mailout_test_cluster)
mailout_test_proba = gbc_clf.predict_proba(X_clustered_encoded)
# mailout_test_proba = grid_obj.predict_proba(mailout_test_cluster)

In [229]:
mailout_test_proba

array([[0.98508734, 0.01491266],
       [0.98534021, 0.01465979],
       [0.98508734, 0.01491266],
       ...,
       [0.9892639 , 0.0107361 ],
       [0.98635439, 0.01364561],
       [0.9892639 , 0.0107361 ]])

In [230]:
proba_pred = pd.DataFrame(mailout_test_proba, columns=['proba_0', 'proba_1'])

In [231]:
proba_pred.apply(lambda x: 1 if x.proba_1>x.proba_0 else 0, axis=1).sum()

0

In [190]:
# mailout_test_pred = ada_model.predict(mailout_test_cluster)
mailout_test_pred = gbc_model.predict(mailout_test_cluster)

In [191]:
mailout_test_pred.sum()

23

In [192]:
test_identifier['RESPONSE'] = mailout_test_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_identifier['RESPONSE'] = mailout_test_pred


In [193]:
test_identifier['RESPONSE'].sum()

23

In [194]:
test_identifier

Unnamed: 0,LNR,RESPONSE
0,1754,0
1,1770,0
2,1465,0
3,1470,0
4,1478,0
...,...,...
42828,67615,0
42829,67938,0
42830,67942,0
42831,67949,0


#### Create Submision

In [195]:
test_identifier.to_csv('results/submision_10282021_2.csv', index=False)