# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

# Importing Libraries

In [1]:
from google.cloud import storage
import os

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

# Loading Data

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [2]:
# set key credentials file path
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'datascience-capstone-project-05b1642f45c3.json'

In [3]:
storage_client = storage.Client()
bucket = storage_client.bucket('udacity-nanodegree-capstone-project')
blob = bucket.blob('azdias.csv')

In [4]:
path = "gs://udacity-nanodegree-capstone-project/azdias.csv"
azdias =  pd.read_csv(path, index_col=0)

path = "gs://udacity-nanodegree-capstone-project/customers.csv"
customers = pd.read_csv(path, index_col=0)

  azdias =  pd.read_csv(path, index_col=0)
  customers = pd.read_csv(path, index_col=0)


### Exploratory data analysis

#### Premisses:
> This project is meant to be a general approach to be used as a methodology to analyse customers in a robust way. Therefore, the data used should be as representative as possible, with the as few features as possible as well. This compromiss is one of the hardest parts of this project.

* **KBA and D19 Data:** The D19 and KBA data are going to be dropped, even though they could be useful on this analysis, they are considered hard to be obtained and beacuse of that, it would potentially harm the generalization capacity of this project.
* Based on the same premisses 35 columns were selected as possible important ones which should be able to explain well the customer behavior and not be so difficult to go into production.

In [5]:
pd.options.display.max_rows = 4000

In [6]:
azdias.head()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
0,910215,-1,,,,,,,,,...,,,,,,,,3,1,2
1,910220,-1,9.0,0.0,,,,,21.0,11.0,...,4.0,8.0,11.0,10.0,3.0,9.0,4.0,5,2,1
2,910225,-1,9.0,17.0,,,,,17.0,10.0,...,2.0,9.0,9.0,6.0,3.0,9.0,2.0,5,2,3
3,910226,2,1.0,13.0,,,,,13.0,1.0,...,0.0,7.0,10.0,11.0,,9.0,7.0,3,2,4
4,910241,-1,1.0,20.0,,,,,14.0,3.0,...,2.0,3.0,5.0,4.0,2.0,9.0,3.0,4,1,3


In [7]:
azdias.describe()

Unnamed: 0,LNR,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,...,VHN,VK_DHT4A,VK_DISTANZ,VK_ZG11,W_KEIT_KIND_HH,WOHNDAUER_2008,WOHNLAGE,ZABEOTYP,ANREDE_KZ,ALTERSKATEGORIE_GROB
count,891221.0,891221.0,817722.0,817722.0,81058.0,29499.0,6170.0,1205.0,628274.0,798073.0,...,770025.0,815304.0,815304.0,815304.0,783619.0,817722.0,798073.0,891221.0,891221.0,891221.0
mean,637263.0,-0.358435,4.421928,10.864126,11.745392,13.402658,14.476013,15.089627,13.700717,8.287263,...,2.417322,6.001214,7.53213,5.945972,3.933406,7.908791,4.052836,3.362438,1.522098,2.777398
std,257273.5,1.198724,3.638805,7.639683,4.09766,3.2433,2.712427,2.452932,5.079849,15.628087,...,1.166572,2.856091,3.247789,2.771464,1.964701,1.923137,1.949539,1.352704,0.499512,1.068775
min,191653.0,-1.0,1.0,0.0,2.0,2.0,4.0,7.0,0.0,0.0,...,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0
25%,414458.0,-1.0,1.0,0.0,8.0,11.0,13.0,14.0,11.0,1.0,...,2.0,3.0,5.0,4.0,2.0,8.0,3.0,3.0,1.0,2.0
50%,637263.0,-1.0,3.0,13.0,12.0,14.0,15.0,15.0,14.0,4.0,...,2.0,6.0,8.0,6.0,4.0,9.0,3.0,3.0,2.0,3.0
75%,860068.0,-1.0,9.0,17.0,15.0,16.0,17.0,17.0,17.0,9.0,...,3.0,9.0,10.0,8.0,6.0,9.0,5.0,4.0,2.0,4.0
max,1082873.0,3.0,9.0,21.0,18.0,18.0,18.0,18.0,25.0,595.0,...,4.0,11.0,13.0,11.0,6.0,9.0,8.0,6.0,2.0,9.0


In [8]:
missing_values = azdias.isna().mean()

missing_values.sort_values(ascending=False)

ALTER_KIND4                    0.998648
ALTER_KIND3                    0.993077
ALTER_KIND2                    0.966900
ALTER_KIND1                    0.909048
EXTSEL992                      0.733996
KK_KUNDENTYP                   0.655967
ALTERSKATEGORIE_FEIN           0.295041
D19_VERSAND_ONLINE_QUOTE_12    0.288495
D19_LOTTO                      0.288495
D19_BANKEN_ONLINE_QUOTE_12     0.288495
D19_LETZTER_KAUF_BRANCHE       0.288495
D19_SOZIALES                   0.288495
D19_GESAMT_ONLINE_QUOTE_12     0.288495
D19_KONSUMTYP                  0.288495
D19_VERSI_ONLINE_QUOTE_12      0.288495
D19_TELKO_ONLINE_QUOTE_12      0.288495
KBA05_DIESEL                   0.149597
KBA05_CCM4                     0.149597
KBA05_GBZ                      0.149597
KBA05_FRAU                     0.149597
KBA05_ZUL4                     0.149597
KBA05_HERST1                   0.149597
KBA05_HERST2                   0.149597
KBA05_HERST3                   0.149597
KBA05_CCM3                     0.149597


In [9]:
cols = azdias.columns

In [10]:
[col for col in cols if ('D19' in col or 'KBA' in col)]

['D19_BANKEN_ANZ_12',
 'D19_BANKEN_ANZ_24',
 'D19_BANKEN_DATUM',
 'D19_BANKEN_DIREKT',
 'D19_BANKEN_GROSS',
 'D19_BANKEN_LOKAL',
 'D19_BANKEN_OFFLINE_DATUM',
 'D19_BANKEN_ONLINE_DATUM',
 'D19_BANKEN_ONLINE_QUOTE_12',
 'D19_BANKEN_REST',
 'D19_BEKLEIDUNG_GEH',
 'D19_BEKLEIDUNG_REST',
 'D19_BILDUNG',
 'D19_BIO_OEKO',
 'D19_BUCH_CD',
 'D19_DIGIT_SERV',
 'D19_DROGERIEARTIKEL',
 'D19_ENERGIE',
 'D19_FREIZEIT',
 'D19_GARTEN',
 'D19_GESAMT_ANZ_12',
 'D19_GESAMT_ANZ_24',
 'D19_GESAMT_DATUM',
 'D19_GESAMT_OFFLINE_DATUM',
 'D19_GESAMT_ONLINE_DATUM',
 'D19_GESAMT_ONLINE_QUOTE_12',
 'D19_HANDWERK',
 'D19_HAUS_DEKO',
 'D19_KINDERARTIKEL',
 'D19_KONSUMTYP',
 'D19_KONSUMTYP_MAX',
 'D19_KOSMETIK',
 'D19_LEBENSMITTEL',
 'D19_LETZTER_KAUF_BRANCHE',
 'D19_LOTTO',
 'D19_NAHRUNGSERGAENZUNG',
 'D19_RATGEBER',
 'D19_REISEN',
 'D19_SAMMELARTIKEL',
 'D19_SCHUHE',
 'D19_SONSTIGE',
 'D19_SOZIALES',
 'D19_TECHNIK',
 'D19_TELKO_ANZ_12',
 'D19_TELKO_ANZ_24',
 'D19_TELKO_DATUM',
 'D19_TELKO_MOBILE',
 'D19_TELKO_OFFL

In [11]:
cols2drop = [col for col in cols if ('D19' in col or 'KBA' in col)]

In [12]:
df_azdias = azdias.drop(cols2drop, axis=1)
df_customers = customers.drop(cols2drop, axis=1)

In [13]:
columns2keep = ['AGER_TYP',
'ALTERSKATEGORIE_GROB',
'ANREDE_KZ',
'ANZ_HAUSHALTE_AKTIV',
'ANZ_PERSONEN',
'BALLRAUM',
'FINANZ_ANLEGER',
'FINANZ_HAUSBAUER',
'FINANZ_MINIMALIST',
'FINANZ_SPARER',
'FINANZ_UNAUFFAELLIGER',
'FINANZ_VORSORGER',
'HEALTH_TYP',
'HH_EINKOMMEN_SCORE',
'INNENSTADT',
'KKK',
'LP_FAMILIE_GROB',
'LP_LEBENSPHASE_GROB',
'LP_STATUS_GROB',
'ONLINE_AFFINITAET',
'ORTSGR_KLS9',
'SEMIO_DOM',
'SEMIO_FAM',
'SEMIO_KRIT',
'SEMIO_KULT',
'SEMIO_MAT',
'SEMIO_REL',
'SEMIO_SOZ',
'SHOPPER_TYP',
'WOHNDAUER_2008',
'WOHNLAGE',
'W_KEIT_KIND_HH',
'ZABEOTYP']

In [14]:
customer_cols = customers.columns[~np.isin(customers.columns, azdias.columns)]
print(customer_cols)

Index(['PRODUCT_GROUP', 'CUSTOMER_GROUP', 'ONLINE_PURCHASE'], dtype='object')


In [15]:
df_azdias = df_azdias[columns2keep]
df_customers = df_customers[list(columns2keep) + list(customer_cols)]

In [18]:
df_customers.isna().mean().sort_values(ascending=False)

KKK                      0.283117
W_KEIT_KIND_HH           0.280415
ORTSGR_KLS9              0.263373
BALLRAUM                 0.260676
INNENSTADT               0.260676
ANZ_HAUSHALTE_AKTIV      0.260509
WOHNLAGE                 0.260509
ANZ_PERSONEN             0.243128
WOHNDAUER_2008           0.243128
LP_STATUS_GROB           0.016765
LP_FAMILIE_GROB          0.016765
ONLINE_AFFINITAET        0.016765
LP_LEBENSPHASE_GROB      0.016765
HH_EINKOMMEN_SCORE       0.015486
SEMIO_MAT                0.000000
SEMIO_REL                0.000000
SHOPPER_TYP              0.000000
SEMIO_SOZ                0.000000
SEMIO_KRIT               0.000000
ZABEOTYP                 0.000000
PRODUCT_GROUP            0.000000
CUSTOMER_GROUP           0.000000
SEMIO_KULT               0.000000
AGER_TYP                 0.000000
SEMIO_FAM                0.000000
SEMIO_DOM                0.000000
ALTERSKATEGORIE_GROB     0.000000
HEALTH_TYP               0.000000
FINANZ_VORSORGER         0.000000
FINANZ_UNAUFFA

In [19]:
df_azdias.isna().mean().sort_values(ascending=False)

KKK                      0.135989
W_KEIT_KIND_HH           0.120735
ORTSGR_KLS9              0.109082
BALLRAUM                 0.105182
INNENSTADT               0.105182
ANZ_HAUSHALTE_AKTIV      0.104517
WOHNLAGE                 0.104517
ANZ_PERSONEN             0.082470
WOHNDAUER_2008           0.082470
HH_EINKOMMEN_SCORE       0.020587
ONLINE_AFFINITAET        0.005446
LP_STATUS_GROB           0.005446
LP_LEBENSPHASE_GROB      0.005446
LP_FAMILIE_GROB          0.005446
SEMIO_KULT               0.000000
SEMIO_MAT                0.000000
SEMIO_KRIT               0.000000
SEMIO_FAM                0.000000
SEMIO_SOZ                0.000000
SHOPPER_TYP              0.000000
SEMIO_REL                0.000000
AGER_TYP                 0.000000
SEMIO_DOM                0.000000
ALTERSKATEGORIE_GROB     0.000000
HEALTH_TYP               0.000000
FINANZ_VORSORGER         0.000000
FINANZ_UNAUFFAELLIGER    0.000000
FINANZ_SPARER            0.000000
FINANZ_MINIMALIST        0.000000
FINANZ_HAUSBAU

In [21]:
df_customers.shape

(191652, 36)

In [16]:
df.head()

NameError: name 'df' is not defined

In [None]:
df_prod[customer_cols]

Unnamed: 0,PRODUCT_GROUP,CUSTOMER_GROUP,ONLINE_PURCHASE
0,COSMETIC_AND_FOOD,MULTI_BUYER,0
1,FOOD,SINGLE_BUYER,0
2,COSMETIC_AND_FOOD,MULTI_BUYER,0
3,COSMETIC,MULTI_BUYER,0
4,FOOD,MULTI_BUYER,0
...,...,...,...
191647,COSMETIC_AND_FOOD,MULTI_BUYER,0
191648,COSMETIC,SINGLE_BUYER,0
191649,COSMETIC_AND_FOOD,MULTI_BUYER,0
191650,FOOD,SINGLE_BUYER,0


In [None]:
df.columns[np.isin(df.columns,columns2keep)]

Index(['AGER_TYP', 'ANZ_HAUSHALTE_AKTIV', 'ANZ_PERSONEN', 'BALLRAUM',
       'FINANZ_ANLEGER', 'FINANZ_HAUSBAUER', 'FINANZ_MINIMALIST',
       'FINANZ_SPARER', 'FINANZ_UNAUFFAELLIGER', 'FINANZ_VORSORGER',
       'HEALTH_TYP', 'HH_EINKOMMEN_SCORE', 'INNENSTADT', 'KKK',
       'LP_FAMILIE_FEIN', 'LP_FAMILIE_GROB', 'LP_LEBENSPHASE_GROB',
       'LP_STATUS_GROB', 'ONLINE_AFFINITAET', 'ORTSGR_KLS9', 'SEMIO_DOM',
       'SEMIO_FAM', 'SEMIO_KRIT', 'SEMIO_KULT', 'SEMIO_MAT', 'SEMIO_REL',
       'SEMIO_SOZ', 'SHOPPER_TYP', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
       'WOHNLAGE', 'ZABEOTYP', 'ANREDE_KZ', 'ALTERSKATEGORIE_GROB'],
      dtype='object')

In [None]:
missing_values[missing_values > 0.7]

ALTER_KIND1    0.909048
ALTER_KIND2    0.966900
ALTER_KIND3    0.993077
ALTER_KIND4    0.998648
EXTSEL992      0.733996
dtype: float64

In [None]:
missing_values['W_KEIT_KIND_HH']

0.12073548536221655

In [None]:
azdias['W_KEIT_KIND_HH']

0         NaN
1         3.0
2         3.0
3         NaN
4         2.0
         ... 
891216    3.0
891217    6.0
891218    NaN
891219    1.0
891220    6.0
Name: W_KEIT_KIND_HH, Length: 891221, dtype: float64

In [None]:
azdias['W_KEIT_KIND_HH'].value_counts()

6.0    281966
4.0    128675
3.0    100170
2.0     84000
1.0     83706
5.0     64716
0.0     40386
Name: W_KEIT_KIND_HH, dtype: int64

In [None]:
azdias['ALTER_KIND1'].unique()

array([nan, 17., 10., 18., 13., 16., 11.,  6.,  8.,  9., 15., 14.,  7.,
       12.,  4.,  3.,  5.,  2.])

In [None]:
azdias['HH_EINKOMMEN_SCORE'].unique()

array([ 2.,  6.,  4.,  1.,  5.,  3., nan])

In [None]:
azdias['ALTER_KIND2'].unique()

array([nan, 13.,  8., 12., 10.,  7., 16., 15., 14., 17.,  5.,  9., 18.,
       11.,  6.,  4.,  3.,  2.])

In [None]:
azdias['ALTER_KIND3'].unique()

array([nan, 10., 18., 17., 16.,  8., 15.,  9., 12., 13., 14., 11.,  7.,
        5.,  6.,  4.])

In [None]:
azdias['ALTER_KIND4'].unique()

array([nan, 10.,  9., 16., 14., 13., 11., 18., 17., 15.,  8., 12.,  7.])

In [None]:
azdias['ALTERSKATEGORIE_GROB'].unique()

array([2, 1, 3, 4, 9], dtype=int64)

In [None]:
len(azdias.columns)

366

In [None]:
azdias[missing_values[missing_values > 0.7].index]

Unnamed: 0,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,EXTSEL992
0,,,,,
1,,,,,
2,,,,,14.0
3,,,,,31.0
4,,,,,
...,...,...,...,...,...
891216,,,,,
891217,,,,,
891218,,,,,
891219,17.0,,,,19.0


Check columns which are not in both datasets.

The columns that are not in both datasets are beeing dropped, since there is no way to use them to characterize the customers on both data.

In [None]:
customers.columns[~np.isin(customers.columns, azdias.columns)]

Index(['PRODUCT_GROUP', 'CUSTOMER_GROUP', 'ONLINE_PURCHASE'], dtype='object')

In [None]:
azdias.columns[~np.isin(azdias.columns, customers.columns)]

Index([], dtype='object')

In [None]:
np.isin(azdias.columns, customers.columns).mean()

1.0

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.