# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

# Overview and Table of Content

The notebook will implement this workflow:

1. [Prepare jupyter lab notebook environment](#Prepare-jupyter-lab-notebook-environment)
1. [Download Data](#Download-Data)
1. [Data exploration and cleaning](#Data-exploration-and-cleaning)
    1. explore the data
    1. cleaning (handling null and empty values, unknown values, encode categorical values)
1. [Visualizations](#Visualizations)
    1. correlation studies
1. [Data preparation and transformation](#Data-preparation-and-transformation)
    1. Feature engineering (PCA)
1. [Model development and training](Model-development-and-training)
    1. Develop a model
    1. Train a model
    1. Model validation and evaluation
    1. Hyperparameters tuning
    1. Select the best performing model based on the test results
1. [Deploy model](#Deploy-Model)

# Prepare jupyter lab notebook environment
---


## Auto Reload Modules 
configure auto-reload of modules when they have been changed - this simplifies developing and testing

In [None]:
%load_ext autoreload

In [None]:
%autoreload 2

In [None]:
!python --version

## Update Conda Packages

In [None]:
! conda update -y conda

In [None]:
! conda update --all -y

In [None]:
! conda install pyarrow

In [None]:
! conda install -y -c anaconda progressbar2

In [None]:
! conda list

## Imports and global configs

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import progressbar

# magic word for producing visualizations in notebook
%matplotlib inline

In [4]:
# display the N columns and rows
pd.set_option('display.max_columns', 50)

pd.set_option('display.max_rows', 100)

## Activate intelex for scikit
see [activate intelex for scikit](https://intel.github.io/scikit-learn-intelex/index.html)

In [None]:
#! conda install -y scikit-learn-intelex

# Download Data
---
The four data sets
- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
    
and two files of description    
- `DIAS Attributes - Values 2017.xlsx`
- `DIAS Information Levels - Attributes 2017.xlsx`

can be downloaded from the Udacity project workspace.

# Data exploration and cleaning
---

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

## Load Data from S3
The load script assumes that the downloaded data has been transferred to S3.

The data load of the AZDIAS data set takes more than a minute the CUSTOMERS data set should be loaded in less than 20 secs


In [None]:
import os
if os.path.exists('data') and os.path.isdir('data'):
    prefix = './data'
else:
    prefix = 's3://sagemaker-eu-central-1-292575554790/dsnd_arvato'

In [None]:
! aws s3 ls s3://sagemaker-eu-central-1-292575554790/dsnd_arvato/

In [None]:
%%time
df_azdias = pd.read_csv(f'{prefix}/Udacity_AZDIAS_052018.csv', sep=';', index_col='LNR')
# load in the data
#azdias = pd.read_csv('data/Udacity_AZDIAS_052018.csv', sep=';')
#customers = pd.read_csv('data/Udacity_CUSTOMERS_052018.csv', sep=';')

In [None]:
%%time
df_customers = pd.read_csv(f'{prefix}/Udacity_CUSTOMERS_052018.csv', sep=';', index_col='LNR')

In [None]:
df_azdias.head()

In [None]:
df_azdias.info()

In [None]:
df_customers.info()

## Checks 


### Data errors during load
during the load process we got two error messages for columns 18 and 19. I will check this here

In [None]:
# column 18 has 0-based index 17
df_azdias.iloc[:,17].unique()

In [None]:
# column 19 has 0-based index 18
df_customers.iloc[:,18].unique()

In [None]:
df_azdias.columns[17:19]


**Result:** The errors is caused by string values "X" and "XX" in the datasets in columns CAMEO_DEUG_2015 and CAMEO_INTL_2015.

I will add some code to the Data Cleaner class to handle this


### check for duplicates
check if dataset contains duplicate records based on column ID LNR

In [None]:
df_azdias.index.duplicated().sum()

In [None]:
df_customers.index.duplicated().sum()

## Loading & Explore Metadata

### Load Metadata

In [None]:
df_metadata = pd.read_excel(f'{prefix}/DIAS Attributes - Values 2017.xlsx', usecols='B:E', dtype='str', header=1).fillna(method='ffill')
df_metadata.head()

### Explore Metadata

1. check nulls
2. check unkown values

In [None]:
df_metadata.isnull().sum()

In [None]:
f"Number of unique attributes: {df_metadata['Attribute'].unique().shape[0]}"

In [None]:
f"Number of Attributes that can be unnkown value: {df_metadata['Meaning'].str.contains('unknown').sum()}"

In [None]:
f"Number of Attributes that can be `no transaction known` value: {df_metadata['Meaning'].str.contains('no transaction.? known', regex=True).sum()}"

In [None]:
f"Total: {df_metadata['Meaning'].str.contains('unknown').sum() + df_metadata['Meaning'].str.contains('no transaction.? known', regex=True).sum()}"

### Compare dataset features (columns)

In [None]:
# columns that customers dataset contain but azidas not
set(df_customers.columns) - set(df_azdias.columns)

In [None]:
# columns that azdias dataset contain but customers not
set(df_azdias.columns) - set(df_customers.columns)

**Result**: `CUSTOMERS` dataset has 3 more columns {'CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'}

### Metadata Columns compared to Dataset columns

check for which columns of the dataset a metadata description exists

In [None]:
df_metadata_cols = df_metadata['Attribute'].copy()
# some columns of the metadata ends on _RZ whereas the datasets have the same columns whcih do not end on _RZ
# therefore we replace this
df_metadata_cols = df_metadata_cols.str.replace('_RZ','')

diff_set = set(df_azdias.columns) - set(df_metadata_cols)
print(f'number of cols in AZDIAS dataset but not described in Metadata: {len(diff_set)}')
pd.Series(list(diff_set)).sort_values().unique()

In [None]:
diff_set2 = set(df_metadata_cols) - set(df_azdias.columns)
print(f'number of cols in Metadata but not in AZDIAS dataset: {len(diff_set2)}')
diff_set2

In [None]:
df_azdias[list(diff_set)].head()

### Extract `Kinder` information and build new feature
we use the ANZ_KINDER and ALTER_KIND(n) columns to derive the number of children <10 and >= 10

In [None]:
num_moreThan4Children = df_azdias[df_azdias['ANZ_KINDER']>4].shape[0]
num_withChildren = df_azdias[df_azdias['ANZ_KINDER']>0].shape[0]
df_children5plus = df_azdias[(df_azdias['ANZ_KINDER']>4) & (df_azdias['ALTER_KIND4']<10)].filter(regex='(ANZ_KINDER)|(ALTER_KIND.?)')

print(f'number of records with more than 4 children: {num_moreThan4Children} of {df_azdias.shape[0]:,.0f} ({(num_moreThan4Children / df_azdias.shape[0] *100):6.5f} %)')
print(f'number of records with at least one child: {num_withChildren} of {df_azdias.shape[0]:,.0f} ({(num_withChildren / df_azdias.shape[0] *100):6.5f} %)')
print(f'number of records with ANZ_KINDER >= 5 and ALTER_KIND4 < 10: {df_children5plus.shape[0]}\n')
print('-'*80)

ax = df_azdias['ANZ_KINDER'].plot.hist()
ax.set_yscale('log')

**Result** The query above shows that there just 9 records with more or equal than 5 children and an age of child4 (`ALTER_KIND4`) < 10. In addition the `ALTER_KIND` column values are ordered so we can assume that the age of child5 and higher is >= 10

In [None]:
num_moreThan4Children = df_customers[df_customers['ANZ_KINDER']>4].shape[0]
num_withChildren = df_customers[df_customers['ANZ_KINDER']>0].shape[0]
df_children5plus = df_customers[(df_customers['ANZ_KINDER']>4) & (df_customers['ALTER_KIND4']<10)].filter(regex='(ANZ_KINDER)|(ALTER_KIND.?)')

print(f'number of records with more than 4 children: {num_moreThan4Children} of {df_customers.shape[0]:,.0f} ({(num_moreThan4Children / df_customers.shape[0] *100):6.5f} %)')
print(f'number of records with at least one child: {num_withChildren} of {df_customers.shape[0]:,.0f} ({(num_withChildren / df_customers.shape[0] *100):6.5f} %)')
print(f'number of records with ANZ_KINDER >= 5 and ALTER_KIND4 < 10: {df_children5plus.shape[0]}\n')
print('-'*80)

max_children = int(df_customers['ANZ_KINDER'].max())
ax = df_customers['ANZ_KINDER'].plot.hist(bins=range(max_children+2), width=0.8)
ax.set_yscale('log')
ax.set_xticks(range(0,max_children+1));
ax.set_title('Record distribution for number of children');
ax.set_xlabel('number of children')

### Metadata Summary

The value *"unkown"* will be treated like a missing value.

The value *"no transaction(s) known"* will be treated as if the customer has done no transaction


# Class Cleaner
----

The code for the ETL Pipeline is outsourced to python module ==> see python module etl.processor

In [None]:
import python.etl.processor as etlp

# Data Cleaning 
---

The `DataCleaner` class will handle the following:

* replace `unknown` values (represented by -1, 0, 9 see [Metadata Descriptions](#Loading-and-Explore-Metadata))
* handle the errors raised during the load
* handle categorical variables
* drop not needed columns

see sections below for details

In [None]:
TESTING = False
if TESTING:
    df_azdias_cleaned = df_azdias.iloc[:100,:].copy()
else:
    df_azdias_cleaned = df_azdias.copy()


In [None]:
df_azdias_cleaned.info()

## Handle Unknown / Missing Data

The dataset contains a lot of unkown values. Many times these values are encoded by -1, 0 or 9 (see Metadata files). I replace all unkown values by np.NaN to use standard pandas function for imputinig and dropping.



In [None]:
df_azdias_cleaned.shape

In [None]:
# Assess missing data in columns
fig, ax = plt.subplots(figsize=(8, 4))
df_nulls = df_azdias.isnull().sum(axis=0)
ax.hist(df_nulls, bins =50, alpha=0.5)
ax_bis = ax.twinx()
ax_bis.hist(df_nulls, bins =50, cumulative=True, density=True, histtype='step', color='red', alpha=0.8, label='cum_line')
#ax_bis.hist(df_azdias_cleaned.isnull().sum(axis=0), bins =50, cumulative=-1, density=True, histtype='step', color='red', alpha=0.95)
plt.title('Distributions of missing data before cleaning')
ax.set_xlabel('# Missing Values (NaN)')
ax.set_ylabel('Columns');
ax_bis.set_ylabel('cumulative');
ax_bis.hlines(xmin=0, xmax=df_nulls.max(), y=0.9, linestyles='dashed', color='grey', label='0.9')
ax_bis.legend(bbox_to_anchor=(1.07, 1.0), loc='upper left');

plt.savefig('dist_of_missingdata_before_transformation.jpg')

Most columns have less than 25% missing values. Some columns have more than 50% missing data. Let's find them

In [None]:
num_of_records = df_azdias_cleaned.shape[0]
s_missing_data = df_azdias_cleaned.isnull().sum(axis=0)
s_missing_data_pct = df_azdias_cleaned.isnull().sum(axis=0) / num_of_records 

df_missing_data = pd.DataFrame({'abs':s_missing_data,'pct':s_missing_data_pct})
df_missing_data.sort_values(by='pct', ascending=False)[:20]


**Results**: 
* There 19 variables with more than 25% missing values -> These are candidates to drop
* There are some variables that have all the same number of missing data (257113 - D19_...).
* the variables `ALTER_KIND1` - `ALTER_KIND4` have a huge number of missing values. This is because they are dependent on `ANZ_KINDER` (number of children) so that for all records with `ANZ_KINDER`=0 the values for `ALTER_KIND1`- `ALTER_KIND4` are missing. We will handle this in feature engineering part and build a new varaible for these


**Note**: The drop operation will be the last part as columns maybe needed during the feature engineering process


In [None]:
drop_level = 0.25
columns_to_drop = s_missing_data_pct.sort_values(ascending=False)
columns_to_drop = columns_to_drop[columns_to_drop>drop_level].index
columns_to_drop

## Inverstigate  columns that throw an error
Info: just a copy from above [Data error during load](#Data-errors-during-load)

In [None]:
df_azdias_cleaned['CAMEO_DEUG_2015'].unique()

Obviously the 'X' is causing the issue. I will replace this by np.NaN

In [None]:
df_azdias_cleaned['CAMEO_INTL_2015'].unique()

Obviously the 'XX' is causing the issue. I will replace this by np.NaN

## Handle Categorical Values

The datasets have a huge number of categorical variables. Most of the categorical variables are already encoded by int and floats, e.g. `AGER_TYP` is encoded by

|value  | meaning |
|-----  |---------|
|-1     |	unknown |
|0	    | no classification possible |
|1	    | passive elderly |
|2	    | cultural elderly |
|3	    | experience-driven elderly |

We keep this encoding as in many cases the categorical values are Ordinal and just some nominal, e.g.

variable `D19_TELKO_ANZ_12` the values are ordered from `very low activity` to `very high activity`

|value  | meaning |
|-----  |---------|
|0      | no transactions known            |
|1      | very low activity                |
|2      | low activity                     |
|3      | slightly increased activity      |
|4      | increased activity               |
|5      | high activity                    |
|6      | very high activity               |


However, some columns are of type = object. These are now investigated.

In [None]:
df_azdias_cleaned.info()

In [None]:
df_azdias_cleaned.select_dtypes(include='object').head()

#### Results Categorical:

|variable   | type      | action    |
|--         |--         | ---   	|
|CAMEO_DEU_2015| nominal | replace by one hot encoding |
|D19_LETZTER_KAUF_BRANCHE | nominal | replace by one hot encoding |
| EINGEFUEGT_AM | date | drop - this is just the date when the record has been added |
| OST_WEST_KZ | nominal | replace by binary 0 and 1 |

`CAMEO_DEUG_2015` encoded categorical variable - contains invalid strings 'X'
`CAMEO_INTL_2015` encoded categorical variable - contains invalid strings 'XX'


In [None]:
pd.set_option('max_seq_items',450)
df_azdias_cleaned.columns

## Run the cleaning process

In [None]:
dfCleaner = etlp.PreDataCleaner(df_metadata)

df_azdias_cleaned = dfCleaner.transform(df_azdias_cleaned)
df_azdias_cleaned.shape

### Check distribution of Missing values again

In [None]:
# Assess missing data in columns
fig, ax = plt.subplots(figsize=(8, 4))
df_nulls = df_azdias.isnull().sum(axis=0)
ax.hist(df_nulls, bins =50, alpha=0.5)
ax_bis = ax.twinx()
ax_bis.hist(df_nulls, bins =50, cumulative=True, density=True, histtype='step', color='red', alpha=0.8, label='cum_line')
#ax_bis.hist(df_azdias_cleaned.isnull().sum(axis=0), bins =50, cumulative=-1, density=True, histtype='step', color='red', alpha=0.95)
plt.title('Distributions of missing data before cleaning')
ax.set_xlabel('# Missing Values (NaN)')
ax.set_ylabel('Columns');
ax.set_ylim(0, 200)
ax_bis.set_ylabel('cumulative');
ax_bis.hlines(xmin=0, xmax=df_nulls.max(), y=0.9, linestyles='dashed', color='grey', label='0.9')
ax_bis.legend(bbox_to_anchor=(1.07, 1.0), loc='upper left');

plt.savefig('dist_of_missingdata_after_transformation.jpg')

In [None]:
# Assess missing data in columns
fig, ax = plt.subplots(figsize=(8, 4))
df_nulls = df_azdias_cleaned.isnull().sum(axis=0)
ax.hist(df_nulls, bins =50, alpha=0.5)
ax_bis = ax.twinx()
ax_bis.hist(df_nulls, bins =50, cumulative=True, density=True, histtype='step', color='red', alpha=0.8, label='cum_line')
#ax_bis.hist(df_azdias_cleaned.isnull().sum(axis=0), bins =50, cumulative=-1, density=True, histtype='step', color='red', alpha=0.95)
plt.title('Distributions of missing data after cleaning')
ax.set_xlabel('# Missing Values (NaN)')
ax.set_ylabel('Columns');
ax.set_ylim(0, 200)
ax_bis.set_ylabel('cumulative');
ax_bis.hlines(xmin=0, xmax=df_nulls.max(), y=0.9, linestyles='dashed', color='grey', label='0.9')
ax_bis.legend(bbox_to_anchor=(1.07, 1.0), loc='upper left');

plt.savefig('dist_of_missingdata_after_transformation.jpg')



### Comparison of distributions of missing data

![alt distribution-before-transformation](dist_of_missingdata_before_transformation.jpg)
![alt distribution-after-transformation](dist_of_missingdata_after_transformation.jpg)

**Results**: 

* there is a significant increase of columns with no missing data  
This is because of the transformation of categorical features to one-hot encoded columns. Therfore the number of columns increased with no missing values.
* the other changes are becuse we replaced "unknown" values by np.NaN

## Save clenaed Datasets
Note: using feather requires to reset the index

In [None]:
df_azdias_cleaned.reset_index().to_feather('df_azdias_cleaned_step1-cleaned')

# Feature Engineering
Many records have ANZ_KINDER (number of children) = 0. 
For theses records the age of children columns (ALTER_KIND(N)) are always NaN. For records with a positive number 
of children the ALTER_KIND columns contains the age of children. We will replace these columns by summerize the
 them to two columns that will indicate the number of children younger than 10 and older or equal than 10.

In [None]:
df  = df_azdias_cleaned
cols_to_investigate = ['ALTER_KIND1','ALTER_KIND2','ALTER_KIND3','ALTER_KIND4','ANZ_KINDER']
df_kinder = df[cols_to_investigate]


In [None]:
#df_kinder = df_azdias_cleaned.filter(regex='(ANZ_KINDER)|(ALTER_KIND.?)')

figure, ax_list = plt.subplots(1,5,figsize=(24,5))

for i, col in enumerate(cols_to_investigate):
    df_kinder[col].value_counts().plot(kind='bar',ax=ax_list[i], title=col)

**Results**: The majority has no children. The dataset is quite imbalanced. Even the age of children is havily imbalanced. As you can see in chart `ALTER_KIND1` the distribution of ages has much higher values for >5 than for <=5

Based on the observations above I will build a new feature `d_has_children` and `d_has_children_yte10` to indicate that person has children younger or equal than 10.

In [None]:
df_azdias_cleaned.head()

## Run Feature Engineering Process

In [None]:
import python.etl.processor as etlp

featureBuilder = etlp.FeatureBuilder()
df_azdias_cleaned = featureBuilder.transform(df_azdias_cleaned)


In [None]:
df_azdias_cleaned.info()


## Save results
saving the results here will help to continue development and testing the next steps

**Info**: [Best way to save pandas Dataframe](https://towardsdatascience.com/the-best-format-to-save-pandas-data-414dca023e0d)


In [None]:
df_azdias_cleaned.reset_index().to_feather('df_azdias_cleaned_step2-feaEngineered')

## Loading DF
you can start here if you want to skip steps before

In [None]:
df_azdias_cleaned = pd.read_feather('df_azdias_cleaned_step2-feaEngineered')

# set the index as feather did store the index as column
df_azdias_cleaned.set_index('LNR', inplace=True)
df_azdias_cleaned.head()

# Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

## Cluster algorithms
For clustering there is number of popular algorithms. For the algorthim selection I focused on the ones that scikit-learn provides and on the article [clustering algorithms with python](https://machinelearningmastery.com/clustering-algorithms-with-python/).

## Feature Reduction and Selection

The dimension of the dataset is quite high (442 features) so that it is worth to consider a reduction of the dimensionality which will increase the performance and in many cases the accuracy of algorithm. In particular the popular K-means which I will use will profit from it.

See e.g. [PCA with k-means](https://365datascience.com/tutorials/python-tutorials/pca-k-means/)

## Approach 

1. **Impute missing data**
1. **Standardize data**

1. **PCA - Principal Component Analysis**  
This algorithms is also provided by scikit-learn. It will transform the given space of features to new space with basis vectors that are linear combinations of the given features so that the new vectors point in direction of the maximum variance.

1. **K-means** 


For the complete process I will use a sklearn pipeline to chain the steps

## Build the pipeline steps

In [2]:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn import metrics

### Load Data

In [3]:
X_train = pd.read_feather('df_azdias_cleaned_step2-feaEngineered')

# set the index as feather did store the index as column
X_train.set_index('LNR', inplace=True)
X_train.head()



Unnamed: 0_level_0,AGER_TYP,AKT_DAT_KL,ALTER_HH,ALTER_KIND1,ALTER_KIND2,ALTER_KIND3,ALTER_KIND4,ALTERSKATEGORIE_FEIN,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,...,D19_LETZTER_KAUF_BRANCHE_D19_TELKO_MOBILE,D19_LETZTER_KAUF_BRANCHE_D19_TELKO_REST,D19_LETZTER_KAUF_BRANCHE_D19_TIERARTIKEL,D19_LETZTER_KAUF_BRANCHE_D19_UNBEKANNT,D19_LETZTER_KAUF_BRANCHE_D19_VERSAND_REST,D19_LETZTER_KAUF_BRANCHE_D19_VERSICHERUNGEN,D19_LETZTER_KAUF_BRANCHE_D19_VOLLSORTIMENT,D19_LETZTER_KAUF_BRANCHE_D19_WEIN_FEINKOST,d_HAS_CHILDREN,d_HAS_CHILDREN_YTE10
LNR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
910215,,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
910220,,9.0,,,,,,21.0,11.0,0.0,...,0,0,0,0,0,0,0,0,0,0
910225,,9.0,17.0,,,,,17.0,10.0,0.0,...,0,0,0,1,0,0,0,0,0,0
910226,2.0,1.0,13.0,,,,,13.0,1.0,0.0,...,0,0,0,1,0,0,0,0,0,0
910241,,1.0,20.0,,,,,14.0,3.0,0.0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891221 entries, 910215 to 825787
Columns: 442 entries, AGER_TYP to d_HAS_CHILDREN_YTE10
dtypes: float64(300), int64(142)
memory usage: 2.9 GB


### Imputation

In [6]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')

X_train = imputer.fit_transform(X_train)

MemoryError: Unable to allocate 2.93 GiB for an array with shape (442, 891221) and data type float64

### Standardization

An important preprocessing step for PCA is stanardization (scaling) of the features. See [Importance of Feature Scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html) for mor information

In [None]:
# in order to generalize we define a new variable for the dataset that will be used in the next steps
df_segmentation = df_azdias_cleaned

scaler = StandardScaler()
X_train = scaler.fit_transform(df_segmentation)

###  PCA

In [None]:
pca = PCA()
pca.fit_transform(X_train)

In [None]:
X_train
y_train 
# Fit to data and predict using pipelined scaling, GNB and PCA.
std_clf = make_pipeline(SimpleImputer(missing_values=np.nan, strategy='median'), StandardScaler(), PCA(n_components=2), GaussianNB())
std_clf.fit(X_train, y_train)
pred_test_std = std_clf.predict(X_test)

# Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')