# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

# BUSINESS UNDERSTANDING

Marketing is crucial for the growth and sustainability of the business as it helps build company’s brand, engage customers, grow revenues and increase sales. One of the key painpoint of business is to understand customers and identify their needs in order to tailor campaigns to customer segments most likely to purchase products.
Customer segmentation  helps business plan marketing campaigns easier, focusing on certain customer groups instead of targeting the mass market, therefore more efficient in terms of time, money and other resources. 
* What are the relationship beween demographics of the company's existing customers and the general population of Germany?
* Which parts of the general population that are more likely to be part of the mail-order company's main customer bases, and which parts of the general population are less so
* How historical demographic data can help business to build prediction model, therefore be able to identify potential customers.<br> 

Fortunately, those business questions can be solved using analytics by involving appropriate data analytics tools and methodologies.

# DATA UNDERSTANDING

In [1]:
# import libraries here
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import operator
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# magic word for producing visualizations in notebook
%matplotlib inline

In [2]:
# load in population data
azdias = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')


  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
# load in customers data
customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')

In [None]:
azdias.head()

In [None]:
customers.head()

### Understanding column descriptions and domain values

* load Attributes and Values files
* merge those 2 files to create a data dictionary for Arvato files
* create a list of tuples which contains column name and its 'unknown' domain values
* create a generic function to display the % of missing values in each column

In [None]:
# load Attributes file and fill in missing values for column Information Level
attributes =  pd.read_excel('DIAS Information Levels - Attributes 2017.xlsx',sheet_name=None)
attb_desc = list(attributes.values())[0]
attb_desc['Information level'] = attb_desc['Information level'].fillna(method='ffill')
attb_desc.head()

In [None]:

print('attribute:  {}  description:  {}'.format(attb_desc['Attribute'],attb_desc['Description']))

In [None]:
# load Values file and filling missing values for column Attribute
values =  pd.read_excel('DIAS Attributes - Values 2017.xlsx', sheet_name=None)
attb_vals = list(values.values())[0]
attb_vals['Attribute'] = attb_vals['Attribute'].fillna(method='ffill')
attb_vals.head()

In [None]:
# join Attributes and Values files to form data dictionary for Arvato files

data_dictionary = pd.merge(attb_vals, attb_desc, on='Attribute')
data_dictionary = data_dictionary[['Information level','Attribute','Description_x','Value','Meaning','Additional notes']]
data_dictionary

In [None]:
data_dictionary[data_dictionary['Attribute']=='CJT_GESAMTTYP']

In [None]:
# display columns and associate values meaning 'unknown'

NaN_meanings = ['unknown','no classification possible','unknown / no main age detectable','no transaction known']
NaN_df = data_dictionary[data_dictionary.Meaning.isin(NaN_meanings)]
NaN_df


In [None]:
NaN_df['Value'].unique()

In [None]:
# create a list of tuples which contains column name and associate 'unknown' values
idx = NaN_df.index
unknown_list =[]
for i in idx:
    val = NaN_df['Value'][i]
    if val == '-1, 0' or val == '-1, 9':
        tupl = (NaN_df['Attribute'][i],val[0:2])
        unknown_list.append(tupl)
        tupl = (NaN_df['Attribute'][i],val[4:6])
        unknown_list.append(tupl)
    else:
        tupl = (NaN_df['Attribute'][i],str(val))
        unknown_list.append(tupl)

In [None]:
def check_unknown_value(col,value):
    '''
    This functions check if a value in a column match 'unknown' definition is data dictionary
    INPUT: - a column in a dataset
           - a value in the column
    OUTPUT:
           return true if the value of the column match 'unknown' definition in data dictionary, otherwise return false
    '''
    result = bool()
    tup = (col,value)
    if tup in unknown_list:
        result = True
    else:
        result = False

    return result

def percent_missing_values(df):
    '''
    This function calculate and display % of missing values in each column of dataset
    INPUT: pandas dataframe
    OUTPUT: dictionary with key is the column name and value is % of missing
    '''
    missing_dict = {}
    total_rec = df.shape[0]
    
    for col in df.columns.values:  #iterate columns
        # if column contain NaN values, calculate the mean of missing values
        if df[col].isnull().values.any():  
            missing_dict[col]= df[col].isnull().mean()
        # if column does not contain NaN values, map values to 'unknown' value in data dictionary
        else:
            s = df[col].value_counts(dropna=False)
            nan_count = 0
            for index, value in s.items():        
                if check_unknown_value(col,str(index)) :
                    nan_count = nan_count + value
            missing_dict[col]= round(nan_count/total_rec,4)
            
    # print dictionary in reverse order (highest-> lowest % missing values)
    sorted_d = dict( sorted(missing_dict.items(), key=operator.itemgetter(1),reverse=True))
        
    missing_dict = sorted_d
    
    return missing_dict



In [None]:
def top_percent_missing(df, percent):
    '''
    This function displays and return columns which have % missing greater than certain percentage
    INPUT: - input pandas dataframe
           - percent , enter 50 for 50%
    OUTPUT: a list of columns which have % missing greater than <percent>
    '''
    cols = []
    
    missing_dict = percent_missing_values(df)
    for key,value in missing_dict.items():
        if value > percent/100:
            cols.append(key)
    return cols

### Data Analysis - Population file

In [None]:
# display top5 rows in population file
azdias.head()

In [None]:
azdias.describe()

In [None]:
# display % of missing in population file
missing_pop = percent_missing_values(azdias)
for key,val in missing_pop.items():
    print('{} - {}'.format(key,val))

In [None]:
# display the columns which have % missing greater than 30%
cols_pop = top_percent_missing(azdias, 30)
cols_pop

In [None]:
def prepare_hist_plot(missing_dict):
    '''
    '''
    missing_list = []
    for value in missing_dict.values():
        missing_list.append(math.ceil(value*100))
    
    return missing_list
    

### Data Analysis - Customer file

In [None]:
# display top5 rows in customers file
customers.head()

In [None]:
customers['PRODUCT_GROUP'].value_counts()/customers.shape[0]

In [None]:
customers['CUSTOMER_GROUP'].value_counts()/customers.shape[0]

In [None]:
customers['ONLINE_PURCHASE'].value_counts()/customers.shape[0]

In [None]:
categorical_cols = list(customers.select_dtypes(exclude=['int64','float64']).columns)
categorical_cols

In [None]:
numeric_cols = list(customers.select_dtypes(include=['int64','float64']).columns)
len(numeric_cols)

In [None]:
customers.describe()

In [None]:
# display % of missing in customers file
missing_cust = percent_missing_values(customers)
for key,val in missing_cust.items():
    print('{} - {}'.format(key,val))

In [None]:
# display the columns which have % missing greater than 50%
cols_cust = top_percent_missing(customers, 50)
cols_cust

In [None]:
# compare % of missing data between Population and Customer files

fig, (ax1, ax2) = plt.subplots(1,2, sharey=True,figsize=(12,7))
ax1.set(xlabel = '% missing values', title = 'GENERAL POPULATION')
ax2.set(xlabel = '% missing values', title = 'CUSTOMERS')
sns.countplot(prepare_hist_plot(missing_pop), ax=ax1, palette ='husl')
sns.countplot(prepare_hist_plot(missing_cust), ax=ax2, palette ='husl')
plt.show()

In [None]:
print('popluation total records:' ,azdias.shape[0])
print('customer total records:' ,customers.shape[0])
azdias.shape[0]/customers.shape[0]

### Data Analysis - Summary

* The domain values represent 'unknown' are not consistent between columns. Different values meaning the same thing (ie. 0, -1,9, NaN)
* Volumn of general population file is nearly 5 times larger than that of customer file. This implies 20% of population is Arvato's customers
* Distribution of missing values are slightly difference between General Population and Customer fiels.Around 80+ columns have 0% missing values.Most columns have percentage missing values between 10% and 30%. The MODE % of missing values in Population file is 12%, whereas in Customer file is 27%
* Many columns have identical % of missing values, it is likely that those colums are relevant.
* Only 10% of customers purchase online, 30% are single buyers vs 70% multi buyers
* 8 variables are categorical, the rest are numeric variables.


# DATA PREPARATION


Create generic function to prepare data for machine learning, which includes:
* drop rows with more than 75% missing values
* drop columns with more than 70% missing values
* drop customer id column 
* drop categorical columns
* drop 3 columns exist in Customer file but not exist in Population file
* for numeric variables, replace NaN with values implying 'unknown' in data dictionary, in this case is -1


In [None]:
def drop_columns(df,percent):
    '''
    This function performs below:
    - drop columns with more than input percent
    - drop customer ID column
    - drop categorical columns
    INPUT: input pandas dataframe
    OUTPUT: pandas dataframe after columns being removed
    '''
    # drop columns with more than percent missing values
    #missing_dict = percent_missing_values(df)
    cols = top_percent_missing(df, percent)
    missing_df = df.drop(columns= cols)
    
    # drop customer ID column
    missing_df.drop(['LNR'], axis=1,inplace=True)
    
    # drop categorical columns
    categorical_cols = list(df.select_dtypes(exclude=['int64','float64']).columns)
    dropped_df = missing_df.drop(columns= categorical_cols)
      
    
    return dropped_df

def drop_rows(df,value_threshold):
    '''
    This function drops records with number of Nan greater than threshold
    INPUT: - input pandas dataframe , 
           - threshold of missing values in a row
    OUTPUT: pandas dataframe after rows being removed
    '''
    # drop rows which have more than threshold missing values
    nan_rows =  df.isnull().sum(axis=1)
    droped_rows = list(nan_rows[nan_rows.values >=value_threshold].index)
    dropped_df = df.drop(droped_rows)
    
    return dropped_df

def prepare_data(df,percent_threshold,value_threshold = 0):
    '''
    This function peforms below:
    - drop rows with number of missing values greater than threshold
    - drop columns not useful for machine learning
    - fill NaN values with -1

    INPUT: 
    - input pandas dataframe, 
    - acceptable percentage of missing values (columns with % > threshold will be removed )
    OUTPUT: cleaned pandas dataframe
    '''
    
    # drop rows 
    if value_threshold == 0:  #threshold not supplied, 
        row_df = df  # no row dropped
    else:
        row_df = drop_rows(df,value_threshold)
    
    # drop columns 
    col_df = drop_columns(row_df,percent_threshold)  

    
    # fill NaN with -1 values
    clean_df = col_df.fillna(-1)
    
   
    return clean_df



In [None]:
# check no of missing values in each rows in population file
nan_rows_pop = azdias.shape[1] - azdias.count(axis=1)
nan_rows_pop.describe()

In [None]:
# clean population file, remove columns with more than 70% Nan and rows whose NaN counts beyond 3rd ISQ (75%)
pop_cleaned_df = prepare_data(azdias,70,16)

In [None]:
# check no of missing values in each rows in customers file
nan_rows_cust = customers.shape[1] - customers.count(axis=1)
nan_rows_cust.describe()

In [None]:
# clean customers file, remove columns with more than 70% Nan and rows whose NaN counts beyond 3rd ISQ (75%)
cust_cleaned_df = prepare_data(customers,70,225)

### Feature Scaling

In [None]:
scaler = StandardScaler()
def scale_numeric_var(df):
    ''' This function scales numeric variables in df dataset
    INPUT:  pandas dataset
    OUTPUT: scaled dataset
    '''

    df_scaled = pd.DataFrame(scaler.fit_transform(df),
                              index=df.index, columns=df.columns)
    
    return df_scaled

In [None]:
# scale population data
df_scaled_pop = scale_numeric_var(pop_cleaned_df)

In [None]:
df_scaled_pop.head()

In [None]:
# scale customer data
df_scaled_cust = scale_numeric_var(cust_cleaned_df)
df_scaled_cust.shape

### Feature Reduction

In [None]:
pca = PCA()
pca_data = pca.fit(df_scaled_pop)

In [None]:
# draw PCA chart

num_components= len(pca.explained_variance_ratio_)
idx = np.arange(num_components)
ratio = pca.explained_variance_ratio_
 
plt.figure(figsize=(13, 9))
ax = plt.subplot(111)
cumvals = np.cumsum(ratio)
ax.bar(idx, ratio)
ax.plot(idx, cumvals)
for i in range(num_components):
    if(i%20 == 0 or i<6):
        ax.annotate(r"%s%%" % ((str(ratio[i]*100)[:4])), (idx[i]+0.2, ratio[i]), va="bottom", ha="center", fontsize=9)
 
    ax.xaxis.set_tick_params(width=0, gridOn=True)
    ax.yaxis.set_tick_params(width=2, length=10, gridOn=True)
 
ax.set_xlabel("Principal Components")
ax.set_ylabel("% Variance Explained")


In [None]:
# Initiate PCA with n =150 and apply to population 
pca = PCA(n_components=150)
pca_pop = pca.fit_transform(df_scaled_pop)


In [None]:
# Display 10 records for the first Principle Component
pca_map = pd.DataFrame({'weight': pca.components_[0],'name': df_scaled_pop.columns})        
pca_map = pca_map.sort_values(by='weight', ascending=False)
pca_map.iloc[:10,:]

In [None]:
# Display 10 records for the second Principle Component
pca_map = pd.DataFrame({'weight': pca.components_[1],'name': df_scaled_pop.columns})        
pca_map = pca_map.sort_values(by='weight', ascending=False)
pca_map.iloc[:10,:]

In [None]:
# Display 10 records for the third Principle Component
pca_map = pd.DataFrame({'weight': pca.components_[2],'name': df_scaled_pop.columns})        
pca_map = pca_map.sort_values(by='weight', ascending=False)
pca_map.iloc[:10,:]

### Find the optimum number of clusters

In [None]:
k_scores = [] 
range_values =  range(1, 12)
for i in range_values:
  kmeans = KMeans(n_clusters = i)
  kmeans.fit(pca_pop)
  k_scores.append(kmeans.inertia_) 


In [None]:
# plot k_scores
plt.plot(k_scores, 'bx-')
plt.title('Finding right number of clusters')
plt.xlabel('Clusters')
plt.ylabel('scores') 
plt.show()

In [None]:
i = 0
for k in k_scores:
    print(k-i)
    i = k

### Apply Kmeans method on population and customer file

In [None]:
# apply Kmeans with n=5
kmeans = KMeans(5)


In [None]:
# fit clustering model to Population file
clustering_model = kmeans.fit(pca_pop)

In [None]:
# assign clusters to general population and customer
pca_cust = pca.fit_transform(df_scaled_cust)

cluster_pop = clustering_model.predict(pca_pop)
cluster_cust = clustering_model.predict(pca_cust)

# SUMMARY

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

In [None]:
def create_cluster_df(df):
    '''
    This functions create a dictionary with key is cluster label and value contains % of customers in each cluster
    INPUT: cluster array
    OUTPUT: dictionary contains % of customers in each cluster
    '''
    total_records = len(df)
    cluster_dict = {}
    
    unique, counts = np.unique(df, return_counts=True)

    d = dict(zip(unique, counts))   
    for index, value in d.items():
        cluster_dict[index+1] = round(value/total_records,5)
        
    return cluster_dict

In [None]:
# plot customers and general population clusters

pop_clusters = create_cluster_df(cluster_pop)
cust_clusters = create_cluster_df(cluster_cust)
plt.rcParams["figure.figsize"] = (15,8)
#plt.figure(figsize=(40,30)) 
fig, (ax1, ax2) = plt.subplots(1,2, sharey=True)

ax1.set_title('General Population Clusters')
ax2.set_title('Customer Clusters')
sns.barplot(x=list(pop_clusters.keys()), y=list(pop_clusters.values()), ax=ax1)
sns.barplot(x=list(cust_clusters.keys()), y=list(cust_clusters.values()), ax=ax2)
plt.show()

In [None]:
print("CLUSTER DISTRIBUTION - GENERAL POPULATION vs CUSTOMERS")
print('-------------------------------------------------------')
for i in range(1,6):
    print("Cluster: {} -  Population: {} - Customer: {}".format(i,pop_clusters[i],cust_clusters[i]))

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')

In [None]:
mailout_train.head()

In [None]:
mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')

In [None]:
mailout_test.head()

In [None]:
# display response distribution

response = mailout_train['RESPONSE']
unique, counts = np.unique(response, return_counts=True)
plt.rcParams["figure.figsize"] = (10,5)

ax = sns.countplot(x=response.index, data=response)

ax.set_title('Response Distribution')
plt.show()

In [None]:
unique, counts = np.unique(response, return_counts=True)
print('Response: {}  Count: {}'.format(unique,counts))
print('Response: {}  Percent: {}'.format(unique,counts/len(mailout_train['RESPONSE'])))


In [None]:
# calculate response1:response0 ratio
response_0 = len(response) / (2 * counts[0])
response_1 = len(response) / (2 * counts[1])
weights = {0:response_0,1:response_1}  # this will be used as a parameter of the model creation
print(weights)

### Data Preparation

In [None]:
# this step remove columns with more than 70% NaN and fill NaN
train_df = prepare_data(mailout_train,70)
test_df = prepare_data(mailout_test,70)

In [None]:
train_Y = train_df['RESPONSE']
train_Y[0:5]
unique, counts = np.unique(train_Y, return_counts=True)
print(unique,counts)

In [None]:
train_df = train_df.drop(['RESPONSE'], axis=1)


In [None]:
#Split train df to  train_X, train_Y
train_X = scale_numeric_var(train_df)

In [None]:
# scale test data
test_X = scale_numeric_var(test_df)

### Build model using RandomForestClassifier

In [None]:
dict_evaluation = {}

In [None]:
from sklearn.ensemble import RandomForestClassifier
model_RFC = RandomForestClassifier(class_weight=weights)

In [None]:
# fit the model
model_RFC.fit(train_X, train_Y)


In [None]:
y_pred_rfc = model_RFC.predict(train_X)

In [None]:
# capture AUC Score
from sklearn.metrics import roc_auc_score
dict_evaluation["Random Forest Classification"] = roc_auc_score(train_Y,y_pred_rfc)

### Build model using Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model_LR = LogisticRegression(solver='lbfgs', class_weight=weights)

In [None]:
model_LR.fit(train_X,train_Y)

In [None]:
y_pred_lr = model_LR.predict(train_X)

In [None]:
# capture AUC Score
dict_evaluation["Logistis Regression"] = roc_auc_score(train_Y,y_pred_lr)


### Build model using Support Vector Classification

In [None]:
from sklearn.svm import SVC
model_svc = SVC( class_weight=weights)

In [None]:
%time model_svc.fit(train_X,train_Y)

In [None]:
%time y_pred_svc = model_svc.predict(train_X)

In [None]:
# capture AUC Score
dict_evaluation["Support Vector Classification"] = roc_auc_score(train_Y,y_pred_svc)


### Build model using Decision Tree Classification

In [None]:
from sklearn.tree import DecisionTreeClassifier
model_dtc = DecisionTreeClassifier(random_state=0,class_weight=weights)

In [None]:
%time model_dtc.fit(train_X,train_Y)

In [None]:
%time y_pred_dtc = model_dtc.predict(train_X)

In [None]:
# capture AUC Score
dict_evaluation["Decision Tree Classification"] = roc_auc_score(train_Y,y_pred_svc)

## MODEL EVALUATION

4 algorithms have been used to build model for mailout_train. The one with highest AUC score will be deployed on mailout_test data

In [None]:
# print the report
print('---------- MODEL EVALUATION -------------')
for key,val in dict_evaluation.items():
    print('Model: {} - AUC Score: {} '.format(key,val))


<span style="color:blue">Model Evaluation Summary:</span><br><br>
From the model evaluation results above, I can see 'Support Vector Classification' and 'Decision Tree Classification have the same highest score. So I compare their runtimes to determine which one is WINNER<br><br>
Support Vector Classifiction:<br>
    * Fit:  CPU times: user 14min 9s, sys: 715 ms, total: 14min 10s 
    * Predict:  user 43.1 ms, sys: 39.9 ms, total: 83 ms
Decision Tree Classification:<br>
    * Fit:   CPU times: user 4.64 s, sys: 47.9 ms, total: 4.69 s
    * Predict:  CPU times: user 43.1 ms, sys: 39.9 ms, total: 83 ms
Support Vector Classification run 168 times longer than Decision Tree Classification (14mins vs 5 secs) <br>
Obviously, DECISION TREE CLASSIFICATION is winner, therefore will be deployed
    



## MODEL DEPLOYMENT

In [None]:
# predict response for mailout_test
response_pred  = model_dtc.predict(test_X)

In [None]:
# predict response probability for mailout_test
response_probas = model_dtc.predict_proba(test_X)
    

In [None]:
# merge model prediction results with mailout_test
predictions = pd.Series(data=response_pred, index=test_X.index, name='predicted_response')
probabilities = pd.DataFrame(data=response_probas, index=test_X.index, columns=['prob_0','RESPONSE'])

results_test = mailout_test.join(predictions, how='left')
results_test = results_test.join(probabilities, how='left')

In [None]:
results_test.head()

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

### PREPARE CSV FILE FOR KAGGLE COMPETITION

In [None]:
results_test['RESPONSE'].value_counts()

In [None]:
mailout_response = results_test[['LNR','RESPONSE']]

In [None]:
mailout_response.set_index('LNR', inplace = True)

In [None]:
# Check data in the file before submission
print('Check file dimension: ',mailout_response.shape)
print('Check the random 10 rows:', mailout_response[58:68])
unique, counts = np.unique(mailout_response, return_counts=True)
print('Check counts in each probability class')
for unique,counts in zip(unique,counts):
    print('probability: {} - count: {} '.format(unique,counts))

In [None]:
# Create CSV file for Kaggle Competition
mailout_response.to_csv('avrvato_response.csv')

In [None]:
# final check to ensure the codes are error free after executing the whole workbook
print("Congratulations! arvato_project_workbook.ipynb execution completed successfully ")