# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import Imputer
from sklearn.cluster import KMeans


from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

from scipy import stats
from time import time
import os, math, time, random, datetime

# magic word for producing visualizations in notebook
%matplotlib inline



## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [None]:
# load in the data
azdias = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_AZDIAS_052018.csv', sep=';')
customers = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_CUSTOMERS_052018.csv', sep=';')

In [None]:
azdias_df = azdias.copy()

In [None]:
customers_df = customers.copy()

In [None]:
len(azdias_df)

In [None]:
len(customers_df)

## Section 2: Analysis; 
### Data Exploration & Visualisation

#### Exploring AZDIAS dataset

In [None]:
#Looking at descriptive stats on the Azdias dataset
azdias_df.describe()

In [None]:
#checking nulls in AZDIAS
azdiasdf_null = azdias_df.isnull().sum()

In [None]:
#checking % of missing attributes in each column
azdiasdf_null_percent = azdiasdf_null / len(azdias_df) * 100

In [None]:
# visualise missing data
(azdiasdf_null.sort_values(ascending=False)[:50].plot(title = 'AZDIAS columns with missing data ranked', kind='bar', figsize=(20,8), fontsize=13))  

In [None]:
# getting distribution of empty data in fields by percentage
plt.figure(figsize=(10,5))
plt.hist(azdiasdf_null_percent, bins = np.linspace(10,100,19), facecolor='g', alpha=0.75)


plt.xlabel('% of missing value')
plt.ylabel('# of Columns')
plt.title('Missing data distribution in each AZDIAS column')
plt.grid(True)

plt.show()

# % of missing data in columns based on dataframe created earlier
print('% of missing data in AZDIAS columns','\n',azdiasdf_null_percent.sort_values(ascending=False))

#### Exploring CUSTOMERS dataset

In [None]:
#Looking at descriptive stats on the Customers dataset
customers_df.describe()

In [None]:
#checking nulls in Customers dataset
customersdf_null = customers_df.isnull().sum()

In [None]:
#checking % of missing attributes in each column
customersdf_null_percent = customersdf_null / len(customers_df) * 100

In [None]:
# visualise missing data
(customersdf_null.sort_values(ascending=False)[:50].plot(title = 'CUSTOMER columns with missing data ranked', kind='bar', figsize=(20,8), fontsize=12))  

In [None]:
# getting distribution of empty data in fields by percentage
plt.figure(figsize=(10,5))
plt.hist(customersdf_null_percent, bins = np.linspace(10,100,19), facecolor='g', alpha=0.75)


plt.xlabel('% of missing value')
plt.ylabel('# of Columns')
plt.title('Missing data distribution in each CUSTOMERS column')
plt.grid(True)

plt.show()


# % of missing data in columns based on dataframe created earlier
print('% of missing data in columns','\n',customersdf_null_percent.sort_values(ascending=False))

## Section 3: Methodology

### Data Preprocessing

In [None]:
# Dropping the columns from Customers that are not in Azdias

customers_df.drop(columns=['CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'], inplace=True)

#### Removing columns with high null values

In [None]:
# Dropping the columns in AZDIAS that have over 65% of nulls in them 
az_column_nans = azdias_df.isnull().mean()
drop_cols = azdias_df.columns[az_column_nans > 0.65]
print('columns to drop: ', drop_cols)

In [None]:
# % of missing data in columns
print('% of missing data in columns','\n',azdias_null_percent.sort_values(ascending=False))

In [None]:
print('# of column in azdias before dropping: ', len(azdias_df.columns))
azdias = azdias_df.drop(drop_cols,axis=1)
print('# of column in azdias after dropping: ', len(azdias.columns))

print('# of column in customers before dropping: ', len(customers_df.columns))
customers = customers_df.drop(drop_cols,axis=1)
print('# of column in customers after dropping: ', len(customers.columns))

In [None]:
#checking shape of each df
azdias.shape
customers.shape

#### Removing columns that would make the model too sensitive to each based on a correlation matrix

In [None]:
## AZDIAS df - correlation matrix

corr_matrix = azdias.corr().abs()
sns.heatmap(corr_matrix)
plt.show()

In [None]:
#checking columns that have an upper correlation limit
upper_limit = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
upper_limit

In [None]:
# identify columns to drop based on those who are over the upper limit of .75
drop_columns = [column for column in upper_limit.columns if any(upper_limit[column] > .75)]

In [None]:
# drop columns from azdias
azdias = azdias.drop(drop_columns, axis=1)
len(azdias.columns)

In [None]:
## CUSTOMERS df - correlation matrix

cus_corr_matrix = customers.corr().abs()
sns.heatmap(cus_corr_matrix)
plt.show()

In [None]:
#checking columns that have an upper correlation limit
cus_upper_limit = corr_matrix.where(np.triu(np.ones(cus_corr_matrix.shape), k=1).astype(np.bool))

In [None]:
# identify columns to drop based on threshold limit
cus_drop_columns = [column for column in cus_upper_limit.columns if any(cus_upper_limit[column] > .75)]

In [None]:
# drop columns from azdias
customers = customers.drop(cus_drop_columns, axis=1)
len(customers.columns)

In [None]:
azdias.shape

In [None]:
customers.shape

#### Checking columns for removal if they have too many unique values

In [None]:
#checking unique values of 'EINGEFUEGT_AM'
len(azdias['EINGEFUEGT_AM'].unique())

In [None]:
# Attribute "EINGEFUEGT_AM" has too many unique values so it needs to be dropped

azdias = azdias.drop(['EINGEFUEGT_AM'],axis=1)
customers = customers.drop(['EINGEFUEGT_AM'],axis=1)

In [None]:
#checking unique values of 'D19_LETZTER_KAUF_BRANCHE
len(azdias['D19_LETZTER_KAUF_BRANCHE'].unique())

In [None]:
azdias = azdias.drop(['D19_LETZTER_KAUF_BRANCHE'],axis=1)
customers = customers.drop(['D19_LETZTER_KAUF_BRANCHE'],axis=1)

In [None]:
len(azdias.columns)

In [None]:
len(customers.columns)

#### Imputing Null values

In [None]:
# Identifying categorical fields 

cols = azdias.columns
num_cols = azdias._get_numeric_data().columns
print('num_cols: ',num_cols)
print('categorical: ',list(set(cols) - set(num_cols)))

In [None]:
# Missing values are filled with -1 showing unknown

azdias[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = azdias[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].replace(['X','XX'],-1)
customers[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = customers[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].replace(['X','XX'],-1)
azdias[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = azdias[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].fillna(-1)
customers[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = customers[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].fillna(-1)
azdias[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = azdias[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].astype(int)
customers[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = customers[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].astype(int)
azdias[['CAMEO_DEU_2015','OST_WEST_KZ']]=azdias[['CAMEO_DEU_2015','OST_WEST_KZ']].fillna(-1)
customers[['CAMEO_DEU_2015','OST_WEST_KZ']]=customers[['CAMEO_DEU_2015','OST_WEST_KZ']].fillna(-1)

In [None]:
azdias.isnull().sum()

In [None]:
customers.isnull().sum()

In [None]:
# fillna with 9 for fields that has 9 marked as unknown

azdias[azdias.columns[(azdias==9).any()]] = azdias[azdias.columns[(azdias==9).any()]].fillna(9)
customers[customers.columns[(customers==9).any()]] = customers[customers.columns[(customers==9).any()]].fillna(9)

In [None]:
# fillna with 0 for fields that has 0 marked as unknown

azdias[azdias.columns[(azdias==0).any()]] = azdias[azdias.columns[(azdias==0).any()]].fillna(0)
customers[customers.columns[(customers==0).any()]] = customers[customers.columns[(customers==0).any()]].fillna(0)

In [None]:
# fillna with -1 for fields that has -1 marked as unknown

azdias[azdias.columns[(azdias==-1).any()]] = azdias[azdias.columns[(azdias==-1).any()]].fillna(-1)
customers[customers.columns[(customers==-1).any()]] = customers[customers.columns[(customers==-1).any()]].fillna(-1)

#### Encoding Features

In [None]:
azdias = pd.get_dummies(azdias)
customers = pd.get_dummies(customers)

In [None]:
azdias.shape

In [None]:
customers.shape

In [None]:
azdias_columns = azdias.columns
customers_columns = customers.columns

#### Imputing nans with mode

In [None]:
#Now replacing missing values by using the most frequent value along each column. 
# We are using this because it can be used with strings or numeric data. 
#If there is more than one such value, only the smallest is returned.
imputer = Imputer(missing_values='NaN',strategy='most_frequent',axis=0)

In [None]:
azdias = imputer.fit_transform(azdias)
azdias = pd.DataFrame(azdias)
azdias.head(5)

In [None]:
customers = imputer.fit_transform(customers)
customers = pd.DataFrame(customers)
customers.head(5)

In [None]:
azdias.shape

In [None]:
customers.shape

In [None]:
# convert to int
azdias = azdias.astype(int)
customers = customers.astype(int)

#### Removing outliers

In [None]:
#removing all rows that have a value in a column that is 3 standard deviations away from the mean

azdias = azdias[(np.abs(stats.zscore(azdias)) < 3).all(axis=1)]

In [None]:
customers = customers[(np.abs(stats.zscore(customers)) < 3).all(axis=1)]

In [None]:
azdias.shape

In [None]:
customers.shape

#### Standardize & Scaling the data

In [None]:
sc_A = StandardScaler(copy=False)
az_scaled = sc_A.fit_transform(azdias)

In [None]:
az_df = pd.DataFrame(az_scaled,columns= azdias_columns)

In [None]:
az_df.shape

In [None]:
az_df = az_df.set_index('LNR')

In [None]:
az_df.to_pickle('azdias_scaled')

In [None]:
cus_scaled = sc_A.fit_transform(customers)

In [None]:
cus_df = pd.DataFrame(cus_scaled,columns= customers_columns)

In [None]:
cus_df = cus_df.set_index('LNR')

In [None]:
cus_df.shape

In [None]:
cus_df.to_pickle('customers_scaled')

In [None]:
az_df.head(2)

In [None]:
cus_df.head(2)

#### Based on the above working a Preprocess pipeline is created making it easier to run once the data is loaded

In [None]:
def data_preprocess_2(df, for_clustering, df_name=None):    
    """
    Runs the pre-processing steps that have been tested and worked, and uses the cleaned processed data for clustering. 
    
    INPUT:
    - df: Azdias and Customers dataframe
    - for_clustering: processed data to be used for clustering 
    
    OUTPUT:
    Dataframes that have:
    - Invaluable & highly correlated columns dropped, 
    - Imputed nulls and missing values
    - Encoded & scaled data
    """
    
    if for_clustering:
        if df_name == 'azdias':
            # 73% of rows kept their data with various missing data points 
            df = df[df.isnull().sum(axis=1) <= 27].reset_index(drop=True)
        elif df_name == 'customers':            
            #dropped these columns as they weren't in AZDIAS
            df.drop(columns=['CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'], inplace=True)
        
    #Dropping cols that have a high % of missing data 
    drop_cols = ['ALTER_KIND1', 'ALTER_KIND2', 'ALTER_KIND3', 'ALTER_KIND4', 'EXTSEL992','KK_KUNDENTYP']
    
    df = df.drop(drop_cols,axis=1)
    
    #Dropping cols that have too many unique values 
    df = df.drop(['EINGEFUEGT_AM'],axis=1)
    df = df.drop(['D19_LETZTER_KAUF_BRANCHE'],axis=1)


    # Correlation Matrix to identify cols that will make the model too sensitive   
    corr_matrix = df.corr().abs()
    upper_limit = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    # Identifying columns to drop based on upper limit
    drop_cols = [column for column in upper_limit.columns if any(upper_limit[column] > .75)]
    # Dropping cols
    df = df.drop(drop_cols, axis=1)
    print('shape after corr', df.shape)


    # Missing Values filled with -1 indicating unknown 
    df[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = df[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].replace(['X','XX'],-1)
    df[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = df[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].fillna(-1)
    df[['CAMEO_DEUG_2015','CAMEO_INTL_2015']] = df[['CAMEO_DEUG_2015','CAMEO_INTL_2015']].astype(int)
    df[['CAMEO_DEU_2015','OST_WEST_KZ']]=df[['CAMEO_DEU_2015','OST_WEST_KZ']].fillna(-1)

    # fillna with 9 for fields that has 9 marked as unknown
    df[df.columns[(df==9).any()]] = df[df.columns[(df==9).any()]].fillna(9)

    # fillna with 0 for fields that has 0 marked as unknown
    df[df.columns[(df==0).any()]] = df[df.columns[(df==0).any()]].fillna(0)

    # fillna with -1 for fields that has 0 marked as unknown
    df[df.columns[(df==-1).any()]] = df[df.columns[(df==-1).any()]].fillna(-1)

     

    #Encoding data via one hot encoding - required for model training and testing
    df = pd.get_dummies(df)
    print('shape after one-hot', df.shape)
    
    df_cols = list(df.columns.values)

    # Imputing Nans with mode value - using 'most frequent'c an be used with strings or numeric data, 
    #only smallest value will be returned
    imputer = Imputer(missing_values='NaN',strategy='most_frequent',axis=0)
    df = imputer.fit_transform(df)
    df = pd.DataFrame(df)
    print('shape after impute', df.shape)
    
    # Convert to int
    df = df.astype(int)

    # Removing all rows that have a value in a column that is 3 standard deviations away from the mean
    if for_clustering:
        print('inside outliers if')
        df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)] 
        print('shape before scaling', df.shape)
        
    # Standardizing and scaling the data 
    scale = StandardScaler(copy=False)
    scaled = scale.fit_transform(df)
    df = pd.DataFrame(scaled,columns= df_cols)
    print('shape after scaling', df.shape)
        
    df = df.set_index('LNR')
    return df

In [None]:
#Applying preprocess on AZDIAS
azdias = data_preprocess_2(azdias, True, 'azdias')
print(azdias.shape)
print(azdias.head(5))

In [None]:
#Applying preprocess on CUSTOMERS
customers = data_preprocess_2(customers, True, 'customers')
print(customers.shape)
print(customers.head(5))

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

### Principal Component Analysis (PCA): As there is a high volume of dimensions, we need to reduce these and the method we will use is PCA.

#### Using a Scree Plot to show the variance explained by each principal component.

In [None]:
pca = PCA().fit(azdias)
plt.figure(figsize=(15,10))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.title('Explained Variance Ratio - AZDIAS')
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

In [None]:
#Based on the Scree plot, it shows that a good number of components to use is: 200
def reduce_dims(df,n=200):
    pca = PCA(n_components=n).fit(df)
    reduced_dims = pca.transform(df)
    reduced_dims = pd.DataFrame(reduced_dims)
    print(pca.explained_variance_ratio_.sum())
    return reduced_dims


In [None]:
#Displays the variance in the data explained by the principal components after running PCA
reduced_azdias = reduce_dims(azdias)
reduced_azdias

In [None]:
reduced_customers = reduce_dims(customers)
reduced_customers

In [None]:
reduced_azdias.shape

In [None]:
reduced_customers.shape

### K-Means Clustering will be used to to describe the required relationship, with the Elbow Method being used to determine the optimal number of clusters

In [None]:
sse = []
list_k = list(range(1, 25))

for k in list_k:
    km = KMeans(n_clusters=k, init='k-means++')
    km.fit(reduced_azdias.sample(20000))
    sse.append(km.inertia_)

In [None]:
# Plot sse against k
plt.figure(figsize=(20, 10))
plt.plot(list_k, sse, linestyle='-', marker='x', color='navy')
plt.title('k-means Clustering Elbow Plot')
plt.xlabel("Number of Clusters")
plt.ylabel("Avg within-cluster dstance")
plt.xticks(list(range(1,25)));

#### Scree plot shows that near theCluster '12' mark, the average distance tends to flatten out - so for this reason, 12 clusters will be used

In [None]:
n_clusters = 12
kmeans = KMeans(n_clusters=n_clusters)

In [None]:
azdias_preds = kmeans.fit_predict(reduced_azdias)

In [None]:
azdias_clustered = pd.DataFrame(azdias_preds, columns = ['Cluster'])

azdias_clustered.to_pickle('azdias_clustered.pkl')

In [None]:
customers_preds = kmeans.fit_predict(reduced_customers)

In [None]:
customers_clustered = pd.DataFrame(customers_preds, columns = ['Cluster'])

customers_clustered.to_pickle('customers_clustered.pkl')

### Comparing the clusters

In [None]:
# Count number in each population segment - AZDIAS
population_clusters = pd.Series(azdias_preds)
pc = population_clusters.value_counts().sort_index()

In [None]:
# Count number of predictions for each customer segment - CUSTOMERS
customer_clusters = pd.Series(customers_preds)
cc = customer_clusters.value_counts().sort_index()

In [None]:
# Create dataframe from population and customer segments

summary_df = pd.concat([pc, cc], axis=1).reset_index()

summary_df.columns = ['Cluster Number','General Population','Customer']

In [None]:
#adding percentage columns

summary_df['% of Total Pop'] = ( summary_df['General Population'] / (summary_df['General Population'].sum()) * 100 ).round(2)

summary_df['% of Total Customer'] = ( summary_df['Customer'] / (summary_df['Customer'].sum()) * 100 ).round(2)

In [None]:
fig = plt.figure(figsize=(20,10))

ax = fig.add_subplot(111)

ax = summary_df['% of Total Pop'].plot(x=summary_df['Cluster Number'], width=-0.3, align='edge', color='navy', kind='bar', position=0)
ax = summary_df['% of Total Customer'].plot(kind='bar', color='green', width = 0.3, align='edge', position=1)

ax.set_xlabel('Clusters', fontsize=15) 
ax.set_ylabel('% Ratio between Population & Customers', fontsize=15)

ax.xaxis.set(ticklabels=range(20))
ax.tick_params(axis = 'x', which = 'major', labelsize = 13)
ax.margins(x=0.5,y=0.1)

plt.legend(('General Population (AZDIAS)', 'Customer (CUSTOMERS)'), fontsize=15)
plt.title(('% of General Population & Customer in Each Cluster'))

plt.show()

#### The above bar chart shows how the clusters are distributed across the 2 datasets. 


#### Population clusters show only small differences in size, however the Customers clusters show a large imbalance, specifcally in Cluster 6 (over-representation) & Cluster 9 (under-representation)

## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')

In [None]:
X = mailout_train.drop('RESPONSE',axis=1)
y = mailout_train['RESPONSE']

In [None]:
# preprocess data
df_mailout_train  = data_preprocess_2(X, False)

In [None]:
df_mailout_train.shape

In [None]:
y.shape

In [None]:
df_mailout_train.head(5)

In [None]:
models = {'RandomForestClassifier': RandomForestClassifier(), 
          'GradientBoostingClassifier': GradientBoostingClassifier()
         }

In [None]:
def randomize(df):
    """
    Returns randomized X and y.
    
    Input: DataFrame
    Output: randomized X and y
    """
    
    df_randomized = df.sample(frac=1) #frac = 1 will take a random sample of the whole df
    y_rand = df_randomized['RESPONSE']
    X_rand = df_randomized.drop(['RESPONSE'],axis=1)
    return X_rand, y_rand

In [None]:
def vis_learning_curves(X, y, estimator, num_trainings):
    """
    Visualise a learning curve that shows the validation and training auc_score of an estimator 
    depending on the number of training samples.
    
    Input:
        X: array 
        y: array 
        estimator: o that implements the “fit” and “predict” methods
        num_trainings (int): number of training samples to plot
        
    Output:
        None
    """
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, scoring = 'roc_auc', train_sizes=np.linspace(.1, 1.0, num_trainings))

    train_scores_mean = np.mean(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    print("AUC train score = {}".format(train_scores_mean[-1].round(2)))
    print("AUC validation score = {}".format(test_scores_mean[-1].round(2)))
    plt.grid()

    plt.title("Learning Curve")
    plt.xlabel("% of training set")
    plt.ylabel("Score")
    
    plt.plot(np.linspace(.1, 1.0, num_trainings)*100, train_scores_mean, 'o-', color="n",
             label="Training score")
    plt.plot(np.linspace(.1, 1.0, num_trainings)*100, test_scores_mean, 'o-', color="o",
             label="Cross-validation score")

    plt.yticks(np.arange(0.45, 1.02, 0.05))
    plt.xticks(np.arange(0., 100.05, 10))
    plt.legend(loc="best")
    print("")
    plt.show()

In [None]:
for model_key in models.keys():
    print(model_key)
    ml_pipeline = Pipeline([
        ('transform', column_transformer),
        ('model', models[model_key])
    ])
    X, y = randomize(mailout_train_clean)
    vis_learning_curves(X, y, ml_pipeline, 10)

#### The Random Forest Classifier shows a low valisation score and high training score, implying the model is overfitted on traning and has bias.  
#### Gradient Boosting Classifier shows decreasing model bias when the sample size is increased. Based on this and the high validation score - the GBC will be the estimator for the model.

In [None]:
#initialize with GradientBoostingClassifier
gbc_pipeline = Pipeline([
    ('transform', column_transformer),
    ('gbc', GradientBoostingClassifier(random_state=42))
])

parameters = {'gbc__learning_rate': [0.1], 'gbc__n_estimators': [100],
             'gbc__max_depth': [5], 'gbc__min_samples_split': [2]}        
        
grid_gbc = GridSearchCV(gbc_pipeline, parameters, scoring = 'roc_auc', verbose=50)

# Fit the grid search object to the training data and find the optimal parameters
grid_gbc.fit(X, y)

In [None]:
# Get the estimator and predict
print(grid_gbc.best_params_)
best_clf = grid_gbc.best_estimator_
best_predictions = best_clf.predict_proba(X)[:, 1]

print("ROC score:".format(roc_auc_score(y, best_predictions)))

In [None]:
feature_importances = best_clf.named_steps['gbc'].feature_importances_
pd.Series(feature_importances, index=[''] + column_names).sort_values()[-20:].plot(kind='barh', figsize=(20,10))
plt.xlabel('feature importance')