# Capstone Project: Create a Customer Segmentation Report for Arvato Financial Services

In this project, you will analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population. You'll use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, you'll apply what you've learned on a third dataset with demographics information for targets of a marketing campaign for the company, and use a model to predict which individuals are most likely to convert into becoming customers for the company. The data that you will use has been provided by our partners at Bertelsmann Arvato Analytics, and represents a real-life data science task.

If you completed the first term of this program, you will be familiar with the first part of this project, from the unsupervised learning project. The versions of those two datasets used in this project will include many more features and has not been pre-cleaned. You are also free to choose whatever approach you'd like to analyzing the data rather than follow pre-determined steps. In your work on this project, make sure that you carefully document your steps and decisions, since your main deliverable for this project will be a blog post reporting your findings.

### Importing Libraries

In [1]:
# import libraries here; add more as necessary
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#imports to help me plot my venn diagrams
import matplotlib_venn as venn2
from matplotlib_venn import venn2
from pylab import rcParams

# import the util.py file where I define my functions
from utils import *

# sklearn
from sklearn.preprocessing import StandardScaler, Imputer, RobustScaler, MinMaxScaler, OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import confusion_matrix,precision_recall_fscore_support
from sklearn.utils.multiclass import unique_labels


# magic word for producing visualizations in notebook
%matplotlib inline

## Part 0: Get to Know the Data

There are four data files associated with this project:

- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person, but also includes information outside of individuals, including information about their household, building, and neighborhood. Use the information from the first two files to figure out how customers ("CUSTOMERS") are similar to or differ from the general population at large ("AZDIAS"), then use your analysis to make predictions on the other two files ("MAILOUT"), predicting which recipients are most likely to become a customer for the mail-order company.

The "CUSTOMERS" file contains three extra columns ('CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'), which provide broad information about the customers depicted in the file. The original "MAILOUT" file included one additional column, "RESPONSE", which indicated whether or not each recipient became a customer of the company. For the "TRAIN" subset, this column has been retained, but in the "TEST" subset it has been removed; it is against that withheld column that your final predictions will be assessed in the Kaggle competition.

Otherwise, all of the remaining columns are the same between the three data files. For more information about the columns depicted in the files, you can refer to two Excel spreadsheets provided in the workspace. [One of them](./DIAS Information Levels - Attributes 2017.xlsx) is a top-level list of attributes and descriptions, organized by informational category. [The other](./DIAS Attributes - Values 2017.xlsx) is a detailed mapping of data values for each feature in alphabetical order.

In the below cell, we've provided some initial code to load in the first two datasets. Note for all of the `.csv` data files in this project that they're semicolon (`;`) delimited, so an additional argument in the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) call has been included to read in the data properly. Also, considering the size of the datasets, it may take some time for them to load completely.

You'll notice when the data is loaded in that a warning message will immediately pop up. Before you really start digging into the modeling and analysis, you're going to need to perform some cleaning. Take some time to browse the structure of the data and look over the informational spreadsheets to understand the data values. Make some decisions on which features to keep, which features to drop, and if any revisions need to be made on data formats. It'll be a good idea to create a function with pre-processing steps, since you'll need to clean all of the datasets before you work with them.

In [None]:
# load in the data
'''
There are 2 warnings when we read in the datasets:
DtypeWarning: Columns (19,20) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

This warning happens when pandas attempts to guess datatypes on particular columns, I will address this on 
the pre-processing steps
'''
azdias = pd.read_csv(r"C:\Users\sousa\Desktop\github\Arvato\data\azdias.csv")
customers = pd.read_csv(r"C:\Users\sousa\Desktop\github\Arvato\data\customers.csv")
attributes = pd.read_csv(r"C:\Users\sousa\Desktop\github\Arvato\data\features.csv")

In [None]:
# I will now check what is the problem with the columns 19 and 20
# getting the name of these columns
print(azdias.iloc[:,19:21].columns)
print(customers.iloc[:,19:21].columns)

In [None]:
# checking the unique values in these columns for possible issues
print(azdias.CAMEO_DEUG_2015.unique())
print(azdias.CAMEO_INTL_2015.unique())
print(customers.CAMEO_DEUG_2015.unique())
print(customers.CAMEO_INTL_2015.unique())

It seems like the mixed type issue comes from that  X that appears in these columns.
There are ints, floats and strings all in the mix

In [None]:
cols = ['CAMEO_DEUG_2015', 'CAMEO_INTL_2015']
azdias = mixed_type_fixer(azdias, cols)
customers = mixed_type_fixer(customers, cols)

#### Checking if values were fixed
#### Change this cell to code if you want to perform the checks

azdias.CAMEO_DEUG_2015.unique()
customers.CAMEO_INTL_2015.unique()

Considering the appearance of these mixed type data entries I created a function to check the dtype of the different attributes

This might be useful in case some attributes have too many category values, which might fragment the data clustering too much.

In [None]:
#doing a quick check of categorical features and see if some are too granular to be maintained
tst = categorical_checker(azdias, attributes)

Based on the categorical info it might be a good idea do drop CAMEO_DEU_2015 column, it is far too fragmented with 45 different category values, this is an idea to revisit after testing the models

There is an extra column called Unnamed that seems like an index duplication, I will now drop it

In [None]:
#dropping unnamed column
azdias = azdias.drop(azdias.columns[0], axis = 1)
customers = customers.drop(customers.columns[0], axis = 1)

We also have 3 columns that are different between azdias and customers:

'CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'

I will drop those to harmonize the 2 datasets

In [None]:
customers = customers.drop(['CUSTOMER_GROUP', 'ONLINE_PURCHASE', 'PRODUCT_GROUP'], inplace=False, axis=1)

#### I will now check overal shapes of the datasets

#### Azdias Shape

In [None]:
# checking how the azdias dataframe looks like
print('Printing dataframe shape')
print(azdias.shape)
print('________________________________________________________')

azdias.head()

#### Customers Shape

In [None]:
# checking how the customer dataframe looks like
print('Printing dataframe shape')
print(customers.shape)
print('________________________________________________________')

customers.head()

#### Attributes shape

In [None]:
# Check the summary csv file
print(attributes.shape)
attributes.head()

### On the dataframe shapes:

#### For now it is noted that the 2 initial working dataframes are harmonized in terms of  number of columns:
#### azdias: (891221, 366)
#### customers: (191652, 366)
#### attributes: (332, 5)

In [None]:
#saving the unique attribute names to lists
attributes_list = attributes.attribute.unique().tolist()
azdias_list = list(azdias.columns)
customers_list = list(customers.columns)

In [None]:
#establishing uniqueness of the attributes accross the datasets in work
common_to_all = (set(attributes_list) & set(azdias_list) & set(customers_list))
unique_to_azdias = (set(azdias_list) - set(attributes_list) - set(customers_list))
unique_to_customers = (set(customers_list) - set(attributes_list) - set(azdias_list))
unique_to_attributes = (set(attributes_list) - set(customers_list) - set(azdias_list))
unique_to_attributes_vs_azdias = (set(attributes_list) - set(azdias_list))
unique_to_azdias_vs_attributes = (set(attributes_list) - set(azdias_list))
common_azdias_attributes = (set(azdias_list) & set(attributes_list))

print("No of items common to all 3 daframes: " + str(len(common_to_all)))
print("No of items exclusive to azdias: " + str(len(unique_to_azdias)))
print("No of items exclusive to customers: " + str(len(unique_to_customers)))
print("No of items exclusive to attributes: " + str(len(unique_to_attributes)))
print("No of items overlapping between azdias and attributes: " + str(len(common_azdias_attributes)))
print("No of items exclusive to attributes vs azdias: " + str(len(unique_to_attributes_vs_azdias)))
print("No of items exclusive to azdias vs attributes: " + str(len(unique_to_azdias_vs_attributes)))

In [None]:
rcParams['figure.figsize'] = 8, 8

ax = plt.axes()
ax.set_facecolor('lightgrey')
v = venn2([len(azdias_list), len(attributes_list), len(common_azdias_attributes)], 
      set_labels=('Azdias', 'Attributes'),
         set_colors = ['cyan', 'grey']);

plt.title("Attribute presence on Azdias vs DIAS Attributes ")
plt.show()

### From this little exploration we got quite a little bit of information:
#### - There are 3 extra features in the customers dataset, it corresponds to the columns 'CUSTOMER_GROUP', 'ONLINE_PURCHASE', and 'PRODUCT_GROUP'

#### - All the datasets share 327 features between them

#### - The attributes file has 5 columns corresponding to feature information that does not exist in the other datasets

## Preprocessing
### Now that I have a birds-eye view of the data I will proceed with cleaning and handling missing calues, re-encode features (since the first portion of this project will involve unsupervised learning), perform some feature enginnering and scaling.

### Assessing missing data and replacing it with nan
### Before dealing with the missing and unknown data I will save a copy of the dataframes for the purpose of visualizing how much improvement was achieved

In [None]:
#making dataframes copies pre-cleanup
azdias_pre_cleanup = azdias.copy()
customers_pre_cleanup = customers.copy()

In [None]:
# I am using feat_fixer to use the information in the attributes dataframe to fill the information 
# regarding missing and unknown values
azdias = feat_fixer(azdias, attributes)
customers = feat_fixer(customers, attributes)

### Since the net step involves dropping columns missing data over a threshold it is important to check if there is a column match between azdias and customers before and after the cleanup process

### There is a chance that some columns are missing too much data in one dataframe and being dropped while they are abundant in the other, causing a discrepancy in the shape between the 2 dataframes

#### It is always hard to define a threshold on how much missing data is too much, my first approach will consider over 30% too much
#### Based on model performance this is an idea to revisit and adjust

In [None]:
balance_checker(azdias, customers)

#### Prior to cleanup customers and azdias match

In [None]:
percent_missing_azdias_df = percentage_of_missing(azdias)
percent_missing_azdias_pc_df = percentage_of_missing(azdias_pre_cleanup)

percent_missing_customers_df = percentage_of_missing(customers)
percent_missing_customers_pc_df = percentage_of_missing(customers_pre_cleanup)

In [None]:
print('Identified missing data in Azdias: ')
print('Pre-cleanup: ' + str(azdias_pre_cleanup.isnull().sum().sum()) + ' Post_cleanup: ' + str(azdias.isnull().sum().sum()))

print('Identified missing data in Customers: ')
print('Pre-cleanup: ' + str(customers_pre_cleanup.isnull().sum().sum()) + ' Post_cleanup: ' + str(customers.isnull().sum().sum()))

In [None]:
print('Azdias columns not missing values(percentage):')
print('Pre-cleanup: ', (percent_missing_azdias_df['percent_missing'] == 0.0).sum())
print('Post-cleanup: ', (percent_missing_azdias_pc_df['percent_missing'] == 0.0).sum())

print('Customers columns not missing values(percentage):')
print('Pre-cleanup: ', (percent_missing_customers_df['percent_missing'] == 0.0).sum())
print('Post-cleanup: ', (percent_missing_customers_pc_df['percent_missing'] == 0.0).sum())

#### Deciding on what data to maintain based on the percentage missing

In [None]:
# missing more or less than 30% of the data
azdias_missing_over_30 = split_on_percentage(percent_missing_azdias_df, 30, '>')
azdias_missing_less_30 = split_on_percentage(percent_missing_azdias_df, 30, '<=')

customers_missing_over_30 = split_on_percentage(percent_missing_customers_df, 30, '>')
customers_missing_less_30 = split_on_percentage(percent_missing_customers_df, 30, '<=')

In [None]:
#plotting select features and their missing data percentages
figure, axes = plt.subplots(4, 1, figsize = (15,15), squeeze = False)

azdias_missing_over_30.sort_values(by = 'percent_missing', ascending = False).plot(kind = 'bar', x = 'column_name', y = 'percent_missing',
                                                                                ax = axes[0][0], color = 'red', title = 'Azdias percentage of missing values over 30%' )

#due to the sheer amount of data points to be plotted this does not make an appealing vis so I will restrict
#the number of plotted points to 40
azdias_missing_less_30.sort_values(by = 'percent_missing', ascending = False)[:40].plot(kind = 'bar', x = 'column_name', y = 'percent_missing',
                                                                                ax = axes[1][0], title = 'Azdias percentage of missing values less 30%' )

customers_missing_over_30.sort_values(by = 'percent_missing', ascending = False).plot(kind = 'bar', x = 'column_name', y = 'percent_missing',
                                                                                ax = axes[2][0], color = 'red', title = 'Customers percentage of missing values over 30%' )

#due to the sheer amount of data points to be plotted this does not make an appealing vis so I will restrict
#the number of plotted points to 40
customers_missing_less_30.sort_values(by = 'percent_missing', ascending = False)[:40].plot(kind = 'bar', x = 'column_name', y = 'percent_missing',
                                                                                ax = axes[3][0], title = 'Customers percentage of missing values less 30%' )

plt.tight_layout()
plt.show()

### The vast majority of the columns with missing values have a percent of missing under 30%
### Based on this information I will remove columns with more than 30% missing values

In [None]:
#extracting column names with more than 30% values missing so we can drop them from azdias df
azdias_col_delete = columns_to_delete(azdias_missing_over_30)

#extracting column names with more than 30% values missing so we can drop them from customers df
customers_col_delete = columns_to_delete(customers_missing_over_30)

In [None]:
#dropping the columns identified in the previous lists

azdias = azdias.drop(azdias_col_delete, axis = 1)
customers = customers.drop(customers_col_delete, axis = 1)

### Now that we dropped columns missing more than 30% of their data let's check if we should also drop rows based on a particular threshold

In [None]:
#plotting distribution of null values
row_hist(azdias, customers, 30)

#### Based on this visualization we deduct 2 things
##### - most of the rows are missing the information over less than 50 columns
##### - both customer and azdias have probably overlapping rows in which they are missing info corresponding to over 200 columns

In [None]:
#deleting rows based on the information acquired in the previous histogram 
azdias = row_dropper(azdias, 50)
customers = row_dropper(customers, 50)

In [None]:
#plotting null values distribution after cleanup
row_hist(azdias, customers, 30)

In [None]:
balance_checker(azdias, customers)

In [None]:
azdias.shape

In [None]:
customers.shape

Based on this information the azdias df has a few columns extra when compared to customers:
- 'KBA13_SEG_WOHNMOBILE', 'ORTSGR_KLS9', 'KBA13_SEG_SPORTWAGEN', 'KBA13_SEG_OBERKLASSE'
-  These colummns refer to information on the type of car individuals own

The customers dataframe has a column not present in azdias: 
- 'AKT_DAT_KL'

So to finalize this step I will drop these columns

In [None]:
azdias = azdias.drop(['KBA13_SEG_WOHNMOBILE', 'ORTSGR_KLS9', 'KBA13_SEG_SPORTWAGEN', 'KBA13_SEG_OBERKLASSE'], inplace=False, axis=1)
customers = customers.drop(['AKT_DAT_KL'], inplace=False, axis=1)

In [None]:
balance_checker(azdias, customers)

## Feature Encoding

### Like I previously checked using the categorical_checker there are many features in need of re-encoding for the unsupervised learning portion 

- numerical features will be kept as is
- ordinal features will be kept as is
- categorical features and mixed type features will have to be re-encoded

In [None]:
#checking for mixed type features
attributes[attributes.type == 'mixed']

In [None]:
#retrieve a list of categorical features for future encoding
cats = attributes[attributes.type == 'categorical']
list(cats['attribute'])

#### At this point I already dealt with the CAMEO_INTL_2015 column by converting XX to nan

#### PRAEGENDE_JUGENDJAHRE has 3 dimentions: generation decade, if people are mainstream or avant-garde and if they are from east or west, I will create new features out of this particular column

#### LP_LEBENSPHASE_GROB seems to encode the same information as the CAMEO column and it is divided between gross(grob) and fine (fein) 

In [None]:
azdias = special_feature_handler(azdias)
customers = special_feature_handler(customers)

In [None]:
azdias.select_dtypes('object').head()

## Feature engineering
#### Based on the previous exploration there are a few features that are good candidates for novel feature creation

In [None]:
azdias_eng = azdias.copy()
customers_eng = customers.copy()
feat_eng(azdias_eng)
feat_eng(customers_eng)

#### Now that I am done with creating new features and dealing with the most obvious columns I need to encode the remaining categorical features
#### Considering this post: https://stats.stackexchange.com/questions/224051/one-hot-vs-dummy-encoding-in-scikit-learn there are advantages and drawbacks with chosing one-hot-encoding vs dummy encoding.
#### There are also concerns regarding using dummies all together https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769 so I will keep this in mind while moving forward
#### For now I will go with dummy creation

In [None]:
#finally I will encode all the features that are left
cat_features = ['AGER_TYP','ANREDE_KZ','CAMEO_DEU_2015','CAMEO_DEUG_2015','CJT_GESAMTTYP','D19_BANKEN_DATUM','D19_BANKEN_OFFLINE_DATUM',
                'D19_BANKEN_ONLINE_DATUM','D19_GESAMT_DATUM','D19_GESAMT_OFFLINE_DATUM','D19_GESAMT_ONLINE_DATUM','D19_KONSUMTYP',
                'D19_TELKO_DATUM','D19_TELKO_OFFLINE_DATUM','D19_TELKO_ONLINE_DATUM','D19_VERSAND_DATUM','D19_VERSAND_OFFLINE_DATUM','D19_VERSAND_ONLINE_DATUM',
                'D19_VERSI_DATUM','D19_VERSI_OFFLINE_DATUM','D19_VERSI_ONLINE_DATUM','FINANZTYP','GEBAEUDETYP',
                'GFK_URLAUBERTYP','GREEN_AVANTGARDE','KBA05_BAUMAX','LP_FAMILIE_FEIN',
                'LP_FAMILIE_GROB','LP_STATUS_FEIN','LP_STATUS_GROB','NATIONALITAET_KZ','OST_WEST_KZ','PLZ8_BAUMAX',
                'SHOPPER_TYP','SOHO_KZ','TITEL_KZ','VERS_TYP','ZABEOTYP']

azdias_ohe = pd.get_dummies(azdias_eng, columns = cat_features)
customers_ohe = pd.get_dummies(customers_eng, columns = cat_features)

## Feature scaling
### Before moving on to dimentionality reduction I need to apply feature scaling, this way principal component vectors won't be affected by the variation that naturally occurs in the data

In [None]:
#dataframes using StandardScaler
azdias_SS = feature_scaling(azdias_ohe, 'StandardScaler')
customers_SS = feature_scaling(customers_ohe, 'StandardScaler')

In [None]:
#dataframes using RobustScaler
azdias_RS = feature_scaling(azdias_ohe, 'RobustScaler')
customers_RS = feature_scaling(customers_ohe, 'RobustScaler')

In [None]:
#dataframes using MinMaxScaler
azdias_MMS = feature_scaling(azdias_ohe, 'MinMaxScaler')
customers_MMS = feature_scaling(customers_ohe, 'MinMaxScaler')

## Dimensionality Reduction
### Finally I will use PCA (linear technique) to select only the features that seem to be more impactfull

In [None]:
components_list_azdias = azdias_SS.columns.values
n_components_azdias = len(components_list_azdias)

components_list_customers = customers_SS.columns.values
n_components_customers = len(components_list_customers)

azdias_SS_pca = pca_model(azdias_SS, n_components_azdias)
customers_SS_pca = pca_model(customers_SS, n_components_customers)

azdias_RS_pca = pca_model(azdias_RS, n_components_azdias)
customers_RS_pca = pca_model(customers_RS, n_components_customers)

azdias_MMS_pca = pca_model(azdias_MMS, n_components_azdias)
customers_MMS_pca = pca_model(customers_MMS, n_components_customers)

In [None]:
scree_plots(azdias_SS_pca, azdias_RS_pca, azdias_MMS_pca, ' azdias')

In [None]:
scree_plots(customers_SS_pca, customers_RS_pca, customers_MMS_pca, ' customers')

Each principal component is a directional vector pointing to the highest variance. The greater the distance from 0 the more the vector points to a feature.

In [None]:
first_dimension = interpret_pca(azdias_SS, n_components_azdias, 1)
first_dimension

In [None]:
display_interesting_features(azdias_SS, azdias_SS_pca, 0)

In [None]:
display_interesting_features(azdias_RS, azdias_RS_pca, 0)

In [None]:
display_interesting_features(azdias_MMS, azdias_MMS_pca, 0)

#### On this first dimension most of the information seems to be related to household size, purchase power and types of purchases

#### Based on these plots:
- using standard scaler with 300 principal components 90% of the original variance can be represented
- using robust scaler with about 150 components we represent 90% of the original variance 
- using minmax scaler with 250 components we represent 90% of the original variance


#### Moving on I will pick the robust scaler PCA and I will re-fit with 150 components

In [None]:
azdias_pca_refit = pca_model(azdias_RS, 150)
azdias_pca_refit.explained_variance_ratio_.sum()

In [None]:
display_interesting_features(azdias_RS, azdias_pca_refit, 0)

## Part 1: Customer Segmentation Report

The main bulk of your analysis will come in this part of the project. Here, you should use unsupervised learning techniques to describe the relationship between the demographics of the company's existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company's main customer base, and which parts of the general population are less so.

### After a lot of data Pre-Processing we are fibally getting to the analysis, I will start by attempting KMeans to find relevant clusters

#### Now that I have reduced the number of components to use, it is important to select the number of clusters to aim at for kmeans

In [None]:
# Over a number of different cluster counts...
results = []
# Append results in a list to ease plotting in the following cell
if 1:
    for nclusters in range(2,15):
        print(nclusters)
        myclustering = KMeans(n_clusters=nclusters)
        # run k-means clustering on the data and...
        myclustering.fit(azdias_pca_refit)

        # compute the average within-cluster distances.
        results.append(myclustering.score(azdias_pca_refit)*-1)

In [None]:
def kmeans_model(data, center):
    '''
    returns the kmeans score regarding SSE for points to centers
    INPUT:
        data - the dataset you want to fit kmeans to
        center - the number of centers you want (the k value)
    OUTPUT:
        score - the SSE score for the kmeans model fit to the data
    '''
    #instantiate kmeans
    kmeans = KMeans(n_clusters=center)

    # Then fit the model to your data using the fit method
    model = kmeans.fit(data)
    
    # Obtain a score related to the model fit
    score = np.abs(model.score(data))
    
    # Array of predicted label values
    kmeans_predict = model.predict(data)
    
    # Get unique label clusters
    unique_clusters = np.unique()
    
    return score, kmeans_predict, unique_clusters

In [None]:
azdias_k_p = kmeans_model(azdias_RS, 5)
customers_k_p = kmeans_model(customers_RS, 5)

azdias_score = azdias_k_p[0]
customers_score = customers_k_p[0]

azdias_predict = azdias_k_p[1]
customers_predict = predict_k_p[1]


unique_clusters = azdias_k_p[2]

In [None]:
print('azdias kmeans score: ', azdias_score)
print(len(azdias_score))
print('customers kmeans score: ', customers_score)
print(len(customers_score))
print('Unique clusters: ', unique_clusters)


## Part 2: Supervised Learning Model

Now that you've found which parts of the population are more likely to be customers of the mail-order company, it's time to build a prediction model. Each of the rows in the "MAILOUT" data files represents an individual that was targeted for a mailout campaign. Ideally, we should be able to use the demographic information from each individual to decide whether or not it will be worth it to include that person in the campaign.

The "MAILOUT" data has been split into two approximately equal parts, each with almost 43 000 data rows. In this part, you can verify your model with the "TRAIN" partition, which includes a column, "RESPONSE", that states whether or not a person became a customer of the company following the campaign. In the next part, you'll need to create predictions on the "TEST" partition, where the "RESPONSE" column has been withheld.

In [None]:
mailout_train = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TRAIN.csv', sep=';')

## Part 3: Kaggle Competition

Now that you've created a model to predict which individuals are most likely to respond to a mailout campaign, it's time to test that model in competition through Kaggle. If you click on the link [here](http://www.kaggle.com/t/21e6d45d4c574c7fa2d868f0e8c83140), you'll be taken to the competition page where, if you have a Kaggle account, you can enter. If you're one of the top performers, you may have the chance to be contacted by a hiring manager from Arvato or Bertelsmann for an interview!

Your entry to the competition should be a CSV file with two columns. The first column should be a copy of "LNR", which acts as an ID number for each individual in the "TEST" partition. The second column, "RESPONSE", should be some measure of how likely each individual became a customer – this might not be a straightforward probability. As you should have found in Part 2, there is a large output class imbalance, where most individuals did not respond to the mailout. Thus, predicting individual classes and using accuracy does not seem to be an appropriate performance evaluation method. Instead, the competition will be using AUC to evaluate performance. The exact values of the "RESPONSE" column do not matter as much: only that the higher values try to capture as many of the actual customers as possible, early in the ROC curve sweep.

In [None]:
mailout_test = pd.read_csv('../../data/Term2/capstone/arvato_data/Udacity_MAILOUT_052018_TEST.csv', sep=';')