# **ELO COMPETITION**
by Ana Maria Cuciuc, Luise Schreiter and Adrian Villegas


# **Content of this Kernel**

1. [Introduction](#10)
2. [Loading Necessary Packages](#1)
3. [First Data Exploration](#2)
4. [Closer Look at the Merchants Data](#3)
5. [Preparing the Data](#4)
6. [Feature Engineering](#5)
7. [Merging the Data Sets](#6)
8. [Second Data Exploration](#7)
9. [Preparing the Prediction](#8)
10. [Trying different Prediction Models](#11)
11. [Ensemble Prediction](#9)
12. [Project Reflection](#12)

<a id="10"></a> <br>
# **1. Introduction**

The Elo competition was chosen as a group project for the data science class from the master program business intelligence and process management. Therefore groups of three worked together to apply used methods from class.

The goal of this kaggle competition is to predict a loyality score for given credit cards. The loyality score should be based on the card owners activities. Therefore data frames with historical and new transactions were provided as well as a merchants data frame. As some columns are difficult to interpret also a data dictionary was provided. Nevertheless some columns were unexplained and labeled as "anonymized measure/category".

This kernel is the result from the group project. It provides the used code and explanations as well as additional information and reflections. This kernel starts with the loading of the data and a first exploration of it.  Afterwards a closer look at the merchants data table is taken. Then the data is prepared and new features are created. In the next step the data frames are merged. With the new merged data frames  a second data exploration is done as there are a lot of new informations given. Before finally the prediction is done the data gets some final adaptions. After the prediction step is done the kernel is closed with a reflection of the project.

<a id="1"></a> <br>
# **2. Loading Necessary Packages and the Input Data**

First of all necessary modules and packages are imported. As well as some configurations are set. In this kernel the following packages are used: pandas, matplotlib.pyplot, seaborn, numpy, scipy.stats, sklearn.ensemble, warnings, datetime. Most of these packages were already used in class. The warnings package is imported to prevent that warnings are shown in the output. There also *warning.filterwarnings('ignore')* was set as well. The datetime package is a module to manipulate dates and times and it is needed in this kernel to create new features from existing ones. With *sns.set(style='darkgrid', palette='deep')* the seaborn style for graphics is set. The *%matplotlib inline* command belongs to the "magic functions" from IPython and is able to render the figure in a notebook. The lightgbm is a package not used in clase it provides the lightgbm model which is used in a lot of kernels in this competition. Therefore it was of interest seeing how it works and therefore it was included.

In [None]:
# Import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
import warnings
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.model_selection import train_test_split
import datetime
import lightgbm as lgb
from IPython.display import Image
sns.set(style='darkgrid', palette='deep')
warnings.filterwarnings('ignore')
%matplotlib inline

Afterwards the datasets are loaded. This is done with the pd.read_csv function. Hence afterwards their type is pandas data frame. The following data frames are imported:

* train.csv
* test.csv
* merchants.csv
* new_trans.csv
* hist_trans.csv

In [None]:
# Load train and test data
train = pd.read_csv("../input/elo-merchant-category-recommendation/train.csv", parse_dates=["first_active_month"])
test = pd.read_csv("../input/elo-merchant-category-recommendation/test.csv", parse_dates=["first_active_month"])

In [None]:
# Load additional data
merchants = pd.read_csv("../input/elo-merchant-category-recommendation/merchants.csv")
new_trans = pd.read_csv("../input/elo-merchant-category-recommendation/new_merchant_transactions.csv", 
                        parse_dates=['purchase_date'])
hist_trans = pd.read_csv("../input/elo-merchant-category-recommendation/historical_transactions.csv", 
                         parse_dates=['purchase_date'])

<a id="2"></a> <br>
# **3. Data Exploration**

Before data is processed and prepared for the prediction, first of all a data exploration is executed. Included in this is the size and shape of the different data frames and the heads of the data frames are shown to get a first impression of the different columns. In order to understand all of the columns the provided data dictionary was used. Besides shape and size, this first data exploration focus first on the train data frame and afterwards on the historical transactions and in the end on the new transactions.

In [None]:
# Shape of data frames
print("Train set size: ", train.shape)
print("Test set size: ", test.shape)
print("New Merchant Transactios set size: ", new_trans.shape)
print("Historical Transaction set size:", hist_trans.shape)
print("Merchants set size:", hist_trans.shape)

In the following the heads of the train and the test data frames are shown. These two data frames have the same columns. The meaning of each column can be found in the data dictionary. Both train and test set contain unique card_ids and some connected informations. Nevertheless the data dictionary does not explain what feature 1, 2 and 3 are really about.

Train Data Frame:

In [None]:
train.head(5)

Test Data Frame:

In [None]:
test.head(5)

Now the transaction data frames are taken into focus. These two data frames also have the same columns. Most of them are explained in the data dictionary but also in this case there are some columns that remain unexplained. In these data frames the card_ids aren't unique as there can be more than one transaction per card_id.

New Transactions Data Frame:

In [None]:
new_trans.head(5)

Historical Transactions Data Frame:

In [None]:
hist_trans.head(5)

The last data frame is the merchants data frame. This data frame differs from the other four as there is no direct connection to the card_ids. It will treated more in the next chapter.

Merchants Data Frame:

In [None]:
merchants.head(5)

As a next step the target variable is explored and plotted. First the variable is described statistically. Different measures for the content of the target column are displayed. Afterwards the target column is plotted.

In [None]:
# Statistics regarding the target variable
train['target'].describe()

In [None]:
#Target Variable Exploration

fig, (ax1,ax2) = plt.subplots(1,2, figsize=(15,6))

# Left plot
ax1.scatter(x=range(train.shape[0]), y=np.sort(train.target.values), c='r')
ax1.set_ylabel('Loyalty Score')

# Right plot
ax2.hist(train.target, bins=50, color='red')
ax2.set_xlabel('Loyalty Score')

plt.show()

As the plots above show some outliers, a closer look at the outliers is taken. As the prediction will be based on tree algorithms, the outliers don't have to be dropped.

In [None]:
# Calculate number of outliers
outliers = train[train.target < -20]
print("Number of outliers {}".format(outliers.target.count()))
print("Percentage of the total number of data points {:}%".format((outliers.target.count()/len(train.target))*100))

Now a closer look at the train data frame is taken. The distribution of all three features is plotted. Unfortunately it isn't possible to gain a lot of information from these plots as there is no information about what these features mean. After this data exploration it has been decided to treat them as categorical values.

In [None]:
# feature 1
plt.figure(figsize=(8,4))
sns.violinplot(x="feature_1", y=train.target, data=train)
plt.xticks(rotation='vertical')
plt.xlabel('Feature 1', fontsize=12)
plt.ylabel('Loyalty score', fontsize=12)
plt.title("Feature 1 distribution")
plt.show()

# feature 2
plt.figure(figsize=(8,4))
sns.violinplot(x="feature_2", y=train.target, data=train)
plt.xticks(rotation='vertical')
plt.xlabel('Feature 2', fontsize=12)
plt.ylabel('Loyalty score', fontsize=12)
plt.title("Feature 2 distribution")
plt.show()

# feature 3
plt.figure(figsize=(8,4))
sns.violinplot(x="feature_3", y=train.target, data=train)
plt.xticks(rotation='vertical')
plt.xlabel('Feature 3', fontsize=12)
plt.ylabel('Loyalty score', fontsize=12)
plt.title("Feature 3 distribution")
plt.show()

In the next step the historical transactions will be taken into focus. Due to the limited RAM, the used outputs were created in another kernel and just imported in this notebook.
First of all it will be checked how many transactions each card_id has.

In [None]:
# Import of the dataset created in a third kernel
temp_hist_eda = pd.read_csv("../input/eda-project/temp_hist_eda.csv")

In [None]:
temp_hist_eda.head().sort_values(by='num_hist_transactions', ascending=False)

Next the value of the historical transactions for the cards are related to the loyalty score. Therefore some a boxplot is created.

In [None]:
Image("../input/boxplots/Boxplot_hist.png")

The Boxplot shows that there seems to be an increase of the loyalty score with more valuable historical transactions. Now the new transactions will be analyzed in a similar way as the historical data frame.

In [None]:
temp_new_eda = pd.read_csv("../input/eda-project/temp_new_eda.csv")

In [None]:
temp_new_eda.head().sort_values(by='num_merch_transactions', ascending=False)

Again Boxplots are created to get more information about the data frame.

In [None]:
Image("../input/boxplots/Boxplot_hist.png")

It seems the loyalty score decreases as the number of new merchants transactions increases. The last bin presents an exception in this case.

As a conclusion from this first data exploration it can be said that the data exploration of the test and train data frames is not significant, as there is no information about what the features are about and the other data frames are still not joined to them. Meanwhile it's obvious that the two transaction data frames will be important for the loyalty score, there is no information about the merchants data frame.

As a last step from the first data exploration the data frames are checked for missing values.

In [None]:
#Checking missing values
train.isnull().any()

In [None]:
test.isnull().any()

In [None]:
hist_trans.isnull().any()

In [None]:
new_trans.isnull().any()

In [None]:
merchants.isnull().any()

Besides the train data frame all the others are suffering from missing values. This has to be taken care of. In chapter 5 "Data Preparation" missing values will be handled.

<a id="3"></a> <br>
# **4. Closer Look at the Merchants Data**

In order to find out if the merchants data has any effect on the loyalty score. The five card_ids with the highest and the lowest scores where filtered from the train data.

**5 highest scores:**
* C_ID_a4e600deef
* C_ID_1c8a5b9d44
* C_ID_b0f1d28bd3
* C_ID_700c15a07d
* C_ID_ecc4e2e188

**5 lowest scores:**
* C_ID_282d394cc6
* C_ID_ebbf8a7516
* C_ID_3e35c68b54
* C_ID_defab7ce82
* C_ID_fc7b761beb

These ids where filtered in the train data and patterns in the merchants were searched. The idea was to see if any merchant had special influence on the high or low scores. Also the merchants categories were checked for the same. It resulted that both groups low and high had the same merchants and same categories in their transaction. Hence it can be concluded that there aren't any special merchants or categories that should be taken more into focus. All the other data columns from the merchants data frame provide specific information to each merchant.

In [None]:
# Filtering data frames regarding highest and lowest top 5
top_five = ['C_ID_a4e600deef', 'C_ID_1c8a5b9d44', 'C_ID_b0f1d28bd3', 'C_ID_700c15a07d',
            'C_ID_ecc4e2e188']

low_five = ['C_ID_282d394cc6', 'C_ID_ebbf8a7516', 'C_ID_3e35c68b54', 'C_ID_defab7ce82',
            'C_ID_fc7b761beb']

In [None]:
# Extracting rows from from historical and new transactions with the filtered card_ids

top_hist = hist_trans.loc[hist_trans['card_id'].isin(top_five)]
top_new = new_trans.loc[new_trans['card_id'].isin(top_five)]

low_hist = hist_trans.loc[hist_trans['card_id'].isin(low_five)]
low_new = new_trans.loc[new_trans['card_id'].isin(top_five)]

Then each of those data frames was once grouped by merchant_id and merchant_category_id. Then the results of lowest and highest score were compared. In this kernel this only will be done as an example on the historical transactions. First it is done for the merchant_id.

In [None]:
top_hist.groupby('merchant_id').count().sort_values(by='authorized_flag', ascending=False).head()

In [None]:
low_hist.groupby('merchant_id').count().sort_values(by='authorized_flag', ascending=False).head()

There is no clear pattern if the low and the high scores are compared. The merchant with the highest count in the high scores also appears in the table from the low scores. Following the same has been executed on the merchant_category_id.

In [None]:
top_hist.groupby('merchant_category_id').count().sort_values(by='authorized_flag', ascending=False).head()

In [None]:
low_hist.groupby('merchant_category_id').count().sort_values(by='authorized_flag', ascending=False).head()

Similar to the result from the merchant_id, the same results from merchant_category_id. 705, 307 and 367 appear in both tables. Therfore it was decided to not include the merchants data frames as it only provides additional information regarding the merchants and as there is no specific pattern it was deleted..

In [None]:
del merchants

<a id="4"></a> <br>
# **5. Preparing the Data**

In the next step the data is prepared. Within this chapter columns get binarized and missing values are handled. In order to binarize the columns a function was written. In this function *Y* is mapped to 1 and *N* to 0. The columns *authorized_flag* and *category_1* were identified with values to be binarized. The target column is already extracted and stored separately. Nevertheless it is still kept within the data frame in order to use it for testing the different models. In the end categorical values are taken care of.

In [None]:
# Define binarize function to binarize some columns
def binarize(df):
    for col in ['authorized_flag', 'category_1']:
        df[col] = df[col].map({'Y':1, 'N':0})
    return df

In [None]:
# Binarize function is applied to new_trans and hist_trans
new_trans = binarize(new_trans)
hist_trans = binarize(hist_trans)

As discovered there are missing values in nearly all data frames. Only the train data does not contain any missing values. The merchants data frame has been deleted and hence needs no adapting. Therefore now the other three data frames will be focused. First the test set is covered. The test set only contains missing values in one column.

In [None]:
# Handling missing values in the test set - filtering rows with missing values
missing_test = test[test.isnull().any(axis=1)]
missing_test

There is only one row with a missing value. As it is in the column *first_active_month*, it is checked when the first transaction occured.
The card_id of this row is C_ID_c27b4f80f7.

In [None]:
missing_value_test = hist_trans.loc[hist_trans['card_id'].isin(['C_ID_c27b4f80f7'])]

In [None]:
missing_value_test.sort_values(by='purchase_date').head()

The first transaction has the timestamp '2017-03-09'. This timestamp is used to fill the missing value.

In [None]:
values = {'first_active_month': '2017-03-09'}
test = test.fillna(value=values)

In [None]:
del missing_value_test

As the train and test data frames are complete now the focus set on the new and historical transactions. Both data frames have missing values in the same columns as seen in the first data exploration. The affected columns are category_2, category_3 and merchant_id.
To fill the gaps in the merchant_id column the most frequented merchant is used. As seen in the chapter focusing on the merchants table, the most frequented one is M_ID_00a6ca8a8a both for the lower scores as well as for the higher scores. For the other two affected columns the most frequently used value is filled in the gaps.

In [None]:
# Finding most common value to fill in category_2
hist_trans['category_2'].value_counts()

In [None]:
# Finding most common value to fill in category_3
hist_trans['category_3'].value_counts()

In [None]:
# Handling missing values in new and historical transactions
for df in [hist_trans, new_trans]:
    df['category_2'].fillna(1.0,inplace=True)
    df['category_3'].fillna('A',inplace=True)
    df['merchant_id'].fillna('M_ID_00a6ca8a8a',inplace=True)

As all missing values are cleared. The categorical values are take care of. The train and test set have three features. As feature_3 only contains 0 and 1 nothing has to be done. For the other two columns the pd.get_dummies function is used.

In [None]:
# Treating feature_1 and feature_2 as categorical values
train = pd.get_dummies(train, columns=['feature_1', 'feature_2'])
test = pd.get_dummies(test, columns=['feature_1', 'feature_2'])

In the historical and new transaction data frames the columns category_2 and category_3 were identified as categorical values. Therefore again the pd.get_dummies function is used.

In [None]:
# Handling categorical values in the new_transaction and historical_transactions
hist_trans = pd.get_dummies(hist_trans, columns=['category_2','category_3'])
new_trans = pd.get_dummies(new_trans, columns=['category_2','category_3'])

To close this chapter the target column gets extracted and stored as an idependent variable.

In [None]:
# Get target column
target = train['target']

<a id="5"></a> <br>
# **6. Feature Engineering**

There are five separated data sets available. The next part is used to create new features in the different data sets.

First of all the pandas *to_datetime* function is used to convert the column *first_active_month* to datetime. From this afterwards the year and the month are extracted and stored in own columns. The days are extracted as well to calculate the elapsed time that is also stored in its own column. This is applied for the train set as well as for the test set.

In [None]:
# Adding columns start_year, start_month and elapsed_time
for dataframe in [train, test]:
    dataframe['first_active_month'] = pd.to_datetime(dataframe['first_active_month'])
    dataframe['start_year'] = dataframe['first_active_month'].dt.year
    dataframe['start_month'] = dataframe['first_active_month'].dt.month
    dataframe['elapsed_time'] = (datetime.datetime.today() - dataframe['first_active_month']).dt.days

The next feature that is created is the month difference. This feature is created either for the new transactions as well as for the historical transactions. Therefore a new column is created. To do so the difference from the current date to the purchase date is calculated and then divided by 30. The results are the months that passed since then. The month lag of each card_id is added to the result.

In [None]:
# Creating a new column on base of the month_lag column
hist_trans['month_diff'] = ((datetime.datetime.today() - hist_trans['purchase_date'])
                            .dt.days)//30
hist_trans['month_diff'] += hist_trans['month_lag']

new_trans['month_diff'] = ((datetime.datetime.today() - new_trans['purchase_date'])
                           .dt.days)//30
new_trans['month_diff'] += new_trans['month_lag']

The next features are basically just time features for the historical and new transactions that are created from the colum *purchase_date*. Similar to what was done in the beginning to_datetime is used to convert the *purchase_date* column. Afterwards month, year, week day and hour are extracted and added as columns, then another column is added where it is checked whether it was a purchase on a weekend or during the week.

In [None]:
# Adding a new column purchase_month
for dataframe in [hist_trans, new_trans]:
    dataframe['purchase_date'] = pd.to_datetime(dataframe['purchase_date'])
    dataframe['purchase_month'] = dataframe['purchase_date'].dt.month
    dataframe['purchase_year'] = dataframe['purchase_date'].dt.year
    dataframe['purchase_dayofweek'] = dataframe['purchase_date'].dt.dayofweek
    dataframe['purchase_hour'] = dataframe['purchase_date'].dt.hour
    dataframe['purchase_weekend'] = (dataframe.purchase_date.dt.weekday >=5).astype(int)


<a id="6"></a> <br>
# **7. Merging the Data Sets**

As the tables are still disconnected they have to be merged. As there are different amounts of rows for each card_id, aggregate functions for each column have to be defined in order to group the rows by card_id.

In [None]:
# Defining aggregate functions
aggregate_function = {
    'authorized_flag': ['sum', 'mean'],
    'card_id': ['size'],
    'category_1': ['sum', 'mean'],
    'category_2_1.0': ['mean', 'sum'],
    'category_2_2.0': ['mean', 'sum'],
    'category_2_3.0': ['mean', 'sum'],
    'category_2_4.0': ['mean', 'sum'],
    'category_2_5.0': ['mean', 'sum'],
    'category_3_A': ['mean', 'sum'],
    'category_3_B': ['mean', 'sum'],
    'category_3_C': ['mean', 'sum'],
    'merchant_id': ['nunique'],
    'merchant_category_id': ['nunique'],
    'state_id': ['nunique'],
    'city_id': ['nunique'],
    'subsector_id': ['nunique'],
    'purchase_amount': ['sum', 'mean', 'max', 'min', 'std', 'var'],
    'installments': ['sum', 'mean', 'max', 'min', 'std', 'var'],
    'month_lag': ['mean', 'max', 'min', 'std', 'var'],
    'month_diff': ['mean'],
    'purchase_date': ['max', 'min'],
    'purchase_month': ['nunique'],
    'purchase_year': ['nunique'],
    'purchase_dayofweek': ['nunique'],
    'purchase_hour': ['nunique'],
    'purchase_weekend': ['sum', 'mean']
}

Before the tables can be merged new columns names for the aggregated rows are created. Therefore a function was written that connects the name of the column with hist or new regarding the data frame and the corresponding aggregate function. Then the data frames are grouped by the card_id and afterwards the created names are given to the new columns.

In [None]:
def new_columns(name, aggregate_functions):
    column = []
    for k in aggregate_function.keys():
        for agg in aggregate_function[k]:
            column.append(name + '_' + str(k) + '_' + str(agg))
    return column

new_hist_columns = new_columns('hist', aggregate_function)
new_new_columns = new_columns('new', aggregate_function)

In [None]:
# grouping the transaction on card_id by the above defined aggregate functions
grouped_hist = hist_trans.groupby(['card_id']).agg(aggregate_function)
grouped_hist.columns = new_hist_columns
grouped_hist.reset_index(drop=False,inplace=True)
del hist_trans

In [None]:
# grouping the transaction on card_id by the above defined aggregate functions
grouped_new = new_trans.groupby(['card_id']).agg(aggregate_function)
grouped_new.columns = new_new_columns
grouped_new.reset_index(drop=False,inplace=True)
del new_trans

The next step belongs to the feature engineering but it needed the tables to already be grouped by the card_id. Therefore this has been switched to this chapter. Three new features are created for each of the grouped data frames. The columns *purchase_date_diff* is created by the difference from the maximum purchase date and the minimum purchase date. The column *purchase_date_avg* uses the newly created column directly and divides it by the column card_id_size which was created by the grouping. The last column *purchase_date_uptonow* is created by the difference from today's date and the maximum purchase date.

In [None]:
count = 0
for dataframe in [grouped_hist, grouped_new]:
    if count == 0:
        x = 'hist'
        count = 1
    else:
        x = 'new'
    dataframe[x+'_purchase_date_diff'] = (dataframe[x+'_purchase_date_max'] - dataframe[x+'_purchase_date_min']).dt.days
    dataframe[x+'_purchase_date_avg'] = dataframe[x+'_purchase_date_diff']/dataframe[x+'_card_id_size']
    dataframe[x+'_purchase_date_uptonow'] = (datetime.datetime.today() - dataframe[x+'_purchase_date_max']).dt.days

After grouping the data sets the grouped historical and new data frames are merged on card_id in form of a left join. Again temporary dataframes are removed afterwards. Then a function is used to reduce the memory usage.

In [None]:
# Join historical transactions into train and test
train = pd.merge(train, grouped_hist, on='card_id', how='left')
train = pd.merge(train, grouped_new, on='card_id', how='left')
test = pd.merge(test, grouped_new, on='card_id', how='left')
test = pd.merge(test, grouped_hist, on='card_id', how='left')

In [None]:
# Delete grouped dataframes
del grouped_hist
del grouped_new

The following function has been copied from the kernel [Elo World](https://www.kaggle.com/fabiendaniel/elo-world). It is a great function to reduce the memory usage. As this competition handels huge amounts of data it is quite helpful.

In [None]:
# reduce_mem_usage was taken over from Elo World
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
# Reduce memory usage
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

<a id="7"></a> <br>
# **8. Second Data Exploration**

Now that the data frames are merged together there is a new base to explore and connections to be made. To start and get a quick overview a correlation matrix in form of a heatmap is created but no real conclusions are possible as it is includes too much features.

In [None]:
# Correlation matrix
corrmat = train.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=.8, square=True);

As the first correlation matrix was unclear a second one is created but this time on a set of chosen columns. The selection has been set randomly

In [None]:
#Correlation matrix on a reduced data frame
columns = ['target', 'hist_month_diff_mean', 'hist_category_1_sum', 'hist_purchase_month_nunique',
           'hist_category_3_B_sum', 'start_year', 'elapsed_time', 'new_merchant_category_id_nunique',
           'hist_authorized_flag_mean']

colormap = plt.cm.RdBu
plt.figure(figsize=(12,12))
sns.heatmap(train[columns].corr(), linewidths=0.1, vmax=1.0, vmin=-1., square=True, cmap=colormap, linecolor='white', annot=True)
plt.title('Pair-wise correlation')

There is a strong negative linear relationship between start_year and elapsed_time this can be lead back to the fact that they have been created from the same column.

Another visualization has been done in form of scatterplots. The columns used were the same as for the previous correlation matrix.

In [None]:
#scatterplots of the chosen columns
columns = ['target','hist_month_diff_mean','hist_category_1_sum','hist_category_3_B_sum',
           'start_year','elapsed_time','hist_authorized_flag_mean']

sns.set()
sns.pairplot(train[columns], size = 2.5)
plt.show();

The same results can be observed in the visualization in form of the pairplots. The negative correlation is visible. A slight correlation is also observeable between *count_hist_transactions* and *hist_category_3_B_sum*. This is also displayed in the correlation matrix.

In [None]:
del train['target']

<a id="8"></a> <br>
# **9. Preparing the prediction**

Before the prediction of the scores for the test set can be done the train and test set need some final adjustments.
First of all the first the columns *first_active_month* and *card_id* are dropped as they are not relevant for the prediction

In [None]:
# Dropping the columns first_active_month and card_id
x_train = train.drop(['first_active_month', 'card_id'], axis=1)
x_test = test.drop(['first_active_month', 'card_id'], axis=1)

In [None]:
x_train.to_csv("missing_values_train.csv", index=False)
x_test.to_csv("missing_values_test.csv", index=False)

As not all card_ids had transactions in the new transaction data frame there are again columns with missing values. These again have to be taken care of. Most of the columns can be filled simply with zero. Nevertheless there are some columns that might be better filled with the mean of the regarding column.

In [None]:
# Define non zero value columns and fill them with the mean
#non_zero_value_columns = ['new_month_lag_mean', 'new_month_lag_max', 'new_month_lag_min',
#                          'new_month_lag_std', 'new_month_lag_var', 'new_purchase_date_diff',
#                          'new_purchase_date_avg', 'new_purchase_date_uptonow']
#
#for x in non_zero_value_columns:
#    x_train[x] = x_train[x].fillna(x_train[x].mean)
#    x_test[x] = x_test[x].fillna(x_train[x].mean)

In [None]:
# Fill the remaining missing values
x_train = x_train.fillna(0)
x_test = x_test.fillna(0)

As there are some columns in formats that can't be used for the chosen prediction models, a list is filled with the fitting columns.

In [None]:
# Columns to use for train set
columns_to_use = []
for c in x_train.columns.values:
    if c != 'hist_purchase_date_min' and c != 'hist_purchase_date_max'and c != 'new_purchase_date_max' and c != 'new_purchase_date_min':
        columns_to_use.append(c)

<a id="11"></a> <br>
# **10. Testing different Prediction Models**

In the progress of the data science class differentt prediction models were introduced. The following prediction models are going to be tested on a train test split and afterwards the scores are presented in a table:
* DecisionTreeRegressor
* RandomForestRegressor
*  GradientBoostingRegressor
* LightGBM
* ElasticNetCV
* RidgeCV
* LassoPath

The first four are based on tree algorithms meanwhile the other three perform linear regression with cross validation.
A train test split is created and later applied to the different models.

As the amount of data leads to problem with the RAM this models were executed in a second kernel and afterwards the results were imported in this kernel.

In [None]:
# Train test split
X_train, X_test, Y_train, Y_test = train_test_split(x_train[columns_to_use], target, test_size=0.3, random_state=42)

In [None]:
Y_train = pd.DataFrame(Y_train)
Y_test = pd.DataFrame(Y_test)

In [None]:
# Exporting the prepared data frames to use them in a second kernel
X_train.to_csv("X_train.csv", index=False)
Y_train.to_csv("Y_train.csv", index=False)
X_test.to_csv("X_test.csv", index=False)
Y_test.to_csv("Y_test.csv", index=False)

In [None]:
# Importing the results from the second kernel
RMSE_table = pd.read_csv("../input/rmse-table/RMSE_table.csv")

In [None]:
RMSE_table

**Conclusion**: As can be seen in the table above the LightGBM has the best score closely followed by the GradientBoostingRegressor. Also the RandomForestRegressor gets a similar score. Therefore a combination of tree algorithms was chosen.

<a id="9"></a> <br>
# **11. Ensemble Prediction**

Considering the results from the trial of the different prediction models it was decided to use an ensemble prediction that means combining the three models LightGBM, RandomForrestRegressor and GradientBoostingRegressor. LightGBM and GradientBoostingRegressor both are boosting tree algorithms, RandomForrestRegressor is a tree algorithm is as well and uses bootstrapping. The used parameters from the LightGBM returned the best result in the previous chapter, therefore the parameters from the GradientBoostingRegressor and RandomForrestRegressor were adapted accordingly.
First both models get trained. The first model is the LightGBM.

**LightGBM**

The parameters are defined and the model is trained.

In [None]:
# Dataset definition LGB
train_data = lgb.Dataset(X_train, label=Y_train)
test_data = lgb.Dataset(X_test,label=Y_test)

lgb_params = {"objective" : "regression", 
"metric" : "rmse",
"max_depth": 8, 
"min_child_samples": 100, 
"reg_alpha": 1, 
"reg_lambda": 1,
"num_leaves" : 64, 
"learning_rate" : 0.01,
"subsample" : 0.8, 
"colsample_bytree" : 0.8, 
"verbosity": -1}

# Model training
lgb_model = lgb.train(lgb_params,train_data,valid_sets=test_data,num_boost_round=100000,early_stopping_rounds=100)

**Gradient Boosting Regressor**

As above the parameters are defined and the model is trained. As said before the parameters got adapted according to the parameters from the LightGBM.

In [None]:
boost_reg = GradientBoostingRegressor(n_estimators=500, learning_rate=0.01, subsample=0.8, max_depth=8)
boost_reg.fit(x_train[columns_to_use], target)

For the Gradient Boosting Regressor the  relative feature importance is printed out. The features are printed in a descending order.

In [None]:
# Plotting the feature_importance
feature_importance = boost_reg.feature_importances_
# make importances relative to max importance
feature_importance = 100.0 * (feature_importance / feature_importance.max())

rel_imp = pd.Series(feature_importance, index=x_train[columns_to_use].columns).sort_values(inplace=False, ascending=False)
print(rel_imp)

To visualize the printed values, the same information is plotted in a bar chart.

In [None]:
(pd.Series(feature_importance, index=train[columns_to_use].columns)
   .nlargest(20)
   .plot(kind='barh'))

**RandomForrestRegressor**

The model gets trained. As stated above the parameters were adapted to the ones from the lightGBM.

In [None]:
model_random = RandomForestRegressor(n_estimators=500, max_depth=8)
model_random.fit(x_train[columns_to_use], target)

For all models the predictions are made and then the mean is taken as the final prediction. Nevertheless four different submission files are created in order to be able to compare the results.

In [None]:
# Make predictions based on the RandomForrestRegressor
random_prediction = model_random.predict(x_test[columns_to_use])

In [None]:
# Make predictions based on the boosting regressor
boost_prediction = boost_reg.predict(x_test[columns_to_use])

In [None]:
# Make predictions based on the lightgbm regressor
lgb_prediction = lgb_model.predict(x_test[columns_to_use])

In [None]:
# Stacked predictions
prediction = (0.33*boost_prediction) + (0.34*lgb_prediction) + (0.33*random_prediction)

Finally the submission file is created.

In [None]:
# Submission for ensemble prediction
x_test_id = test['card_id']
sub_df = pd.DataFrame({"card_id":x_test_id.values})
sub_df["target"] = prediction
sub_df.to_csv("submission_ensemble.csv", index=False)

In [None]:
# Submission file for RandomForrestRegressor
sub_df = pd.DataFrame({"card_id":x_test_id.values})
sub_df["target"] = random_prediction
sub_df.to_csv("submission_random.csv", index=False)

In [None]:
# Submission file for GradientBoostingRegressor
sub_df = pd.DataFrame({"card_id":x_test_id.values})
sub_df["target"] = boost_prediction
sub_df.to_csv("submission_boost.csv", index=False)

In [None]:
# Submission file for LigthGBM
sub_df = pd.DataFrame({"card_id":x_test_id.values})
sub_df["target"] = lgb_prediction
sub_df.to_csv("submission.csv", index=False)

<a id="12"></a> <br>
# **12. Project Reflection**

This project was a great opportunity to apply concepts learned in class. Nevertheless there were a lot of struggles along the way. Compared to previous regression problems that were used for class, the elo competition provided a more complicated scenario. This was caused due to having five different data frames instead of just two. Therefore an understanding for the data itself needed to be created for. After the understanding that these different data frames needed to be combined the first real struggle appeared: as the data sets include huge numbers of data all actions on them were computationally expensive. First tries on zeno were unsuccessful as the amount of data got the server to crash. As the availability of the zeno server was generally often limited it was then decided to continue working in a kaggle kernel but also the kaggle kernels sometimes die due to the complete usage of the available RAM.
The next obstacle was the prediction model. Initial linear regression models returned poor results, inspiration was searched in kernels from other users and after different approaches it was decided to focus mainly on tree algorithms as they appear to be more suitable.
The last barrier was the joining of the tables and the creation of new features. The joining itself wasn't difficult technically but aggregate functions needed to be defined for each column. Creating new features was difficult as it wasn't easy to understand the content of the provided tables and also the existence of the merchants table did not make it easier as it was tried to involve the data but in the end there was no way found to incorporate it in a way that it contributes to the final result.
After having joined all the data the last problem was that a lot of card_ids did not have any new transaction and therefore it resulted in a lot of missing values. Whether to ignore the new transactions or include and fill them needed to be analyzed and in case the filling was chosen it had to be decided with which values. Supprisingly the best result turned out to be just filling the missing values with zero as done in this kernel. Nevertheless there should be a better solution for this.