In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Contents
- [Introduction](#introduction)
    * [Problem Statement](#problem-statement)
- [Exploratory Data Analysis](#eda)
- [Feature Engineering](#feature-engineering)
- [Model Building](#model-building)
    * [Linear Regression](#linear-regression)
    * [Gradient Bososted Tree](#gradient-boosted-trees)
- [Model Analysis](#model-analysis)
- [Conclusion](#conclusion)




## Introduction<a class="anchor" id="introduction"></a>


Elo, one of Brazil's largest payment brands, is partnered with many merchants to offer promotions and discounts to their cardholders. Elo aimed to reduce marketting that is irrelevant to members and offer them custom-tailored promotions, thereby providing an enjoyable experience and beneficial service. To that end, Elo launched a Kaggle competition, enlisting the Kaggle community's help, to produce a machine learning model that can find signal between trasaction data and loyalty. Such a model will help Elo gauge customer loyalty and how promotional strategies affect it.

**The data provided is simulated and fictitious. It does not contain real customer data.*


### Problem Statement<a class="anchor" id="problem-statement"></a>
Build a machine learning model that can effectively predict customer loyalty scores using trasaction data.

## Exploratory Data Analysis <a class="anchor" id="eda"></a>

The data contained five main tables consisting **train data**, **test data**, **historical transactions**, **new merchant transactions** and **merchants**.

**train data** table contained card_ids, loyalty scores and 3 arbitrary features provided by ELO. The aribtrary features were not very useful as they did not provide much signal in predicting loyalty scores.

**test data** table contained the same arbitrary features as **train data** and card_id but did not contain loyalty scores.

**historical transactions** contained details on purchases made by all the cardholders in the dataset. Details included purchase amount, merchant category, etc. Every card_id had atleaset 3 months of records.

**new merchant transactions** contained transaction data from new merchants the cardholder had not yet purchased from in the historical transaction data. Every card_id had upto two months of new merchant data after a reference date (which differed for each card_id).

**merchants** contained details on merchants seen in **historical transactions** and **new merchant transactions**


The data was heavily anonymized. Non-numerical data such as city of purchase and merchant category were reassigned with arbitrary value so it is difficult to connect real world knowledge and find insights.





In [None]:
# load data

# train data
traindf = pd.read_csv('../input/elo-merchant-category-recommendation/train.csv')

# given test data with no loyaltly score
giventestdf = pd.read_csv('../input/elo-merchant-category-recommendation/test.csv')
giventestdf['card_id'].nunique()

# historical transaction data
histtransdf = pd.read_csv('../input/elo-merchant-category-recommendation/historical_transactions.csv')

# new merchant transactional data
newtransdf = pd.read_csv('../input/elo-merchant-category-recommendation/new_merchant_transactions.csv')

# merchant data
merchdf = pd.read_csv('../input/elo-merchant-category-recommendation/merchants.csv')

In [None]:
# training dataset at glance
traindf.head()

In [None]:
traindf.card_id.nunique()

In [None]:
histtransdf.head()

In [None]:
histtransdf['card_id'].nunique()

In [None]:
histtransdf.shape

In [None]:
newtransdf.card_id.nunique()

In [None]:
merchdf['merchant_id'].nunique()

There are 201,917 card ids in the training data set.

There are 123,623 card ids in the test data. This data cannot be used to train the model as it does not have loyalty scores (the response variable)

Hisotrical transaction data has 325,540 card ids with 29,112,361 transactions. Train and test card ids are both included in this historical data.

New merchant transactional data set had 290,001 card ids.

There were 334,633 different merchants in the merchant data.

In [None]:
# plotting loyalty score distribution
traindf['target'].hist()
plt.xlabel('Loyalty Score')
plt.ylabel('Count')
plt.title('Loyalty Score Distribution');

Loyalty scores are normally distriubted ranging from -10 to 10. There are some outliers at -33.

## Feature Engineering<a class='anchor' id='feature-engineering'></a>

`NDR` - (New Dollar Ratio) Calculated by dividing amount of dollars

In [None]:
# concatenating all transaction data and adding indicator of new merchant
histtransdf['new'] = False
newtransdf['new'] = True
alltransdf = pd.concat([histtransdf, newtransdf])

In [None]:
alltransdf['purchase_date'] = pd.to_datetime(alltransdf['purchase_date'])

In [None]:
trx.authorized_flag = trx.authorized_flag.apply(lambda x: True if x == 'Y' else False)
trx.city_id = trx.city_id.apply(str)

## Model Building<a class='anchor' id='model-building'></a>

### Linear Regression <a class='anchor' id='linear-regression'></a>

### Gradient Boosted Trees <a class="anchor" id='gradient-boosted-trees'></a>

## Conclusion <a class='anchor' id='conclusion'></a>