**Elo Merchant Category Recommendation**

Competition description:

> Imagine being hungry in an unfamiliar part of town and getting restaurant recommendations served up, based on your personal preferences, at just the right moment. The recommendation comes with an attached discount from your credit card provider for a local place around the corner!
> 
> Right now, Elo, one of the largest payment brands in Brazil, has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalization is key.
> 
> Elo has built machine learning models to understand the most important aspects and preferences in their customers’ lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where you come in.
> 
> In this competition, Kagglers will **develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty**. Your input will improve customers’ lives and help Elo reduce unwanted campaigns, to create the right experience for customers.

When I read this description (and looked at the data provided), I found it  pretty confusing (or maybe I'm just not used to this). So, let's try to figure out what exactly is going on in this competition and what we are predicting.

Firstly, from the competition description, we need to "develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty". 

What is loyalty? According to the Data_Dictionary.xlsx, **loyalty is a numerical score calculated 2 months after historical and evaluation period**.  Additionally, by looking at historical_transactions.csv and new_merchant_transactions.csv, we can find that the historical transactions are the transactions occurred before the "reference date" and new merchant transactions - the ones that occurred after the reference date (according to the 'month_lag' field, which is generously described as "month lag to reference date").

**So, here is my understanding of the situation:**
* Based on the data in historical_transactions.csv, Elo picked new mechants to recommend for each card holder.
* The date when Elo began providing recommentations is called the 'reference date'.
* The recommended mechant data is not provided (so we don't figure out the recommendation algorithm Elo uses).
* After the reference date, for each card Elo gathered transaction history for all new merchants that appeared on the card.
* By comparing each card's new merchant activity and the secret list of the merchants recommended by Elo, the loyalty score was calculated.
* **Our goal is to evaluate Elo's recommendation algorithm by trying to predict in which cases it's going to work well (yielding a high loyalty score) and in which cases - not (yielding a low loyalty score)**.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print('List of files:')
print(os.listdir("../input"))
data_folder = '../input'

# Any results you write to the current directory are saved as output.

**Let's compare the data distribution in test.csv and train.csv.**

train.csv and test.csv column descriptions:

**card_id:**	Unique card identifier

**first_active_month:**	 'YYYY-MM', month of first purchase

**feature_1:**	Anonymized card categorical feature

**feature_2:**	Anonymized card categorical feature

**feature_3:**	Anonymized card categorical feature

**target:** Loyalty numerical score calculated 2 months after historical and evaluation period!

Well, that's not very informative. Let's look at the sizes of these tables:

In [None]:
train = pd.read_csv(os.path.join(data_folder, 'train.csv'))
test = pd.read_csv(os.path.join(data_folder, 'test.csv'))

plt.figure(figsize=[4,3])
plt.bar([0, 1], [train.shape[0], test.shape[0]], edgecolor=[0.2]*3, color=(1,0,0,0.5))
plt.xticks([0,1], ['train rows', 'test rows'], fontsize=13)
plt.title('Number of rows in train.csv and test.csv', fontsize=15)
plt.tight_layout()
plt.show()

And feature distributions:

In [None]:
%matplotlib inline

plt.figure(figsize=[15,5])
plt.suptitle('Feature distributions in train.csv and test.csv', fontsize=20, y=1.1)
for num, col in enumerate(['feature_1', 'feature_2', 'feature_3', 'target']):
    plt.subplot(2, 4, num+1)
    if col is not 'target':
        v_c = train[col].value_counts() / train.shape[0]
        plt.bar(v_c.index, v_c, label=('train'), align='edge', width=-0.3, edgecolor=[0.2]*3)
        v_c = test[col].value_counts() / test.shape[0]
        plt.bar(v_c.index, v_c, label=('test'), align='edge', width=0.3, edgecolor=[0.2]*3)
        plt.title(col)
        plt.legend()
    else:
        plt.hist(train[col], bins = 100)
        plt.title(col)
    plt.tight_layout()
plt.tight_layout()
plt.show()

Looks like the test and train data are distributed similarly. Additionally, there are some outliers in the target column. Let's take a look at them:

In [None]:
outliers = train.loc[train['target'] < -30]
non_outliers = train.loc[train['target'] >= -30]
print('{:d} outliers found (target < -30)'.format(outliers.shape[0]))

plt.figure(figsize=[10,5])
plt.suptitle('Outlier vs. non-outlier feature distributions', fontsize=20, y=1.1)

for num, col in enumerate(['feature_1', 'feature_2', 'feature_3', 'target']):
    if col is not 'target':
        plt.subplot(2, 3, num+1)
        v_c = non_outliers[col].value_counts() / non_outliers.shape[0]
        plt.bar(v_c.index, v_c, label=('non-outliers'), align='edge', width=-0.3, edgecolor=[0.2]*3)
        v_c = outliers[col].value_counts() / outliers.shape[0]
        plt.bar(v_c.index, v_c, label=('outliers'), align='edge', width=0.3, edgecolor=[0.2]*3)
        plt.title(col)
        plt.legend()

plt.tight_layout()
plt.show()

There are some slignt differences between outliers and non-outliers, but they don't seem to be that big and they certainly can't explain the difference between the target values,  at least based on the features in the train dataset. It's probably a good idea to remove them, unless we can find differences between outliers and non-outliers in other tables that will allow us to detect them.

Correlation coefficients for all variables in train.csv:

In [None]:
corrs = np.abs(train.corr())
np.fill_diagonal(corrs.values, 0)
plt.figure(figsize=[5,5])
plt.imshow(corrs, cmap='plasma', vmin=0, vmax=1)
plt.colorbar(shrink=0.7)
plt.xticks(range(corrs.shape[0]), list(corrs.columns))
plt.yticks(range(corrs.shape[0]), list(corrs.columns))
plt.title('Correlations between target and user\'s features', fontsize=17)
plt.show()

The correlations between the target and any of the features (so, don't use linear models for precition) are quite low, but the features 1 and 3 are somewhat decently correlated. Let's take a look at the scatter plots:

In [None]:
from pandas.plotting import scatter_matrix
select_cols = ['feature_1', 'feature_2', 'feature_3', 'target']
scatter_matrix(train[select_cols], figsize=[10,10])
plt.suptitle('Pair-wise scatter plots for columns in train.csv', fontsize=15)
plt.show()

Again, not a lot of useful information. The only thing I can see is different target variances for different values of the feature_1 , but it's most likely due to the different amount of data corresponding to each feature.

Now, let take a look at the merchants. From the dataset's describtion:

**merchant_id:**	Unique merchant identifier

**merchant_group_id:**	Merchant group (anonymized )

**merchant_category_id:**	Unique identifier for merchant category (anonymized )

**subsector_id:**	Merchant category group (anonymized )

**numerical_1:**	anonymized measure

**numerical_2:**	anonymized measure

**category_1:**	anonymized category

**most_recent_sales_range:**	Range of revenue (monetary units) in last active month --> A > B > C > D > E

**most_recent_purchases_range:**	Range of quantity of transactions in last active month --> A > B > C > D > E

**avg_sales_lag3:**	Monthly average of revenue in last 3 months divided by revenue in last active month

**avg_purchases_lag3:**	Monthly average of transactions in last 3 months divided by transactions in last active 
month

**active_months_lag3:**	Quantity of active months within last 3 months

**avg_sales_lag6:**	Monthly average of revenue in last 6 months divided by revenue in last active month

**avg_purchases_lag6:**	Monthly average of transactions in last 6 months divided by transactions in last active 
month

**active_months_lag6:**	Quantity of active months within last 6 months

**avg_sales_lag12:**	Monthly average of revenue in last 12 months divided by revenue in last active month

**avg_purchases_lag12:**	Monthly average of transactions in last 12 months divided by transactions in last active month

**active_months_lag12:**	Quantity of active months within last 12 months

**category_4:**	anonymized category

**city_id:**	City identifier (anonymized )

**state_id:**	State identifier (anonymized )

**category_2:**	anonymized category!

In [None]:
merchants = pd.read_csv(os.path.join(data_folder, 'merchants.csv'))
# replacing inf values with nan
merchants.replace([-np.inf, np.inf], np.nan, inplace=True)

Several columns in merchants.csv have outliers that squeeze most of the data into one bin (you can check it yourself, I'm not showing the raw data for the sake of saving space). Let's fix that by removing those outliers:

In [None]:
clean_merchants = merchants.loc[(merchants['numerical_1'] < 0.1) &
                               (merchants['numerical_2'] < 0.1) &
                               (merchants['avg_sales_lag3'] < 5) &
                               (merchants['avg_purchases_lag3'] < 5) &
                               (merchants['avg_sales_lag6'] < 10) &
                               (merchants['avg_purchases_lag6'] < 10) &
                               (merchants['avg_sales_lag12'] < 10) &
                               (merchants['avg_purchases_lag12'] < 10)]

Now, let's look at column histograms:

In [None]:
cat_cols = ['active_months_lag6','active_months_lag3','most_recent_sales_range', 'most_recent_purchases_range','category_1','active_months_lag12','category_4', 'category_2']
num_cols = ['numerical_1', 'numerical_2','merchant_group_id','merchant_category_id','avg_sales_lag3', 'avg_purchases_lag3', 'subsector_id', 'avg_sales_lag6', 'avg_purchases_lag6', 'avg_sales_lag12', 'avg_purchases_lag12']

plt.figure(figsize=[15, 15])
plt.suptitle('Merchants table histograms', y=1.02, fontsize=20)
ncols = 4
nrows = int(np.ceil((len(cat_cols) + len(num_cols))/4))
last_ind = 0
for col in sorted(list(clean_merchants.columns)):
    #print('processing column ' + col)
    if col in cat_cols:
        last_ind += 1
        plt.subplot(nrows, ncols, last_ind)
        vc = clean_merchants[col].value_counts()
        x = np.array(vc.index)
        y = vc.values
        inds = np.argsort(x)
        x = x[inds].astype(str)
        y = y[inds]
        plt.bar(x, y, color=(0, 0, 0, 0.7))
        plt.title(col, fontsize=15)
    if col in num_cols:
        last_ind += 1
        plt.subplot(nrows, ncols, last_ind)
        clean_merchants[col].hist(bins = 50, color=(0, 0, 0, 0.7))
        plt.title(col, fontsize=15)
    plt.tight_layout()

From the histograms you can see several things:
* merchant_group_ids are sorted in the descending order
* match_category_id and subsector_id are not sorted
* numerical_1 and numerical_2 (ironically) seem to represent more of a categorical value as they take a discrete set of values

In [None]:
# converting category names to numbers, so these columns
# can participate in the correlation coefficient heat map
clean_merchants['most_recent_purchases_range'].replace({'A':4, 'B':3, 'C':2, 'D':1, 'E':0}, inplace=True)
clean_merchants['most_recent_sales_range'].replace({'A':4, 'B':3, 'C':2, 'D':1, 'E':0}, inplace=True)
clean_merchants['category_1'].replace({'N':0, 'Y':1}, inplace=True)

Now, let's look at correlations between columns in merchants.csv:

In [None]:
corrs = np.abs(clean_merchants.corr())
ordered_cols = (corrs).sum().sort_values().index
np.fill_diagonal(corrs.values, 0)
plt.figure(figsize=[10,10])
plt.imshow(corrs.loc[ordered_cols, ordered_cols], cmap='plasma', vmin=0, vmax=1)
plt.colorbar(shrink=0.7)
plt.xticks(range(corrs.shape[0]), list(ordered_cols), rotation=90)
plt.yticks(range(corrs.shape[0]), list(ordered_cols))
plt.title('Heat map of coefficients of correlation between merchant\'s features', fontsize=17)
plt.show()

The heat map above reveals some relationships:
* numerical_1 and numerical_2 are highly correlated
* Unsurprisingly, avg_sales and avg_purchases within the last 3, 6, and 12 months are highly correlated
* mechant_group_id is loosely correlated with numerical_1, 2, city_id, and sales statistics. Interestingly, merchant_category_id shows little correlation with merchant_group_id, city_id, or really anything else.
* category_1 is loosely correlated with the merchant's location (city_id and state_id)
* category_1 and category_2 are never not NaNs at the same time

Let's take a closer look at the groups of correlated variables:

Group 1:

In [None]:
scatter_matrix(clean_merchants[ordered_cols[-6:]], figsize=[15,15])
plt.show()

Nothing too interesting here, looks like regular coorelated values. One interesting thing to look at is the average sales and purchases within the last 3, 6, and 12 months:


In [None]:
x = np.array([12, 6, 3]).astype(str)
sales_rates = clean_merchants[['avg_sales_lag3', 'avg_sales_lag6', 'avg_sales_lag12']].mean().values
purchase_rates = clean_merchants[['avg_purchases_lag3', 'avg_purchases_lag6', 'avg_purchases_lag12']].mean().values
plt.bar(x, sales_rates, width=0.3, align='edge', label='average sales', edgecolor=[0.2]*3)
plt.bar(x, purchase_rates, width=-0.3, align='edge', label='average purchases', edgecolor=[0.2]*3)
plt.legend()
plt.title('Avergage sales and number of purchases\nover the last 12, 6, and 3 months', fontsize=17)
plt.show()

It seems like businesses' sales grow over time, which is good.

Group 2:

In [None]:
scatter_matrix(clean_merchants[ordered_cols[-14:-8]], figsize=[15,15])
plt.tight_layout()
plt.show()

Now, this is interesting.
* numerical_1 serves as the upper limit for numerical_2. 
* both numerical_1 and numerical_2 increase with the amount of active months
* active_months_lag 3 and 6 are basically truncated versions of active_months_12, so they can be dropped. 
* It seems that the merchants with less than 12 active months are the ones that only recently opened their business. I don't see any merchants that ran out of business (they would show higher lag12 values compared to the lag6 and lag3 when lag6 < 6 and lag3 < 3)

Group 3:

In [None]:
scatter_matrix(merchants[ordered_cols[0:6]], figsize=[10,10])
plt.show()

Not much interesting in this last group.

Let's look at the new merchants' transactions. "New" here means new for the customer, i.e. the customer has never purchased anything from those vendors during the period corresponding to the data in historical_transactions.csv.

**Dataset description:**

**card_id:**	Card identifier

**month_lag:**	month lag to reference date

**purchase_date:**	Purchase date

**authorized_flag:**	Y' if approved, 'N' if denied

**category_3:**	anonymized category

**installments:**	number of installments of purchase

**category_1:**	anonymized category

**merchant_category_id:**	Merchant category identifier (anonymized )

**subsector_id:**	Merchant category group identifier (anonymized )

**merchant_id:**	Merchant identifier (anonymized)

**purchase_amount:**	Normalized purchase amount

**city_id:**	City identifier (anonymized )

**state_id:**	State identifier (anonymized )

**category_2:**	anonymized category

In [None]:
new_merch = pd.read_csv(os.path.join(data_folder, 'new_merchant_transactions.csv'))
new_merch.info(verbose=True, null_counts=True)

In [None]:
# converting purchase time string into datetime
from datetime import datetime
new_merch['purchase_date'] = new_merch['purchase_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

In [None]:
# drawing histograms for each column
filtered_new_merch = new_merch.loc[new_merch['purchase_amount'] < 1]
cat_cols = ['authorized_flag', 'category_1', 'installments','category_3', 'month_lag','category_2', 'subsector_id']
num_cols = ['purchase_amount', 'purchase_date', 'merchant_category_id', 'subsector_id']

plt.figure(figsize=[15, 10])
plt.suptitle('New merchant transaction info', y=1.02, fontsize=20)
ncols = 4
nrows = int(np.ceil((len(cat_cols) + len(num_cols))/4))
last_ind = 0
for col in sorted(list(filtered_new_merch.columns)):
    #print('processing column ' + col)
    if col in cat_cols:
        last_ind += 1
        plt.subplot(nrows, ncols, last_ind)
        vc = filtered_new_merch[col].value_counts()
        x = np.array(vc.index)
        y = vc.values
        inds = np.argsort(x)
        x = x[inds].astype(str)
        y = y[inds]
        plt.bar(x, y, color=(0, 0, 0, 0.7))
        plt.title(col, fontsize=15)
        plt.xticks(rotation=90)
    if col in num_cols:
        last_ind += 1
        plt.subplot(nrows, ncols, last_ind)
        filtered_new_merch[col].hist(bins = 50, color=(0, 0, 0, 0.7))
        plt.title(col, fontsize=15)
        plt.xticks(rotation=90)
    plt.tight_layout()

In [None]:
# converting category_1 and category_3 values to numeric ones, so we can use then in scatter plots and correlation coefficients
filtered_new_merch['category_3'].replace({'A':0, 'B':1, 'C':2}, inplace=True)
filtered_new_merch['category_1'].replace({'N':0, 'Y':1}, inplace=True)

* All transactions in this table are authorized, so we can safely drop this column.
* Based on the purchase_date column, the reference date is different for different cards, but most of them are in February - March 2018.
* All purchases here were made within 2 months after the reference date

Finally, let's look at the historical transactions. This table has exactly the same columns as the new merchant transactions:

In [None]:
trns_history = pd.read_csv(os.path.join(data_folder, 'historical_transactions.csv'))
trns_history.info(verbose=True, null_counts=True)

In [None]:
trns_history['purchase_date'] = trns_history['purchase_date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))

In [None]:
filtered_trns_history = trns_history.loc[trns_history['purchase_amount'] < 1]
cat_cols = ['authorized_flag', 'category_1', 'installments','category_3', 'month_lag','category_2', 'subsector_id']
num_cols = ['purchase_amount', 'purchase_date', 'merchant_category_id', 'subsector_id']

plt.figure(figsize=[15, 10])
plt.suptitle('Transaction history info', y=1.02, fontsize=20)
ncols = 4
nrows = int(np.ceil((len(cat_cols) + len(num_cols))/4))
last_ind = 0
for col in sorted(list(filtered_trns_history.columns)):
    #print('processing column ' + col)
    if col in cat_cols:
        last_ind += 1
        plt.subplot(nrows, ncols, last_ind)
        vc = filtered_trns_history[col].value_counts()
        x = np.array(vc.index)
        y = vc.values
        inds = np.argsort(x)
        x = x[inds].astype(str)
        y = y[inds]
        plt.bar(x, y, color=(0, 0, 0, 0.7))
        plt.title(col, fontsize=15)
        plt.xticks(rotation=90)
    if col in num_cols:
        last_ind += 1
        plt.subplot(nrows, ncols, last_ind)
        filtered_trns_history[col].hist(bins = 50, color=(0, 0, 0, 0.7))
        plt.title(col, fontsize=15)
        plt.xticks(rotation=90)
    plt.tight_layout()

* unlike in new nerchants' transactions, some transactions here were not authorized
* installments column either has bugs or is normalized (since it contains -1 and 999 values)
* all transactions here were made before the reference date (month_lag <= 0)

To be continued...