<h1><center> Elo - Data Engineering </center></h1> 

### Competition Description
Imagine being hungry in an unfamiliar part of town and getting restaurant recommendations served up, based on your personal preferences, at just the right moment. The recommendation comes with an attached discount from your credit card provider for a local place around the corner!

Right now, Elo, one of the largest payment brands in Brazil, has built partnerships with merchants in order to offer promotions or discounts to cardholders. But do these promotions work for either the consumer or the merchant? Do customers enjoy their experience? Do merchants see repeat business? Personalization is key.

Elo has built machine learning models to understand the most important aspects and preferences in their customers’ lifecycle, from food to shopping. But so far none of them is specifically tailored for an individual or profile. This is where you come in.

In this competition, Kagglers will develop algorithms to identify and serve the most relevant opportunities to individuals, by uncovering signal in customer loyalty. Your input will improve customers’ lives and help Elo reduce unwanted campaigns, to create the right experience for customers.

### What files do I need?
You will need, at a minimum, the train.csv and test.csv files. These contain the card_ids that we'll be using for training and prediction.

The historical_transactions.csv and new_merchant_transactions.csv files contain information about each card's transactions. 
- historical_transactions.csv contains up to 3 months' worth of transactions for every card at any of the provided merchant_ids. 
- new_merchant_transactions.csv contains the transactions at new merchants (merchant_ids that this particular card_id has not yet visited) over a period of two months.
- merchants.csv contains aggregate information for each merchant_id represented in the data set.

### What should I expect the data format to be?
The data is formatted as follows:

train.csv and test.csv contain card_ids and information about the card itself - the first month the card was active, etc. train.csv also contains the target.

historical_transactions.csv and new_merchant_transactions.csv are designed to be joined with train.csv, test.csv, and merchants.csv. They contain information about transactions for each card, as described above.

merchants can be joined with the transaction sets to provide additional merchant-level information.

### What am I predicting?
You are predicting a loyalty score for each card_id represented in test.csv and sample_submission.csv.

<hr>

### Goals of this notebook:
1. Load all necessary data
1. Optimize memory and adjust datasets
1. Feature engineer variables and correlate to target
1. Export ready to use data for machine learning model

<hr>

<h1><center> Import Libraries </center></h1>

#### Import all necessary python libraries 

In [None]:
# data engineering libraries
import numpy as np 
import pandas as pd 

# plotting libraries
import matplotlib.pyplot as plt
import seaborn as sns

# support python libraries
import warnings
import time
import sys
import datetime

# machine learning libraries
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

#### Change notebook options 

In [None]:
warnings.simplefilter(action='ignore', category=FutureWarning)
pd.set_option('display.max_columns', 500)

<hr>

<a id="1"></a> <br>
<h1><center> Definitions </center></h1>

#### Define functions 

In [None]:
def reduce_mem_usage(df, verbose=True):
    """
    Reduce dataset memory usage by changing the dtypes within each dataset column
    :param df: (pd.DataFrame) dataset to be changed
    :param verbose: (bool) Flag indicating if we should verbose actions
    
    :return: (pd.DataFrame) optimized dataset
    """
    # list numeric datatypes
    numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    
    # set up the initial memory usage
    start_mem = df.memory_usage().sum() / 1024**2   
    
    # for each column in the dataset
    for col in df.columns:
        # get the column type
        col_type = df[col].dtypes
        
        # if the column type is within the numeric datatypes
        if col_type in numerics:
            # calculate the column value range
            c_min = df[col].min()
            c_max = df[col].max()
            
            # Change the column type based on its range
            if str(col_type)[:3] == 'int':
                if c_min >= np.iinfo(np.int8).min and c_max <= np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min >= np.iinfo(np.uint8).min and c_max <= np.iinfo(np.uint8).max:
                    df[col] = df[col].astype(np.uint8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                    df[col] = df[col].astype(np.uint16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                    df[col] = df[col].astype(np.uint32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
                elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                    df[col] = df[col].astype(np.uint64) 
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64) 
        # if the column is not of numeric type
        elif col_type == np.object:
            # change it to categorical
            df[col] = df[col].astype('category')
                
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: 
        print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
def dataset_overview(df):
    """
    """
    dataset_info = list()
    for c in df.columns:
        # create a list of column information to add
        column_info = list()
        
        # add the column name
        column_info.append(c)
        
        # add the column type
        column_info.append(df[c].dtypes)
        
        # add the column memory usage
        column_info.append('%.2f' % (df[c].memory_usage(index=False, deep=True)/(1024*1024)))
        
        # add the percentage of fill
        column_info.append('%.1f%%' % (100*df[c].count()/df.shape[0]))
        
        # add the number of unique values
        column_info.append(df[c].nunique())
        
        # add the example of values
        column_info.append(df[c].unique())
        
        # add the column info to the dataset info
        dataset_info.append(column_info)
        
    # return a dataframe with all information
    return pd.DataFrame(data=dataset_info, columns=['Column Name', 'Column Type', 'Memory Usage (MB)', '% of Fill', 'Unique Values', 'Examples'])
        

<hr>

<a id="1"></a> <br>
<h1><center>Database Loading</center></h1>

#### New transactions dataset 

In [None]:
rNewTransc = pd.read_csv('../input/new_merchant_transactions.csv', parse_dates=['purchase_date'])

In [None]:
dataset_overview(rNewTransc)

#### Historical transactions dataset 

In [None]:
rHistTransc = pd.read_csv('../input/historical_transactions.csv', parse_dates=['purchase_date'])

In [None]:
dataset_overview(rHistTransc)

#### Merchant data 

In [None]:
rMerchant = pd.read_csv('../input/merchants.csv', parse_dates=['first_active_month'])

In [None]:
dataset_overview(rMerchant)

#### Train set 

In [None]:
rTrain = pd.read_csv('../input/train.csv', parse_dates=['first_active_month'])

In [None]:
dataset_overview(rTrain)

#### Test set 

In [None]:
rTest = pd.read_csv('../input/test.csv', parse_dates=['first_active_month'])

In [None]:
dataset_overview(rTest)

<hr>

<a id="2"></a> <br>
<h1><center>Data Manipulation</center></h1>

#### Binarize data

In [None]:
rHistTransc['authorized_flag'] = rHistTransc['authorized_flag'].replace({'Y':1, 'N':0})
rHistTransc['category_1'] = rHistTransc['category_1'].replace({'Y':1, 'N':0})
rNewTransc['authorized_flag'] = rNewTransc['authorized_flag'].replace({'Y':1, 'N':0})
rNewTransc['category_1'] = rNewTransc['category_1'].replace({'Y':1, 'N':0})

#### Add the train data to historical and new transactions 

In [None]:
tHistTrain = rHistTransc.merge(rTrain, on='card_id', how='inner')
tNewTrain = rNewTransc.merge(rTrain, on='card_id', how='inner')

tHistTest = rHistTransc.merge(rTest, on='card_id', how='inner')
tNewTest = rNewTransc.merge(rTest, on='card_id', how='inner')

In [None]:
rTrain.columns

In [None]:
(tHistTrain['card_id'].nunique())/rTrain['card_id'].nunique()

In [None]:
(tNewTrain['card_id'].nunique())/rTrain['card_id'].nunique()

#### Optimize memory

In [None]:
tHistTrain = reduce_mem_usage(tHistTrain)
tHistTest = reduce_mem_usage(tHistTest)
tNewTrain = reduce_mem_usage(tNewTrain)
tNewTest = reduce_mem_usage(tNewTest)

#### Merge the merchant data 

In [None]:
tNewTest[cols].columns

In [None]:
tNewTest.merge(rMerchant.drop(['merchant_category_id', 'subsector_id', 'category_1', 'c'], axis=1), on=['merchant_id'], how='left')

<a id="3"></a> <br>
<h1><center>Feature Engineering</center></h1>

#### Create dummy variables

#### Add the customer lifetime up to the data generation 

In [None]:
tNewTrain['elapsed_time'] = (tNewTrain['first_active_month'].max() - tNewTrain['first_active_month'].dt.date).dt.days
tHistTrain['elapsed_time'] = (tHistTrain['first_active_month'].max() - tHistTrain['first_active_month'].dt.date).dt.days

#### Normalize difference to current date and time

In [None]:
dataset_overview(rHistTransc)

#### Get dummies of categorical data 

In [None]:
historical_transactions = reduce_mem_usage(historical_transactions)
new_transactions = reduce_mem_usage(new_transactions)

agg_fun = {'authorized_flag': ['mean']}
auth_mean = historical_transactions.groupby(['card_id']).agg(agg_fun)
auth_mean.columns = ['_'.join(col).strip() for col in auth_mean.columns.values]
auth_mean.reset_index(inplace=True)

authorized_transactions = historical_transactions[historical_transactions['authorized_flag'] == 1]
historical_transactions = historical_transactions[historical_transactions['authorized_flag'] == 0]

#### Add purchase month

The second function first aggregates on the two variables `card_id` and `month_lag`. Then a second grouping is performed to aggregate over time: