## About the Project

* Vesta Corporation provided the dataset for this competition. Vesta Corporation is the forerunner in guaranteed e-commerce payment solutions. Founded in 1995.
 
* In this competition, the aim is to benchmark machine learning models on a challenging large-scale dataset. 
* The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features. 
* The machine learning model will alert the fraudulent transaction for millions of people around the world, helping hundreds of thousands of businesses reduce their fraud loss and increase their revenue. 
* The training dataset consists of more than 400 features and 5.9 Million samples. This is supervised binary classification problem and goal is to predict if a credit card transaction is Fraud based on input features mentioned below

**Evaluation**
* The model is evaluated on AUC ROC score. The notebook will produce an output csv file with TransactionID and predicted probabilties on test set,  which will be automatically evaluted by Kaggle.

### Transaction Table 
* TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
* TransactionAMT: transaction payment amount in USD
* ProductCD: product code, the product for each transaction
* card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
* addr: address
* dist: distance
* P_ and (R__) emaildomain: purchaser and recipient email domain
* C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
* D1-D15: timedelta, such as days between previous transaction, etc.
* M1-M9: match, such as names on card and address, etc.
* Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.

<br>  **Categorical Features:**
* ProductCD
* card1 - card6
* addr1, addr2
* P_emaildomain
* R_emaildomain
* M1 - M9

### Identity Table 
* Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
* They're collected by Vesta’s fraud protection system and digital security partners.
* (The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

<br> **Categorical Features:**
* DeviceType
* DeviceInfo
* id_12 - id_38



## Import Libraries

In [None]:
#Install latest version of the package as  the defualt version is not working fine
!pip install seaborn==0.11.0

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os, gc
print(os.listdir("../input"))

import seaborn as sns
sns.set_theme(style="darkgrid")

import matplotlib.pyplot as plt

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

#setting for plot fonts 
SMALL_SIZE = 14
MEDIUM_SIZE = 16
BIGGER_SIZE = 18

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


## Constants

In [None]:
RANDOM_STATE = 42
DEBUG_MODE = False  # Load fewer samples to save time for quick testing
TARGET = 'isFraud'


## Read Data
https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#latest-586800



In [None]:
%%time

# Load fewer samples to save time for quick testing
if DEBUG_MODE:
    nrows = 50000
else:
    nrows = None
        
data_path = '/kaggle/input/ieee-fraud-detection/'
train_identity = pd.read_csv(os.path.join(data_path, 'train_identity.csv'))
train_transaction = pd.read_csv(os.path.join(data_path, 'train_transaction.csv'), nrows = nrows)
test_identity = pd.read_csv(os.path.join(data_path, 'test_identity.csv'))
test_transaction =pd.read_csv(os.path.join(data_path, 'test_transaction.csv'), nrows = nrows)
print('Train Identity Data - rows:', train_identity.shape[0], 
      'columns:', train_identity.shape[1])
print('Train Transaction Data - rows:', train_transaction.shape[0], 
      'columns:', train_transaction.shape[1])
print('Test Identity Data - rows:', test_identity.shape[0], 
      'columns:', test_identity.shape[1])
print('Test Transaction Data - rows:', test_transaction.shape[0], 
      'columns:', test_transaction.shape[1])

### Transaction Data



In [None]:
train_transaction.head()

In [None]:

def column_properties(df):
    columns_prop = pd.DataFrame()
    columns_prop['column'] = df.columns.tolist()
    columns_prop['count_non_null'] = df.count().values
    columns_prop['count_null'] = df.isnull().sum().values
    columns_prop['perc_null'] = columns_prop['count_null'] * 100 / df.shape[0]

    #using df.nunique() is memory intensive and slow resulting in kernal death
    unique_list = []
    for col in df.columns.tolist():
        unique_list.append(df[col].value_counts().shape[0])
    columns_prop['count_unique'] =  unique_list
    
    columns_prop['dtype'] = df.dtypes.values
    columns_prop.set_index('column', inplace = True)
    return columns_prop


In [None]:
column_properties(train_transaction).T

### Identity Data
* Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
* As we can see that the columns name for training set and test are not same,we will correct columns names of test set using traning column name

In [None]:
# the columns name for training set and test are not same,we will correct columns names of test set using traning column name
identity_col_names =  train_identity.columns.tolist()
test_identity.columns = identity_col_names
print(test_identity.columns.tolist())

In [None]:
test_identity.head()

In [None]:
column_properties(train_identity).T

## Merge Data
Make a join between transaction data and identity data which are connected by key 'TransactionID

In [None]:
%%time
train = pd.merge(train_transaction, train_identity, on= 'TransactionID', how = 'left')
test = pd.merge(test_transaction, test_identity, on= 'TransactionID', how = 'left')
del train_transaction, train_identity, test_transaction, test_identity
gc.collect()
train.shape

## Categorical Columns
Create list of categorical columns based on decsription below
<br>https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#latest-586800

In [None]:




cat_cols = ['DeviceType', 'DeviceInfo', 'ProductCD', 'addr1', 'addr2', 'P_emaildomain', 'R_emaildomain']
cat_cols +=  ['M' + str(i) for i in range(1,10)]
cat_cols += ['card' + str(i) for i in range(1,7)]
cat_cols += ['id_' + str(i) for i in range(12,39)]
column_properties(train[cat_cols]).T


In [None]:
train[cat_cols].head()

## Numeric Columns
* From list of all columns remove categorical columns, Target Value, and ID, this will give us numerical columns
* Display the statistical properties of numeric columns

In [None]:
%%time


all_cols = train.columns.tolist()
num_cols = [x for x in all_cols if x not in cat_cols]

num_cols.remove('TransactionID')
num_cols.remove(TARGET)
train[num_cols].describe()

## Feature Engineering
Create new feature and do Analysis for them. 


In [None]:
def plot_numeric_data(df, col, target_col, remove_outliers = False):
   
    df = df[[col, target_col]].copy()
    df.dropna(subset=[col], inplace =True)
    
    #Remove Outliers: keep only the ones that are within +3 to -3 standard deviations in the column 
    if remove_outliers:       
       
        df = df[np.abs(df[col]-df[col].mean()) <= (3*df[col].std())]
       


    fig, (ax1, ax2,ax3)  =  plt.subplots(ncols = 3, figsize = (24,4))
    fig.suptitle('Plots for {}'.format(col))
    
    #Display Density Plot
    sns.distplot(df[col], color = 'b',  kde = True ,  ax = ax1 )
    plt.ylabel('Density')


    # Display Box Plot for feature
    sns.boxplot(x = col , data = df,ax = ax2)
   
    #Display Density Plot for Fraud Vs NotFraud
    sns.distplot(df[df[target_col] == 0][col], color = 'b', label = 'NotFraud',ax = ax3)
    sns.distplot(df[df[target_col] == 1][col], color = 'r', label = 'Fraud',ax = ax3)
    plt.legend(loc = 'best')
    plt.ylabel('Density NotFraud vs Fraud')

    plt.show()
    



In [None]:
def plot_categorical_data(col, data, top_n = 10, display_data = False ):
    
    # Prpare a dataframe for count and postive classs percent givel colums
    df_data = data[[col, TARGET]].copy()    
    df = df_data.groupby(col).agg({col:['count'], TARGET:['sum']})
    df.columns = ['count', 'fraud_count']

    df['fraud_perc'] = df['fraud_count'] * 100 / df['count']
    df['fraude_perc'] = df['fraud_perc'].round(2)
    
#    % missing values in the columns to be displayed in title
    null_perc = (df_data.shape[0]- df['count'].sum())  / df_data.shape[0]

    width = 18
    height = 6

#   select only top n categories
    df_disp = df.sort_values(by ='count', ascending= False).head(top_n )

    fig, (ax1, ax2)  =  plt.subplots(ncols = 2, figsize = (width,height))
    fig.suptitle('Plots for {} (Missing Values: {:.2%})'.format(col, null_perc))
    
#   Display Sort order should be by descending value of count
    plot_order = df_disp.sort_values(by='count', ascending=False).index.values

#   Display Bar chart for frequency count of top_n categories
    s = sns.barplot(ax = ax1,  y = df_disp.index, x = df_disp['count'], order=plot_order, orient = 'h'  )
    s.set_title('Count for {}'.format(col))
    
#   Display Bar chart for perecnt of positive class for top categories
    s = sns.barplot(ax = ax2,  y = df_disp.index, x = df_disp['fraud_perc'], order=plot_order , orient = 'h'    )
    s.set(xlabel='Fraud Percent')
    s.set_title('% Fraud {}'.format(col))
    plt.show()
    if display_data:
        return df

### New Feature: Number of Nulls
* Count the number of null values in a row. As we can see this is an important feature .
* This is because as evident from from joint distribution plot if a trasactions have fewer data points availible(more nulls), the chances of fraud are low

In [None]:
display_features = ['TransactionID']


col = 'nulls'
train[col] = train.isnull().sum(axis=1)
test[col] = test.isnull().sum(axis=1)

display_features.append(col)
plot_numeric_data(train, col, TARGET, remove_outliers = True)

### New Feature:Transaction Amount Decimal part
* Get the decimal part of Transaction Amount and Multiply it by 1000
* This is probably due to fraud transaction happening in foreign currency hence the credit card is charged with decimal amount.
* We can also see that if the decimal part of transaction amount is zero chances of fraud are less


In [None]:

col = 'TAmt_decimal'
train[col] = ((train['TransactionAmt'] - train['TransactionAmt'].astype(int)) * 1000).astype(int)
test[col] = ((test['TransactionAmt'] - test['TransactionAmt'].astype(int)) * 1000).astype(int)

display_features.append('TransactionAmt')
display_features.append(col)
plot_numeric_data(train, col, TARGET, remove_outliers = True)


### New Feature: Frequency Counts
* Count the frequency of important categorical variables related to card, address, emaildoman and product code.
* As we can see credit card which are used frequently have lesser chance of fraud
* Frequency counts of categorical variables in general is good way to convert a categorical column into a numeric column which ML model can understand better


In [None]:
freq_cols = ['card1', 'addr1', 'addr2', 'card2', 'card3', 'card4', 'card5',
             'card6', 'P_emaildomain', 'ProductCD', 'R_emaildomain']
for col in freq_cols:
    display_features.append(col)
    
    train[col + '_count'] = train[col].map(train[col].value_counts(dropna=False) )
    test[col + '_count'] =  test[col].map(train[col].value_counts(dropna=False) )
    display_features.append(col + '_count')
    plot_numeric_data(train, col + '_count',  TARGET, remove_outliers = True)

### New Feature: Hour of the day
* From TransactionDT extract the hour of day of the transaction time, encoded as 0-23
* TransactionDT field indicates the timestamp of a transaction and we can extract time related data from it
* We can see that more frauds are committed between 1AM and 11 AM. This is probably because fraud originated in different time zone


In [None]:
def make_hour_feature(df, tname='TransactionDT'):
    #Creates an hour of the day feature, encoded as 0-23.  
   
    hours = df[tname] / (3600)        
    encoded_hours = np.floor(hours) % 24
    return encoded_hours

In [None]:
display_features.append('TransactionDT')

col = 'hour'
train[col] = make_hour_feature(train)
test[col] = make_hour_feature(test)
display_features.append(col)

plot_numeric_data(train, col,  TARGET, remove_outliers = True)

## Display Fraud Transactions for New Features 
Display data with new features and source features for which transactions were fraud


In [None]:
display_features.append(TARGET)
train[train[TARGET]==1][display_features].head(10)

## Data Pre-Processing
* Concatenate the training and test dataset by appending. This is done so that we can apply feature engineering and pre-processing steps to combined set.
* Convert the categorical features from string to int using ordinal encoding. For example, convert ['A', 'B'. 'C'] to [1,2,3]
* Create a data frame sub for submission of test scores, we will later fill it with predictions on test set


In [None]:



# Concatenate the tranining and test dataset by appending
data_all = train.append(test, ignore_index = True, sort=False)

#Create submission pandas dataframe 
sub = pd.DataFrame()
sub['TransactionID'] = test.TransactionID



# free the memory which is not required as it can exceed the physical ram 
del train, test
gc.collect()


In [None]:



# Do ordinal encoding for categorical features
for col in cat_cols:
    data_all[col], uniques = pd.factorize(data_all[col])
    #the factorize sets null values to -1, so convert them back to null, as we want LGB to handle null values
    data_all[col] = data_all[col].replace(-1, np.nan)
    

    


## Create Test, Train and Validation sets
* From combined dataset split the training and test datasets and separate the target and features
* Split the training set into training and validation set. We will use first 80% of data as training set and last 20% as validation set.
* Since the data is sorted in time according to transaction timestamp, we should not use random split.


In [None]:
from sklearn.model_selection import train_test_split

#For test set target value will be null
X_train =  data_all[data_all[TARGET].notnull()]
X_test  =  data_all[data_all[TARGET].isnull()]
del data_all
gc.collect()

#get the labels for traning set
y_train = X_train[TARGET]

# Remove ID and TARGET column from train and test set
X_train = X_train.drop(['TransactionID', TARGET], axis = 1)
X_test = X_test.drop(['TransactionID',   TARGET], axis = 1)

# Split the training set into training and validation set. 
# We will use first 80% of data as traningg set and last 20% as validation set.
# Since the data is sorted in time according to transaction timestamp, we should not use random split.
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.2, shuffle=False, 
                                                      random_state = RANDOM_STATE)

print('Train shape{} Valid Shape{}, Test Shape {}'.format(X_train.shape, X_valid.shape, X_test.shape))

## Train LightGBM Model With Validation
* LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient https://lightgbm.readthedocs.io/en/latest/Features.html
* In this notebook we are using LightGBM in random forest mode, which trains on parallel trees. This is done by setting parameter boosting_type = 'random_forest'.
* Train on first 80% of dataset and evaluate on next 20 % as data is sorted in time
* parameter 'SCALE_POS_WEIGHT' is to handle the unbalanced nature of dataset. This parameter gives more weight to minority class, which improves precision, recall and F1 scores
* No imputation of missing values is necessary as LightGBM can use optimized strategies automatically.
* New Features created in above section will be also be used if they are in Top N features list


In [None]:
from sklearn.metrics import accuracy_score, roc_auc_score,f1_score, precision_score, recall_score,confusion_matrix

def validation_results(y_valid, y_prob, verbose = True):   
    scores = {}                      
    y_pred_class =  [0  if x < 0.5 else 1 for x in y_prob]
    scores['val_accuracy']  = accuracy_score(y_valid, y_pred_class)
    scores['val_auc']       = roc_auc_score(y_valid, y_prob)
    scores['val_f1']        =   f1_score(y_valid, y_pred_class, average = 'binary')
    scores['val_precision'] = precision_score(y_valid, y_pred_class)
    scores['val_recall']    = recall_score(y_valid, y_pred_class)
    
    cm = confusion_matrix(y_valid, y_pred_class)
    cm_df = pd.DataFrame(cm, columns=np.unique(y_valid), index = np.unique(y_valid))
    if verbose:
        print('\nValidation Accuracy      {:0.5f}'.format( scores['val_accuracy'] ))
        print('Validation   AUC         {:0.5f}'.format( scores['val_auc']   ))
        print('Validation Precision     {:0.5f}'.format(scores['val_precision']))
        print('Validation Recall        {:0.5f}'.format(scores['val_recall']))
        print('Validation  F1           %0.5f' %scores['val_f1'] )
    return scores , cm_df




In [None]:
%%time

import lightgbm as lgb

feature_imp = pd.DataFrame()

params = {}

params['learning_rate'] = 0.06
params['boosting_type'] = 'random_forest'
params['objective'] = 'binary'
params['seed'] =  RANDOM_STATE
params['metric'] =    'auc'
params['bagging_fraction'] = 0.7
params['bagging_freq'] = 5
params['feature_fraction'] = 0.7
# params['max_bin'] = 127
params['scale_pos_weight'] = 2
params['num_leaves'] = 256



print('Train shape{} Valid Shape{}, Test Shape {}\n'.format(X_train.shape, X_valid.shape, X_test.shape))

lgb_train = lgb.Dataset(X_train, y_train)
lgb_valid  = lgb.Dataset(X_valid, y_valid)
early_stopping_rounds = 200
lgb_results = {}

model = lgb.train(params,
                      lgb_train,
                      num_boost_round = 10000,
                      valid_sets =  [lgb_train,lgb_valid],
                      early_stopping_rounds = early_stopping_rounds,                    
    #                   categorical_feature = cat_cols,
                      evals_result = lgb_results,
                      verbose_eval = 100
                       )

print('\nPrinting Model Parameters\n')
print(params)

## Display Results

In [None]:
y_prob = model.predict(X_valid)
results, cm_df  = validation_results(y_valid, y_prob, verbose = True)


### Display Confusion Matrix

In [None]:

cm_df.index.name = 'Actual'
cm_df.columns.name = 'Predicted'
plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(cm_df, cmap="Blues", annot=True,annot_kws={"size": 16}, fmt='g')# font size
plt.show()

### Plot Training vs Validation scores

In [None]:

def plot_lgb_scores(lgb_results):
    train_res = lgb_results['training']['auc']
    valid_res = lgb_results['valid_1']['auc']
    ntrees = range(1, len(train_res) + 1)

    plt.figure(figsize = (12, 6))
    plt.plot(ntrees, train_res , 'b', label = 'Training')
    plt.plot(ntrees, valid_res, 'r', label = 'Validation')
    plt.xlabel('Number of Trees', fontsize = 14)
    plt.ylabel('AUC Score', fontsize = 14)
    plt.legend(fontsize = 14)
    plt.show()
    

    


In [None]:
plot_lgb_scores(lgb_results)

### Plot ROC Curve

In [None]:



import sklearn.metrics as metrics
fpr, tpr, threshold = metrics.roc_curve(y_valid, y_prob)
roc_auc = metrics.auc(fpr, tpr)

plt.figure(figsize = (12, 6))
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

### Plot Feature Importance
Display top 20 features 

In [None]:
def plot_feature_imp(model, top_n = 30):
    feature_imp = pd.DataFrame()
    feature_imp['feature'] = model.feature_name()
    feature_imp['importance']  = model.feature_importance()
    feature_imp = feature_imp.sort_values(['importance'], ascending = False)
    feature_imp_disp = feature_imp.head(top_n)
    plt.figure(figsize=(10, 12))
    sns.barplot(x="importance", y="feature", data=feature_imp_disp)
    plt.title('LightGBM Features')
    plt.show() 
#     return feature_imp

In [None]:
plot_feature_imp(model, top_n = 20)

## Predict on test set
Also write the results as csv file

In [None]:
y_prob_test = model.predict(X_test)
sub['isFraud'] = y_prob_test
sub.to_csv('lgb_sub.csv', index=False)
sub.head()

## Summary

* Do basic data preprocessing steps to convert categorical variables to integers using ordinal encoding.
* Train on a LightGBM model with 80% data and 20% validation date. We are using Random Forest mode by setting parameter boosting_type = 'random_forest'
* Use metric AUC to evaluate the performance of model.
* Handle Imbalanced Dataset by using inbuilt model parameter SCALE_POS_WEIGHT= 4. Using this we were able to improve F1 Score from 0.38244(Baseline Model) to  0.56081 on current model
* The Validation AUC score for Random Forest is 0.867972 while AUC on public test data is 0.891612. The random Forest performs worse than LightGBM and XGBOOST where we were able to score AUC 0.918617 for XGBoost and 0.923118 for LightGBM on public test data
