# IEEE-CIS Fraud Detection 
In this dataset contains about 590K records of online transactions. Purpose is to detect if the transaction is fraudulent or not. This would be the binary classification problem

Data: https://www.kaggle.com/c/ieee-fraud-detection

Reference Kernel: https://www.kaggle.com/plasticgrammer/ieee-cis-fraud-detection-playground/data


In [None]:
import numpy as np 
import pandas as pd 
from sklearn.feature_selection import SelectKBest, chi2,f_classif
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
from sklearn.model_selection import train_test_split

We use the following code to reduce the memory use while importing the file data into Pandas DF

In [None]:
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df


def import_data(file):
    """create a dataframe and optimize its memory usage"""
    df = pd.read_csv(file, parse_dates=True, keep_date_col=True)
    df = reduce_mem_usage(df)
    return df

Let us see what files are there in the kernel data

In [None]:
! ls ../input/ieee-fraud-detection/

Importing the data

In [None]:
#!ls ../input/ieee-fraud-detection
df_transacation = import_data('../input/ieee-fraud-detection/train_transaction.csv')
df_identity = import_data('../input/ieee-fraud-detection/train_identity.csv')


We can see that there are 2 training datasets, one is related to transaction and other is related to identity. Shape of those DFs are given below

In [None]:
print(df_transacation.shape, df_identity.shape)

We can learn that we don't have identity data for all the transactions. Let us try to  understand more about it

In [None]:
df_transacation.head()


We can see that there are 393 features and 1 target vaiable which is ```isFraud```

In [None]:
df_identity.head()

And there are 40 features in identity dataset

In [None]:
df_identity.set_index(['TransactionID'], inplace=True)
def get_id(x):
#     return df_identity[df_identity['TransactionID'] == x].shape[0]
    try :
        df_identity.loc[x].shape
        return 1
    except :
        return 0
    
# get_id(2987004)

In [None]:
df_transacation['id_exists'] = df_transacation['TransactionID'].apply(get_id)

In [None]:
'Identity records exists in {:.2f}% of transactions'.format(df_transacation[df_transacation['id_exists'] == 1].shape[0] / df_transacation.shape[0]*100)

# High level Data Analysis
Let us try to understand how many fradulent transactions are there

In [None]:
print('Fradulent transactions are {:.2f}%'.format(df_transacation[df_transacation['isFraud'] == 1].shape[0] / df_transacation.shape[0]*100))
print('with {} Fradulent transactions'.format(df_transacation[df_transacation['isFraud'] == 1].shape[0]))

We understand that the dataset is highly imbalanced. 

In [None]:

# gives some infos on columns types and number of null values
tab_info=pd.DataFrame(df_transacation.dtypes).T.rename(index={0:'column type'})
tab_info=tab_info.append(pd.DataFrame(df_transacation.isnull().sum()).T.rename(index={0:'null values (nb)'}))
tab_info=tab_info.append(pd.DataFrame(df_transacation.isnull().sum()/df_transacation.shape[0]*100)
                         .T.rename(index={0:'null values (%)'}))
tab_info = tab_info.transpose()
tab_info

We can see many of the features have more than 86% null value. We can remove them. Let see how many such clumns are there with more than 30% null values.

In [None]:
'There are {} columns which have  more than {}% null values'.format(tab_info[tab_info['null values (%)'] > 30].shape[0], 30)

Let us remove them from our dataset

In [None]:
cols_to_remove = tab_info[tab_info['null values (%)'] > 30].index.values
cols_to_remove
df_transacation = df_transacation.drop(cols_to_remove,axis=1)

Now we need to handle the null values


In [None]:
df  = df_transacation.dropna()

# Feature encoding

Let us first find out categorical features to be transformed

In [None]:
y = df['isFraud']
X = df.drop(['isFraud'], axis=1)

In [None]:
categorical_feature_mask = X.dtypes=='category'
categorical_cols = X.columns[categorical_feature_mask].tolist()
categorical_cols

Above are the categorical features which will be considered for one hot encoding.

In [None]:
X_cat = X[categorical_cols]
X_cat = pd.get_dummies(X_cat)
X = X.drop(categorical_cols,axis=1)
X = pd.concat([X,X_cat], axis=1)

In [None]:
X.head()
X_cols = X.columns

# Feature analysis

## Correlation Analysis

Before performing any feature analysis, let us transform the features into values between 0 and 1

In [None]:
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X = pd.DataFrame(X)
X.head()

In [None]:
bestfeatures = SelectKBest(chi2, k=20)
fit = bestfeatures.fit(X,y)
# dfcolumns = pd.DataFrame(X_new.columns)
# dfcolumns
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_cols)
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
featureScores = featureScores.sort_values(by='Score', ascending=False)
featureScores.head(20)

In [None]:
bestfeatures = SelectKBest(f_classif, k=20)
fit = bestfeatures.fit(X,y)
# dfcolumns = pd.DataFrame(X_new.columns)
# dfcolumns
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_cols)
featureScores1 = pd.concat([dfcolumns,dfscores],axis=1)
featureScores1.columns = ['Specs','Score']  #naming the dataframe columns
featureScores1 = featureScores1.sort_values(by='Score', ascending=False)
featureScores1.head(20)

Common features in both chi2 and ANOVA
```
155	V281	6528.623034
182	V308	5564.151940
154	V280	5471.553775
153	V279	5185.097715
169	V295	4804.288722
167	V293	4173.462409
191	V317	3029.784155
```

These features will be considered for further model building

# Model Building
As this is the data is credit card Fraud, this is the typical example where, Recall metric should be used. F Score with High beta is used for checking the model

In [None]:
y = df['isFraud']
X = df.drop(['isFraud'], axis=1)

Let us take only features which are having impact on target 

In [None]:
X1 = X[['V281', 'V308','V280','V279','V295','V293','V317']]

y = pd.DataFrame(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X1, y['isFraud'], test_size=0.2)

In [None]:
import lightgbm as lgb
d_train = lgb.Dataset(X_train, label=y_train)
# d_test = lgbm.Dataset(X_test, y_test)
params = {}
params['learning_rate'] = 0.02
params['boosting_type'] = 'gbdt'
# params['boosting_type'] = 'dart'
params['objective'] = 'binary'
params['metric'] = 'binary'
params['sub_feature'] = 0.99
params['num_leaves'] = 500
params['min_data'] = 100
params['max_depth'] = 10000
params['is_unbalance'] = True
# y_train=y_train.ravel()
clf = lgb.train(params, d_train, 1000)


In [None]:
from sklearn.metrics import f1_score, accuracy_score, fbeta_score
results=clf.predict(X_test)
score1 = fbeta_score(y_test,results.round(), beta=100 )
print('F1 Score ',score1)

## XGBoost
Let us use XGBoost tree based algorithm to see if it performs any better

In [None]:
import xgboost
xgb = xgboost.XGBClassifier(n_estimators=800, learning_rate=0.02, gamma=0, subsample=0.2,
                           colsample_bytree=1, max_depth=100)
xgb.fit(X_train,y_train)
results=xgb.predict(X_test)


In [None]:

score1 = fbeta_score(y_test,results.round(), beta=100 )
print('F Score - Recall:',score1)