# IEEE-CIS Fraud Detection competition

This is a starter notebook to help you with the competition submissions. Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

### The plan

* work in groups
* apply the knowledge and skills learned in the course:
  - data normalization
  - deal with class imbalance
  - dealing with categorical features
  - discretizing continuous features
  - find optimal hyperparameters using Grid/Random search
  - try at least three different type of estimators
  - combining multiple estimators (use weighted averaging)
* test statistic: roc_auc_score
* until 14:00 get the best score
* Present methods and estimators you tried - 15min presentation, 15min - Q&A


## Loading libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import resample
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import time
import matplotlib.patches as mpatches
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import warnings

In [None]:
warnings.simplefilter("ignore")

In [None]:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

## Loading data

### Load train data

In [None]:
%%time 
train_transactions=pd.read_csv('../input/train_transaction.csv')
train_identity=pd.read_csv('../input/train_identity.csv')
print('Train data set is loaded !')

In [None]:
train_transactions.head()

In [None]:
train_transactions.info()

In [None]:
train_identity.head()

In [None]:
train_identity.info()

In [None]:
sns.countplot(x=train_transactions["isFraud"]);

- There is clearly a class imbalace problem.

### Merging training data

In [None]:
train_df = train_transactions.merge(train_identity, how="left", on="TransactionID")

print('Train shape',train_df.shape)

print("Data set merged ")

del train_transactions, train_identity

### Reducing training memory usage

In [None]:
%%time
train_df = reduce_mem_usage(train_df)

### Load test data

In [None]:
%%time 
test_transaction=pd.read_csv('../input/test_transaction.csv')
test_identity=pd.read_csv('../input/test_identity.csv')
sample_submission=pd.read_csv('../input/sample_submission.csv')
print('Test data set is loaded !')
                              

### Merging testing data

In [None]:
%%time
test_df = test_transaction.merge(test_identity, how="left", on="TransactionID")

print('Train shape',train_df.shape)

print("Data set merged ")

del test_transaction, test_identity

### Reducing testing memory usage

In [None]:
%%time
test_df = reduce_mem_usage(test_df)

In [None]:
# fix columns names
test_df= test_df.rename(columns=lambda x:"_".join(x.split("-")))

# set TransactionID as index
train_df.set_index('TransactionID', inplace=True)
test_df.set_index('TransactionID', inplace=True)

# Start working from here

### Replace missing values and use label encoder for categorical variables

In [None]:
%%time

# Replace missing values with -999
train_df = train_df.fillna(-999)
test_df = test_df.fillna(-999)

In [None]:
%%time

# Label Encoding
for f in train_df.columns:
    if train_df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train_df[f].values))
        train_df[f] = lbl.transform(list(train_df[f].values)) 

for f in test_df.columns:
    if test_df[f].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(test_df[f].values))
        test_df[f] = lbl.transform(list(test_df[f].values))

- We will now split the train dataset into train and validation sets (20% of data for validation)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train_df.drop('isFraud', axis=1), train_df['isFraud'].astype('uint8'), test_size=.2, random_state=42, stratify=train_df['isFraud'].astype('uint8'))

In [None]:
df_majority = X_train[y_train == 0]
df_minority = X_train[y_train == 1]
size_minor = len(df_minority)

y_majority = y_train[y_train == 0]
y_minority = y_train[y_train == 1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                   replace=False,    # sample with replacement
                   n_samples=size_minor,    # to match majority class
                   random_state=42)  # reproducible results

X_down_train = pd.concat([df_minority, df_majority_downsampled])
# print(X_down_train.shape)

y_majority_downsampled = y_majority[df_majority_downsampled.index]
y_down_train = pd.concat([y_minority, y_majority_downsampled])
# y_down_train.value_counts()

sns.countplot(x = y_down_train);

## Fit the model

In [None]:
clf_rf_down = RandomForestClassifier(random_state=42)
model_rf_down = clf_rf_down.fit(X_down_train, y_down_train)
# y_pred = model_rf_down.predict(X_test)
y_prob_rf = model_rf_down.predict_proba(X_test)[:, 1]
print(f'ROC-AUC score: {roc_auc_score(y_test, y_prob_rf):.3f}')

predictions_rf = clf_rf_down.predict_proba(test_df)[:,1]

## Alternative model

In [None]:
from sklearn.linear_model import LogisticRegression
logit_model = LogisticRegression().fit(X_down_train, y_down_train)

y_prob_logit = logit_model.predict_proba(X_test)[:, 1]
print(f'ROC-AUC score: {roc_auc_score(y_test, y_prob_logit):.3f}')

predictions_logit = logit_model.predict_proba(test_df)[:,1]

## Prepare submission file

In [None]:
# weight diferent model predictions
weights = [0.1, 0.9]

y_prob_joined = (weights[0]*y_prob_logit + weights[1]*y_prob_rf)/sum(weights)
print(f'ROC-AUC score: {roc_auc_score(y_test, y_prob_joined):.3f}')

In [None]:
predictions_joined = (weights[0]*predictions_logit + weights[1]*predictions_rf)/sum(weights)

submission = pd.DataFrame({'TransactionID':test_df.index,'isFraud':predictions_joined})
submission["TransactionID"]=submission["TransactionID"].astype(int)
submission.head()

In [None]:
filename = 'joined_model_submission.csv'
submission.to_csv(filename, index=False)
print(f'Saved file: {filename}')

## Make Submission

Once you've finished your kernel and titled it, press [Save Version] > [Save & Run All (Commit)] on the top right corner of the editor screen. Wen  running your code is finished, you can go to viewer, than select data tab, where the saved files should be located. Select the relevant submission file (csv file which you saved) and press [Submit] button. Once the submission file is scored you can check the results under [My Submissions] and you will be able to see how well you did relative to the other people on the [Leaderboard].

Your [Private Score] should be better than the score of this starter notebook, which is 0.854279.