### 'IEEE_CIS' Dataset is of high memory and we will explore how to handle high memory datasets in this kernel

 # 'IEEE-CIS' Fraud Detection Data
 
 
 
 #### The data is broken into two files identity and transaction, which are joined by TransactionID.
 
 #### Categorical Features - Transaction

1. ProductCD
2. emaildomain
3. card1 - card6
4. addr1, addr2
5. P_emaildomain
6. R_emaildomain
7. M1 - M9

#### Categorical Features - Identity

1. DeviceType
2. DeviceInfo
3. id_12 - id_38
 
 
 
 


In [None]:
import pandas as pd
import numpy as np
import gc
import os

In [None]:
# Kaggle input path
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Read train(transaction and identity) and test(transaction and identity) data

In [None]:
# Read train data
train_trans = pd.read_csv("/kaggle/input/ieee-fraud-detection/train_transaction.csv")
train_identity = pd.read_csv("/kaggle/input/ieee-fraud-detection/train_identity.csv")

# Read test data
test_trans = pd.read_csv("/kaggle/input/ieee-fraud-detection/test_transaction.csv")
test_identity = pd.read_csv("/kaggle/input/ieee-fraud-detection/test_identity.csv")

### Combine 'transaction' and 'identity' data

In [None]:
# Train data (Combine 'train_identity' and 'train_trans')
df_train = train_trans.merge(train_identity, how='left', left_index=True, right_index=True, on='TransactionID')

# Test data (Combine 'test_identity' and 'test_trans')
df_test =  test_trans.merge(test_identity, how='left', left_index=True, right_index=True, on='TransactionID')

### In this kernel, we will delete temporary storage to provide space to RAM or else session crashes due to shortage of RAM

In [None]:
del train_trans, train_identity, test_trans, test_identity; x = gc.collect()

### 'train' and 'test' dataset Memory

In [None]:
print('train data memory in MB:', df_train.memory_usage().sum() / 1024**2) 
print('test data memory in MB:', df_test.memory_usage().sum() / 1024**2) 

### As the memory of train and test dataset is high and our RAM space is low, we will try to reduce the dataset memory

In [None]:
# Most often used function to reduce dataset memory

def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in. 
    for col in props.columns:
        if props[col].dtype != object:  # Exclude strings
            
            # Print current column type
            #print("******************************")
            #print("Column: ",col)
            #print("dtype before: ",props[col].dtype)
            
            # make variables for Int, max and min
            IsInt = False
            mx = props[col].max()
            mn = props[col].min()
            
            # Integer does not support NA, therefore, NA needs to be filled
            if not np.isfinite(props[col]).all(): 
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True)  
                   
            # test if column can be converted to an integer
            asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum()
            if result > -0.01 and result < 0.01:
                IsInt = True

            
            # Make Integer/unsigned Integer datatypes
            if IsInt:
                if mn >= 0:
                    if mx < 255:
                        props[col] = props[col].astype(np.uint8)
                    elif mx < 65535:
                      props[col] = props[col].astype(np.uint16)
                    elif mx < 4294967295:
                        props[col] = props[col].astype(np.uint32)
                    else:
                        props[col] = props[col].astype(np.uint64)
                else:
                    if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
                        props[col] = props[col].astype(np.int8)
                    elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
                        props[col] = props[col].astype(np.int16)
                    elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
                        props[col] = props[col].astype(np.int32)
                    elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
                        props[col] = props[col].astype(np.int64) 
            # Make float datatypes 32 bit
            else:
                props[col] = props[col].astype(np.float32)
            
            # Print new column type
            #print("dtype after: ",props[col].dtype)
            #print("******************************")
    
    # Print final result
    print("___MEMORY USAGE AFTER COMPLETION:___")
    mem_usg = props.memory_usage().sum() / 1024**2 
    print("Memory usage is: ",mem_usg," MB")
    print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")
    return props

In [None]:
reduce_mem_usage(df_train)
reduce_mem_usage(df_test)

### Conclusion: 
1. Train dataset memory reduced from '1955' MB to '546' MB
2. Test  dataset memory reduced from '1673' MB to '459' MB

## Combine train and test data

In [None]:
df_test['isFraud'] = 'test'
df = pd.concat([df_train, df_test], axis = 0, sort=False)
df = df.reset_index()
df.drop('index', axis=1, inplace = True)

In [None]:
del df_train, df_test; x = gc.collect()

In [None]:
df.head(2)

### Understanding the columns
1. transaction related columns
2. card related columns
3. addr,dist,domain related columns
4. C columns
5. D columns
6. M columns
7. V columns
8. others ('identity' columns along with 'device' information)

In [None]:
# Transaction columns
df.columns[0:5]

In [None]:
# Card related columns
df.columns[5:11]

In [None]:
#  addr, dist, emaildomain related columns
df.columns[11:17]

In [None]:
# C columns
df.columns[17:31]

In [None]:
# D columns
df.columns[31:46]

In [None]:
# M columns
df.columns[46:55]

In [None]:
# V columns
df.columns[55:394]

In [None]:
# Identity columns
df.columns[394:]

### Summary
**It's been observed that, V columns are in large number (around 340). So we can either ignore all V columns or apply PCA for all V columns in order to reduce the columns/memory.**

**In this kernel, we will apply PCA for V columns in order not to lose any information.**

#### Email Mappings

In [None]:
emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 
          'scranton.edu': 'other', 'optonline.net': 'other', 'hotmail.co.uk': 'microsoft',
          'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',
          'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 
          'aim.com': 'aol', 'hotmail.de': 'microsoft', 'centurylink.net': 'centurylink',
          'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 'gmx.de': 'other',
          'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 
          'protonmail.com': 'other', 'hotmail.fr': 'microsoft', 'windstream.net': 'other', 
          'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo', 'yahoo.de': 'yahoo',
          'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
          'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft',
          'verizon.net': 'yahoo', 'msn.com': 'microsoft', 'q.com': 'centurylink', 
          'prodigy.net.mx': 'att', 'frontier.com': 'yahoo', 'anonymous.com': 'other', 
           'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo', 
          'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 
          'bellsouth.net': 'other', 'embarqmail.com': 'centurylink', 'cableone.net': 'other', 
          'hotmail.es': 'microsoft', 'mac.com': 'apple', 'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 
          'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other', 'cox.net': 'other',
          'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}

In [None]:
col = ['P_emaildomain', 'R_emaildomain']

for x in col:
    df[x + '_bin'] = df[x].map(emails)

### Label Encoding

In [None]:
from sklearn import preprocessing

In [None]:
for col in df.drop('isFraud', axis = 1).columns:    
    if df[col].dtype == 'object':
        le = preprocessing.LabelEncoder()
        le.fit(list(df[col].values))
        df[col] = le.transform(list(df[col].values))

In [None]:
df.memory_usage().sum() / 1024**2

In [None]:
reduce_mem_usage(df)

### I have tried to apply PCA for V columns using the entire dataset 'df' at once, but session crashed due to low RAM.

### Kaggle has RAM usage of 13GB. So i need to split the data and perform PCA

### Split the data back into train and test 

In [None]:
df_train, df_test = df[df['isFraud'] != 'test'], df[df['isFraud'] == 'test'].drop('isFraud', axis=1)

In [None]:
del df; x = gc.collect()

### Applying PCA for V columns

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import minmax_scale

In [None]:
v_columns = df_train.columns[55:394]
v_columns

In [None]:
# fill NaN values and scale the data using scalar function

# for train data
for col in v_columns:
    df_train[col] = df_train[col].fillna((df_train[col].min() - 2))
    df_train[col] = (minmax_scale(df_train[col], feature_range=(0,1)))

# for test data
for col in v_columns:
    df_test[col] = df_test[col].fillna((df_test[col].min() - 2))
    df_test[col] = (minmax_scale(df_test[col], feature_range=(0,1)))

In [None]:
def func_pca(df, v_columns, prefix):
    
    pca = PCA(n_components = 30, random_state = 1)
    pca = pca.fit_transform(df[v_columns])
    pca_df = pd.DataFrame(pca)
    df.drop(v_columns, axis=1, inplace=True)
    pca_df.rename(columns=lambda x: str(prefix)+str(x), inplace=True)
    df = pd.concat([df, pca_df], axis=1)
    
    return df

In [None]:
train = func_pca(df_train, v_columns = v_columns, prefix = 'PCA_V_')

In [None]:
del df_train; x= gc.collect()

In [None]:
test = func_pca(df_test, v_columns = v_columns, prefix = 'PCA_V_')

In [None]:
del df_test; x= gc.collect()

In [None]:
train.info()

In [None]:
test.info()

### Summary

**Still test dataset is having 1.2 GB memory. Function that we have used to reduce the memory is not effective.**

**There is a simple approach to reduce dataset memory. Just convert float64 into float32 and int64 into int32**

In [None]:
for col in test.columns:
    if test[col].dtype=='float64': test[col] = test[col].astype('float32')    

In [None]:
for col in train.columns:
    if train[col].dtype=='float64': train[col] = train[col].astype('float32')
    if train[col].dtype=='int64': train[col] = train[col].astype('int32')      

In [None]:
test.info()

### Conclusion: Almost 50% of test dataset memory is reduced with simple approach

In [None]:
train.head(2)

### Build the model

In [None]:
# spilt the train data for 'training' and 'validation'.

# train index
idxT = train.index[:3*len(train)//4]

# Validation index
idxV = train.index[3*len(train)//4:]

In [None]:
# only X columns
cols = train.columns.difference(['isFraud'])
cols

In [None]:
# Model
import xgboost as xgb

In [None]:
# xgb.XGBClassifier?

In [None]:
clf = xgb.XGBClassifier(n_estimators = 300, eval_metric = 'auc')

In [None]:
X_train = train.loc[idxT, cols]
y_train = train['isFraud'][idxT]

X_val = train.loc[idxV, cols]
y_val = train['isFraud'][idxV]

In [None]:
 clf.fit(X_train, y_train,eval_set=[(X_val, y_val)],verbose=50, early_stopping_rounds=100)

## Conclusion: 

#### Without using any feature engineering or optimizing the model, we have acheived validation accuracy around 90% (which is not bad). 

#### Now the model is ready and we can predict the 'Y' for test data.

### References for feature engineering, EDA and some advance techniques.

https://www.kaggle.com/cdeotte/xgb-fraud-with-magic-0-9600

https://www.kaggle.com/alijs1/ieee-transaction-columns-reference

https://www.kaggle.com/kabure/extensive-eda-and-modeling-xgb-hyperopt/comments

#### Thank you for reading the kernel, hope you find it useful:)
