<a href="https://www.kaggle.com/code/samyamaryal1/ieee-cis?scriptVersionId=133908417" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/ieee-fraud-detection/sample_submission.csv
/kaggle/input/ieee-fraud-detection/test_identity.csv
/kaggle/input/ieee-fraud-detection/train_identity.csv
/kaggle/input/ieee-fraud-detection/test_transaction.csv
/kaggle/input/ieee-fraud-detection/train_transaction.csv


In [2]:
# Importing the datasets
train_identity = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_identity.csv')
train_transaction = pd.read_csv('/kaggle/input/ieee-fraud-detection/train_transaction.csv')

The data is broken into two files: identity and transaction, which share a common column *TransactionID*. Not all transactions have corresponding identity information.

# Thorough inspection of all the datasets.

In [3]:
train_identity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144233 entries, 0 to 144232
Data columns (total 41 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   TransactionID  144233 non-null  int64  
 1   id_01          144233 non-null  float64
 2   id_02          140872 non-null  float64
 3   id_03          66324 non-null   float64
 4   id_04          66324 non-null   float64
 5   id_05          136865 non-null  float64
 6   id_06          136865 non-null  float64
 7   id_07          5155 non-null    float64
 8   id_08          5155 non-null    float64
 9   id_09          74926 non-null   float64
 10  id_10          74926 non-null   float64
 11  id_11          140978 non-null  float64
 12  id_12          144233 non-null  object 
 13  id_13          127320 non-null  float64
 14  id_14          80044 non-null   float64
 15  id_15          140985 non-null  object 
 16  id_16          129340 non-null  object 
 17  id_17          139369 non-nul

In [4]:
train_transaction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float64(376), int64(4), object(14)
memory usage: 1.7+ GB


The train_transaction dataset has 394 columns.
According to the data description by IEEE,
* TransactionDT - timedelta from reference timestamp
* TransactionAMT - txn amount
* ProductCD - Product code
* card1-6 - payment card information (card type, category, issue bank, country,
* addr
* dist - distances between billing and mailing address
* emaildomain (Purchaser and Recipient)
* C1-14 - "counting"
* D1-15 - timedelta
* M1-9 - match
* Vxxx - Vesta engineered features, over 300 of these. (V1-V339)

- id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc.
- id_12 - id_38 are categorical features.

# Rename the two columns card4 and card6, because we can infer its meaning from the above dataframe.

In [5]:
train_transaction.loc[:, 'card1':'card6']

Unnamed: 0,card1,card2,card3,card4,card5,card6
0,13926,,150.0,discover,142.0,credit
1,2755,404.0,150.0,mastercard,102.0,credit
2,4663,490.0,150.0,visa,166.0,debit
3,18132,567.0,150.0,mastercard,117.0,debit
4,4497,514.0,150.0,mastercard,102.0,credit
...,...,...,...,...,...,...
590535,6550,,150.0,visa,226.0,debit
590536,10444,225.0,150.0,mastercard,224.0,debit
590537,12037,595.0,150.0,mastercard,224.0,debit
590538,7826,481.0,150.0,mastercard,224.0,debit


In [6]:
# Renaming columns card4 and card6 as 'issuer' and 'type'
train_transaction.rename({'card4':'issuer', 'card6':'type'}, axis=1, inplace=True)

# Function to remove null columns

In [7]:
threshold=100
print(f"LETS try {threshold} raw strings")

LETS try 100 raw strings


In [8]:
def drop_missing_values(original_df, missing_df):
    threshold=85
    extra = missing_df[missing_df['value']>threshold]
    index = list(extra.reset_index()['index'])
    print(len(index), f"columns have over {threshold}% missing values")
    original_df.drop(index, axis=1, inplace=True)
    return original_df

# Find the percentage of missing values in the dataframes *train_transaction* and *train_identity*.

In [9]:
# Determining the number of missing values in the train_transaction column, except the 'V' columns
transaction_missing_values = pd.DataFrame(train_transaction.loc[:, :'M9'].isnull().sum() * 100 / train_transaction.shape[0], columns=['value'])
px.bar(transaction_missing_values.sort_values(by='value', ascending=False))

**DROP ALL COLUMNS WITH OVER 85% MISSING VALUES**

In [10]:
train_transaction = drop_missing_values(train_transaction, transaction_missing_values)
#Delete variable due to high memory consumption
del transaction_missing_values

8 columns have over 85% missing values


In [11]:
train_transaction.shape

(590540, 386)

Columns dropped as expected.

**This plot are for all columns of the train_transaction dataset except the *v* columns.**

In [12]:
identity_missing_values = pd.DataFrame((train_identity.isnull().sum() * 100 / train_identity.shape[0]).sort_values(ascending=False), columns=['value'])
px.bar(identity_missing_values)

In [13]:
train_identity = drop_missing_values(train_identity, identity_missing_values)
del identity_missing_values

9 columns have over 85% missing values


**Columns with over 85% of missing values may be dropped.**

Determining the ratio of missing values for every column in the dataframe to find out how to deal with it. I've excluded the 'V' column because those are engineered features and are extremely sparse.

The 'V' columns have a lot of NaN values. Let's find out.

In [14]:
v_columns = train_transaction.loc[:, 'V1':'V339']

In [15]:
def correlated_features(dataframe, correlation_threshold=0.85):
    # Generate the correlation matrix
    corr_matrix = dataframe.corr()

    # Get the column names from the correlation matrix
    columns = corr_matrix.columns

    # Initialize an empty list to store the lists of correlated columns
    correlated_columns = []

    # Initialize a set to keep track of visited columns
    visited_columns = set()

    # Iterate over the columns
    for i in range(len(columns)):
        # Skip the column if it has already been visited
        if columns[i] in visited_columns:
            continue

        # Create a new list for the current correlated group
        correlated_group = [columns[i]]

        # Iterate over the remaining columns
        for j in range(i + 1, len(columns)):
            # Check if the correlation between the columns exceeds the threshold
            if abs(corr_matrix.iloc[i, j]) > correlation_threshold:
                # Add the correlated column to the group
                correlated_group.append(columns[j])

                # Add the correlated column to the visited set
                visited_columns.add(columns[j])

        # Add the correlated group to the list
        correlated_columns.append(correlated_group)

    return correlated_columns

In [16]:
v_columns

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
590535,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,...,,,,,,,,,,
590536,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,,,,,,,,,,
590537,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,,,,,,,,,,
590538,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,0.0,...,,,,,,,,,,


Obviously, we can't print out each individual record and check them for NaN values. But based on the above output, this is an extremely sparse slice of dataframe. Let's determine the count of NaN values using the isnull() method.

In [17]:
v_null_values = v_columns.isnull().sum() * 100/ v_columns.shape[0]

In [18]:
px.bar(v_null_values.sort_values(ascending=False), labels='Null count')

**The columns have varying amounts of missing values, and V322-V339 have the most NaN values.**

### **The best way to deal with these missing values is by imputation. We'll replace missing values with the placeholder value: -999.** ###

In [19]:
# v_columns.loc[:, 'V1':'V339'].fillna(-999, inplace=True)

In [20]:
# v_columns

# Correlation

In [21]:
# correlation_matrix = v_columns.corr(numeric_only=True)
# px.imshow(correlation_matrix)

In [22]:
#t = correlated_features(correlation_matrix, 0.8)

In [24]:
temp_cols = v_columns.copy()

In [58]:
def correlated_features(corr_matrix, correlation_threshold):
    correlated_list_mine = list()
    # Get the column names from the correlation matrix
    columns = corr_matrix.columns

    # Initialize an empty list to store the lists of correlated columns
    correlated_columns = []

    # Initialize a set to keep track of visited columns
    visited_columns = set()

    # Iterate over the columns
    for i in range(len(columns)):
        # Skip the column if it has already been visited
        if columns[i] in visited_columns:
            continue

        # Create a new list for the current correlated group
        correlated_group = [columns[i]]

        # Iterate over the remaining columns
        for j in range(i+1, len(columns)):
            # Check if the correlation between the columns exceeds the threshold
            if corr_matrix.iloc[i, j] > correlation_threshold:
                # Add the correlated column to the group
                correlated_group.append(columns[j])
        
                # Add the correlated column to the visited set
                visited_columns.add(columns[j])

        # Add the correlated group to the list
        correlated_columns.append(correlated_group)
        print("correlation for ",i+1, correlated_group)
        print("V"+str(i+1))
        correlated_list_mine.append("V"+str(i+1))
    # Print the list of lists containing correlated columns
    #for group in correlated_columns:
     #   print(group)

    return correlated_list_mine, correlated_columns

In [61]:
listofcorr, v_new = correlated_features(temp_cols, 0.9)

correlation for  1 ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V12', 'V13', 'V14', 'V19', 'V20', 'V23', 'V24', 'V25', 'V26', 'V53', 'V54', 'V55', 'V56', 'V61', 'V62', 'V65', 'V66', 'V67', 'V75', 'V76', 'V77', 'V78', 'V86', 'V87', 'V88', 'V96', 'V102', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V127', 'V133', 'V282', 'V283', 'V290', 'V291', 'V292', 'V294', 'V305', 'V307', 'V317']
V1
correlation for  10 ['V10', 'V12', 'V13', 'V14', 'V19', 'V20', 'V23', 'V24', 'V25', 'V26', 'V35', 'V36', 'V37', 'V38', 'V41', 'V44', 'V45', 'V46', 'V47', 'V54', 'V55', 'V56', 'V61', 'V62', 'V65', 'V66', 'V67', 'V75', 'V76', 'V77', 'V78', 'V82', 'V83', 'V86', 'V87', 'V88', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V290', 'V291', 'V292', 'V305', 'V314']
V10
correlation for  11 ['V1

In [63]:
print(listofcorr, "\n", len(listofcorr))

['V1', 'V10', 'V11', 'V27', 'V28', 'V48', 'V49', 'V68', 'V89', 'V101', 'V103', 'V104', 'V105', 'V106', 'V135', 'V136', 'V137', 'V138', 'V139', 'V140', 'V141', 'V142', 'V146', 'V147', 'V161', 'V162', 'V163', 'V226', 'V269', 'V325', 'V326', 'V327', 'V328', 'V329', 'V330', 'V334', 'V335', 'V336'] 
 38


In [66]:
px.imshow(train_transaction[listofcorr].corr())

In [69]:
x, y = correlated_features(train_transaction[listofcorr], 0.9)

correlation for  1 ['V1']
V1
correlation for  2 ['V10']
V2
correlation for  3 ['V11']
V3
correlation for  4 ['V27', 'V101', 'V103']
V4
correlation for  5 ['V28']
V5
correlation for  6 ['V48']
V6
correlation for  7 ['V49']
V7
correlation for  8 ['V68']
V8
correlation for  9 ['V89']
V9
correlation for  12 ['V104']
V12
correlation for  13 ['V105']
V13
correlation for  14 ['V106']
V14
correlation for  15 ['V135']
V15
correlation for  16 ['V136']
V16
correlation for  17 ['V137']
V17
correlation for  18 ['V138']
V18
correlation for  19 ['V139']
V19
correlation for  20 ['V140']
V20
correlation for  21 ['V141']
V21
correlation for  22 ['V142']
V22
correlation for  23 ['V146']
V23
correlation for  24 ['V147']
V24
correlation for  25 ['V161']
V25
correlation for  26 ['V162']
V26
correlation for  27 ['V163']
V27
correlation for  28 ['V226']
V28
correlation for  29 ['V269']
V29
correlation for  30 ['V325']
V30
correlation for  31 ['V326']
V31
correlation for  32 ['V327']
V32
correlation for  33 ['

In [70]:
x

['V1',
 'V2',
 'V3',
 'V4',
 'V5',
 'V6',
 'V7',
 'V8',
 'V9',
 'V12',
 'V13',
 'V14',
 'V15',
 'V16',
 'V17',
 'V18',
 'V19',
 'V20',
 'V21',
 'V22',
 'V23',
 'V24',
 'V25',
 'V26',
 'V27',
 'V28',
 'V29',
 'V30',
 'V31',
 'V32',
 'V33',
 'V34',
 'V35',
 'V36',
 'V37',
 'V38']

In [31]:
correlation_matrix = v_columns.corr(numeric_only=True)
px.imshow(correlation_matrix)

In [None]:
px.imshow(v_new.corr())

after imputation, we'll merge columns that are highly correlated.

doing it programatially:

# Target Label Analysis

In [None]:
train_transaction['isFraud'].value_counts()

In [None]:
train_transaction['isFraud'].value_counts().plot(kind='bar')

This is a highly skewed dataset, with only 20k Fraud transactions present, compared to 569k non-fraud transactions.

# Email transactions & correlation to the 'isFraud' feature.

In [None]:
email_info = train_transaction[['P_emaildomain', 'R_emaildomain']]

In [None]:
email_info.isnull().sum()

**Most values in the Recipient email domain are null.**

In [None]:
train_transaction[train_transaction['P_emaildomain'] == train_transaction['R_emaildomain']].shape[0]

There are 102504 transactions where purchaser and recipient email domains are the same.

Now let's find out the number of columns where both are NOT NaN and both have DIFFERENT entries.

In [None]:
temp = train_transaction[~train_transaction['P_emaildomain'].isnull() & ~train_transaction['R_emaildomain'].isnull()]
temp[temp['P_emaildomain'] != temp['R_emaildomain']]

# Find the number of fraudulent transactions for occurrence of email domain as both purchaser and receiver.

In [None]:
# Calculate percentage of fraud transactions for every Purchaser email domain, and put the result into a dataframe.
purchaser_email_df = train_transaction.groupby(['P_emaildomain'])['isFraud'].value_counts(normalize=True).rename('counts').reset_index()

In [None]:
# train_transaction.groupby(['P_emaildomain'])['isFraud'].value_counts().reset_index()

In [None]:
purchaser_email_df

What proportion of emails from a given domain are fraud?

In [None]:
purchaser_email_df.groupby('isFraud')['counts'].value_counts()

In [None]:
sns.barplot(data=purchaser_email_df[purchaser_email_df['isFraud']==1].sort_values('counts', ascending=False), x='P_emaildomain', y='counts')
ticks = plt.xticks(rotation=90)

~ 40% emails from the domain 'protonmail.com' are fraudulent. 2nd highest fraud rate is for the domain 'mail.com', at just less than 20%

## Purchaser email is an important feature.

In [None]:
train_transaction.groupby(['P_emaildomain', 'R_emaildomain'])['isFraud'].value_counts()

In [None]:
# Grouping the number of fraudulent transactions per email domain, without dropping NaN values.
p_email_groupedby_fraud = train_transaction.groupby('isFraud', dropna=False)['P_emaildomain'].value_counts(sort=True, dropna=False)
r_email_groupedby_fraud = train_transaction.groupby('isFraud', dropna=False)['R_emaildomain'].value_counts(sort=True, dropna=False)

In [None]:
# Check fraudulent transactions for Recipient email
r_email_groupedby_fraud[1][:10]

In [None]:
# Checking the number of purchaser email domains with most fraud transactions. index 1 contains email with isFraud=1
p_email_groupedby_fraud[1][:10].plot(kind='bar')
# sns.barplot(p_email_groupedby_fraud)

In [None]:
r_email_groupedby_fraud[1][:10].plot(kind='bar')

what i'm trying to do here is find out the number of fraudulent transactions when a given email domain occurs in the purchaser column and receiver column. plot domain name in the x-axis, count in the y-axis and have 2 plots per category: one for receiver and one for purchaser.

In [None]:
p_email_groupedby_fraud[1]

In [None]:
train_transaction.rename({'card4':'issuer', 'card6':'type'}, axis=1, inplace=True)

# Sort isFraud transactions based on card issuer company

In [None]:
issuer_groupedby_fraud = train_transaction.groupby('isFraud', dropna=False)['issuer'].value_counts(sort=True, dropna=False)

In [None]:
issuer_groupedby_fraud[1]

In [None]:
issuer_groupedby_fraud[1].plot(kind='bar')

VISA card has the most fraudulent transactions, followed by mastercard

In [None]:
issuer_groupedby_fraud[0].plot(kind='bar')

Would be a good idea to plot percent of transactions per card issuer that are fraudulent.

In [None]:
issuer_groupedby_fraud

In [None]:
issuer_groupedby_fraud.groupby(['issuer', 'isFraud']).value_counts()

In [None]:
email_df = train_transaction[['isFraud', 'issuer']]

In [None]:
email_df.groupby(['issuer']).value_counts()

In [None]:
issuer_fraud_counts = email_df.groupby(['issuer']).value_counts(normalize=True).rename('percentage').reset_index()
sns.barplot(data=issuer_fraud_counts, x='issuer', y='percentage', hue='isFraud')

About 10% of Discover's transactions are fraudulent

In [None]:
train_transaction.columns

# Analysis of the TransactionAmt/ TransactionDT feature

In [None]:
sns.scatterplot(data=train_transaction, x='TransactionDT', y='TransactionAmt', hue='isFraud', alpha=0.5, palette='Set1')

Most transactions are in a similar range, so it's difficult to determine whether a transaction is fraudulent or not solely based on the transaction amount. There is one outlier transaction here, which has been labeled as non-fraudulent.

In [None]:
train_transaction['day'] = ((train_transaction['TransactionDT']//(3600*24)-1)%7)+1

In [None]:
train_day = train_transaction.groupby('isFraud')['day'].value_counts(normalize=True).rename('percentage').mul(100).reset_index().sort_values('day')
plt.figure(figsize=(10,6))
barplot = sns.barplot(x="day", y="percentage", hue="isFraud", data=train_day, palette = 'pastel')
plt.legend()
plt.ylabel('percentage of transaction frequency')
plt.xlabel('Day')
for p in barplot.patches:
    barplot.annotate(format(p.get_height(), '.2f'), (p.get_x() + p.get_width() / 2., p.get_height()), ha = 'center', va = 'center', xytext = (0, 5), textcoords = 'offset points')
plt.show()

In [None]:
train_transaction.columns

In [None]:
train_transaction

In [None]:
train_merged = train_transaction.merge(train_identity, how='left', on='TransactionID')

In [None]:
train_merged.columns

In [None]:
transaction_no_v.columns

# Count values

In [None]:
transaction_c = transaction_no_v.loc[:, 'C1':'C14']
transaction_c = pd.concat([transaction_c, transaction_no_v['isFraud']], axis=1)

In [None]:
def get_unique_values(df):
    for col, values in df.items():
        print("\n\n", col, values.dtype)
        print("No. of unique values: ", values.nunique(), "\n", "List of unique values", values.unique())
        if values.dtype in [int, float]:
            print("Min", values.min(), "\tMax", values.max(), "\n")

In [None]:
get_unique_values(transaction_c)

In [None]:
transaction_c.isnull().sum()

**Interestingly, there are no null values in the *C* columns.**

In [None]:
px.imshow(transaction_c.corr())

In [None]:
corr_matrix = transaction_c.corr()

In [None]:
def get_correlated_list(corr_matrix, threshold):
    correlation_threshold = threshold

    # Get the column names from the correlation matrix
    columns = corr_matrix.columns

    # Initialize an empty list to store the lists of correlated columns
    correlated_columns = []

    # Initialize a set to keep track of visited columns
    visited_columns = set()

    # Iterate over the columns
    for i in range(len(columns)):
        # Skip the column if it has already been visited
        #if columns[i] in visited_columns:
           # continue

        # Create a new list for the current correlated group
        correlated_group = [columns[i]]

        # Iterate over the remaining columns
        for j in range(1, len(columns)):
            # Check if the correlation between the columns exceeds the threshold
            if abs(corr_matrix.iloc[i, j]) > correlation_threshold:
                # Add the correlated column to the group
                correlated_group.append(columns[j])

                # Add the correlated column to the visited set
                visited_columns.add(columns[j])

        # Add the correlated group to the list
        correlated_columns.append(correlated_group)

    # Print the list of lists containing correlated columns
    for group in correlated_columns:
        print(group)


In [None]:
get_correlated_list(corr_matrix, 0.9)

**C columns are not LINEARLY correlated to isFraud. However, there may be a non-linear relationship here.**

**What we CAN deduce here is that many columns in the 'C' category are LINEARLY correlated. This means, we can group highly correlated features together, and select only one feature as a representative of the group.**

There are clear clusters in the above correlation matrix. We'll try and group highly correlated features, and select only one feature out of these groups.

In [None]:
pairplot_data = pd.DataFrame()

In [None]:
pairplot_data = pd.concat([train_transaction.loc[:, 'card1':'type'], train_transaction.loc[:, 'C1':'C14'], train_transaction['isFraud']], axis=1)

In [None]:
pairplot_data

In [None]:
train_transaction

In [None]:
c_data = train_transaction.loc[:, 'C1':'C14'].copy()
c_data = pd.concat([c_data, train_transaction['isFraud']], axis=1)

In [None]:
c_data.columns

In [None]:
def pair_plots(df, x_vars, y_vars, hue='isFraud'):
    sns.pairplot(df, kind='scatter', diag_kind='kde', x_vars=x_vars, y_vars=y_vars, hue=hue)

In [None]:
#pair_plots(c_data, x_vars=c_data.columns[0:5], y_vars=c_data.columns[0:5])

In [None]:
pairplot_data.columns[0:5]

In [None]:
pairplot_data[pairplot_data['isFraud']==1]

# M COLUMN ANALYSIS

In [None]:
m_columns = train_transaction.loc[:, 'M1':'M9']
m_columns = pd.concat([m_columns, train_transaction['isFraud']])

In [None]:
m_columns = pd.concat([m_columns, train_transaction['isFraud']], axis=1)

M columns are matches between names on card and address, and so on. Masked.

Mx is attribute of matching check, e.g. is phone areacode matched with billing zipcode, purchaser and recipient first/or last name match, etc.

In [None]:
for col, values in  m_columns.items():
    print(col, values.unique())

In [None]:
m = pd.DataFrame()
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col, value in m_columns.items():
    m[col] = le.fit_transform(m_columns[col])

In [None]:
m

No of categories = 10. So our dof is 9.
Assuming an alpha value of 0.05, we need to find out the chi square value.

In [None]:
y=m['isFraud']
X = m.drop('isFraud', axis=1)

In [None]:
X

In [None]:
X['M1'].value_counts().rename('Observed').reset_index()

In [None]:
from sklearn.feature_selection import chi2
chi_scores = chi2(X,y)

In [None]:
chi_scores

# Implement multivariate logistic regression

In [None]:
train_transaction.loc[:, :'type'].columns

In [None]:
sns.scatterplot(data=train_transaction, x='type', y='TransactionAmt', hue='isFraud', alpha=0.2)

In [None]:
sns.scatterplot(data=train_transaction[train_transaction['isFraud']==1], x='type', y='TransactionAmt')

In [None]:
train_transaction[train_transaction['type']=='debit or credit']['isFraud'].value_counts()

In [None]:
train_transaction[train_transaction['type']=='charge card']['isFraud'].value_counts()

**Apparently, there are no fraud transactions when 'type' = 'chargecard' and 'type' = 'debit or credit' - a total of 45 entries for both. Maybe we can drop these entries?**

In [None]:
train_transaction.loc[:, 'C1':'V1']

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Identity table EDA

In [None]:
train_identity.head()

Feeding a few features (except V columns) to the LR model, by min max scaling the values first. categorical variables will be one hot encoded.

id 12-38 are categorical variables
id 1-11 are numeric

In [None]:
pd.set_option("display.max_columns", 0)
i = train_identity.loc[:,'id_12':'id_38']

In [None]:
get_unique_values(i)

In [None]:
i['id_15'].unique()

In [None]:
train_identity

In [None]:
train = train_transaction.merge(train_identity, on='TransactionID', how='left')

In [None]:
train.columns

In [None]:
train_transaction.shape

In [None]:
train.head()