# IEEE-CIS Fraud Detection Data Set:
* Why Fraud Detection? 
    * Fraud detection is a set of activities undertaken to prevent money or property from being obtained through false pretenses. Fraud detection is applied to many industries such as banking or insurance. In banking, fraud may include forging checks or using stolen credit cards. 
* This competition is a binary classification problem - i.e. our target variable is a binary attribute (Is the user making the click fraudulent or not?) and our goal is to classify users into "fraudulent" or "not fraudulent" as well as possible. 

In this kernel I did deep dive Exploratory Data Analysis(EDA) on The IEEE-CIS Fraud Detection dataset to understand patterns of fraudulent transactions. Don’t forget to upvote if you find this kernel helpful. I suggest you also read the complete dataset overview and data description found in IEEE-CIS Fraud Detection page. https://www.kaggle.com/c/ieee-fraud-detection/overview 

I intentionally share all of my code. My aim is to have a good EDA code base that will minimise the development process for the next projects at the backlog.

You can glance at my util function notebooks to have an idea how i eased the EDA process at my GitHub page. https://github.com/ElifKarakutukDinc/IEEE-CIS-Fraud-Detection 

# Exploratory Data Analysis(EDA) 
Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypotheses and to check assumptions with the help of summary statistics and graphical representations.

* To take a closer look at the data take help of the '.head()` function of pandas library which returns the first five observations of the data set.Similarly `.tail()` returns the last five observations of the data set.
* To find out the total number of rows and columns in the data set using `.shape`.
* To see label of each columns in the data set using `.columns.values`
* It is also a good practice to know the columns and their corresponding data types,along with finding whether they contain null values or not with `.info()`
* The `.describe()` function in pandas is very handy in getting various summary statistics.This function returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the data.
* Few key insights just by looking at dependent variable are as follows:
    * `.unique()`
    * `.value_counts()`
* To check missing values in the data set's columns using `.isnull().sum()`
* To check `Outliers`: `A box plot` (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.
* To check `the linearity of the variables` it is a good practice to `plot distribution graph` and look for skewness of features. Kernel density estimate (kde) is a quite useful tool for plotting the shape of a distribution.

## Data Understanding & Wrangling:
### Transaction Table 
    * TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
    * TransactionAMT: transaction payment amount in USD
    * ProductCD: product code, the product for each transaction
    * card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
    * addr: address
    * dist: distance
    * P_ and (R__) emaildomain: purchaser and recipient email domain
    * C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
    * D1-D15: timedelta, such as days between previous transaction, etc.
    * M1-M9: match, such as names on card and address, etc.
    * Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
#### Categorical Features:
    * ProductCD
    * card1 - card6
    * addr1, addr2
    * P_emaildomain
    * R_emaildomain
    * M1 - M9

### Identity Table
    * Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
    * They're collected by Vesta’s fraud protection system and digital security partners.
    * (The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

    * “id01 to id11 are numerical features for identity, which is collected by Vesta and security partners such as device rating, ip_domain rating, proxy rating, etc. Also it recorded behavioral fingerprint like account login times/failed to login times, how long an account stayed on the page, etc. All of these are not able to elaborate due to security partner T&C. I hope you could get basic meaning of these features, and by mentioning them as numerical/categorical, you won't deal with them inappropriately.”
    
#### Categorical Features:
    * DeviceType
    * DeviceInfo
    * id_12 - id_38   

In [None]:
# This library is to work with Data Frames
import pandas as pd

# This library is to work with vectors
import numpy as np

# This library is to visualise statistical graphs
import seaborn as sns

# This library is to visualise graphs
import matplotlib.pyplot as plt

# To set some plotting parameters
from matplotlib import rcParams

# To supplies classes for manipulating dates and times
import datetime

# Library to work with Regular Expressions
import re

# To ignore filterwarnings
import warnings

# This library is to create displays
from IPython.display import Image

"""
At my Github Repository i stored all the functions in appropriate util packages. At Kaggle submission i will directly use those functions. 

# Calling reporting functions from util_reporting
from util_reporting import (
    df_first_look,
    df_descriptive_statistics,
    countplot_viz,
    boxplot_viz,
    histogram_multiple_viz,
    countplot_pointplot_viz,
)

# Calling reporting functions from util_data_cleaning
from util_data_cleaning import (
    missing_data_finder,
)

# Calling feature engineering functions from util_feature_engineering
from util_feature_engineering import (
    calculating_zscore,
    creating_date_columns,
)

"""

%matplotlib inline
warnings.filterwarnings("ignore")
%config Completer.use_jedi = False

# Setting a universal figure size<
rcParams["figure.figsize"] = 8, 6

In [None]:
# util_reporting 
def df_first_look(df):
    """
    This function gets a Python Pandas dataframe and visualize basic information about the dataframe.
    :param df: Dataframe to be analyze
    :return: This function doesn't return anything.
    """
    try:
        print("First 5 rows of dataframe:\n--------------------------\n", df.head())
        print("")
        print("Last 5 rows of dataframe:\n--------------------------\n", df.tail())
        print("")
        print(
            "Row count of dataframe:\n-----------------------\n",
            df.shape[0],
            "\nColumn count of dataframe:\n--------------------------\n",
            df.shape[1],
        )
        print("")
        print(
            "List of columns in the dataframe:\n---------------------------------\n",
            df.columns.values,
        )
        print("")
        print(
            "Looking NaN values and datatypes of columns in the dataframe:\n--------------------------------------------\n"
        )
        print(df.info())
        print("")

    except Exception as e:
        print("Error at df_first_look function: ", str(e))


def df_descriptive_statistics(df, column_list):
    """
    This function gets a Python Pandas dataframe and list of columns to visualize descriptive statistics about those columns.
    :param df: Dataframe to be analyze
    :param column_list: List of columns to filter out only numeric columns to use in the fuction"
    :return: This function doesn't return anything.

    """
    try:
        dummy_df = df[column_list].copy()
        print(
            f"Descriptive Statisctics for column:\n--------------------------\n",
            dummy_df.describe(),
        )
        print("")
        print(f"Mode values for column:\n--------------------------\n", dummy_df.mode())
        print("")
    except Exception as e:
        print("Error at df_descriptive_statistics function: ", str(e))


def countplot_viz(
    data,
    xcolumn,
    xlabel,
    ylabel,
    title,
    hue=None,
    fontsize_label=16,
    fontsize_title=20,
    fontsize_text=12,
    rotation=45,
    figsize_x=12,
    figsize_y=5,
    palette="mako",
):
    """
    This function gets a Python Pandas dataframe and visualize a countplot.
    :param data: Dataframe to be analyze
    :param xcolumn: This column designates x axis column.
    :param xlabel: It designates name of x axis column.
    :param ylabel: It designates name of y axis column.
    :param title: This column designates name of graph.
    :param hue: Name of variables in `data` or vector data, optional Inputs for plotting long-form data.
    :param fontsize_label: It designates label size.
    :param fontsize_title: It designates title size.
    :param rotation: It designates rotation of graph.
    :param palette: It designates colors of graph.
    :return: This function doesn't return anything.

    """
    plt.figure(figsize=(figsize_x,figsize_y))
    
    g = sns.countplot(x=xcolumn, data=data, hue=hue, palette=palette)
    g.set_title(title, fontsize=19)
    g.set_xlabel(xlabel, fontsize=17)
    g.set_ylabel(ylabel, fontsize=17)
    g.set_xticklabels(g.get_xticklabels(), rotation=40, ha="right")
    plt.tight_layout()
    for p in g.patches:
        height = p.get_height()
        g.text(
            p.get_x() + p.get_width() / 2.0,
            height + 3,
            "{:1}".format(height),
            ha="center",
            fontsize=fontsize_text,
        )    
    if hue != None:
        g.legend(bbox_to_anchor=(1, 1), loc=1, borderaxespad=0)  
        

def countplot_pointplot_viz(
    data,
    filter_list,
    xcolumn,
    ycolumn,
    ycolumn_point,
    xlabel,
    ylabel,
    title,
    hue=None,
    fontsize_label=16,
    fontsize_title=20,
    fontsize_text=12,
    rotation=45,
    figsize_x=12,
    figsize_y=5,
    palette="mako",
):
    """
    This function gets a Python Pandas dataframe and visualize a countplot and a pointplot.
    :param data: Dataframe to be analyze
    :param filter_list: It takes conditions for filtering. 
    :param xcolumn: This column designates x axis column.
    :param ycolumn: This column separetes data by its conditions at countplot. 
    :param ycolumn_point: This column separetes data by its conditions at pointplot. 
    :param xlabel: It designates name of x axis column.
    :param ylabel: It designates name of y axis column.
    :param title: This column designates name of graph.
    :param hue: Name of variables in `data` or vector data, optional Inputs for plotting long-form data.
    :param fontsize_label: It designates label size.
    :param fontsize_title: It designates title size.
    :param rotation: It designates rotation of graph.
    :param palette: It designates colors of graph.
    :return: This function doesn't return anything.

    """    
    
    plt.figure(figsize=(figsize_x,figsize_y)) 
    
    filter_list = filter_list
    df2 = data[data[ycolumn].isin(filter_list)]
    order = df2[xcolumn].value_counts().index

    ax1 = sns.countplot(
        x=xcolumn, hue=ycolumn, data=df2, hue_order=filter_list, palette=palette
    )
    for p in ax1.patches:
        height = p.get_height()
        ax1.text(
            p.get_x() + p.get_width() / 2.0,
            height + 3,
            "{:1}".format(height),
            ha="center",
            fontsize=fontsize_text,
        )
    ax1.set_title(title, fontsize=19)
    ax1.set_xlabel(xlabel, fontsize=17)
    ax1.set_ylabel(ylabel, fontsize=17)
    ax1.set_xticklabels(ax1.get_xticklabels(), rotation=40, ha="right")
    
    ax2 = ax1.twinx()
    sns.pointplot(x=xcolumn, y=ycolumn_point, data=df2, ax=ax2)

def boxplot_viz(
    data,
    xcolumn,
    xlabel,
    title,
    hue=None,
    fontsize_label=16,
    fontsize_title=20,
    rotation=45,
    palette="mako",
):
    """
    This function gets a Python Pandas dataframe and visualize a countplot.
    :param data: Dataframe to be analyze
    :param xcolumn: This column shows x axis column.
    :param xlabel: It designates name of x axis column.
    :param ylabel: It designates name of y axis column.
    :param title: This column designates name of graph.
    :param hue: Name of variables in `data` or vector data, optional Inputs for plotting long-form data.
    :param fontsize_label: It designates label size.
    :param fontsize_title: It designates title size.
    :param rotation: It designates rotation of graph.
    :param palette: It designates colors of graph.
    :return: This function doesn't return anything.

    """
    plt.figure(1, figsize=(9, 6))

    sns.boxplot(x=xcolumn, data=data, hue=hue, palette=palette)
    plt.xlabel(xlabel, fontsize=fontsize_label) 
    plt.title(title, fontsize=fontsize_title)
    plt.xticks(rotation=rotation)
    plt.show()



def histogram_multiple_viz(
    data,
    column,
    separate_column,
    condition_1,
    condition_2,
    title1,
    title2,
    title3,
    title4,
    color1="blue",
    color2="darkorange",
):
    """
    Gets a Python Pandas dataframe and visualize four histograms by a column's conditions and by gets its log.
    :param data: Dataframe to be analyze
    :param column: This column is for showing data distribution.
    :param separate_column: this colum is for creating histogram by a column's conditions.
    :param condition_1: It designates condition of separate column.
    :param condition_2: It designates condition of separate column.
    :param title1: It designates title by graph1.
    :param title2: It designates title by graph2.
    :param title3: It designates title by graph3.
    :param title4: It designates title by graph4.
    :param color1: It designates color for condition_1.
    :param color2: It designates color for condition_2.
    :return: This function doesn't return anything.

    """    
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 6))
    data.loc[data[separate_column] == condition_1][column].apply(
        np.log
    ).plot(
        kind="hist",
        bins=100,
        title=title1,
        color=color1,
        xlim=(-3, 10),
        ax=ax1,
    )
    data.loc[data[separate_column] == condition_2][column].apply(
        np.log
    ).plot(
        kind="hist",
        bins=100,
        title=title2,
        color=color2,
        xlim=(-3, 10),
        ax=ax2,
    )
    data.loc[data[separate_column] == condition_1][column].plot(
        kind="hist", bins=100, title=title3, color=color1, ax=ax3
    )
    data.loc[data[separate_column] == condition_2][column].plot(
        kind="hist",
        bins=100,
        title=title4,
        color=color2,
        ax=ax4,
    )
    plt.show()     
    


In [None]:
# util_feature_engineering: 
def calculating_zscore(df, cols):
    """
    This function gets a Python Pandas dataframe and calculating z score for column list and creating new column to show outlier and non-outlier values as categorical. 
    :param df: Dataframe to be analyze
    :param cols: The column list for calculating zscore.
    :return: Returning Python Pandas dataframe.
    """
    try:
        df_dummy = df.copy()
        for col in cols:
            col_zscore = col + "_zscore"
            df_dummy[col_zscore] = (df_dummy[col] - df_dummy[col].mean()) / df_dummy[
                col
            ].std(ddof=0)
            
            col_zscore_outlier = col_zscore + "_outlier"
        
            df_dummy[col_zscore_outlier] = np.where(
        (
            (df_dummy[col_zscore] > 3)
            | (df_dummy[col_zscore] < -3)
        ),
        "outlier",
        "non-outlier",
    )


        return df_dummy

    except Exception as e:
        print("Error at df_first_look function: ", str(e))
        return df

def creating_date_columns(df, date_column, START_DATE):
    """
    This function gets a Python Pandas dataframe and converting time delta date_column to date and creating new columns as date, weekdays, hours and days. 
    :param df: Dataframe to be analyze
    :param date_column: The column is main date column in dataframe that is time delta. 
    :return: Returning Python Pandas dataframe.
    """
    startdate = datetime.datetime.strptime(START_DATE, "%Y-%m-%d")
    df["Date"] = df[date_column].apply(
        lambda x: (startdate + datetime.timedelta(seconds=x))
    )

    df["Weekdays"] = df["Date"].dt.dayofweek
    df["Hours"] = df["Date"].dt.hour
    df["Days"] = df["Date"].dt.day

In [None]:
# util_data_cleaning: 
    
def missing_data_finder(df):
    """
    This function gets a Python Pandas dataframe and finding missing values and showing these percentages in the column of the dataframe 
    :param df: Dataframe to be analyze     
    :return: This function doesn't return anything.  
    
    """
    df_missing = df.isnull().sum().reset_index().rename(columns={'index': 'column_name', 0: 'missing_row_count'}).copy()
    df_missing_rows = df_missing[df_missing['missing_row_count'] > 0].sort_values(by='missing_row_count',ascending=False)
    df_missing_rows['missing_row_percent'] = (df_missing_rows['missing_row_count'] / df.shape[0]).round(4)
    return df_missing_rows

In [None]:
import os
print(os.listdir("../input/ieee-fraud-detection"))

### Uploading data sets:

In [None]:
# Transaction CSVs
train_transaction = pd.read_csv("../input/ieee-fraud-detection/train_transaction.csv")
test_transaction = pd.read_csv("../input/ieee-fraud-detection/test_transaction.csv")
# Identity CSVs - These will be merged onto the transactions to create additional features
train_identity = pd.read_csv("../input/ieee-fraud-detection/train_identity.csv")
test_identity = pd.read_csv("../input/ieee-fraud-detection/test_identity.csv")
# Sample Submissions
sample_submission = pd.read_csv("../input/ieee-fraud-detection/sample_submission.csv")

### The First Looking to Data Set:
* I called `df_first_look` from `util_reporting.py`. 
* This function returns;
    * First 5 rows of dataframe
    * Last 5 rows of dataframe
    * Row count of dataframe
    * Column count of dataframe
    * List of columns in the dataframe
    * Looking NaN values and datatypes of columns in the dataframe

In [None]:
df_first_look(train_transaction)

In [None]:
df_first_look(train_identity)

In [None]:
df_first_look(test_transaction)

In [None]:
df_first_look(test_identity)

* Data sets contain Float, integer and object types of data.
* All data sets contain null/missing values. 

* I want to see whether there is enough associated transactionID between train_transaction table and train_identity table. If there is enough associated transactionID we can look at the relationship between the identity table's columns and isFraud column of the transaction table.  

In [None]:
# To see, how many TransactionIDs in train_transaction have an associated train_identity.
print(
    np.sum(
        train_transaction["TransactionID"].isin(
            train_identity["TransactionID"].unique()
        )
    )
)

* 24.4% of TransactionIDs in the train have an associated train_identity. There is no enough associated transactionID to built the relationship between the identity table's columns and isFraud column of the transaction table.

### To Check Missing Values:

* I called `missing_data_finder` from `util_data_cleaning.py`. 
* This function returns;
    * Finding missing values and showing these percentages in the columns of the dataframe. 

In [None]:
missing_data_finder(train_transaction).head()

In [None]:
missing_data_finder(test_transaction).head()

##### Observations: 
* We found NaN values of columns in dataframes. 
    * 374 columns of total columns (394) of train_transaction dataframe have NaN values. 
    * 345 columns of total columns (393) of test_transaction dataframe have NaN values. 

### Changing Categorical Column's Values: 
### IsFraud Column:

* The IsFraud column has two categories: 0 and 1. Since this column is coded numerically it is not easy to understand which value is equal to which label. So I'm coding new values: `1 = "Fraud", 0 = "Non-Fraud"`. 

In [None]:
train_transaction["isFraud_"] = np.where(
    train_transaction["isFraud"] == 1, "Fraud", "Non-Fraud"
)

In [None]:
train_transaction["isFraud_"].value_counts()

##### Observation:
* We changed categorical names as Fraud and Non-Fraud.
* Count of Non-Fraud transaction more than count of Fraud transaction. 

## Understanding Variables:
### Categorical Features:

### ProductCD:
* `Definition:` ProductCD is product code that the product for each transaction. In the data description post, they state that ProductCD is a service and not a physical product.  
* `Categories & Labels:` C, W, R, H, S

In [None]:
train_transaction.ProductCD.value_counts()

In [None]:
countplot_viz(
    train_transaction,
    "ProductCD",
    "Transactions Product Code",
    "Freq",
    "Product Code Distribution",
    palette="rocket_r",
)

##### How to Read The Graph: 
* This graph shows the count of ProductCDs. 
* x column shows categories, y column shows counts of categories.

##### Observation:
* Count of product code W more than count of other codes. 

### Card1-Card6:
* `Definition:` card1 - card6: Payment card information, such as card type, card category, issue bank, country, etc.  
* `Definition od Card4:` Card4 shows card distributer types. 
    * `Categories & Labels:` Visa, Mastercard, American express, Discover 
* `Definition od Card6:` Card6 shows card types. 
    * `Categories & Labels:` Debit, Credit, Debit or Credit , Charge card      

In [None]:
card_cols = [c for c in train_transaction.columns if "card" in c]
train_transaction[card_cols].head(3)

* We can only describe card4 and card6 in all these columns. 

In [None]:
# For card4:
countplot_viz(
    train_transaction,
    "card4",
    "Card Distributers",
    "Freq",
    "Card Distributer Types Distribution",
    palette="rocket_r",
)

##### How to Read The Graph: 
* This graph shows the count of Card Distributor Types. 
* x column shows Card Distributors, y column shows counts of Card Distributors.

In [None]:
# For card6:
countplot_viz(
    train_transaction,
    "card6",
    "Card Types",
    "Freq",
    "Card Types Distribution",
    palette="rocket_r",
)

##### How to Read The Graph: 
* This graph shows the count of Card Types. 
* x column shows card types, y column shows counts of card types.

##### Observation:
* Card4: 
    * Count of "visa card" is more than count of other card distributors. 
* Card6: 
    * Count of "debit card" is more than count of other card types. 

### addr1 - addr2:
* `Definition:` They show address. Both addresses are for purchaser. 
    * addr1 as billing region
    * addr2 as billing country

### addr1:

In [None]:
# Unique count of Regions:
train_transaction.addr1.nunique()

In [None]:
# To find top 10 regions by count of transaction.
addr1_df = pd.DataFrame(train_transaction.addr1.value_counts())
addr1_df = addr1_df.rename_axis("region").reset_index()
addr1_df = addr1_df.sort_values(by=["addr1"], ascending=False).head(10)
top_region_df = train_transaction[
    train_transaction["addr1"].isin(list(addr1_df["region"]))
]

In [None]:
countplot_viz(
    top_region_df,
    "addr1",
    "Transaction Regions",
    "Freq",
    "Regions Distribution",
    palette="rocket",
)

### addr2:

In [None]:
# Unique count of Country:
train_transaction.addr2.nunique()

In [None]:
# To find top 5 countries by count of transaction.
addr2_df = pd.DataFrame(train_transaction.addr2.value_counts())
addr2_df = addr2_df.rename_axis("country").reset_index()
addr2_df = addr2_df.sort_values(by=["addr2"], ascending=False).head()
top_country_df = train_transaction[
    train_transaction["addr2"].isin(list(addr2_df["country"]))
]

In [None]:
countplot_viz(
    top_country_df,
    "addr2",
    "Transaction Countries",
    "Freq",
    "Country Distribution",
    palette="rocket_r",
)

##### Observation:
* addr1: 
    * There are 332 unique regions in dataframe. We show top 10 regions that have most transactions.  
* addr2: 
    * There are 74 unique countries in dataframe. We show top 5 countries that have most transactions.

### P_emaildomain: 
* `Definition:` They show purchaser email domain.
* I will group all email domains by the respective enterprises.
* Also, I won't include less than 1000 entries in analysis.

In [None]:
P_email_df = pd.DataFrame(train_transaction.P_emaildomain.value_counts())
P_email_df = P_email_df.rename_axis("email").reset_index()
P_email_df = P_email_df[P_email_df["P_emaildomain"] > 1000]
Pemail_df = train_transaction[
    train_transaction["P_emaildomain"].isin(list(P_email_df["email"]))
]

In [None]:
countplot_viz(
    data=Pemail_df,
    xcolumn="P_emaildomain",
    xlabel="P_Email Domains",
    ylabel="Freq",
    title="P_Email Domain Distribution",
    palette="rocket",
    fontsize_text=9,
)

##### How to Read The Graph: 
* This graph shows the count of transaction's email domains. 
* x column shows email domains, y column shows counts of email domain transactions.

##### Observation:
* Transactions that were done with Gmail and Outlook have the most count.

### R_emaildomain: 
* `Definition:` They show recipient email domain. Certain transactions don't need recipient, so R_emaildomain is null.
* I will group all email domains by the respective enterprises.
* Also, I won't include less than 1000 entries in analysis.

In [None]:
R_email_df = pd.DataFrame(train_transaction.R_emaildomain.value_counts())
R_email_df = R_email_df.rename_axis("email").reset_index()
R_email_df = R_email_df[R_email_df["R_emaildomain"] > 1000]
Remail_df = train_transaction[
    train_transaction["R_emaildomain"].isin(list(R_email_df["email"]))
]

In [None]:
countplot_viz(
    data=Remail_df,
    xcolumn="R_emaildomain",
    xlabel="R_Email Domains",
    ylabel="Freq",
    title="R_Email Domain Distribution",
    palette="rocket",
)

##### How to Read The Graph: 
* This graph shows the count of transaction's email domains. 
* x column shows email domains, y column shows counts of email domain transactions.

##### Observation:
* Transactions that were done with Gmail and Hotmail have the most count. 
* At here most second transaction count belongs to Hotmail but at P_emaildomain most second transaction count belongs to Outlook. 

### M1- M9:
* `Definition:` They show match, such as names on card and address, etc.
* `Categories & Labels:` T = True, F= False, NaN values
* `Categories & Labels for M4:` M0, M1, M2

In [None]:
M_cols = [m for m in train_transaction.columns if "M" in m]
train_transaction[M_cols].head(3)

In [None]:
train_transaction.M4.value_counts()

In [None]:
for col in ["M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9"]:
    countplot_viz(
        train_transaction,
        col,
        col,
        "Freq",
        "Distribution of " + col,
        palette="rocket",
        fontsize_label=11,
        fontsize_title=15,
        figsize_x=12,
        figsize_y=2,
    )

##### How to Read The Graph: 
* These graphs show the count of matching. 
* x column shows M situations, y column shows counts of matching situations.

##### Observation:
* M5, M6, M7, M8 have high false matching than true matching. We don't know what they represent. We'll look at whether false matching has fraud activities. 

### Numerical Features:
### TransactionAmt:

* `Definition:` TransactionAmt shows transaction payment amount in USD.

* I called `df_descriptive_statistics` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and list of columns to visualize descriptive statistics about those columns.

In [None]:
list_of_column_descriptive = ["TransactionAmt"]
df_descriptive_statistics(train_transaction, list_of_column_descriptive)

* To find outliers in the dataset we can use two different ways:  
    * To see whether there are outliers we visualize a Boxplot.
    * To find frequency of |z-score| > 3 transaction amounts. It gives us the count of outliers. 

* To Visualize The Boxplot:
    * I called `boxplot_viz` from `util_reporting.py`.
    * This function does;
        * This function visualizes a boxplot for a column.

In [None]:
boxplot_viz(
    train_transaction,
    "TransactionAmt",
    xlabel="TransactionAmt",
    title="Boxplot of TransactionAmt",
)

##### How to Read The Graph: 
* This graph shows boxplot of the TransactionAmt column. 
* `The minimum` (the smallest number in the data set). The minimum is shown at the far left of the chart, at the end of the left “whisker.”
* `First quartile`, Q1, is the far left of the box (or the far right of the left whisker).
* `The median` is shown as a line in the center of the box.
* `Third quartile`, Q3, shown at the far right of the box (at the far left of the right whisker).
* `The maximum` (the largest number in the data set), shown at the far right of the box.
* Data sets can sometimes contain `outliers` that are suspected to be anomalies (perhaps because of data collection errors or just plain old flukes). If outliers are present, the whisker on the appropriate side is drawn to 1.5 * IQR rather than the data minimum or the data maximum. Small circles or unfilled dots are drawn on the chart to indicate where suspected outliers lie. Filled circles are used for known outliers.

* To Calculate The z_score Column:
    * I called `calculating_zscore` from `util_feature_engineering.py`.
    * This function does;
        * This function calculating z score for column list. 
        * Creating new column to show outlier and non-outlier values as categorical. 

In [None]:
cols = ["TransactionAmt"]
train_transaction = calculating_zscore(train_transaction, cols)

In [None]:
# Total outlier and non-outlier count:

train_transaction["TransactionAmt_zscore_outlier"].value_counts()

##### Observation:
* Descriptive Statistics: 
    * Avg, median and mode are totally different. Because of this situation distribution is right skewed (positive skew). 
    * Mean value shows average transaction amount is 135$. 
* Box Plot Graph and z score Table:
    * Outliers' count is huge at the TransactionAMT column. 
    * We'll use the TransactionAmt_zscore_outlier column to see if the outlier value is fraud or not. 

### C1-C14: 
* `Definition:` Counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.

In [None]:
train_transaction[
    [
        "C1",
        "C2",
        "C3",
        "C4",
        "C5",
        "C6",
        "C7",
        "C8",
        "C9",
        "C10",
        "C11",
        "C12",
        "C13",
        "C14",
    ]
].describe()

##### Observation:
* The data is masked so we are not reviewing these columns. However, we'll look at the relationship between some of these columns and the isFraud column at the below cells.

### Time Series Features:
### TransactionDT:

* `Definition:` Timedelta from a given reference datetime (not an actual timestamp)

In [None]:
train_transaction["TransactionDT"].plot(
    kind="hist",
    figsize=(15, 5),
    label="train",
    bins=50,
    title="Train vs Test TransactionDT Distribution",
)
test_transaction["TransactionDT"].plot(kind="hist", label="test", bins=50)
plt.legend()

##### How to Read The Graph: 
* These graphs show distributions of transactionDT for test dataset and train dataset. 

* To convert timedelta column to time series column you can review the page that is at below link. 

https://www.kaggle.com/c/ieee-fraud-detection/discussion/100400#latest-579480

* I called `creating_date_columns` from `util_feature_engineering.py`.
* This function does;
    * Gets a Python Pandas dataframe and converting time delta date_column to date and creating new columns as date, weekdays, hours and days. 

In [None]:
creating_date_columns(train_transaction, "TransactionDT", START_DATE="2017-12-01")
train_transaction.head(3)

##### Observation:
* We converted the TransactionDT column. We'll use these columns at "Bivariate Relationships".

### Understanding Target Variable
### isFraud Column
* `Definition:` It shows fraud situation of transactions.  
* `Categories & Labels:` 0 = Non-Fraud, 1 = Fraud.
* I created "isFraud_" column for showing the column as categorical names. I'll use it to show bivariate relationships. 

In [None]:
train_transaction.isFraud_.value_counts()

In [None]:
countplot_viz(
    train_transaction,
    "isFraud_",
    "isFraud",
    "Freq",
    "isFraud Distribution",
    palette="rocket_r",
    figsize_x=12,
    figsize_y=4,
)

##### How to Read The Graph: 
* This graph shows count of fraud and non-fraud transactions. 
* x column shows fraud situations, y column shows counts of fraud situations.

##### Observation:
* There are 3.4% Fraud transactions in the dataset.
* 3.4% seems small in all data but it can be changed if the amount percentage is higher or lower than 3.5% of total. We'll see it later.

## Bivariate Relationships:
### ProductCD & isFraud: 

In [None]:
pd.crosstab(
    train_transaction.isFraud_, train_transaction.ProductCD, margins=True
).style.background_gradient(cmap="mako")


* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "ProductCD",
    "isFraud_",
    "isFraud",
    "Product Codes",
    "Freq",
    "ProductCD & isFraud",
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows count of fraud and non-fraud transactions by product codes.   
* Pointplot shows the percentage of fraud in all transaction of Product Codes. 
* x column shows product codes.
* y column shows; 
    * Counts of product codes by isFraud_ column.
    * Percentage of fraud in all transactions of Product Codes.

##### Observation:
* Most of transaction belong "W" product code. 
* The most fraudulent activities belong to the "C" product code with 11,5%.

### Card1-Card6 & isFraud: 

* We can use card4 and card6 for analyse. 
    * card4 : Credit card distributors
   
    * card6 : Card types

### Card4 & isFraud:

In [None]:
pd.crosstab(
    train_transaction.isFraud_, train_transaction.card4, margins=True
).style.background_gradient(cmap="mako")

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "card4",
    "isFraud_",
    "isFraud",
    "Credit Card Distributors",
    "Freq",
    "Distributors & isFraud",
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by card4.   
* Pointplot shows the percentage of fraud in all transactions of Credit Card Distributors. 
* x column shows Credit Card Distributors.
* y column shows; 
    * Counts of Credit Card Distributors by isFraud_ column.
    * Percentage of fraud in all transactions of Credit Card Distributors.

* The most of transactions were done with "visa". The second most of transactions were done with "mastercard". Because of the fraudulent transaction counts of these distributors are higher than others. But we should look that how much fraud there is in all transaction of these distributors. (as rate) 
    * In total transactions of visa, fraudulent transactions' percentage is 0,034. 
    * In total transactions of mastercard, fraudulent transactions' percentage is 0,034. 
    * In total transactions of discovery, fraudulent transactions' percentage is 0,077. 
    * In total transactions of american express, fraudulent transactions' percentage is 0,028. 
* Due to above results we can say that "discover" is more open to fraudulent activities than other distributors. 

### Card6 & isFraud:

In [None]:
pd.crosstab(
    train_transaction.isFraud_, train_transaction.card6, margins=True
).style.background_gradient(cmap="mako")

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "card6",
    "isFraud_",
    "isFraud",
    "Card Types",
    "Freq",
    "Card Types & isFraud",
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by card6.   
* Pointplot shows percentage of fraud in all transactions of card types. 
* x column shows card types.
* y column shows; 
    * Counts of card types by isFraud_ column.
    * Percentage of fraud in all transaction of card types.

##### Observation:
* Most of the transactions were done with "debit". The second most of the transactions were done with "credit card". 
* In total transactions of credit card, fraudulent transactions' rate is 0,066%. This rate is highest rate in all card types. Due to that result we can say that "credit card" is more open to fraudulent activities than other distributors. 

### addr1-addr2 & isFraud:
### addr1 & isFraud:

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    top_region_df,
    filter_list,
    "addr1",
    "isFraud_",
    "isFraud",
    "Regions",
    "Freq",
    "Regions & isFraud",
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by addr1(Regions).   
* Pointplot shows the percentage of fraud in all transactions of regions. 
* x column shows regions.
* y column shows; 
    * Count of regions by isFraud_ column.
    * Percentage of fraud in all transactions of regions.

##### Observation:
* Most of transactions belong 299th region.  
* From the point of view of fraud transaction rate 330th, 272th, 204th have high fraudulent activities. 

### addr2 & isFraud:

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    top_country_df,
    filter_list,
    "addr2",
    "isFraud_",
    "isFraud",
    "Countries",
    "Freq",
    "Countries & isFraud",
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by addr2(Countries).   
* Pointplot shows the percentage of fraud in all transactions of countries. 
* x column shows countries.
* y column shows; 
    * Count of countries by isFraud_ column.
    * Percentage of fraud in all transactions of countries.

##### Observation:
* Most of transactions belong 87th country.  
* From the point of view of fraud transaction rate 65th has the highest fraudulent activities. 

### P_emaildomain & isFraud:

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    Pemail_df,
    filter_list,
    "P_emaildomain",
    "isFraud_",
    "isFraud",
    "Purchaser Email Domain",
    "Freq",
    "Purchaser Email Domain & isFraud",
    fontsize_text=9,
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by P_emaildomain(Purchaser Email Domain).   
* Pointplot shows the percentage of fraud in all transactions of Purchaser Email Domain. 
* x column shows Purchaser Email Domain.
* y column shows; 
    * Count of Purchaser Email Domain by isFraud_ column.
    * Percentage of fraud in all transactions of Purchaser Email Domain.

##### Observation:
* The most of transactions belong to "gmail.com".  
* From the point of view of fraud transaction rate "outlook.com" has the highest fraudulent activities. 

### R_emaildomain & isFraud:

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    Remail_df,
    filter_list,
    "R_emaildomain",
    "isFraud_",
    "isFraud",
    "Recipient Email Domain",
    "Freq",
    "Recipient Email Domain & isFraud",
    fontsize_text=9,
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by R_emaildomain(Recipient Email Domain).   
* Pointplot shows the percentage of fraud in all transactions of Recipient Email Domain. 
* x column shows Recipient Email Domain.
* y column shows; 
    * Count of Recipient Email Domain by isFraud_ column.
    * Percentage of fraud in all transactions of Recipient Email Domain.

##### Observation:
* Most of the transactions belong to "gmail.com". 
* From the point of view of fraud transaction rate "outlook.com" and "icloud.com" have the highest fraudulent activities. 

### M1-M9 & isFraud:

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
for col in ["M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8", "M9"]:
    filter_list = ["Fraud", "Non-Fraud"]
    countplot_pointplot_viz(
        train_transaction,
        filter_list,
        col,
        "isFraud_",
        "isFraud",
        col,
        "Freq",
        "isFraud & " + col,
        fontsize_text=9,
        fontsize_title=15,
        figsize_x=12,
        figsize_y=3,
        palette="rocket",
    )

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by Matching columns.   
* Pointplot shows the percentage of fraud in all transactions of Matching. 
* x column shows Matching.
* y column shows; 
    * Count of Matching by isFraud_ column.
    * Percentage of fraud in all transactions of Matching.

##### Observation:
* As I said before, M5, M6, M7, M8 have higher false matching than true matching. In these matching columns;
    * Fraud activities of M5 and M7 belong to "true matching". 
    * Fraud activities of M6 and M8 belong to "false matching". We can say for M6 and M8, if there is no matching this transaction would be fraud. 
* In M1, M2, M3, M4, M9 matching columns;
    * Fraud activities of M1 "true matching".
    * Fraud activities of M2, M3 and M9 belong to "false matching".  We can say for M6 and M8, if there is no matching this transaction would be fraud.
    * M4 has different matching conditions like M0, M1, M2. Most transactions belong to "M0". Fraud activities of M4 belong to "M2". 

### C1-C14 & isFraud:

* The data is masked so we are not reviewing these columns. However, we'll look at the relationship between some of these columns and the isFraud column.

#### C1 & isFraud:

In [None]:
train_transaction.loc[
    train_transaction.C1.isin(
        train_transaction.C1.value_counts()[
            train_transaction.C1.value_counts() <= 600
        ].index
    ),
    "C1",
] = "Others"

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "C1",
    "isFraud_",
    "isFraud",
    "C1",
    "Freq",
    "C1 & isFraud",
    fontsize_text=8,
    palette="rocket",
    figsize_x=15,
    figsize_y=5,
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by C1 column.   
* Pointplot shows percentage of fraud in all transactions of C1 column. 
* x column shows C1 column's situations.
* y column shows; 
    * Count of C1 column by isFraud_ column.
    * Percentage of fraud in all transactions of C1 column.

##### Observations: 
* 1.0, 2.0, 3.0 have most of the transactions. 
* 15.0, 16.0, 17.0 have most fraud transaction rates. 
* The most fraud transaction rate belongs to 16.0. 
* The most fraud transaction count belongs to 1.0. 

#### C2 & isFraud:

In [None]:
train_transaction.loc[
    train_transaction.C2.isin(
        train_transaction.C2.value_counts()[
            train_transaction.C2.value_counts() <= 600
        ].index
    ),
    "C2",
] = "Others"

In [None]:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "C2",
    "isFraud_",
    "isFraud",
    "C2",
    "Freq",
    "C2 & isFraud",
    fontsize_text=8,
    palette="rocket",
    figsize_x=15,
    figsize_y=5,
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by C2 column.   
* Pointplot shows percentage of fraud in all transactions of C2 column. 
* x column shows C2 column's situations.
* y column shows; 
    * Count of C2 column by isFraud_ column.
    * Percentage of fraud in all transactions of C2 column.

##### Observations: 
* 1.0, 2.0, 3.0 have most of the transactions. 
* 15.0, 17.0, 18.0 have most fraud transaction rates. 
* The most fraud transaction rate belongs to 15.0. 
* The most fraud transaction counts belongs to 1.0. 

### TransactionAmt & isFraud: 

* I called `histogram_multiple_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize four histograms by a column's conditions and by gets its log. 

In [None]:
histogram_multiple_viz(
    train_transaction,
    "TransactionAmt",
    "isFraud_",
    "Fraud",
    "Non-Fraud",
    "Log Transaction Amt - Fraud",
    "Log Transaction Amt - Not Fraud",
    "Transaction Amt - Fraud",
    "Transaction Amt - Not Fraud",
    color1="skyblue",
    color2="tomato",
)

##### How to Read The Graph:  
* These graphs show distributions of Fraud and Non-Fraud transactions. First two graphs which are at above show log histogram of transactionAmt by isFraud column. Last two graphs which are at below show histogram of transactionAmt by isFraud column.

##### Observations: 
* Generally at both of conditions (Fraud and Non-Fraud), Transactions have a small amount. But we can see some transactions of Non-Fraud has high amount than fraud transaction's amount. 
* As I explained before there are a lot of outliers in the TransactionAmt column. We can see from these graphs that outliers' reason is not fraud transactions, the reason is about non-fraud transaction amounts. 

### TransactionDT & isFraud: 

* I called `countplot_pointplot_viz` from `util_reporting.py`.
* This function does;
    * Gets a Python Pandas dataframe and visualize a countplot and a pointplot. 

In [None]:
# For seeing weekdays fraud situations:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "Weekdays",
    "isFraud_",
    "isFraud",
    "Weekdays",
    "Freq",
    "Weekdays & isFraud",
    fontsize_text=9,
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by weekdays.   
* Pointplot shows the percentage of fraud in all transactions of weekdays. 
* x column shows weekdays.
* y column shows; 
    * Count of weekdays by isFraud_ column.
    * Percentage of fraud in all transactions on weekdays.
* 1 = Monday, 2 = Tuesday, 3 = Wednesday, 4 = Thursday, 5 = Friday, 6 = Saturday 0 = Sunday   

##### Observations: 
* First days of week mostly include non-fraud activities than fraud activities. The minimum amount of fraud activities belongs to "Monday". 
* 4 = Thursday, 5 = Friday, 6 = Saturday 0 = Sunday mostly include fraud activities than non-fraud activities. 

In [None]:
# For seeing hours fraud situations:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "Hours",
    "isFraud_",
    "isFraud",
    "Hours",
    "Freq",
    "Hours & isFraud",
    fontsize_text=9,
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by hours of day.   
* Pointplot shows the percentage of fraud in hours of day's all transactions. 
* x column shows hours of day.
* y column shows; 
    * Count of hours of day by isFraud_ column.
    * Percentage of fraud in hours of day's all transactions.

##### Observations: 
* The busy hours by transaction count are late evening hours. 
* The busy hours by fraud transaction rate are early evening hours like 5am, 6am, 7am, 8am and 9am. 
* The most fraudulent transaction rate belong to 7am. 
* The most fraudulent transaction count belongs to 23pm. 

In [None]:
# For seeing hours fraud situations:
filter_list = ["Fraud", "Non-Fraud"]
countplot_pointplot_viz(
    train_transaction,
    filter_list,
    "Days",
    "isFraud_",
    "isFraud",
    "Days",
    "Freq",
    "Days & isFraud",
    fontsize_text=9,
    palette="rocket",
)

##### How to Read The Graph: 
* Countplot shows the count of fraud and non-fraud transactions by days of month.   
* Pointplot shows the percentage of fraud in days of month's all transactions. 
* x column shows weekdays.
* y column shows; 
    * Count of days of month by isFraud_ column.
    * Percentage of fraud in days of month's all transactions.

##### Observations: 
* The busy days by fraud transaction rate belong 1st, 29th, 31th, days of the month.
* The most fraudulent transaction rate belong to 31th day of the month. 
* The most fraud transaction counts belong to the 3rd day of the month. 

## Multivariate Relationships:
### TransactionAmt- TransactionDT & isFraud:

In [None]:
# For fraudulent activities:
fraud_ts = train_transaction.copy()
fraud_ts.set_index("Date", inplace=True)
fraud_ts_week = (
    fraud_ts[fraud_ts["isFraud_"] == "Fraud"]["TransactionAmt"].resample("W").apply(sum)
)
fraud_ts_week.plot(
    marker="o",
    markerfacecolor="blue",
    markersize=12,
    color="skyblue",
    title="Weekly Total Fraud Transaction Amount",
    xlabel="Week of Transaction",
    ylabel="Total Amount",
)
plt.show()

##### Observations:  
* The highest fraud transaction amount belongs in March and April. 

In [None]:
# For Non-Fraud activities:
fraud_ts = train_transaction.copy()
fraud_ts.set_index("Date", inplace=True)
fraud_ts_week = (
    fraud_ts[fraud_ts["isFraud_"] == "Non-Fraud"]["TransactionAmt"]
    .resample("W")
    .apply(sum)
)
fraud_ts_week.plot(
    marker="o",
    markerfacecolor="blue",
    markersize=12,
    color="skyblue",
    title="Weekly Total Non-Fraud Transaction Amount",
    xlabel="Week of Transaction",
    ylabel="Total Amount",
)
plt.show()

##### Observations:  
* The highest non-fraud transaction amount belongs to December. 