The purpose of this notebook is to perform exploratory analysis with the goal of summarizing the characteristics of the data to be used in building a model to predict the probability that an online transaction is fraudulent( as denoted by the binary target,"isFraud") 

# Prepare Environment

Load packages needed for data exploration

In [None]:
import os
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

import gc


In [None]:
pd.set_option('display.max_columns', 100)

The data for this project is broken into two files: identity and transaction, which are joined by "TransactionID". Not all transactions have corresponding identity information.

# Load Training Data

I will begin by importing the training sets for the "identity" and "transaction" files respectively.

In [None]:
train_id = pd.read_csv('../input/ieee-fraud-detection/train_identity.csv')
train_transaction = pd.read_csv('../input/ieee-fraud-detection/train_transaction.csv')

print(f'Size of train_id - rows: {train_id.shape[0]}, columns: {train_id.shape[1]}')
print(f'Size of train_transaction - rows:{train_transaction.shape[0]}, columns: {train_transaction.shape[1]}')

In [None]:
train_id.head()

**Description of the Identity Dataset**

*Indented = Categorical*

* *TransactionID* — Foreign key to the Transaction Dataset.
>* *id_01-id_38* — Masked features corresponding to the identity of the card holders.
>* *DeviceType* — Type of Device used to make the Transaction.
>* *DeviceInfo* — Information regarding the characteristics of the Device.

In [None]:
train_transaction.head()

**Description of Transaction Dataset**
  
  *Indented = Categorical*
   
* *TransactionID* — Id of the transaction and is the foreign key in the Identity Dataset.
* *isFraud* — 0 or 1 signifying whether a transaction is fraudulent or not.
* *TransactionDT* — timedelta from a given reference datetime (not an actual timestamp)
* *TransactionAMT* — Transaction Payment Amount in USD.
>  * *ProductCD* — Product Code.
>  * *card1* — card6 — Payment Card information, such as card type, card category, issue bank, country, etc.
>  * *addr1, addr2* — Address
* *dist* — Distance
>  * *P_emaildomain* — Purchaser Email Domain.
>  * *R_emaildomain* — Receiver Email Domain.
* *C1-C14* — counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
* *D1-D15* — timedelta, such as days between previous transactions, etc.
> * *M1-M9* — match, such as names on card and address, etc.
* *V1-V339*— Vesta engineered rich features, including ranking, counting, and other entity relations.

Let's look at how much memory these dataframes are taking up 

In [None]:
trainID_GB = (train_id.memory_usage(deep = True).sum()/1024**3)
trainTR_GB = (train_transaction.memory_usage(deep = True).sum()/1024**3)

print(f'The train_id dataframe is taking up about {trainID_GB:.2f} GB of memory storage')
print(f'The train_transaction dataframe is taking up about {trainTR_GB:.2f} GB of memory storage')

**Merge Dataframes**

We will merge these files into one data set. 

In [None]:
train = train_transaction.merge(train_id,on=['TransactionID'],how='left') 
print(f'Size of train - rows : {train.shape[0]}, columns : {train.shape[1]}')

In [None]:
train.head()

I will now calculate the size (in GB) of the merged DataFrame.

In [None]:
train_GB = train.memory_usage(deep = True).sum()/1024**3 
print(f'train dataframe is using {train_GB:.2f} GB of memory storage')

 We will be using the merged train dataframe moving forward and no longer have use fo the train_id and train_transaction dataframes. We will drop these to free up some memory

In [None]:
del train_id
del train_transaction
gc.collect

# Reduce Memory Usage

In [None]:
def reduce_mem_usage(df):

    start_mem = df.memory_usage(index=True, deep=True).sum() / 1024**2
    start_mem_GB = df.memory_usage(index=True, deep=True).sum() / 1024**3
    print(f'Initial memory usage of dataframe is {start_mem:.2f} MB/{start_mem_GB:.2f} GB')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage(index=True, deep=True).sum() / 1024**2
    end_mem_GB = df.memory_usage(index=True, deep=True).sum() / 1024**3
    reduction = 100 * (start_mem - end_mem) / start_mem
    print(f'Memory usage after optimization is: {end_mem:.2f} MB/{end_mem_GB:.2f} GB')
    print(f'Decreased by {reduction:.1f}%')
    
    return df

train = reduce_mem_usage(train)

# Save Optimized Dataframe 

In [None]:
train.to_csv('train_IEEE',index=False, header=True) 

# Check for Missing Values

In [None]:
total_mv= train.isnull().sum().to_frame()                        #round to whole number 
percent_mv = (train.isnull().sum()/train.isnull().count()*100)   # round to 2 dp

pd.concat([total_mv, percent_mv], axis=1, keys=['Total Missing Values', 'Percent']).transpose()

# Explore Target Variable 

# **Label distribution**

In [None]:
target_count = train['isFraud'].value_counts()
target_percent = train['isFraud'].value_counts()/len(train)

print('Target Column : isFraud')
pd.concat([ target_count, target_percent], axis=1, keys=['Count', 'Percent'])

In [None]:
train['isFraud'].value_counts().plot(kind='bar', 
                                     figsize=(7, 5), 
                                     xlabel = "Fraudulent(Yes/No)",
                                     ylabel ="Count of Transactions",
                                     title= "Count of Fraudulent vs Non-Fraudulent Transactions")

In [None]:
def plot_count(feature, title, df, size=1):
    f, ax = plt.subplots(1,1, figsize=(4*size,4))
    total = float(len(df))
    g = sns.countplot(df[feature], order = df[feature].value_counts().index[:30], palette='Set3')
    g.set_title("Number and percentage of {}".format(title))
    if(size > 2):
        plt.xticks(rotation=90, size=8)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.1f}%'.format(100*height/total),
                ha="center") 
    plt.show()   

In [None]:
plot_count('isFraud',  'train: isFraud', df=train, size=1)


We can see from the chart above that the data is unbalanced. 96.5% of the transactions in the dataset are legit whiles 3.5% are fraudulent.

We would have to think of a method to address the data imbalance. 

# Explore Other Features

### Distribution of "Card4"(credit card company) feature 

In [None]:
plot_count('card4',  'train: card4', df=train, size=2)

65.2% of transactions in our dataset used visa cards, 32% used mastercard, 14% used American express and 11% used Discover cards.

### Distribution of "Card6" (card type) feature 

In [None]:
plot_count('card6',  'train: card6', df=train, size=2)

Most of the transactions in our datasetused debit cards. There are no transactions that used charge cards.

### Distribution of "id_07" feature 

In [None]:
plt.hist(train['id_07']);
plt.title('Distribution of id_07 variable');

### Distribution of Transactions by Device Type 

In [None]:
device_type_df = train.groupby(by='DeviceType').TransactionDT.count()
device_type_df.plot(kind='pie',colors=['r','g'],autopct='%1.1f%%', shadow=True, startangle=140)
plt.title('Number of transactions by device type')
print(device_type_df)

Most of the transactions in our dataset were done on desktop computer as compared to mobile devices.

### Fraudulent Transactions according to Credit Card Company 

In [None]:
sns.barplot(x ="isFraud",y="card4",data=train)
plt.xlabel('isFraud')
plt.ylabel('Card Type')
plt.title('Fraudulent Transactions by Card Company')
plt.show()

Here, we look at the distribution of fraudulent trnsactions by credit card company. Even though discover had the leastrepresnetation in the whole datasets, it seems to be have the highest number of fraudulent transactions.

### Fraudulent Transactions by Card Type  

In [None]:
sns.barplot(x ="isFraud",y="card6",data=train)
plt.xlabel('isFraud')
plt.ylabel('Card Type')
plt.show()

While credit card transactions make up only about 26% of the whole data, it has the highest number of fraudulent transactions.

### Fraudulent Transactions according to "ProductCD" feature

In [None]:
sns.barplot(x ="isFraud",y="ProductCD",data=train)
plt.xlabel('isFraud')
plt.ylabel('ProductCD')
plt.show()

In the chart above, we see the distribution of fraudulent transactions among the productCD types 

In [None]:
sns.barplot(x ="isFraud",y="DeviceType",data=train)
plt.xlabel('isFraud')
plt.ylabel('Card Type')
plt.title('Fraudulent Transactions by Card Company')
plt.show()

We saw earlier that most of the transactions in our dataset were made on desktop computers. However, more fraudulent transactions happened on mobile devices than desktop computers. 

In [None]:
train.dtypes.to_frame().T

In [None]:
train.describe ()

The cell above shows the counts, number of unique values,the top value and its frequency of each character column

# Identify Categorical and Numerical Features

Our target variable is "isFraud". We save that as y_train.

In [None]:
y_train = train['isFraud']

We will create lists of the numerical and categorical features

In [None]:
cat_features = ['ProductCD', 'card1','card2','card3','card4','card5','card6', 'addr1','addr2', 'P_emaildomain', 'R_emaildomain',
                'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'DeviceType', 'DeviceInfo', 'id_12', 'id_13','id_14','id_15',
                'id_16','id_17','id_18','id_19','id_20','id_21','id_22','id_23','id_24','id_25','id_26','id_27','id_28','id_29','id_30','id_31',
                'id_32','id_33','id_34','id_35','id_36','id_37','id_38']

num_features = [x for x in train.columns.values[2:] if x not in cat_features]  #slicing from 2 onwards ( first 2 columns are identifier and target)

features = num_features + cat_features

print('Categorical features :', len(cat_features))
print('Numerical features : ',len(num_features))

# Create Dataframe of Only Numerical Features

In [None]:
train.drop(cat_features, axis=1)

print(train.shape)


In [None]:
train.head

**Save Numeric Dataframe**

In [None]:
train.to_csv('train_num',index=False, header=True) 

# 