# Detailed exploration of IEEE-CIS 'Fraud Detection' dataframe
Here I examine and plot all features/variables from the training dataset, adding notes for all plots for later development

**This kernel is a bit long, so I'm continuining here with missing values analysis:**  
https://www.kaggle.com/pabloinsente/ieee-missing-nan-values-analysis-and-imputation

## Description variables/features:
(https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#latest-583068) 

### Transaction Table:
- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.


### Identity Table:
- Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions. 
- They're collected by Vesta’s fraud protection system and digital security partners.
- (The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

In [None]:
import numpy as np 
import pandas as pd 

import seaborn as sns
import matplotlib.pyplot as plt

import os
print(os.listdir("../input"))

# I. Import data

In [None]:
# load data
train_tran = pd.read_csv('../input/train_transaction.csv', index_col='TransactionID')
train_iden = pd.read_csv('../input/train_identity.csv', index_col='TransactionID')
test_tran = pd.read_csv('../input/test_transaction.csv', index_col='TransactionID')
test_iden = pd.read_csv('../input/test_identity.csv', index_col='TransactionID')
sample_sub = pd.read_csv('../input/sample_submission.csv', index_col='TransactionID')

In [None]:
# Join training datasets
train = train_tran.merge(train_iden, how='left',left_index=True, right_index=True)
train.shape

In [None]:
train.head()

In [None]:
# Join testing datasets
test = test_tran.merge(test_iden, how='left',left_index=True, right_index=True)
test.shape

In [None]:
test.head()

In [None]:
# get target feature
y_train = train['isFraud'].copy()
y_train.shape

In [None]:
y_train.head()

In [None]:
# get features matrices
X_train = train.drop('isFraud', axis=1)
X_test = test.copy()

In [None]:
del test, train_tran, train_iden, test_tran, test_iden

# II. Explore data: describe single variables

## Plot univariate distributions

### Categorical variables according to dataset documentation
** Categorical Features - Transaction:**  
- ProductCD
- card1 - card6
- addr1, addr2
- P_emaildomain
- R_emaildomain
- M1 - M9

** Categorical Features - Identity:**
- DeviceType
- DeviceInfo
- id_12 - id_38

## Plot categorical variables

**Plot I:  target, ProductCD, Devicetype, DeviceInfo**


In [None]:
f, axes = plt.subplots(1, 3, figsize=(12, 4))
isFraud = sns.countplot(x='isFraud', data=train, ax=axes[0])
ProductCD = sns.countplot(x='ProductCD', data=train, ax=axes[1])
DeviceType = sns.countplot(x='DeviceType', data=train, ax=axes[2])
plt.tight_layout()

**Plot I notes:**
- Very unbalance target (isFraud)
- Very unbalance product type purchase
- Most purchases are made on desktop devices

**Plot II: DeviceInfo**

In [None]:
# First create a dataframe with 2 cols: device info and the count by device
group = pd.DataFrame()
group['DeviceCount'] = train.groupby(['DeviceInfo'])['DeviceInfo'].count()
group['DeviceInfo'] = group.index

# There are too many Devices, so we will subset the top 20
group_top = group.sort_values(by='DeviceCount',ascending=False).head(20)

plt.figure(figsize=(25, 10))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x="DeviceInfo", y="DeviceCount", data=group_top)
xt = plt.xticks(rotation=60)

**Plot II notes:**

The top devices are:
1. Windows
2. iOS
3. Trident
4. MacOS

**Plot III: cards 1,2,3, and 5**

In [None]:
# These cards are encoded as float64, and since there are too many values
# we will plot this a distributions (x a# xis is just an id)
f, axes = plt.subplots(4, 1, figsize=(25, 30))

c1 = sns.distplot(train.card1,kde=False, ax=axes[0])
c2 = sns.distplot(train.card2.dropna(),kde=False, ax=axes[1])
c3 = sns.distplot(train.card3.dropna(),kde=False, ax=axes[2])
c5 = sns.distplot(train.card5.dropna(),kde=False, ax=axes[3])

**Plot III notes:**  
- The bulk of the transactions are on card1 and2  
- Not sure about identity of card3 and card5
- They may be dollars amount per transaction, or some sort of identifier 

In [None]:
# Plot IV: cards 4 and 6
f, axes = plt.subplots(1, 2, figsize=(18, 6))
sns.set(color_codes=True)
card4 = sns.countplot(x='card4', data=train, ax=axes[0])
card6 = sns.countplot(x='card6', data=train, ax=axes[1])

**Plot IV notes:**  
- Card4 refers to visa brand; most transactions are on Visa and Mastercard 
- Card5 refers to type of card; most transactions are debit 

**Plot V: addr1**

In [None]:
# First create a dataframe with 2 cols: device info and the count by device
group = pd.DataFrame()
group['addr1Count'] = train.groupby(['addr1'])['addr1'].count()
group['addr1'] = group.index

# There are too many addr, so we will subset the top 20
group_top = group.sort_values(by='addr1Count',ascending=False).head(20)

plt.figure(figsize=(25, 10))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x="addr1", y="addr1Count", data=group_top)
xt = plt.xticks(rotation=60)

**Plot VI: addr2**

In [None]:
# First create a dataframe with 2 cols: device info and the count by device
group = pd.DataFrame()
group['addr2Count'] = train.groupby(['addr2'])['addr2'].count()
group['addr2'] = group.index

# There are too many addr, so we will subset the top 20
group_top = group.sort_values(by='addr2Count',ascending=False).head(20)

plt.figure(figsize=(25, 10))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x="addr2", y="addr2Count", data=group_top)
xt = plt.xticks(rotation=60)

In [None]:
# Let's examine addr2 values
train.addr2.value_counts().head(10)

**Plot V-VI notes:**
- Transactions on addr1 are more evenly distributed
- Transactions on addr2 has 1 big outlier 

**Plot VII: emaildomains**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(18, 12))

sns.set(color_codes=True)
p_email = sns.countplot(y='P_emaildomain', data=train, ax=axes[0])
r_email = sns.countplot(y='R_emaildomain', data=train, ax=axes[1])
plt.tight_layout()

In [None]:
print(train.addr1.nunique()) # 332 unique locations
print(train.addr2.nunique()) # 74 unique locations
print(train.P_emaildomain.nunique()) # 59 unique domains
print(train.R_emaildomain.nunique()) # 59 unique domains

**Plot VII notes:**
- As expected, gmail is at the top.
- There is one interesting 'anonymous.com' domain

In [None]:
# Plot VIII: M1 - M9 variables
M1_loc = train.columns.get_loc("M1")
M9_loc = train.columns.get_loc("M9")
df_m = train.iloc[:,M1_loc:M9_loc+1] #subset dataframe M1-M9

cols = df_m.columns
f, axes = plt.subplots(3, 3, figsize=(16, 12))
count = 0
for i in range(3): # rows loop
    for j in range(3): # cols loop
        mplot = sns.countplot(x=cols[count], data=df_m, ax=axes[i,j])
        count += 1 # to loop over col-names
plt.tight_layout()

**Plot VIII notes:**
- Identity of M1-M9 is still unclear
- Basically boolean variables; M4 seems to be different

In [None]:
#Exploration id_12 - id_38
id12_loc = train.columns.get_loc("id_12")
id38_loc = train.columns.get_loc("id_38")
df_id = train.iloc[:,id12_loc:id38_loc+1] #subset dataframe id12-id19

In [None]:
df_id.dtypes

In [None]:
df_id.head(15)

**Notes id12-id38**
- There is a mix of data types
- Mostly NaN values
- id30 is OS again
- id31 is browser


**Plot IX: id_30**

In [None]:
# First create a dataframe with 2 cols: device info and the count by device
group = pd.DataFrame()
group['id_30Count'] = df_id.groupby(['id_30'])['id_30'].count()
group['id_30'] = group.index

# There are too many addr, so we will subset the top 20
group_top = group.sort_values(by='id_30Count',ascending=False).head(20)

plt.figure(figsize=(25, 10))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x="id_30", y="id_30Count", data=group_top)
xt = plt.xticks(rotation=60)

**Plot X: id_31**

In [None]:
# First create a dataframe with 2 cols: device info and the count by device
group = pd.DataFrame()
group['id_31Count'] = df_id.groupby(['id_31'])['id_31'].count()
group['id_31'] = group.index

# There are too many addr, so we will subset the top 20
group_top = group.sort_values(by='id_31Count',ascending=False).head(20)

plt.figure(figsize=(25, 10))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x="id_31", y="id_31Count", data=group_top)
xt = plt.xticks(rotation=60)

**Plot X notes:**
- Most transactions are done with Windows 7 and 10, and iOS
- Most transactions are done with chrome and safari

**Plot XI: ProductCD**

In [None]:
# This variable is NOT listed as categorical, but clearly is
plt.figure(figsize=(10, 5))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.countplot(x='ProductCD', data=train)

## Plot continuous variables 

In [None]:
# Plot XII: TransactionDT, TransactionAmt
f, axes = plt.subplots(2, 1, figsize=(15, 10))

dt = sns.distplot(train.TransactionDT,kde=False, ax=axes[0])
am = sns.distplot(train.TransactionAmt,kde=False, hist_kws={'log':True}, ax=axes[1])

**Plot XII notes:**
- TransactionDT is evenly distributed, unclear identity
- Transaction amount follows a log distribution, with a few large outliers

**Plot XIII: C7 - C14**

In [None]:
C7_loc = train.columns.get_loc("C7")
C14_loc = train.columns.get_loc("C14")
df_c = train.iloc[:,C7_loc:C14_loc+1] #subset dataframe

cols = df_c.columns

rows = 8
f, axes = plt.subplots(rows, 1, figsize=(15, 20))
for i in range(rows):
    dp = sns.distplot(df_c[cols[i]],kde=False, hist_kws={'log':True}, ax=axes[i])
plt.tight_layout()

**Plot XIII notes:**
- All variables follow roughly a log distribution 
- Identity is unclear

**Plot XIV: D1 - D15**

In [None]:
D1_loc = train.columns.get_loc("D1")
D15_loc = train.columns.get_loc("D15")
df_d = train.iloc[:,D1_loc:D15_loc+1] #subset dataframe

cols = df_d.columns
rows = 15
f, axes = plt.subplots(rows, 1, figsize=(15, 25))
for i in range(rows):
    #d = sns.distplot(df_d[cols[i]].dropna(),kde=False, hist_kws={'log':True}, ax=axes[i])
    d = sns.distplot(df_d[cols[i]].dropna(), ax=axes[i])
plt.tight_layout()

**Plot XIV notes:**
- D11-D15 have negative values, which may say something about the identity of the feature
- D9 has a different distribution, kinda binomial
- The rest roughly a log distribution

**Exploration V1 - V339**

In [None]:
V1_loc = train.columns.get_loc("V1")
V339_loc = train.columns.get_loc("V339")
df = train.iloc[:,V1_loc:V339_loc+1] #subset dataframe

df.head(20)

**Notes:**
- From V1-V305 & V322-V339 seems to be mostly 0 - 1 values, which may indicate that they are actually a categorical feature 
- From V306-V321 seems to be true continuous variables 
- More interesting insights may come from computing averages by target feature

**Plot XV: id_01 - id_11**

In [None]:
id_01_loc = train.columns.get_loc("id_01")
id_11_loc = train.columns.get_loc("id_11")
df = train.iloc[:,id_01_loc:id_11_loc+1] #subset dataframe

cols = df.columns
rows = 11
f, axes = plt.subplots(rows, 1, figsize=(15, 25))
for i in range(rows):
    #d = sns.distplot(df[cols[i]].dropna(),kde=False, ax=axes[i])
    d = sns.distplot(df[cols[i]].dropna(), ax=axes[i])
plt.tight_layout()

**Plot XV: id_01 - id_11 / SAME as LOG distributions**

In [None]:
id_01_loc = train.columns.get_loc("id_01")
id_11_loc = train.columns.get_loc("id_11")
df = train.iloc[:,id_01_loc:id_11_loc+1] #subset dataframe

cols = df.columns
rows = 11
f, axes = plt.subplots(rows, 1, figsize=(15, 25))
for i in range(rows):
    d = sns.distplot(df[cols[i]].dropna(),kde=False, hist_kws={'log':True}, ax=axes[i])
plt.tight_layout()

**Plot XV notes:**
- id_02 may be dollar amounts, with log distribution
- id_01 - id_10 have negative values, but it is unlikely to indicate debt given values
- id_07 - id_08 are kinda normally distributed

# III. Explore data: describe variables by target (Fraud/not Not Fraud)

## Plot/Explore bivariate relationships

**Plot I:  target, ProductCD, Devicetype, DeviceInfo / Target**

In [None]:
f, axes = plt.subplots(1, 3, figsize=(15, 8))
isFraud = sns.countplot(x='isFraud', data=train, ax=axes[0])
ProductCD = sns.countplot(x='ProductCD', hue="isFraud", data=train, ax=axes[1])
DeviceType = sns.countplot(x='DeviceType', hue="isFraud", data=train, ax=axes[2])
plt.tight_layout()

**Plot I as percentage**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(15, 8))

props = train.groupby("ProductCD")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='bar', stacked='True', ax=axes[0])

props = train.groupby("DeviceType")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='bar', stacked='True', ax=axes[1])

**Notes:**
- ProductCD: C and S types have the highest number AND proportion of Fraud Transactions
- DeviceType: mobile has the highest number AND proportion of Fraud Transactions (not by much though)

**Plot II: Fraud transactions by OS**

In [None]:
# Subset dataframe
fraud = pd.DataFrame()
is_fraud = train[train['isFraud']==1]
fraud['DeviceCount'] = is_fraud.groupby(['DeviceInfo'])['DeviceInfo'].count()
fraud['DeviceInfo'] = fraud.index

# There are too many Devices, so we will subset the top 20
group_top = fraud.sort_values(by='DeviceCount',ascending=False).head(20)

plt.figure(figsize=(25, 10))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x="DeviceInfo", y="DeviceCount", data=group_top)

font_size= {'size': 'x-large'}
ax.set_title("Fraud transactions by OS", **font_size)
xt = plt.xticks(rotation=60)

**Notes**
- Fraud transaction cases come mostly from Windows and iOS devices. This is predictable given the vast majority of all transactions come from those systems. Still, the problem is this feature will still send the signal to the model that Windos/iOS_Device transactions -> likely fraud relative to other systems
- Trident OS drop 5 places (3th overall, 8th on Fraud transactions)

**Plot III: cards 1,2,3, and 5**

In [None]:
# These cards are encoded as float64, and since there are too many values
# we will plot this as distributions (x axis is just a class identifier)

is_fraud = train[train['isFraud']==1]
no_fraud = train[train['isFraud']==0]

f, axes = plt.subplots(4, 1, figsize=(25, 30))

d1 = sns.distplot(no_fraud.card1, color="fuchsia", label="No fraud", ax=axes[0])
l1 = d1.legend()
c1 = sns.distplot(is_fraud.card1, color="black", label = "Fraud", ax=axes[0])
l2 = c1.legend()

d2 = sns.distplot(no_fraud.card2.dropna(), color="fuchsia", label="No fraud", ax=axes[1])
l3 = d2.legend()
c2 = sns.distplot(is_fraud.card2.dropna(), color="black",  label = "Fraud", ax=axes[1])
l4 = c2.legend()

d3 = sns.distplot(no_fraud.card3.dropna(), color="fuchsia", label="No fraud", ax=axes[2])
l5 = d3.legend()
c3 = sns.distplot(is_fraud.card3.dropna(), color="black",  label = "Fraud", ax=axes[2])
l6 = c3.legend()

d5 = sns.distplot(no_fraud.card5.dropna(), color="fuchsia", label="No fraud", ax=axes[3])
l7 = d5.legend()
c5 = sns.distplot(is_fraud.card5.dropna(), color="black", label = "Fraud", ax=axes[3])
l8 = c5.legend()

**Notes:**
- card1, card2 and card3 show similar distribution patterns fraud/no_fraud
- card5 reverse proportion around value '225' 

**Plot IV: cards 4 and 6**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(18, 10))
sns.set(color_codes=True)
card4 = sns.countplot(x='card4', hue="isFraud", data=train, ax=axes[0])
card6 = sns.countplot(x='card6', hue="isFraud", data=train, ax=axes[1])

**Plot IV as percentage**

In [None]:
f, axes = plt.subplots(1, 2, figsize=(18, 10))

props = train.groupby("card4")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='bar', stacked='True', ax=axes[0])

props = train.groupby("card6")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='bar', stacked='True', ax=axes[1])

**Notes:**
- Visa has the higher NUMBER of Fraud, but such number is a minor proportion of all VISA transactions
- Discover has very few Fraud transaction, yet as percentage of all Discover transactions is a bit higher
- Most Fraud transactions are done with Debit, but there is a higher proportion of Fraud Transactions within Credit 

**Plot V: addr1**

In [None]:
# Subset fraud dataset
addr = 'addr1'
addrC = 'addr1Count'
fraud = pd.DataFrame()
is_fraud = train[train['isFraud']==1]
fraud[addrC] = is_fraud.groupby([addr])[addr].count()
fraud[addr] = fraud.index

# Subset NOT fraud dataset
NOfraud = pd.DataFrame()
no_fraud = train[train['isFraud']==0]
NOfraud[addrC] = no_fraud.groupby([addr])[addr].count()
NOfraud[addr] = NOfraud.index

# There are too many addr, so we will subset the top 20
group_top_f = fraud.sort_values(by=addrC,ascending=False).head(20)
order_f = group_top_f.sort_values(by=addrC,ascending=False)[addr]

group_top_l = NOfraud.sort_values(by=addrC,ascending=False).head(20)
order_l = group_top_l.sort_values(by=addrC,ascending=False)[addr]

f, axes = plt.subplots(4, 1, figsize=(18, 20))

sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x=addr, y=addrC, data=group_top_f, order = order_f, ax=axes[0])
bx = sns.barplot(x=addr, y=addrC, data=group_top_l, order = order_l, ax=axes[1])

az = sns.barplot(x=addr, y=addrC, data=group_top_f, ax=axes[2])
bz = sns.barplot(x=addr, y=addrC, data=group_top_l, ax=axes[3])

font_size= {'size': 'x-large'}
ax.set_title("Fraud transactions by addr1 (ranked)", **font_size)
bx.set_title("Legit transactions by addr1 (ranked)", **font_size)

az.set_title("Fraud transactions by addr1", **font_size)
bz.set_title("Legit transactions by addr1", **font_size)

xt = plt.xticks(rotation=60)
plt.tight_layout()

**Notes:**
- Most fraud transactions come from addr1 204; most not fraud come from 299
- First 5 addr1 are the same, but in different rank-order

**Plot VI: addr2**

In [None]:
# Subset fraud dataset
addr = 'addr2'
addrC = 'addr2Count'
fraud = pd.DataFrame()
is_fraud = train[train['isFraud']==1]
fraud[addrC] = is_fraud.groupby([addr])[addr].count()
fraud[addr] = fraud.index

# Subset NOT fraud dataset
NOfraud = pd.DataFrame()
no_fraud = train[train['isFraud']==0]
NOfraud[addrC] = no_fraud.groupby([addr])[addr].count()
NOfraud[addr] = NOfraud.index

# There are too many addr, so we will subset the top 20
group_top_f = fraud.sort_values(by=addrC,ascending=False).head(20)
order_f = group_top_f.sort_values(by=addrC,ascending=False)[addr]

group_top_l = NOfraud.sort_values(by=addrC,ascending=False).head(20)
order_l = group_top_l.sort_values(by=addrC,ascending=False)[addr]

f, axes = plt.subplots(4, 1, figsize=(18, 20))

sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.barplot(x=addr, y=addrC, data=group_top_f, order = order_f, ax=axes[0])
bx = sns.barplot(x=addr, y=addrC, data=group_top_l, order = order_l, ax=axes[1])

az = sns.barplot(x=addr, y=addrC, data=group_top_f, ax=axes[2])
bz = sns.barplot(x=addr, y=addrC, data=group_top_l, ax=axes[3])

font_size= {'size': 'x-large'}
ax.set_title("Fraud transactions by addr1 (ranked)", **font_size)
bx.set_title("Legit transactions by addr1 (ranked)", **font_size)

az.set_title("Fraud transactions by addr1", **font_size)
bz.set_title("Legit transactions by addr1", **font_size)

xt = plt.xticks(rotation=60)
plt.tight_layout()

**Plot VII: emaildomains by Fraud status**

In [None]:
# Get top 10 
order_p=train.P_emaildomain.value_counts().iloc[:10].index
order_r=train.R_emaildomain.value_counts().iloc[:10].index

f, axes = plt.subplots(1, 2, figsize=(16, 8))

sns.set(color_codes=True)
p_email = sns.countplot(y='P_emaildomain',  hue="isFraud", data=train, order = order_p, ax=axes[0])
r_email = sns.countplot(y='R_emaildomain',  hue="isFraud", data=train, order = order_r, ax=axes[1])
plt.tight_layout()

**Plot VII: emaildomains by Fraud status as percentage**

In [None]:
f, axes = plt.subplots(2, 1, figsize=(12, 20))

props = train.groupby("P_emaildomain")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='barh', stacked='True', ax=axes[0])

props = train.groupby("R_emaildomain")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='barh', stacked='True', ax=axes[1])

plt.tight_layout()

**Notes:**  
- **'Protonmail.com'**,  **'mail.com'**, **'outlook.es'**, and, **'net.zero'** have a high proportion of Fraud transaction, yet they account for an small total number of fraud transactions

**Plot VIII: M1 - M9 by Fraud status variables**

In [None]:
M1_loc = train.columns.get_loc("M1")
M9_loc = train.columns.get_loc("M9")
df_m = train.iloc[:,M1_loc:M9_loc+1] #subset dataframe M1-M9
df_m['isFraud'] = train.isFraud 

cols = df_m.columns
f, axes = plt.subplots(3, 3, figsize=(16, 12))
count = 0
for i in range(3): # rows loop
    for j in range(3): # cols loop
        mplot = sns.countplot(x=cols[count], hue = 'isFraud', data=df_m, ax=axes[i,j])
        count += 1 # to loop over col-names
plt.tight_layout()

**Plot VIII: M1 - M9 variables by Fraud status as percentage**

In [None]:
ms = df_m.columns.tolist()
ms.pop()
rows = len(ms)
f, axes = plt.subplots(rows, 1, figsize=(12, 20))
for i,m in enumerate(ms): 
    props = train.groupby(m)['isFraud'].value_counts(normalize=True).unstack()
    p = props.plot(kind='barh', stacked='True', ax=axes[i])
plt.tight_layout()

**Notes:**
- As frequency is hard to catch meaningful differences between classes
- As percentage there are some interesting patterns on 'M4' where class M2 get the highest proportion of Fraud transactions, or M1 where 'F' doesn't get any Fraud cases

**Plot IX**

In [None]:
# Subset fraud dataset
addr = 'id_30'
addrC = 'id_30Count'
fraud = pd.DataFrame()
is_fraud = train[train['isFraud']==1]
fraud[addrC] = is_fraud.groupby([addr])[addr].count()
fraud[addr] = fraud.index

# Subset NOT fraud dataset
NOfraud = pd.DataFrame()
no_fraud = train[train['isFraud']==0]
NOfraud[addrC] = no_fraud.groupby([addr])[addr].count()
NOfraud[addr] = NOfraud.index

# There are too many OS, so we will subset the top 20
group_top_f = fraud.sort_values(by=addrC,ascending=False).head(20)
order_f = group_top_f.sort_values(by=addrC,ascending=False)[addr]

group_top_l = NOfraud.sort_values(by=addrC,ascending=False).head(20)
order_l = group_top_l.sort_values(by=addrC,ascending=False)[addr]

f, axes = plt.subplots(2, 1, figsize=(18, 20))

sns.set(color_codes=True)
sns.set(font_scale = 1.3)

ax = sns.barplot(y=addr, x=addrC, data=group_top_f, order = order_f, ax=axes[0])
bx = sns.barplot(y=addr, x=addrC, data=group_top_l, order = order_l, ax=axes[1])

font_size= {'size': 'x-large'}
ax.set_title("Fraud transactions by OS (ranked)", **font_size)
bx.set_title("Legit transactions by OS (ranked)", **font_size)

plt.tight_layout()

**Plot IX as percentage**

In [None]:
f, axes = plt.subplots(2, 1, figsize=(12, 20))

props = train.groupby("id_30")['isFraud'].value_counts(normalize=True).unstack()
props = props.sort_values(by=1, ascending = False).head(20) # sort by fraud and get top 20
p = props.plot(kind='barh', stacked='True', ax=axes[0])

props = train.groupby("id_31")['isFraud'].value_counts(normalize=True).unstack()
props = props.sort_values(by=1, ascending = False).head(20) # sort by fraud and get top 20
p = props.plot(kind='barh', stacked='True', ax=axes[1])

plt.tight_layout()

In [None]:
# Let's get the frequency of the cases with higher proportion of Fraud
props_30 = train.groupby("id_30")['isFraud'].value_counts(normalize=True).unstack()
props_30 = props_30.sort_values(by=1, ascending = False).head(20) # sort by fraud and get top 20
id_30_top = props_30.index.tolist()
props_30_c = train.groupby("id_30")['isFraud'].value_counts()
props_30_c.loc[id_30_top]

In [None]:
props_31 = train.groupby("id_31")['isFraud'].value_counts(normalize=True).unstack()
props_31 = props_31.sort_values(by=1, ascending = False).head(20) # sort by fraud and get top 20
id_31_top = props_31.index.tolist()
props_31_c = train.groupby("id_31")['isFraud'].value_counts()
props_31_c.loc[id_31_top]

**Notes**
- **id_30**: Other and Android 5.1.1 have the highest proportion of Fraud, **BUT**  negligible frequency: Other have 6 cases and Android 5.1.1, 101 cases
- **id_31**: Lanix, Mozilla, comodo, and lanix have really high proportions of 'Fraud', *BUT*, negible frequency: Lanix/Ilium 1 fraud, Mozilla/Firefox 5 fraud cases, comodo 2, lanix 1

**Plot XI: ProductCD by Fraud status**

In [None]:
# This variable is NOT listed as categorical, but clearly is
plt.figure(figsize=(10, 5))
sns.set(color_codes=True)
sns.set(font_scale = 1.3)
ax = sns.countplot(x='ProductCD', hue ="isFraud", data=train)

**Plot XI: ProductCD by fraud status as percentage**

In [None]:
props = train.groupby("ProductCD")['isFraud'].value_counts(normalize=True).unstack()
p = props.plot(kind='barh', stacked='True')

**Notes**  
  
**ProductCD is 'Production code'**   
- 'C': has both the highest number AND the highest proportion of Fraud transactions
- 'W': has a similar frequency of Fraud transactions for a minor proportion of the W class

**Plot XII: TransactionDT, TransactionAmt by Fraud status**

In [None]:
is_fraud = train[train['isFraud']==1]
no_fraud = train[train['isFraud']==0]

f, axes = plt.subplots(2, 1, figsize=(15, 10))

d1 = sns.distplot(no_fraud.TransactionDT, color="fuchsia", label="No fraud", ax=axes[0])
l1 = d1.legend()
d2 = sns.distplot(is_fraud.TransactionDT, color="black", label = "Fraud", ax=axes[0])
l2 = d1.legend()

t1 = sns.distplot(no_fraud.TransactionAmt.apply(np.log2), color="fuchsia", label="No fraud", ax=axes[1])
l3 = t1.legend()
t2 = sns.distplot(is_fraud.TransactionAmt.apply(np.log2), color="black", label = "Fraud", ax=axes[1])
l4 = t2.legend()

plt.tight_layout()

**Notes:**
- **TransactionDT (time delta from some reference time)**: Not-Fraud transactions tend to be more close to the 'Time zero reference' for the transactions; Fraud transactions tend to be a bit more evenly distributed. There is a pick around 0.55
- **TransanctionAmt (on dollars)**: Not-Fraud transactions are concentrated on the middle of the distribution, while Fraud transactions are a bit more concentrated on the tails (really small or really bit). This makes a lot of intuitive sense: micro-frauds and large-amounts-frauds are more likely. 

**Plot XIII: C7 - C14 by Fraud status**

In [None]:
C7_loc = train.columns.get_loc("C7")
C14_loc = train.columns.get_loc("C14")
df_c = train.iloc[:,C7_loc:C14_loc+1] #subset dataframe
cols = df_c.columns

# run this to allow np.log to work, i.e., prevent zero division
df_c.replace(0, 0.000000001, inplace = True) 

df_c['isFraud'] = train.isFraud 

is_fraud = df_c[train['isFraud']==1]
no_fraud = df_c[train['isFraud']==0]

rows = 8
f, axes = plt.subplots(rows, 1, figsize=(15, 30))

for i in range(rows):
    dp = sns.distplot(no_fraud[cols[i]].apply(np.log), color="fuchsia", ax=axes[i])
    dp = sns.distplot(is_fraud[cols[i]].apply(np.log), color="black", ax=axes[i])
plt.tight_layout()

**Notes:**
- This is all supossed to be 'counting' data, yet, we get a bunch of negative values
- The main patter, is that not-fraud transactions have higher values, more tightly concentrated (high kurtosis), while fraud transactions are more evenly spread out (low kurtosis), which means more outliers

**Plot XIV: D1 - D15 by Fraud status**

In [None]:
D1_loc = train.columns.get_loc("D1")
D15_loc = train.columns.get_loc("D15")
df_d = train.iloc[:,D1_loc:D15_loc+1] #subset dataframe
cols = df_d.columns

# run this to allow np.log to work, i.e., prevent zero division
df_d.replace(0, 0.000000001, inplace = True) 

df_d['isFraud'] = train.isFraud 

# log transfrom for visualization
is_fraud = df_d[train['isFraud']==1].apply(np.log)
no_fraud = df_d[train['isFraud']==0].apply(np.log)

rows = 15
f, axes = plt.subplots(rows, 1, figsize=(15, 30))
for i in range(rows):
    dp = sns.distplot(no_fraud[cols[i]].dropna(), color="fuchsia", ax=axes[i])
    dp = sns.distplot(is_fraud[cols[i]].dropna(), color="black", ax=axes[i])
plt.tight_layout()

**Notes** 
- **Main insight**: Fraud transactions tend to be **more spread out over time**, while Not-Fraud transactions tend to be **more clustered around shorter time periods** (from the 0 time-point reference) 

**Plot XV: id_01 - id_11 by fraud status**

In [None]:
id_01_loc = train.columns.get_loc("id_01")
id_11_loc = train.columns.get_loc("id_11")
df = train.iloc[:,id_01_loc:id_11_loc+1] #subset dataframe
cols = df.columns

# run this to allow np.log to work, i.e., prevent zero division
df.replace(0, 0.000000001, inplace = True) 

df['isFraud'] = train.isFraud 

# log transfrom for visualization
is_fraud = df[train['isFraud']==1].apply(np.log)
no_fraud = df[train['isFraud']==0].apply(np.log)

# run this to avoid runtime error (log is undefined for inf/NaN values in 'isFraud')
is_fraud.drop(columns=['isFraud'], inplace=True)
no_fraud.drop(columns=['isFraud'], inplace=True)

rows = 11
f, axes = plt.subplots(rows, 1, figsize=(15, 25))
for i in range(rows):
    dp = sns.distplot(no_fraud[cols[i]].dropna(), color="fuchsia", ax=axes[i])
    dp = sns.distplot(is_fraud[cols[i]].dropna(), color="black", ax=axes[i])
plt.tight_layout()

**Plot XV: id_01 - id_11 by Fraud status**

In [None]:
id_01_loc = train.columns.get_loc("id_01")
id_11_loc = train.columns.get_loc("id_11")
df = train.iloc[:,id_01_loc:id_11_loc+1] #subset dataframe
cols = df.columns

# run this to allow np.log to work, i.e., prevent zero division
df.replace(0, 0.000000001, inplace = True) 

df['isFraud'] = train.isFraud 

# log transfrom for visualization
is_fraud = df[train['isFraud']==1].apply(np.log)
no_fraud = df[train['isFraud']==0].apply(np.log)


**Notes:**
- In the cases where Fraud/Not-Fraud differ, the pattern is the same: **Fraud more clustered with a higher peak**, and **Not-Fraud more spread out with longer/heavier tails**

**Explore 'V' Features**

In [None]:
# Here I subset the dataset by the % difference between Fraud and Not-Fraud transactions
from sklearn import preprocessing

#subset dataframe
V1_loc = train.columns.get_loc("V1")
V339_loc = train.columns.get_loc("V339")
df = train.iloc[:,V1_loc:V339_loc+1] 
cols = df.columns

#scale values
scaler = preprocessing.MinMaxScaler()
scaled_array = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_array, index=df.index, columns=df.columns)
scaled_df['isFraud'] = train.isFraud 

# compute percentage difference between Fraud/Not-fraud transactions
group_means=scaled_df.groupby('isFraud').mean()
group_means_t = group_means.transpose()
group_means_t['delta_percentage'] = ((group_means_t.iloc[:,1] - group_means_t.iloc[:,0]) / ((group_means_t.iloc[:,1] + group_means_t.iloc[:,0]) / 2)) * 100

In [None]:
# Let's limit the plots to the cases where Fraud differs by 100% to Not-Fraud
# i.e., values that double 
plus_100 = group_means_t[group_means_t["delta_percentage"] >= 100]
plus_100_index = plus_100.index.tolist()
len(plus_100)

In [None]:
# This will plot and format 52! barplots, so it may take while tu run (few minutes)
df['isFraud'] = train.isFraud 
cols = plus_100_index
rows = 13
columns = 4
f, axes = plt.subplots(rows, columns, figsize=(20, 35))
count = 0
for i in range(rows): # rows loop
    for j in range(columns): # cols loop
        mplot = sns.barplot(x="isFraud", y=cols[count], data=df, ax=axes[i,j])
        count += 1 # to loop over col-names
plt.tight_layout()

**Notes:**
- There are so many features with no-identity info that it is hard to get a clear insight. It is clear though that there are A LOT features where Fraud transactions have higher means, which means that these variables are going to be variable for the model to learn to capture Fraud cases

**This kernel is a bit long, so I'm continuining here with missing values analysis:**  
https://www.kaggle.com/pabloinsente/ieee-missing-nan-values-analysis-and-imputation