### Sairam
### Machine Learning Nanodegree
### Capstone Project : Finding Fraud Payments



### Exploring Data
The data file is payments.csv and it contains payment transactions. The datafile contains 11 columns and isFraud is a target label. The isFraud indicates whether the payment transaction is fraud or not.


In [5]:
#Import required librariries 
import pandas as pd
import numpy as np
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

# Graphs display library
%matplotlib inline

# Load the payments dataset
start = time() # Get start time
data = pd.read_csv("data/payments.csv")
end = time() # Get end time
    
#  Calculate the data load time
print("Time to load the data file:{:.2f} seconds".format(end-start))

#Display first 10 records
display(data.head(10))

Time to load the data file:19.59 seconds


Unnamed: 0,step,type,amount,nameOrig,oldbalanceOrg,newbalanceOrig,nameDest,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
0,1,PAYMENT,9839.64,C1231006815,170136.0,160296.36,M1979787155,0.0,0.0,0,0
1,1,PAYMENT,1864.28,C1666544295,21249.0,19384.72,M2044282225,0.0,0.0,0,0
2,1,TRANSFER,181.0,C1305486145,181.0,0.0,C553264065,0.0,0.0,1,0
3,1,CASH_OUT,181.0,C840083671,181.0,0.0,C38997010,21182.0,0.0,1,0
4,1,PAYMENT,11668.14,C2048537720,41554.0,29885.86,M1230701703,0.0,0.0,0,0
5,1,PAYMENT,7817.71,C90045638,53860.0,46042.29,M573487274,0.0,0.0,0,0
6,1,PAYMENT,7107.77,C154988899,183195.0,176087.23,M408069119,0.0,0.0,0,0
7,1,PAYMENT,7861.64,C1912850431,176087.23,168225.59,M633326333,0.0,0.0,0,0
8,1,PAYMENT,4024.36,C1265012928,2671.0,0.0,M1176932104,0.0,0.0,0,0
9,1,DEBIT,5337.77,C712410124,41720.0,36382.23,C195600860,41898.0,40348.79,0,0


### Implementation: Data Exploration
In this section, we will review the dataset. How many datarecords are identified as fraud transactions compare to the total count. 

- The total number of records, n_records
- The total number of columns, n_columns
- Total number of features, n_features
- Number of fraud transactions, n_fraud_count
- Number of fraud transactions percentage, n_fraud_percentage


In [34]:
### File Dimension
print("Data file contains {} rows and {} columns".format(data.shape[0], data.shape[1]))

### Number of records
n_records = len(data)

### Nunber of Columns
n_columns = len(data.columns)

###Number of features : Number of columns - 1 target column 
n_features = len(data.columns)-1

###Number of fraud payment records
n_fraud_count = len(data[data['isFraud'] == 1])

###Fraud percentantage 
n_fraud_percentage = (n_fraud_count/n_records)*100




print("Total Number of Records in the datafile:{}".format(n_records))
print("Total Number of features in the datafile:{}".format(n_features))
print("Total Number of fraud records in the datafile:{}".format(n_fraud_count))
print("Fraud pecentage compare to the total number of records:{:.4f}%".format(n_fraud_percentage))

Data file contains 6362620 rows and 11 columns
Total Number of Records in the datafile:6362620
Total Number of features in the datafile:10
Total Number of fraud records in the datafile:8213
Fraud pecentage compare to the total number of records:0.1291%


In [29]:
###Statistical Info
data.info()
display(data.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6362620 entries, 0 to 6362619
Data columns (total 11 columns):
step              int64
type              object
amount            float64
nameOrig          object
oldbalanceOrg     float64
newbalanceOrig    float64
nameDest          object
oldbalanceDest    float64
newbalanceDest    float64
isFraud           int64
isFlaggedFraud    int64
dtypes: float64(5), int64(3), object(3)
memory usage: 534.0+ MB


Unnamed: 0,step,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud
count,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0,6362620.0
mean,243.3972,179861.9,833883.1,855113.7,1100702.0,1224996.0,0.00129082,2.514687e-06
std,142.332,603858.2,2888243.0,2924049.0,3399180.0,3674129.0,0.0359048,0.001585775
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,156.0,13389.57,0.0,0.0,0.0,0.0,0.0,0.0
50%,239.0,74871.94,14208.0,0.0,132705.7,214661.4,0.0,0.0
75%,335.0,208721.5,107315.2,144258.4,943036.7,1111909.0,0.0,0.0
max,743.0,92445520.0,59585040.0,49585040.0,356015900.0,356179300.0,1.0,1.0


## Featureset Analysis
- **Step:**	It maps a unit of time in the real world. In this case, step 1 represents First hour of transactions
- **Type:**	Transaction Type, CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER
- **Amount:**	Transaction Amount in local currency
- **nameOrig:**	The customer who initiated the transaction
- **oldbalanceOrg:**	The initial balance before the transaction
- **newbalanceOrig:**	The new balance after processing the transaction. 
- **nameDest:**	The customer who is the recipient of the payment
- **oldbalanceDest:**	The initial balance in the recipient account before the transaction. Note that there is not information for customers: that start with M (Merchants).
- **newbalanceDest:**	The new balance in the recipient account after processing the transaction. Note that there is not - information for customers that start with M (Merchants).
- **isFlaggedFraud:**	If a transfer amount is more than 200,000 then single transaction flags as illegal attempt. The business model flags the transaction as “illegal Attempt” for higher denominations. 



## Target Column/Label
- **isFraud:**	Value values are either 0 or 1.  The value 1 indicates that this transaction was created by the fraudulent agent inside the simulator