# 1 Getting the Data


- The main goal is to predict which transactions are fraudulent and which ones are genuine. You will work with training data that spans 8 days, and make predictions for test data, that spans the 2 following days.

# Recommendations 


- Bear in mind that the dataset is imbalanced. The provided dataset to illustrate Feedzai use case, has a rate of positive class cases (fraud) of about 10%. Though this is larger than typical fraud rates at Feedzai (which can be 1% or smaller), it will already allow you to explore some strategies adapted to imbalanced datasets.


- Note that the dataset contains time dependencies so you will have to be careful on how to split your dataset for training and validation of the model (hint, hint: sorting on the timestamp sorts on time)


- You may have high cardinality categoricals. 


- There are categorical values that may exist in the test set, but not in the train set. You’ll need to be clever in how you deal with this. 


- If at any moment if looks like your computer is about to fly, you might find useful to work with samples. You may also find that heavier operations may take really long (or even crash your machine) on such a big dataset, so be smart about how your use your resources.


- Remember: “weeks of programming can save hours of planning”, so work with your team to plan and distribute work before diving in! 


- Focus on feature engineering and data understanding/exploration, which type of features you can build to characterize user past behavior.


- Make sure that you get to and submit a baseline ASAP! Then work on improving it

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import category_encoders as ce
from sklearn.pipeline import make_pipeline
from sklearn.base import TransformerMixin
%matplotlib inline

In [2]:
df = pd.read_csv('train.csv')

# Sorting and Setting column ID as Index

In [3]:
df = df.sort_values('timestamp')

In [4]:
df = df.set_index('id')

In [5]:
# The measures above do not need to be included in the pipepline
df.head()

Unnamed: 0_level_0,timestamp,product_id,product_department,product_category,card_id,user_id,C15,C16,C17,C18,C19,C20,C21,amount,isfraud
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
96186,1413849604595,c4e18dd6,85f751fd,50e219e0,92e72531,a99f214a,320,50,2480,3,297,100111,61,191.77,0
114679,1413849611766,c4e18dd6,85f751fd,50e219e0,e71aba61,a99f214a,320,50,1722,0,35,-1,79,191.77,1
60688,1413849613367,dd7026ee,15d93b0b,50e219e0,ecad2386,5c7c1b02,320,50,2495,2,167,100173,23,227.63,0
45825,1413849619068,c4e18dd6,85f751fd,50e219e0,5e3f096f,ba2d210a,320,50,2161,0,35,100051,157,191.77,0
87991,1413849625209,c4e18dd6,85f751fd,50e219e0,39947756,0ddad6d9,320,50,1955,3,163,100192,71,191.77,0


# Variables Description

- id - an anonymous id unique to a given transaction **(this column will represent the index)**


- timestamp - timestamp in unix ms of the transaction **(should be important this variable)**


- product_id - product id of the product present in the transaction **(Need to encode this variable)**


- product_department - product department of the product present in the transaction 


- product_category - product category of the product present in the transaction


- card_id - card id of the card used in the transaction


- user_id - user id of the user that did the transaction


- {C15, C16, C17, C18, C19, C20, C21} - anonymized categorical variables that characterize the transaction


- amount - amount of the transaction


- isfraud - binary variable that marks a transaction as fraud or not


## Undersampling Unbalanced Dataset with TimeSeries

In [14]:
# This result is in percentages
round((df['isfraud'].value_counts()/len(df))*100,2)

0    89.43
1    10.57
Name: isfraud, dtype: float64

# 2 Data Analysis and Preparation

In [None]:
df.info()

1. We have a Dataframe with 522 412 observations and 16 columns.

2. The target column is **"isfraud"**.

3. We have **12** Categorical Variables. 
    3.1. The variables with the prefix C are classified as int64, but they actually represent categorical variables.
    
4. We have **1** ID variable

5. We have **1** datetime variable, which is classified as int664

6. **Amount** is the only true numerical variable that we have

### 2.1 Data Analysis

1. Check for missing values
2. Check for outliers 
3. Check for correlations
4. Check Categorical Variables
5. check fraud rate per variables

In [None]:
# Split the data into an X Dataframe and y Dataframe 
X = df.drop(['isfraud'],axis=1)
y = df['isfraud']

In [None]:
def number_of_uniques(dataframe):
    dictionary = {column : dataframe[column].nunique() for column in dataframe.columns}
    return dictionary

In [None]:
uniques = number_of_uniques(X)
uniques

#### 2.1.1 Missing Values 

In [None]:
df.info()

- there are no missing values

#### 2.1.2 Outliers 
- We only have one numeric variable, **amount**.

- We should not remove the individuals who are considered as na outlier in this type of dataset, as they can be potential outliers

In [None]:
df['amount'].describe()

In [None]:
df.groupby('isfraud')['amount'].hist()
plt.show()

- In general the actual fraudsters tend to follow the spending the distribution of the individuals in this sample.

#### 2.1.3 Correlation 

- Since we only have numerical variable we should not compute the correlation of the variables

#### 2.1.4 Checking Categorical variables

In [None]:
# product_id 
# df.groupby('isfraud')['product_id'].hist()

# Creating a Pipeline 