### installing SnapMl library

##### we are going to use python API which is offered by Snap Machin Learning. SnapMl is a hig performance library from IBM. It provides highly efficient CPU/GPU implementation of linear model and tree based model. Here we are going to use a Decision tree and Support Vector Machine models.

### About the data set

##### For this detection i am going to use the real data set and which is includes information about transactions made by credit cards in September 2013 by European cardholders. In the data set each row represent a credit card transactions and the class columns is a target column which we have to predict. This data set is highly unbalanced and the target variable is not equally distributed. 

In [2]:
%pip install snapml

Note: you may need to restart the kernel to use updated packages.


### Import the needed libraries

In [3]:
import pandas as pd # To work with DataFrame and open the file
import numpy as np # For a Array level calculation or working with array
import matplotlib.pyplot as plt # Visualize the results
from sklearn.model_selection import train_test_split as tts# use to split the training set and train the model
from sklearn.preprocessing import StandardScaler as sds,normalize as norm # Use to preprocess the raw data and eliminate the unwanted datas
from sklearn.utils.class_weight import compute_class_weight as ccw # Used to handel the imbalanced data
from sklearn.metrics import roc_auc_score as ras # a popular metric for evaluating the performance of classification models, 
# particularly in binary classification tasks.

### Read the dataset using pandas library

In [4]:
cc_dataset = pd.read_csv("creditcard.csv") 
cc_dataset.head() # Print the first 5 rows from the dataset
cc_dataset.shape # check the raws and columns (diamention)

(284807, 31)

#### To deal Big data we can make the data set 10 times bigger than it was exist. becouse in prectically the financial institude may have acces for much larger data set

In [5]:
n_replication = 10

big_cc_dataset = pd.DataFrame(np.repeat(cc_dataset.values,n_replication,axis=0),columns=cc_dataset.columns)
# pd.DataFrame is used to change the array into DataFrame
# np.repeat is used to repeat the values from the data n_replication time in the axis = 0 and change them to array
# This data extension is known as inflation
big_cc_dataset.head(11)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
2,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
3,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
4,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
5,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
6,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
7,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
8,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
9,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0


#### Analyse the target variable value with respect to other columns value

In [9]:
Target_variable = big_cc_dataset.Class.unique() 
#.unique() function used to get a distinct value
# Here the target variable has only 0 and 1 so it would be a binery classification problem

sizes = big_cc_dataset.Class.value_counts() 
# will give the count of transactions {1(Fraud) and 0(legit)}

sizes

Class
0.0    2843150
1.0       4920
Name: count, dtype: int64

### Data Preprocessing

#### We actually make mean and std equal for all the feature's  which is known as Standardizing and prepare the dataset for training

In [18]:

big_cc_dataset.iloc[:,1:30] = sds().fit_transform(big_cc_dataset.iloc[:,1:30])
# Here i actually did is make all the feature's std 1 and mean 0 (excluding target variable) 
# using StandardScaler() function from sklearn and the fit it to all the features set and transform it

big_cc_dataset_matrix = big_cc_dataset.values
# In common method is to change the data set into numpy array before feeding to machine learning model
# becouse we are going to work with sklean library

X=big_cc_dataset_matrix[:,1:30]
Y=big_cc_dataset_matrix[:,30] 
# Now select the X(Feature set) and Y(Target set)

X=norm(X,norm="l1") 
# Now i have to normalize() the feature set X to do this i am going to use L1 normalization
# i am going to use this to ensure each row of the feature sets has a unit norm and sum of the absolute valu is 1
# it will scale each sample's feature so that the sum will be one

print("X_shape:",X.shape,"Y_shape:",Y.shape)
# print the shape of the sets


X_shape: (2848070, 29) Y_shape: (2848070,)


### Dataset Train/Test split
#### Now I have the prepocessed dataset so it is ready to biuld a classification model, before that we should split the  preprocessed dataset into two subset such as Tarining set(used to train the model) another one is test set(Evaluating the quality of the model)

In [22]:
X_train,X_test,Y_train,Y_test = tts(X,Y,test_size=0.3,random_state=42,stratify=Y)
# I am going to use the Train/Test split model from sklearn library 
# test_size is used to split the 30% of the data for testing and rest is for train
# random_state used to set the random seed for reproducibility. and it will give the same output for all time
# stratify use to maintain the data distribution of classes in both training and testing. 
# It is usufull when we deal with a classification problem with unbalanced dataset

