In [1]:
#Import Libraries
import numpy as np
import matplotlib.pyplot as mp
import pandas as pd

In [2]:
#Import Dataset
dataset=pd.read_csv("/home/shashank/data_mining/dm-la-1-fraud-detecion-dt-creditcard-data.csv")
dataset=dataset.sample(frac=0.1,random_state=1)
print(dataset.shape)
x=dataset.iloc[:,1:30].values
y=dataset.iloc[:,30].values

(28479, 31)


In [3]:
print(x.shape)

(28479, 29)


The output (28479, 29) refers to the shape of the Pandas DataFrame dataset, which has 28481 rows and 31 columns.

In more detail, 28481 represents the number of observations or rows in the dataset, and 31 represents the number of features or columns in the dataset. The first column of the dataset (column 0) is the index column, and the remaining 30 columns (columns 1-30) represent the features or attributes of each observation.

In [4]:
print(y.shape)

(28479,)


In [5]:
#Missing values
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(x[:,1:30])
x[:,1:30]=imputer.transform(x[:,1:30])

This code is using the SimpleImputer class from the scikit-learn library to handle missing values in the dataset.

First, the SimpleImputer is initialized with two arguments:
- missing_values: the type of value in the dataset that represents a missing value. In this case, it is numpy.nan, which is a special value that represents not-a-number or missing value in NumPy arrays.
- strategy: the strategy to use to replace the missing values. In this case, the 'mean' strategy is used, which replaces the missing values with the mean of the non-missing values in the same column.

Next, the fit() method of the imputer object is called to compute the mean of each column in the dataset, which will be used to replace the missing values.

Then, the transform() method of the imputer object is called to replace the missing values in the dataset with the mean of each column. The transform() method returns a new array with the missing values replaced.

Finally, the transformed data is assigned back to the same variable x to update the original dataset with the imputed values.

In [6]:
print(x.shape)

(28479, 29)


In [7]:
#Splitting dataset into tranning and test set
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25, random_state = 0)

This code is splitting the preprocessed dataset `x` and corresponding labels `y` into training and testing sets for model building and evaluation purposes. 

The `train_test_split` function is imported from the `sklearn.model_selection` module, which allows for easy splitting of datasets. 

The function takes four parameters:

- `x`: the input dataset
- `y`: the corresponding labels for the input dataset
- `test_size`: the proportion of the dataset to include in the testing set
- `random_state`: a seed value for the random number generator used by the function to ensure reproducibility of the split

The function returns four arrays:
- `xtrain`: the training set input data
- `xtest`: the testing set input data
- `ytrain`: the training set labels
- `ytest`: the testing set labels

These sets can be used to train a model on the training set, and then evaluate its performance on the testing set.

In [8]:
print("xtrain.shape : ", xtrain.shape)
print("xtest.shape  : ", xtest.shape)
print("ytrain.shape : ", ytrain.shape)
print("xtest.shape  : ", xtest.shape)

xtrain.shape :  (21359, 29)
xtest.shape  :  (7120, 29)
ytrain.shape :  (21359,)
xtest.shape  :  (7120, 29)


In [9]:
#Fatures Scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
xtrain = sc.fit_transform(xtrain)
xtest = sc.transform(xtest)

This code performs feature scaling on the training and testing data using the StandardScaler from the sklearn.preprocessing module. Feature scaling is a technique used to standardize the range of input features so that each feature contributes equally to the model's performance. 

First, the StandardScaler object is created. Then, the fit_transform method of the StandardScaler object is called on the training data to fit the scaler to the data and transform it. The resulting scaled data is stored in the xtrain variable. The transform method is then called on the testing data using the fitted scaler object to scale the testing data using the same scaling factor as the training data. The resulting scaled testing data is stored in the xtest variable.

In [10]:
print("Standardised Training Set : \n", xtrain[0])

Standardised Training Set : 
 [ 0.1755644  -0.71248176 -0.31583312  0.85960078  0.01465336  0.49792678
  0.55263113 -0.06872708 -0.0259695  -0.24648807 -1.23350634  0.25705763
  0.98690252  0.22150697  1.47365492  0.53144905 -0.88433372 -0.4011437
 -0.38480945  1.19318005 -0.07850928 -1.61777284 -0.74765221 -2.18692915
  0.46867959 -1.268875   -0.11169069  0.23662413  1.91620047]


In [11]:
#Decision Tree Classification
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(xtrain, ytrain)

This code is performing Decision Tree Classification using the `DecisionTreeClassifier` class from the `sklearn.tree` module. 

Here's a breakdown of the parameters used:
- `criterion = 'entropy'`: This specifies the criterion to be used for splitting. In this case, it's entropy, which is a measure of the impurity of a node.
- `random_state = 0`: This sets the random seed for reproducibility.

The `fit` method is then used to train the decision tree classifier on the training set (`xtrain` and `ytrain`).

In [12]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred=classifier.predict(xtest)
cm = confusion_matrix(ytest, y_pred)
print(cm)
accuracy_score(ytest, y_pred)
Accuracy_Decison = ((cm[0][0] + cm[1][1]) / cm.sum()) *100
print("Accuracy_Decison    : ", Accuracy_Decison)
Error_rate = ((cm[0][1] + cm[1][0]) / cm.sum()) *100
print("Error_rate  : ", Error_rate)

[[7103    4]
 [   7    6]]
Accuracy_Decison    :  99.84550561797752
Error_rate  :  0.1544943820224719


Decision Tree Accuracy :
Accuracy_Decison : 99.87361325656508 -- (A better approach to follow)
Error_rate : 0.12638674343491083

# new input with output

10, 1.44904378114715, -1.17633882535966, 0.913859832832795, -1.37566665499943, -1.97138316545323, -0.62915213889734, -1.4232356010359, 0.048455887908856, -1.72040839292037, 1.62665905834133, 1.1996439495421, -0.671439778462005, -0.513947152539479, -0.095045045399955, 0.230930409124119, 0.031967466786208, 0.253414715863197, 0.854343814324194, -0.221365413645481, -0.387226474431156, -0.009301896524901, 0.313894410791098, 0.027740158017025, 0.500512287104917, 0.25136735874921, -0.129477953726618, 0.042849870938146, 0.016253261937552, 7.8 = This input is not fraud.

In [13]:
# create a new data point
new_data_point = np.array([1.44904378114715, -1.17633882535966, 0.913859832832795, -1.37566665499943, -1.97138316545323, -0.62915213889734, -1.4232356010359, 0.048455887908856, -1.72040839292037, 1.62665905834133, 1.1996439495421, -0.671439778462005, -0.513947152539479, -0.095045045399955, 0.230930409124119, 0.031967466786208, 0.253414715863197, 0.854343814324194, -0.221365413645481, -0.387226474431156, -0.009301896524901, 0.313894410791098, 0.027740158017025, 0.500512287104917, 0.25136735874921, -0.129477953726618, 0.042849870938146, 0.016253261937552, 7.8])

In [14]:
# convert the new input data to numpy array
new_data_point = np.array(new_data_point)
# standardize the new data point using the same StandardScaler used for training the model
new_data_point_std = sc.fit_transform(new_data_point.reshape(1, -1))
# 
new_data_point_std_a = sc.transform(new_data_point_std)
#
prediction = classifier.predict(new_data_point_std_a)

In [15]:
# print the prediction
if prediction == 0:
    print("This transaction is not fraudulent...")
else:
    print("This transaction is fraud...")

This transaction is not fraudulent...
