# Credit Card Fraud Detection

# Predict whether the Credit Card Transaction is Fraud or not?

# Context :
It is important that credit card companies are able to recognize fraudulent credit card transactions
so that customers are not charged for items that they did not purchase.

# The Problem :
The challenge is to recognize fraudulent credit card transactions so that the customers of credit
card companies are not charged for items that they did not purchase.

# **Content**

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.



**Main challenges involved in credit card fraud detection are:**
* Enormous Data is processed every day and the model build must be fast enough to respond to the
scam in time.
* Imbalanced Data i.e most of the transactions (99.8%) are not fraudulent which makes it really
hard for detecting the fraudulent ones
* Data availability as the data is mostly private.
* Misclassified Data can be another major issue, as not every fraudulent transaction is caught and
reported.
* Adaptive techniques used against the model by the scammers.
* How to tackle these challenges?
* The model used must be simple and fast enough to detect the anomaly and classify it as a
fraudulent transaction as quickly as possible.
* Imbalance can be dealt with by properly using some methods which we will talk about in the next
paragraph
* For protecting the privacy of the user the dimensionality of the data can be reduced.
* A more trustworthy source must be taken which double-checks the data, at least for training the
model.
* We can make the model simple and interpretable so that when the scammer adapts to it with just
some tweaks we can have a new model up and running to deploy.

# **Data** **Analysis**

**PANDAS**:

  Pandas provide high performance, fast, easy to use data structures and data analysis tools for manipulating numeric data and time series. Pandas is built on the numpy library and written in languages like Python, Cython, and C. In pandas, we can import data from various file formats like JSON, SQL, Microsoft Excel, etc.

**NUMPY**:

It is the fundamental library of python, used to perform scientific computing. It provides high-performance multidimensional arrays and tools to deal with them. A numpy array is a grid of values (of the same type) that are indexed by a tuple of positive integers, numpy arrays are fast, easy to understand, and give users the right to perform calculations across arrays.


# **DATA** **VISUALIZATION**:

Data Visualization is the graphic representation of data. It converts a huge dataset into small graphs, thus aids in data analysis and predictions.

**MATPLOTLIB**:

It is a Python library used for plotting graphs with the help of other libraries like Numpy and Pandas. It is a powerful tool for visualizing data in Python. It is used for creating statical interferences and plotting 2D graphs of arrays.

**SEABORN**:
It is also a Python library used for plotting graphs with the help of Matplotlib, Pandas, and Numpy. It is built on the roof of Matplotlib and is considered as a superset of the Matplotlib library. It helps in visualizing univariate and bivariate data.

# Importing the required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# Reading the Data Set

In [None]:
df = pd.read_csv('../input/credit-card-fraud-detection-data/Credit_Card_Fraud_Detection.csv')

# Examining the Data

In [None]:
df.head(3)

In [None]:
df.tail(3)

# EDA

# Exploratory Data Analysis



*   **head**()Understand your data using the head() function to look at the first few rows.

* **shape**()Review the dimensions of your data with the shape 
property.
* **info**()To know the information about the data
* **Dtyes** Look at the data types for each attribute with the dtypes property.
* **describeReview** the distribution of your data with the describe() function.
* **Correlation** Calculate pairwise correlation between your variables using the corr() function.




In [None]:
df.shape

# info()

This method prints information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage

In [None]:
df.info()

### Interpretation:
* By understanding the info() , we can say that all the columns are float type expect class column
* There is no null values are present, we can check the memory usage i.e 31.4 MB

In [None]:
print("There are {} rows and {} columns are present in the Data Set".format(df.shape[0],df.shape[1]))

# Describe():

In [None]:
df.describe()

# Outlier Detection

In [None]:
df.select_dtypes("number").head()

In [None]:
# Visualize the outliers using Box Plot

plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = (15, 4)

plt.subplot(1, 2, 1)
sns.boxplot(df['Amount'], color = "blue")

plt.subplot(1, 2, 2)
sns.boxplot(df['Class'], color = "green")

plt.suptitle('Outliers Present in the Data')
plt.show()

In [None]:
# Removing outliers from amt Column from train dataset

# Shape before removing outliers
print("Before Removing Outliers ", df.shape)

# Filtering the amt having more than 18000
df = df[df["Amount"] < 16000]

# Shape after removing outliers
print("After Removing Outliers ", df.shape)

# **Handling Missing Values**
There are broadly divide into two ways to treat missing values

Delete --> Delete the missing values
* 2.impute -->
     * imputing by a simple static: Replace the missing values by another value according to MEAN,MEDIAN,MODE
     * Predictive Techniques: Use statitical models such as K-NN,SVM etc to predict and replace missing values fillna
* Otherwise deletion is often safer and recongineed . You may loose data but will not make false predections
* Caution : Always have backup of the orginal data .if you are deleting missing values

In [None]:
df.isnull().sum()

In [None]:
!pip install missingno

In [None]:
# plot missng values in bar graph
import missingno as msno
msno.bar(df)
plt.show()

There is no missing value in this dataset

In [None]:
df.columns

In [None]:
# creating index in a list
lst=['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class']

# **Histogram**

A histogram is a graphical representation of the distribution of data given by the user.
Its appearance is similar to Bar-Graph except it is continuous.
The towers or bars of a histogram are called bins.
The height of each bin shows how many values from that data fall into that range.

### **Skewed**

* These distributions are sometimes called asymmetric or asymmetrical distributions as they don’t show any kind of symmetry.
* Symmetry means that one half of the distribution is a mirror image of the other half.
* For example, the normal distribution is a symmetric distribution with no skew. The tails are exactly the same.

### **Normal** **Distribution**
* A normal distribution, sometimes called the bell curve.
* The bell curve is symmetrical. Half of the data will fall to the left of the mean; half will fall to the right.
* The mean, mode and median are all equal.
* The curve is symmetric at the center (i.e. around the mean, μ).
* Exactly half of the values are to the left of center and exactly half the values are to the right.
* The total area under the curve is 1.
* The Standard Normal Model
* A standard normal model is a normal distribution with a mean of 0 and a standard deviation of 1.

### **Left side skewed**

* A left-skewed distribution has a long left tail.
* Left-skewed distributions are also called negatively-skewed distributions.
* That’s because there is a long tail in the negative direction on the number line.
* The mean is also to the left of the peak.

### **Right side skewd**

* A right-skewed distribution has a long right tail.
* Right-skewed distributions are also called positive-skew distributions.
* That’s because there is a long tail in the positive direction on the number line.

In [None]:
for i in lst[1:]: # iterating all the rows
    df[i].hist(bins=50,figsize=(10,6))
    
    # Width of each bin is = (max value of data – min value of data) / total number of bins 
    # hist means histogram, here we using with the help of matplotlib , it gives some bins to understand bars
    
    plt.yscale('log') 
    #the type of conversion of the scale, to convert y-axes to logarithmic scale we pass the “log” keyword or the matplotlib. scale
    #LogScale class to the yscale method
    plt.title(i)
    
    plt.show()

### **Interpretation**:

* V1 ---> its a left side skwed, mean is on the left, Here data points are most on left side and very few amount of outliers occurs.

* V2 ---> its looks like bell cure i.e Uniform distribution some slightly a left side skwed, mean is on the left, Here data points are most uniform and very few amount of outliers occurs.

* V3 ---> its a left side skwed mean is on the left, Here data points are most on left side and very few amount of outliers occurs

* V4 --->its looks like bell cure i.e Uniform distribution some slightly a right side skwed, mean is on the right, Here data points are most uniform and very few amount of outliers occurs on 2 bins.

* V5 ---> It looks unifrom distribution and one bin outlier far away to this data.

* V6 ---> It looks unifrom distribution and one bin outlier far away to this data.

* V7 ---> It looks unifrom distribution and one bin outlier far away to this data.

* V8 --->its looks like bell cure i.e Uniform distribution some slightly a right side skwed, mean is on the right, Here data points are most uniform and very few amount of outliers occurs on 2 bins

* V9 ---> its looks like bell cure i.e Uniform distribution some slightly a left side skwed, mean is on the left, Here data points are most uniform and very few amount of outliers occurs

* V10 ---> it looks unifrom distribution small amount of data oustide.

* V11 ---> Its a Completely Uniform Distribution

* V12--->its looks like bell cure i.e Uniform distribution some slightly a right side skwed, mean is on the right, Here data points are most uniform and very few amount of outliers occurs on 1 bin.

* V13 ---> Its a Completely Uniform Distribution

* V14 --->its looks like bell cure i.e Uniform distribution some slightly a right side skwed, mean is on the right, Here data points are most uniform and very few amount of outliers occurs on positive side aswellas neagative.

* V15 ---> It looks unifrom distribution and one bin outlier far away to this data.

* V16 ---> It looks unifrom distribution and one bin outlier far away to this data.

* V17 ---> its looks like bell cure i.e Uniform distribution some slightly a left side skwed, mean is on the left, Here data points are most uniform and very few amount of data is on left but there in group.

* V18 ---> its looks like bell cure i.e Uniform distribution some slightly a right side skwed, mean is on the right, Here data points are most uniform and very few amount of data is on right but there in group.

* V19 ---> Its a Completely Uniform Distribution

* V20 ---> Its a Completely Uniform Distribution

* V21 ---> Its a Completely Uniform Distribution

* V22 ---> Its a Completely Uniform Distribution

* V23 ---> Its a Completely Uniform Distribution

* V24 --->its looks like bell cure i.e Uniform distribution some slightly a right side skwed, mean is on the right, Here data points are most uniform and very few amount of outliers occurs on positive side.

* V25 ---> it looks unifrom distribution very small amount of data is far.

* V26 ---> it looks unifrom distribution very small amount of data is far.

* V27 ---> it looks unifrom distribution very small amount of data is far.

* V28 ---> it looks unifrom distribution very small amount of data is far.

* Amount ---> its a left side skwed, mean is on the left, Here data points are most on left side and very few amount of outliers occurs.

* Class ------> It has around not fraud 1.3 Lakhs and 1 is fraud i.e nearly 400

In [None]:
df['Class'].value_counts()   

In [None]:
# df.drop(columns=['Time','Class']) 

 This shows that data is highly unbalanced.

# **Correlation** :

In [None]:
df.corr()

In [None]:
cor=df.corr()
plt.figure(figsize=(16,10))
sns.heatmap(cor)

# Heat Map

**HeatMap** - A heatmap is a graphical representation of data in which data values are represented as colors. That is, it uses color in order to communicate a value to the reader. This is a great tool to assist the audience towards the areas that matter the most when you have a large volume of data.

* It shows the relationship between two columns or variables
* If correlation is equal to zero i.e No Correlated
* If correlation is equal to one i.e Perfect Correlated
* If correlation is between less than zero to less than 0.45 i.e small positive correlated
* If correlation is between greater than 0.5 to 0.9 then it is i.e large positive correlated
* If correlation is negative to -0.45 is small neagtive correlated
* If correlation is negative between greater than - 0.5 to - 0.9 then it is i.e large negative correlated

In [None]:
plt.figure(figsize=(20,15))
#plotting the figure size based on width and height

sns.heatmap(df.corr(),cmap='PiYG',annot=True,linewidths=1,fmt='0.2f')

**Interpretation**:

* V1, V3, V5, V8, V9, V10,V11, V12, V13, V14, V15, V16, V19, V22,V23, V25, V26 are small neagtive correlated with amount
* V2 is large neative correlated with amount
* V11 does not correlated with amount i.e 0.0
* V4, V6, V7, V14, V17, V18, V20,V21, V27 V28 are Small positive correlated.

# **countplot**():

Countplot(): method is used to Show the counts of observations in each categorical bin using bars.
###Counter
Counter is a container which stores the count of elements in a dictionary format where element is the key and its value corrosponds to it's count.

In [None]:
sns.countplot(x='Class', data = df)
from collections import Counter
counter = Counter( df [ 'Class' ])
print(counter)

There are 284315 number is not fraud and remaining 492 is fraud.

Split the dataset into training and testing

In [None]:
# separating the data for analysis
legit = df[df.Class == 0]
fraud = df[df.Class == 1]

In [None]:
print(legit.shape)
print(fraud.shape)

In [None]:
legit.Amount.describe()

In [None]:
# statistical measures of the data
legit.describe()

In [None]:
fraud.Amount.describe()

In [None]:
# compare the values for both transactions
df.groupby('Class').mean()

Under-Sampling

Build a sample dataset containing similar distribution of normal transactions and Fraudulent Transactions

Number of Fraudulent Transactions --> 492

In [None]:
legit_sample = legit.sample(n=492)

Concatenating two DataFrames

In [None]:
new_dataset = pd.concat([legit_sample, fraud], axis=0)
new_dataset.head()

In [None]:
new_dataset['Class'].value_counts()

In [None]:
new_dataset.groupby('Class').mean()

Splitting the data into Features & Targets

In [None]:
x = new_dataset.drop(columns='Class', axis=1)
y = new_dataset['Class']
print(x)

In [None]:
print(y)

# Split the data into Training data & Testing Data

# Train_Test_Split

Train-Test Split Evaluation
The train-test split is a technique for evaluating the performance of a machine learning algorithm.
It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

Train Dataset: Used to fit the machine learning model.

Test Dataset: Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

common split percentages include:

Train: 80%, Test: 20%


In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, stratify=y, random_state=2)

In [None]:
x_train.shape,x_test.shape,x_train.shape,x_test.shape

In [None]:
print(x.shape, x_train.shape, x_test.shape)

# Model Training

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

In [None]:
# training the Logistic Regression Model with Training Data
model.fit(x_train, y_train)

# Model Evaluation

# Accuracy Score

In [None]:
from sklearn.metrics import accuracy_score
# accuracy on training data
x_train_prediction = model.predict(x_train)
training_data_accuracy = accuracy_score(x_train_prediction, y_train)

In [None]:
print('Accuracy on Training data : ', training_data_accuracy)

In [None]:
# accuracy on test data
x_test_prediction = model.predict(x_test)
test_data_accuracy = accuracy_score(x_test_prediction, y_test)

In [None]:
print('Accuracy score on Test Data : ', test_data_accuracy)

# Classification Algorithm

**KNeighborsClassifier**

* The K-Nearest Neighbors classifier (KNN) is one of the simplest yet most commonly used classifiers in supervised machine learning.
* KNN is often considered a lazy learner.
* it doesn’t technically train a model to make predictions.
* Instead an observation is predicted to be the class of that of the largest proportion of the k nearest observations. For example, if an observation with an unknown class is surrounded by an observation of class 1, then the observation is classified as class 1.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
model = KNeighborsClassifier(n_neighbors=5,metric='minkowski',p=2)
model.fit(x_train,y_train)

# Evaluation

In [None]:
y_pred=model.predict(x_test)

#predict the models and probabilities
y_pred_proba=model.predict_proba(x_test)[:,1]

# Confusion Matrix

**Confusion** **matrix**
A confusion matrix is a table that is often used to describe the performance of a classification model
**true positives (TP)**: These are cases in which we predicted yes (they have the disease), and they do have the disease.

**true negatives (TN):** We predicted no, and they don't have the disease.

**false positives (FP):** We predicted yes, but they don't actually have the disease. (Also known as a "Type I error.")

**false negatives (FN):** We predicted no, but they actually do have the disease. (Also known as a "Type II error.")

**precision** - What proportion of positive identifications was actually correct?
**recall** - What proportion of actual positives was identified correctly?

**F1 Score**

* F1 Score is the weighted average of Precision and Recall
* F1 is usually more useful than accuracy, especially if you have an uneven class distribution.

In [None]:
import numpy as np  

np.random.seed(1001)
# np.random.seed   it can generate same random numbers on multiple executions of the code on the same machine

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

#Importing cohen_kappa_score and roc_auc_score metrices
from sklearn.metrics import cohen_kappa_score, roc_auc_score
from sklearn.metrics import roc_curve, auc

#importing visualizing library
import matplotlib.pyplot as plt
import seaborn as sns

#logloss to check is there loss or difference
from sklearn.metrics import log_loss

#Creating a Function name called Classification Metric
def classification_metric(y_test,y_pred,y_prob,label,n=1,verbose=False):
    """
    Note: only for binary classification
    confusionmatrix(y_true,y_pred,labels=['No','Yes'])
    """
    # confusion matrix
    
    cm = confusion_matrix(y_test,y_pred)
    row_sum = cm.sum(axis=0)
    cm = np.append(cm,row_sum.reshape(1,-1),axis=0)
    col_sum = cm.sum(axis=1)
    cm = np.append(cm,col_sum.reshape(-1,1),axis=1)

    labels = label+['Total']
    
    plt.figure(figsize=(10,6))
    #plotting a fig size as 10 width and 6 height
    
    
    sns.heatmap(cm,annot=True,cmap='summer',fmt='0.2f',xticklabels=labels,
                yticklabels=labels,linewidths=3,cbar=None,)
    #create a heapmap using seaborn libarary and used various parametere

    plt.xlabel('Predicted Values')
    #ploting the values on x- axis as Predicted values
    
    plt.ylabel('Actual Values')
    #ploting the values on y- axis as actual values
    
    plt.title('Confusion Matrix')
    # Mentioning the title of the figure
    
    plt.show()
    #show the image
    
    print('*'*30+'Classifcation Report'+'*'*30+'\n\n')
    #showing * are to put a  line to style
    
    #created classification report
    cr = classification_report(y_test,y_pred)
    
    #print the classifiaction report
    print(cr)
    
    print('\n'+'*'*36+'Kappa Score'+'*'*36+'\n\n')
    
    
    # Kappa score
    kappa = cohen_kappa_score(y_test,y_pred) # Kappa Score
    print('Kappa Score =',kappa)
    
    print('\n'+'*'*30+'Area Under Curve Score'+'*'*30+'\n\n')
    # Kappa score
    roc_a = roc_auc_score(y_test,y_pred) # Kappa Score
    print('AUC Score =',roc_a)
    
    # ROC
    
    
    plt.figure(figsize=(8,5))
    #plot the figuare based on width and height sizes
    
    fpr,tpr, thresh = roc_curve(y_test,y_prob)
    #fpr false positive rate
    #tpr true positive rate
    
    plt.plot(fpr,tpr,'r')
    print('Number of probabilities to build ROC =',len(fpr))
    if verbose == True:
        for i in range(len(thresh)):
            if i%n == 0:
                plt.text(fpr[i],tpr[i],'%0.2f'%thresh[i])
                plt.plot(fpr[i],tpr[i],'v')


    plt.xlabel('False Positive Rate')
    #fpr on x -axis 
    
    plt.ylabel('True Positive Rate')
    #tpr on y axis
    
    plt.title('Receiver Operating Characterstic')
    #mentioning the title of the figuare
    
    plt.legend(['AUC = {}'.format(roc_a)])
    #assign the legend to the figuare
    
    plt.plot([0,1],[0,1],'b--',linewidth=2.0)
    #mentioning then line width as 2.0
    
    plt.grid()
    # show the grid lines to the image
    
    plt.show()
    #display the image
    
  # A point beyond which there is a change in the manner a program executes  
class threshold():
    '''
    Setting up the threshold points
    '''
    def __init__(self):
        self.th = 0.5
        
    def predict_threshold(self,y):
        if y >= self.th:
            return 1
        else:
            return 0

In [None]:
classification_metric(y_test,y_pred,y_pred_proba,['no','yes'],n=1,verbose=True)

# Kappa Score :

It can also be used to assess the performance of a classification model.
we know that Cohen’s kappa is a useful evaluation metric when dealing with imbalanced data
Cohen's kappa coefficient (κ) is a statistic that is used to measure inter-rater reliability
Cohen’s kappa tries to correct the evaluation bias by taking into account the correct classification by a random guess

## AUC

Its provies an aggregative measure of performance occurs all posibile classification thresholds
it talks about linearty about the dataset
AUC starts from 0 to 1
The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
## ROC (Receiver Operating Characterstic curve):

It shows the performance of the model through all thresholds
curve plot between two parameters Tpr (Sensivity) Fpr (specifity)

* Above we can understand that Kappa Score is 0.116 is very less
* AUC score is moderate i.e 0.53 , we wnat to increase
* **Reason**: The dataset is an imbalance dataset so, the kappa score and auc score are low, we can make it balanace by using some of techniques

* Above we can understand that Kappa Score is 0.116 is very less
* AUC score is moderate i.e 0.53 , we wnat to increase
* **Reason**: The dataset is an imbalance dataset so, the kappa score and auc score are low, we can make it balanace by using some of techniques

##Solution for unbalanced dataset:
* is oversamplling techique i.e is Syntetic Minority Oversample Techique(SMOTE)
* Works based on K-NN

#Synthetic Minority Oversampling Technique

* **SMOTE:** Synthetic Minority Oversampling Technique
* SMOTE is an oversampling technique where the synthetic samples are generated for the minority class.
* This algorithm helps to overcome the overfitting problem posed by random oversampling.
* It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.
* General idea to carry out this technique is to bring the minority class values ( either 0 or 1 ) to a comparable number in terms of the other class . In other words to match up the length of the other class.

# **Linear Discriminent Analysis**

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

#Initialize the Linear Discriminant Analysis Classifier
model = LinearDiscriminantAnalysis()

#Train the model using Training Dataset
model.fit(x_train, y_train)

# Prediction using test data
y_pred = model.predict(x_test)

# Calculate Model accuracy by comparing y_test and y_pred
acc_lda = round( accuracy_score(y_test, y_pred) * 100, 2 )
print( 'Accuracy of Linear Discriminant Analysis Classifier: ', acc_lda )

# **GaussianNB**

In [None]:
from sklearn.naive_bayes import GaussianNB

#Initialize the Gaussian Naive Bayes Classifier
model = GaussianNB()

#Train the model using Training Dataset
model.fit(x_train, y_train)

# Prediction using test data
y_pred = model.predict(x_test)

# Calculate Model accuracy by comparing y_test and y_pred
acc_ganb = round( accuracy_score(y_test, y_pred) * 100, 2 )
print( 'Accuracy of Gaussian Naive Bayes : ', acc_ganb )

# **Decision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier

#Initialize the Decision Tree Classifier
model = DecisionTreeClassifier()

#Train the model using Training Dataset
model.fit(x_train, y_train)

# Prediction using test data
y_pred = model.predict(x_test)

# Calculate Model accuracy by comparing y_test and y_pred
acc_dtree = round( accuracy_score(y_test, y_pred) * 100, 2 )
print( 'Accuracy of  Decision Tree Classifier : ', acc_dtree )


# **Random Forest**

In [None]:
#Import Library for Random Forest
from sklearn.ensemble import RandomForestClassifier

#Initialize the Random Forest
model = RandomForestClassifier()

#Train the model using Training Dataset
model.fit(x_train, y_train)

# Prediction using test data
y_pred = model.predict(x_test)

# Calculate Model accuracy by comparing y_test and y_pred
acc_rf = round( accuracy_score(y_test, y_pred) * 100, 2 )
print( 'Accuracy of  Random Forest : ', acc_rf )