# Task 03: Prediction using Decision Tree Algorithm

## Submitted By: Yashuv Baskota
### Language- Python
### Level: Intermediate
### Dataset: https://bit.ly/3kXTdox

#### Description:
The task is to create the `Decision Tree classifier` and visualize it graphically. *Decision tree classifier* is a type of supervised learning algorithm that can be used for classification tasks. It works by constructing a tree-like model of decisions based on the features of the training data.
The purpose is that if we feed any new data to this classifier, it would be able to
predict the right class accordingly. 

## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.model_selection import learning_curve

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

## 2. Load Dataset

In [None]:
# Load the data
df = pd.read_csv("data/Iris.csv")

## 3. Exploratory Data Analysis

In [None]:
df.shape

In [None]:
df.head()

### Basic information about the data

In [None]:
df.info()

### Summary Statistics

In [None]:
df.describe()

### Unique Species

In [None]:
print(df['Species'].nunique())
print(df['Species'].unique())

### Frequency distribution of each Species

In [None]:
df['Species'].value_counts()

### Check for missing values

In [None]:
df.isnull().sum()

### Check for duplicate rows

In [None]:
df.duplicated().sum()

### Data Visualizaion: 
### Histogram

In [None]:
df = df.drop(df[['Id']], axis=1)

df.hist(bins=50, figsize=(20,15))
plt.show()

In [None]:
sns.histplot(x='Species',data = df)

### Box Plot to Check for outliers

In [None]:
plt.figure(figsize=(15,15))
plt.subplot(2,2,1)
sns.boxplot(x="Species", y="SepalLengthCm", data=df)
plt.subplot(2,2,2)
sns.boxplot(x="Species", y="SepalWidthCm", data=df)
plt.subplot(2,2,3)
sns.boxplot(x="Species", y="PetalLengthCm", data=df)
plt.subplot(2,2,4)
sns.boxplot(x="Species", y="PetalWidthCm", data=df)

### A pair plot of all the numeric columns, colored by species

In [None]:
sns.pairplot(df, hue="Species")

### Heatmap of the correlation between all the columns

In [None]:
sns.heatmap(df.corr(), annot=True)

In [None]:
# check the outliers using IQR
Q1 = np.percentile(df['SepalWidthCm'], 25,
                interpolation = 'midpoint')
  
Q3 = np.percentile(df['SepalWidthCm'], 75,
                interpolation = 'midpoint')
IQR = Q3 - Q1
  
print("Old Shape: ", df.shape)
  
# Upper bound
upper = np.where(df['SepalWidthCm'] >= (Q3+1.5*IQR))
  
# Lower bound
lower = np.where(df['SepalWidthCm'] <= (Q1-1.5*IQR))
  
# Removing the Outliers
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
  
print("New Shape: ", df.shape)
  
sns.boxplot(x='SepalWidthCm', data=df)

## 4. Model Building

In [None]:
# input variables
X = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]

# target variable
y = df['Species']

In [None]:
# Split the data into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Create the decision tree model
classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)  

# Train the model on the training data
classifier.fit(X_train, y_train)

### Bar plot of the feature importances

In [None]:
# Bar plot of feature importances
importances = classifier.feature_importances_
sns.barplot(x=importances, y=X.columns)

### Learning curve: showing the relationship between the size of the training set and the model's performance

In [None]:
# Plot the learning curve
train_sizes, train_scores, test_scores = learning_curve(classifier, X, y, cv=5)
sns.lineplot(x=train_sizes, y=train_scores.mean(axis=1), label="Training score")
sns.lineplot(x=train_sizes, y=test_scores.mean(axis=1), label="Test score")

### Visualize the decision tree

In [None]:
fig = plt.figure(figsize=(25,20))
fn=['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']
cn=['setosa', 'versicolor', 'virginica']
plot_tree(classifier,
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('Tree.png')

In [None]:
# Use the model to make predictions on the test data
y_pred = classifier.predict(X_test)
y_pred

## 5. Performance Evaluation 

### Confusion matrix showing the performance of the model on the test set

In [None]:
# Plot the confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d")

# Add labels to the plot
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')

# Show the plot
plt.show()

### Classification Report of Performance

In [None]:
print(classification_report(y_test, y_pred))

In [None]:
# Calculate the accuracy of the model

# accuracy = classifier.score(X_test, y_test)
# print("Accuracy:", accuracy)
accuracy_score(y_test, y_pred)

## 6. Make Prediction (Takes Input dimensions, predicts Species)

In [None]:
# prediction function
def make_prediction(new_data_point):
    
    # predict the cluster for the new data point
    prediction = classifier.predict([new_data_point])[0]
    print(f"Predicted Species -> {prediction}")

In [None]:
# new data points
new_data_point1 = [5.6, 3.9, 1.8, 0.5]
new_data_point2 = [6.5, 3.5, 4.8, 1.0]
new_data_point3 = [7.1, 3.1, 5.8, 1.8]

# predict species
make_prediction(new_data_point1)
make_prediction(new_data_point2)
make_prediction(new_data_point3)

<br>

__Thank You!__