<a href="https://www.kaggle.com/code/zmkalila/iris-dataset-practice?scriptVersionId=199500657" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Iris Dataset: EDA, Data Visualization, and Classification

I make this notebook based on the codes in this YouTube lesson:  
https://youtu.be/Op3019SFYzI?si=DQ4VwN5I5t8Vk6jX (thus the credit goes to the video-maker).

# Exploratory Data Analysis (EDA)

## Import modules

In [None]:
import pandas as pd
import seaborn as sns

Import this module below so when `sns.get_dataset_names()` cell is run, there won't be any warnings.

In [None]:
import warnings
warnings.filterwarnings('ignore')

## Load dataset

In [None]:
sns.get_dataset_names()

In [None]:
df = sns.load_dataset('iris') # load dataset
df

## Identify the shape of the dataset

In [None]:
df.shape # dataset dimension (number_of_rows, number_of_columns)

## Get the list of all the column names

In [None]:
df.columns

## Identify data types for each column

In [None]:
df.dtypes

## Get basic dataset information

In [None]:
df.info()

## Identify missing values

In [None]:
df.isna().values.any()
# can also be written as: df.isnull().values.any()

## Identify duplicated entries/rows

In [None]:
df[df.duplicated(keep=False)] # to display all rows that are duplicated

In [None]:
df.duplicated().value_counts()

## Drop duplicated entries/rows

In [None]:
df.drop_duplicates(inplace=True)
df.shape

## Describe the dataset

In [None]:
df.describe()

Note: all the numerical data above is in centimeter (cm) unit.

## Correlation matrix

In [None]:
df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].corr()

The numbers above shows how one variable correlates with the other variables.  
- **Positive value**: shows that the two variables correlate with each other.<br>(the closer the value to 1, the stronger the correlation is)
- **Negative value**: shows that the two variables have weak to no correlation.

# Data Visualization

## Import modules

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set() # to tell Python that we want the visualization to be in Seaborn style

## Heatmap

In [None]:
sns.heatmap(data=df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].corr(),
           cmap='rocket_r')

## Bar Plot

In [None]:
df['species'].value_counts()

In [None]:
df['species'].value_counts().plot.bar()
plt.ylabel('Frequency')
plt.tight_layout()
plt.xticks(rotation=0)

plt.show()

In [None]:
sns.countplot(x='species', data=df)
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

## Pie Chart

In [None]:
df['species'].value_counts().plot.pie(autopct='%1.1f%%', labels=None, legend=True)
plt.tight_layout()

## Line Plot

In [None]:
fig,ax = plt.subplots(nrows=2, ncols=2, figsize=(10,8))
plt.suptitle('Iris Sepal & Petal Length Width')

df['sepal_length'].plot.line(ax=ax[0][0])
ax[0][0].set_title('Sepal Length')

df['sepal_width'].plot.line(ax=ax[0][1])
ax[0][1].set_title('Sepal Width')

df.petal_length.plot.line(ax=ax[1][0])
ax[1][0].set_title('Petal Length')

df.petal_width.plot.line(ax=ax[1][1])
ax[1][1].set_title('Petal Width')

plt.tight_layout()
plt.show()

In [None]:
df.plot()
plt.tight_layout()
plt.show()

## Histogram

In [None]:
df.hist(bins=10)
plt.tight_layout()

In [None]:
df.plot.hist(bins=10)
plt.tight_layout()

## Boxplot

In [None]:
df.plot.box()

In [None]:
df.plot.box()
plt.tight_layout()

In [None]:
df.boxplot(by='species', figsize=(6,6))
plt.tight_layout()

## Scatter Plot

In [None]:
sns.scatterplot(x='sepal_length', y='sepal_width', data=df, hue='species')
plt.tight_layout()

## Pair Plot

In [None]:
sns.pairplot(df, hue='species', markers='*')
plt.tight_layout()

## Violin Plot

In [None]:
sns.violinplot(data=df, y='species', x='sepal_length', inner='quartile')
plt.tight_layout()

# Machine Learning: Classification Model

## Import modules
Scikit-Learn

In [None]:
from sklearn.model_selection import train_test_split # to split dataset into 2 parts: training & testing set
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # to evaluate performance of the model

## Dataset: Features & Class Label

**'Features'** is the input for the Machine Learning model,
whereas **'Class Label'**, as the name suggests, is the output or the labels used for the classification results.

These 'features' are what the machine will learn to train itself to recognize some pattern in it so that the machine will be able to classify objects into the existing categories, which is the 'Class Labels'.

In [None]:
# assign features to variable X
# only the columns with numerical data becomes the features, thus 'species' column is dropped

X = df.drop(columns='species')
X.head() # to show the first 5 rows

In [None]:
# the 'species' column acts as the class label (target)
# assign label (target) to variable y

y = df['species']
y.head().to_frame() # to show the first 5 rows in the form of dataframe

## Split the dataset into a training set and testing set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=10)

`test_size` is the proportion of the testing dataset.  
0.4 means 40% testing set and thus the rest (60%) is training set.

`random_state` is the number of the randomization replication.

In [None]:
print('training dataset')
print(X_train.shape)
print(y_train.shape)
print()
print('testing dataset')
print(X_test.shape)
print(y_test.shape)

## K Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
k_range = list(range(1,26))
scores = [] # empty list

for k in k_range:

    # configure algorithm by using KNeighborsClassifier function and determine the neighbors number
    model_knn = KNeighborsClassifier(n_neighbors=k) 

    # train model/classifier by using the method .fit() then apply it to X_train and y_train
    model_knn.fit(X_train, y_train)

    # tell the model to make prediction based on the X_test dataset and assign it to variable y_pred
    y_pred = model_knn.predict(X_test)

    # lastly see the model's performance based on the accuracy score of the actual performance (y_test) and the trained prediction process (y_pred)
    scores.append(accuracy_score(y_test, y_pred))

In [None]:
plt.plot(k_range, scores)
plt.xlabel('Value of k for KNN')
plt.ylabel('Accuracy Score')
plt.title('Accuracy Scores for k values of KNN')
plt.tight_layout()

As we can see above, the accuracy score shows an increasing trend.

In [None]:
# model_knn = KNeighborsClassifier(n_neighbors=3)
# model_knn.fit(X_train, y_train)
# y_pred = model_knn.predict(X_test)

### Accuracy Score

In [None]:
print(accuracy_score(y_test, y_pred)) # to evaluate the accuracy of the classification model

### Confusion Matrix

In [None]:
print(confusion_matrix(y_test, y_pred))

### Classification Report

In [None]:
print(classification_report(y_test, y_pred))

## Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model_logreg = LogisticRegression(solver='lbfgs', multi_class='auto')
model_logreg.fit(X_train, y_train)
y_pred = model_logreg.predict(X_test)

## Support Vector Classifier

In [None]:
from sklearn.svm import SVC

In [None]:
model_svc = SVC(gamma='scale')
model_svc.fit(X_train, y_train)
y_pred = model_svc.predict(X_test)

## Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
model_dt = DecisionTreeClassifier()
model_dt.fit(X_train, y_train)
y_pred = model_dt.predict(X_test)

## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_rf = RandomForestClassifier(n_estimators=100)
model_rf.fit(X_train, y_train)
pred_rf = model_rf.predict(X_test)

## Accuracy comparison of various classifier models

In [None]:
models = [model_knn, model_logreg, model_svc, model_dt, model_rf]
accuracy_scores = []

for model in models:
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

print(accuracy_scores)

In [None]:
plt.plot(['KNN', 'Logistic Regression', 'SVC', 'Decision Tree', 'Random Forest'], accuracy_scores)
plt.title('Accuracy Score Comparison for Various Classifier Models', fontweight='bold')
plt.xlabel('Model', fontweight='bold', color='b')
plt.ylabel('Accuracy Score', fontweight='bold', color='b')
plt.tight_layout()