# Assignment 1 - The Bean Project

## Task 1

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

df = pd.read_csv('Dry_Bean_Dataset.csv')


In [None]:
X = df.drop(columns=['Class']).values
y = df['Class'].values

df.shape

In [None]:
print(f'Duplicate rows:{df.duplicated().sum()}\n')

print('Bean type of duplicated rows:')
print(df[df.duplicated()].Class.value_counts())

df[df.duplicated(keep=False)]

In [None]:
df=df.drop_duplicates()
df.shape

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

sc_X_train = StandardScaler()
X_train = sc_X_train.fit_transform(X_train)
X_test = sc_X_train.fit_transform(X_test)

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
tree = DecisionTreeClassifier()
lr = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier()

knn.fit(X_train, y_train)
tree.fit(X_train, y_train)
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)

y_pred_knn = knn.predict(X_test)
y_pred_tree = tree.predict(X_test)
y_pred_lr = lr.predict(X_test)
y_pred_rf = rf.predict(X_test)

In [None]:
cm_knn = confusion_matrix(y_test,y_pred_knn)
cm_tree = confusion_matrix(y_test,y_pred_tree)
cm_lr = confusion_matrix(y_test,y_pred_lr)
cm_rf = confusion_matrix(y_test,y_pred_rf)

accuracy_knn = accuracy_score(y_test, y_pred_knn)
accuracy_tree = accuracy_score(y_test, y_pred_tree)
accuracy_lr = accuracy_score(y_test, y_pred_lr)
accuracy_rf = accuracy_score(y_test, y_pred_rf)

report_knn = classification_report(y_test, y_pred_knn)
report_tree = classification_report(y_test, y_pred_tree)
report_lr = classification_report(y_test, y_pred_lr)
report_rf = classification_report(y_test, y_pred_rf)


In [None]:
print("confusion matrix of k-Nearest Neighbour:")
print(cm_knn)
print("confusion matrix of Decision Tree:")
print(cm_tree)

In [None]:
print("confusion matrix of Logistic Regression:")
print(cm_lr)
print("confusion matrix of Random Forest:")
print(cm_rf)

In [None]:
print(accuracy_knn,accuracy_tree,accuracy_lr,accuracy_rf)

In [None]:
print(report_knn)
print(report_tree)

In [None]:
print(report_lr)
print(report_rf)

#### Accuracy:

- KNN achieved an accuracy of approximately 91.92%.
- Decision Tree achieved an accuracy of about 88.47%.
- Logistic Regression achieved an accuracy of approximately 92.54%.
- Random Forest achieved an accuracy of about 92.03%.

#### Precision, Recall, and F1-score:
- For all algorithms, the classes "BOMBAY" have very high precision, recall, and F1-score, close to 1.0. This suggests that these algorithms are very good at correctly identifying instances of the "BOMBAY" class.
- For the classes "DERMASON," "HOROZ," "SEKER," and "CALI," the precision, recall, and F1-score are relatively high for all algorithms. This indicates good performance in classifying these classes.
- The class "SIRA" has lower precision, recall, and F1-score values across all algorithms, indicating that it might be more challenging to classify this class accurately.
- Overall, the "weighted avg" metrics show that Logistic Regression and Random Forest outperform KNN and Decision Tree in terms of overall precision, recall, and F1-score.

#### Analysis:

- The algorithms KNN, Logistic Regression, and Random Forest all perform reasonably well with accuracy scores above 90%, indicating they are generally good at correctly classifying instances.
- Decision Tree has the lowest accuracy among the four algorithms, which suggests it might not be the best choice for this dataset.
- Logistic Regression and Random Forest consistently show good performance across multiple evaluation metrics, making them strong candidates for this classification task.
- The precision, recall, and F1-scores provide a more detailed view of how well each algorithm performs for different classes. Depending on the specific requirements and goals of your classification task, you may choose one algorithm over the others.

In summary, Logistic Regression and Random Forest appear to be the top-performing algorithms based on the provided metrics. 

## Task 2

In [None]:
print(report_rf)

In the Random Forest classification report, 'Sira' has a precision of 0.88, recall of 0.87, and an F1-score of 0.88. These metrics indicate that the model is moderately accurate in classifying 'Sira' beans as poisonous. Other classes have higher accuracy. Overall model accuracy is 0.92. 

In [None]:
df["Class"].value_counts().plot(kind='bar', color='blue')
df['Class'].value_counts()

- Given the imbalanced nature of the dataset, particularly the 'BOMBAY' class, it's essential to focus on improving the classification performance for the 'SIRA' class without negatively impacting the other classes. 

###  Exploratory Data Analysis


#### plot the stripplot

In [None]:
numerical_cols = df.drop(columns=['Class']).columns

fig, ax = plt.subplots(4, 4, figsize=(20, 15))
for variable, subplot in zip(numerical_cols, ax.flatten()):
    g=sns.stripplot(x=df[variable],y=df.Class ,ax=subplot) 
plt.tight_layout()

- There are many outliers for SIRA in the AspectRation, roundness, Compactness, ShapeFactor2-4 in the dataset

#### plot the histograms
- This helps us better visualise the frequency of values present in a column

In [None]:
numerical_cols = df.drop(columns=['Class']).columns

fig, ax = plt.subplots(4, 4, figsize=(15, 15))
for variable, subplot in zip(numerical_cols, ax.flatten()):
    g=sns.histplot(data=df,x=variable ,ax=subplot) #working with default bin size
    g.axvline(x=df[variable].mean(), color='y', label='Mean', linestyle='--', linewidth=2)
plt.tight_layout()

In [None]:
numerical_cols = df.drop(columns=['Class']).columns

fig, ax = plt.subplots(4, 4, figsize=(15, 12))

for variable, subplot in zip(numerical_cols, ax.flatten()):
    g = sns.histplot(data=df, x=variable, ax=subplot, hue='Class', kde=True)
    g.set_title(f'Distribution of {variable}')
    subplot.axvline(x=df[df['Class'] == 'SIRA'][variable].mean(), color='y', label='SIRA Mean', linestyle='--', linewidth=2)
    
plt.tight_layout()
plt.show()

#### boxplot

In [None]:
fig, ax = plt.subplots(8, 2, figsize=(15, 25))

for variable, subplot in zip(numerical_cols, ax.flatten()):
    sns.boxplot(x=df['Class'], y= df[variable], ax=subplot,palette='pastel')
plt.tight_layout()

- There are many outliers for SIRA in the AspectRation, MajorAxisLength, roundness, Compactness, ShapeFactor2,4 in the dataset
- use Isolation Forest algorithm to detect and replace outliers with median

In [None]:
from sklearn.ensemble import IsolationForest

sira_data = df[df['Class'] == 'SIRA']
features_to_process = ['AspectRation', 'MajorAxisLength', 'roundness', 'Compactness', 'ShapeFactor2', 'ShapeFactor4']

clf = IsolationForest(contamination=0.05, random_state=42)
clf.fit(sira_data[features_to_process])
outliers = clf.predict(sira_data[features_to_process]) == -1
for feature in features_to_process:
    median_value = sira_data[feature].median()
    sira_data.loc[outliers, feature] = median_value

In [None]:
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

- Following this data processing step, the performance of the 'SIRA' class hasn't seen a significant improvement; however, it has maintained its initial level of performance

In [None]:
from sklearn.model_selection import cross_val_score

clf = RandomForestClassifier(random_state=42)

cv_scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')

for i, score in enumerate(cv_scores):
    print(f'Cross-Validation Fold {i + 1}: {score}')

average_accuracy = cv_scores.mean()
print(f'Average Accuracy: {average_accuracy}')

In [None]:
df.replace(['SIRA','BOMBAY','DERMASON', 'BARBUNYA', 'HOROZ', 'CALI', 'SEKER',], [1,2,3,4,5,6,0], inplace=True)
df.head()

In [None]:
plt.figure(figsize=(18,12))
sns.heatmap(df.corr(), yticklabels='auto', annot=True, cmap='coolwarm')
plt.show()

* There a lot of highly correlated attributes in the above correlation matrix, for eg: </br>

    *   **Area & Convex Area**:1
    *   **Shaped Factor3 & Comapctness**:1
    *   **Aspect ration & compactness**: -0.99
    *   **Area & Perimeter**: 0.97
    *   **Perimeter & ShapeFactor1**: -0.87
    *   **Aspect ration & Eccentricity**: 0.92 
    
    
* Some attributes with low level of correlation among them:</br>
    *   **Extent & EquivDiameter**: 0.029
    *   **Solidity & Eccentricity**: -0.3
    *   **Compactnes & Area**: -0.27

In [None]:
q1 = df.quantile(0.25)
q3 = df.quantile(0.75)

iqr = q3 - q1

df1 = df[~((df < (q1 - 1.5 * iqr)) |(df > (q3 + 1.5 * iqr))).any(axis=1)]

df1.describe().T

#### Synthetic Minority Over-sampling Technique

In [None]:
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='auto', random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)


In [None]:
clf = RandomForestClassifier(class_weight='balanced', random_state=42)
clf.fit(X_train_resampled, y_train_resampled)

In [None]:
y_pred = clf.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

#### Class Weighted Random Forest model

In [None]:
class_weights = {
    'DERMASON': 1.0,
    'SIRA': 3.0,
    'SEKER': 1.0,
    'HOROZ': 1.0,
    'CALI': 1.0,
    'BARBUNYA': 1.0,
    'BOMBAY': 1.0
}

In [None]:
clf = RandomForestClassifier(class_weight=class_weights, random_state=42)
clf.fit(X_train, y_train)

In [None]:
y_pred = clf.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

#### GridSearchCV tuned Random Forest model

In [None]:
from sklearn.model_selection import GridSearchCV

clf = RandomForestClassifier(class_weight='balanced', random_state=42)
clf.fit(X_train, y_train)

param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30]
}

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='f1_macro')
grid_search.fit(X_train_resampled, y_train_resampled)

best_clf = grid_search.best_estimator_
y_pred = best_clf.predict(X_test)

report = classification_report(y_test, y_pred)
print(report)

- In GridSearchCV tuned Random Forest model, The high recall value of 0.90 for the 'SIRA' class indicates that the model is effective at correctly identifying the majority of samples belonging to the 'SIRA' class.

In summary, the GridSearchCV tuned Random Forest model generally shows slightly better performance across most classes, with improved precision, recall, and F1-scores in some cases. However, the Class Weighted model may have a slight advantage in maintaining better precision for some classes while trading off with a slight decrease in recall for the 'SIRA' class. 