# Dataset Information

Given a set of features extracted from the shape of the beans in images and  it's required to predict the class of a bean given some features about its shape.
There are 7 bean types in this dataset.

**Data fields**
- ID - an ID for this instance
- Area - (A), The area of a bean zone and the number of pixels within its boundaries.
- Perimeter - (P), Bean circumference is defined as the length of its border.
- MajorAxisLength - (L), The distance between the ends of the longest line that can be drawn from a bean.
- MinorAxisLength - (l), The longest line that can be drawn from the bean while standing perpendicular to the main axis.
- AspectRatio - (K), Defines the relationship between L and l.
- Eccentricity - (Ec), Eccentricity of the ellipse having the same moments as the region.
- ConvexArea - (C), Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
- EquivDiameter - (Ed), The diameter of a circle having the same area as a bean seed area.
- Extent - (Ex), The ratio of the pixels in the bounding box to the bean area.
- Solidity - (S), Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
- Roundness - (R), Calculated with the following formula: (4piA)/(P^2)
- Compactness - (CO), Measures the roundness of an object: Ed/L
- ShapeFactor1 - (SF1)
- ShapeFactor2 - (SF2)
- ShapeFactor3 - (SF3)
- ShapeFactor4 - (SF4)
- y - the class of the bean. It can be any of BARBUNYA, SIRA, HOROZ, DERMASON, CALI, BOMBAY, and SEKER.


<img src= "https://www.thespruceeats.com/thmb/eeIti36pfkoNBaipXrTHLjIv5YA=/1888x1416/smart/filters:no_upscale()/DriedBeans-56f6c2c43df78c78418c3b46.jpg" alt ="Titanic" style='width: 800px;height:400px'>

# 1: Import Libraries

In [118]:
# Supressing the warning messages
import warnings
warnings.filterwarnings('ignore')

In [119]:
# for basic mathematics operation 
import numpy as np
import pandas as pd
from pandas import plotting
from sklearn.metrics import confusion_matrix

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
from sklearn.metrics import ConfusionMatrixDisplay
from mlxtend.plotting import plot_confusion_matrix
import matplotlib.pyplot as plt

# for Accuracy 
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import accuracy_score, r2_score,confusion_matrix, plot_confusion_matrix, classification_report,f1_score, make_scorer

# for learning models
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler, LabelEncoder,RobustScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, VotingClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
import xgboost as xgb
from xgboost import XGBClassifier, XGBRFClassifier
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMClassifier

# for path
import os

# 2: Reading the Dataset

In [120]:
dataset_path = '../input/dry-beans-classification-iti-ai-pro-intake02'
df = pd.read_csv(os.path.join(dataset_path, 'train.csv'))
print("The shape of the dataset is {}.\n\n".format(df.shape))

In [121]:
df.head(10)

# 3- Explainatry Data Analysis - EDA

In [122]:
# Shape or Size
df.shape

 ### 3.1 Check data types and for missing values

In [123]:
#Dataset information
df.info()

**The features are all numerical but (Y / Bean Class)**
<br>

### 3.2 Checking for null values

In [124]:
df.isna().sum()

**No Nullable Data**

### 3.3 Checking for duplicated values

In [125]:
df.duplicated().sum()

**No Duplicated Data**

In [126]:
df['y'].unique()

### 3.4 Display beans per type

In [127]:
print(df['y'].value_counts())
_ = sns.countplot(x='y', data=df)

**Number of instancs for each class , Dermason has the highest number.**

# 4- Data Visualization
**Heatmap**
### 4.1 Check correlation between features

In [128]:
corr_matrix = df.corr()

plt.figure(figsize=(15,15))
plt.title('Correlation Heatmap of Beans Dataset')
a = sns.heatmap(corr_matrix, cmap='Blues', square=True, annot=True, fmt='.2f', linecolor='black')
a.set_xticklabels(a.get_xticklabels(), rotation=30)
a.set_yticklabels(a.get_yticklabels(), rotation=30)
plt.show()

From this correlation matrix we can exctract features that are strongly correlated like : 
- Area
- Perimeter
- MajorAxisLength
- MinorAxisLength
- ConvexArea
- EquivDiameter
- ShapeFactor1

Features to be drobbed : 

- ShapeFactor3
- Compactness
- AspectRation
- Area
- MajorAxisLength
- MinorAxisLength
- ConvexArea
- EquivDiameter
- ShapeFactor1

### 4.2 Pair Plot of values in each feature

In [129]:
Strongly_corr_features = df[["Area","Perimeter","AspectRation","Eccentricity","roundness","Compactness","y"]]
Strongly_corr_features.head()
sns.set_theme(style="whitegrid")
sns.pairplot(Strongly_corr_features, hue="y")

**From the graph above, Linear and log relations can be detected.**
</br>
**Next step will be Detecting how Beans classes can be effected by many features ..**

### 4.3 Display distribution of values in each feature

In [130]:
Numeric_cols = df.drop(columns=['y', 'ID']).columns

fig, ax = plt.subplots(4, 4, figsize=(15, 12))
for variable, subplot in zip(Numeric_cols, ax.flatten()):
    g=sns.histplot(df[variable],bins=30, kde=True, ax=subplot)
    g.lines[0].set_color('crimson')
    g.axvline(x=df[variable].mean(), color='m', label='Mean', linestyle='--', linewidth=2)
plt.tight_layout()

### 4.4  Check for outliers

In [131]:
Numeric_cols = df.drop(columns=['y', 'ID']).columns
fig, ax = plt.subplots(8, 2, figsize=(15, 25))

for variable, subplot in zip(Numeric_cols, ax.flatten()):
    sns.boxplot(x=df['y'], y= df[variable], ax=subplot)
plt.tight_layout()

- A perimeter is  a path that encompasses/surrounds/outlines a shape or its length. 'Wikipedia'
- The above graph shows that (BOMBAY) has the highest perimeter

In [132]:
fig, ax = plt.subplots(4, 4, figsize=(15, 12))

for variable, subplot in zip(Numeric_cols, ax.flatten()):
    sns.boxplot(y= df[variable], ax=subplot)
plt.tight_layout()

# 5- Feature Engineering

In [133]:
df.describe(percentiles=[.25, .5, .75, 0.995]).T

**Features like:** (Eccentricity , Extent ,Solidity ,roundness ,Compactness ,and shapeFactor1,2,3,4 ) **ranges between (0 and 1)**

**On the other side , there are other features like:**
- (Area) ranges between (20420 and 254616 )
- (ConvexArea) ranges between (20684 and 263261 )

When a dataset has values of different columns at different scales, it gets tough to analyze the trends and patterns , so we need to make sure that all the columns have a significant difference in their scales, and they can be modified in such a way that all those values fall into the same scale. This process is called Scaling.

### 5.1 Data Splitting

Now it's time to split the dataset for the training step. Typically the dataset is split into 3 subsets, namely, the training, validation and test sets. In our case, the test set is already predefined. So we'll split the "training" set into training and validation sets with 0.8:0.2 ratio. 


In [134]:
features = df.drop(columns=['y', 'ID', 'Extent']).values
labels = df['y'].values
features_train, features_val, labels_train, labels_val = train_test_split(features, labels, test_size=0.06, random_state=0, stratify = labels, shuffle=True)

In [135]:
print(features_train.shape, labels_train.shape, features_val.shape, labels_val.shape)

### 5.2 Display beans class distribution

In [136]:
plt.figure(figsize = (7, 5))
plt.bar((pd.DataFrame(labels_train).groupby(0).size()).index, (pd.DataFrame(labels_train).groupby(0).size()))
plt.xlabel("Bean type")
plt.ylabel("Count")
plt.title("Number of beans per type in training data")
plt.show()

In [137]:
plt.figure(figsize = (7, 5))
plt.bar((pd.DataFrame(labels_val).groupby(0).size()).index, (pd.DataFrame(labels_val).groupby(0).size()))
plt.xlabel("Bean type")
plt.ylabel("Count")
plt.title("Number of beans per type in validation data")
plt.show()

### 5.3  Normalize feature values

In [138]:
# sc = MinMaxScaler()
# df_scaled = sc.fit_transform(features_train)
# features_train = pd.DataFrame(df_scaled , columns= df.columns.difference(['ID','y']))
# #features_train = features_train.drop(columns=['ShapeFactor1','ShapeFactor3','EquivDiameter','Area','Perimeter','AspectRation'])

# df_scaled = sc.fit_transform(features_val)
# features_val = pd.DataFrame(df_scaled , columns= df.columns.difference(['ID','y']))
# #features_val = features_val.drop(columns=['ShapeFactor1','ShapeFactor3','EquivDiameter','Area','Perimeter','AspectRation'])

In [139]:
#features_train.min()

In [140]:
#features_train.max()

In [141]:
#features_train.describe(percentiles=[.25, .5, .75, 0.995]).T

In [142]:
#features_val.describe(percentiles=[.25, .5, .75, 0.995]).T

In [143]:
#features_val.min()

In [144]:
#features_val.max()

### 5.4 Encode labels (for Ensemble algorithm)

In [145]:
le = LabelEncoder()
labels_train_encoded = le.fit_transform(labels_train)
labels_val_encoded = le.fit_transform(labels_val)
print(labels_train_encoded)
print(labels_val_encoded)

# 6- Model Training

### 6.1 Logistic Regression

In [146]:
LogisticRegression = make_pipeline(RobustScaler(),
                    PCA(n_components=13, whiten=True),
                    LogisticRegression(max_iter=4000000,C=15,solver='sag')).fit(features_train, labels_train_encoded)

print("Logistic Regression Training F1-scores:", f1_score(labels_train_encoded, LogisticRegression.predict(features_train), average='micro'))
print("Logistic Regression Validation F1-scores:", f1_score(labels_val_encoded, LogisticRegression.predict(features_val), average='micro'))

### 6.2 Random Forest Classifier

In [147]:
randomClassifier =  make_pipeline(RobustScaler(),
                    PCA(n_components=13, whiten=True),
                    RandomForestClassifier(n_estimators = 200)).fit(features_train, labels_train_encoded)
#RandomForestClassifier(max_depth = 10, n_estimators = 100, random_state = 42).fit(features_train, labels_train_encoded)
print("Random Forest Training F1-scores:", f1_score(labels_train_encoded, randomClassifier.predict(features_train), average='micro'))
print("Random Forest Validation F1-scores:", f1_score(labels_val_encoded, randomClassifier.predict(features_val), average='micro'))

### 6.3 AdaBoost Classifier

In [148]:
# params_ada = {
#     "n_estimators": [5, 10, 15, 20],
#     "learning_rate": [0.4, 0.6, 0.8, 1.0]
# }

In [149]:
#gs_ada = GridSearchCV(AdaBoostClassifier(base_estimator = RandomForestClassifier(max_depth = 10, n_estimators = 100, random_state = 42)), param_grid = params_ada, scoring = make_scorer(f1_score , average = "micro"), cv = 4, n_jobs = -1).fit(features_train, labels_train_encoded)

In [150]:
#gs_ada.cv_results_

In [151]:
#gs_ada.best_params_

In [152]:
adaClassifier =  make_pipeline(RobustScaler(),
                               PCA(n_components=13,whiten=True),
                               AdaBoostClassifier(base_estimator = RandomForestClassifier(n_estimators = 200), n_estimators = 200, learning_rate = 0.01)).fit(features_train, labels_train_encoded)
#AdaBoostClassifier(base_estimator = RandomForestClassifier(max_depth = 10, n_estimators = 100, random_state = 42), n_estimators = 20, learning_rate = 1.0, random_state = 42).fit(features_train, labels_train_encoded)
train_predicted = adaClassifier.predict(features_train)
val_predicted = adaClassifier.predict(features_val)
print("AdaBoost Training F1-scores:", f1_score(labels_train_encoded, train_predicted, average='micro'))
print("AdaBoost Validation F1-scores:", f1_score(labels_val_encoded, val_predicted, average='micro'))

### 6.4 XGBoost Classifier

In [153]:
xgbClassifier = make_pipeline(RobustScaler(),
                              PCA(n_components=13,whiten=True),
                              XGBClassifier(n_estimators=200,learning_rate=0.07)).fit(features_train, labels_train_encoded)
#XGBClassifier(learning_rate=0.07, random_state =42, objective='multi:softproba', max_depth=5, reg_alpha = 0.002, gamma=0.01, verbosity=0).fit(features_train, labels_train_encoded)
train_predicted = xgbClassifier.predict(features_train)
val_predicted = xgbClassifier.predict(features_val)
print("XGB Training F1-scores:", f1_score(labels_train_encoded, train_predicted, average='micro'))
print("XGB Validation F1-scores:", f1_score(labels_val_encoded, val_predicted, average='micro'))

### 6.5 Voting Classifier

In [154]:
votingClassifier = VotingClassifier(estimators=[('logist', LogisticRegression), ('rf', randomClassifier), ('ada', adaClassifier), ('xgb', xgbClassifier)], voting='hard').fit(features_train, labels_train_encoded)
train_predicted = votingClassifier.predict(features_train)
val_predicted = votingClassifier.predict(features_val)
print("Voting Training F1-scores:", f1_score(labels_train_encoded, train_predicted, average='micro'))
print("Voting Validation F1-scores:", f1_score(labels_val_encoded, val_predicted, average='micro'))

### 6.6 Validation Data Confusion Matrix

In [155]:
plt.figure(figsize = (8, 7))
sns.heatmap(confusion_matrix(labels_val_encoded, val_predicted),
            annot = True,
            fmt = ".0f",
            cmap = "coolwarm",
            linewidths = 2, 
            linecolor = "white",
            xticklabels = votingClassifier.classes_,
            yticklabels = votingClassifier.classes_)
plt.xlabel("Actual")
plt.ylabel("Predicted")
plt.title("Confusion matrix on the validation data")
plt.show()

# 7- Model Prediction 

We have built a model and we'd like to submit our predictions on the test set! In order to do that, we'll load the test set, predict the class and save the submission file. 


### 7.1  Read Test Data

In [158]:
dataset_path = '../input/dry-beans-classification-iti-ai-pro-intake02/'
df_test = pd.read_csv(os.path.join(dataset_path, 'test.csv'))
df_test= df_test.drop(columns=['Extent'])
df_test.head()

### 7.2  Normalize test values

In [159]:
features_test = df_test.drop(columns = ['ID'])
#features_test = features_test.drop(columns=['ShapeFactor1','ShapeFactor3','EquivDiameter','Area','Perimeter','AspectRation', 'MajorAxisLength'])

### 7.3 predicting test labels

In [160]:
features_test_predicted = adaClassifier.predict(features_test)

# add y column to the test data
df_test['y'] = le.inverse_transform(features_test_predicted)

df_test.head()

# 8- Submission File Generation

In [161]:
df_test[['ID', 'y']].to_csv('/kaggle/working/submission.csv', index=False)