# Dry Bean

https://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset

## Data Set Information:

Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.


## Attribute Information:

01. Area (A): The area of a bean zone and the number of pixels within its boundaries.
02. Perimeter (P): Bean circumference is defined as the length of its border.
03. Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.
04. Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.
05. Aspect ratio (K): Defines the relationship between L and l.
06. Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.
07. Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
08. Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area.
09. Extent (Ex): The ratio of the pixels in the bounding box to the bean area.
10. Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. Roundness (R): Calculated with the following formula: (4piA)/(P^2)
12. Compactness (CO): Measures the roundness of an object: Ed/L
13. ShapeFactor1 (SF1)
14. ShapeFactor2 (SF2)
15. ShapeFactor3 (SF3)
16. ShapeFactor4 (SF4)
17. Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)

In [1]:
import re  # Regex
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay, accuracy_score
from sklearn import preprocessing
import seaborn as sns
from matplotlib import pyplot as plt

In [2]:
# @formatter:off
%matplotlib inline
# @formatter:on

"""
Executing methods that do some data transformation which involves randomness
(i.e. sampling or data splitting) yields different result for each run. In order
 to have predictable results we can leverage the function argument called
 `random_state` and set it to a constant value (`randomness_id`).
"""
randomness_id = 42
np.random.seed(randomness_id)

In [None]:
original_df = pd.read_excel("./data/Dry_Bean_Dataset.xlsx")
original_df.shape[0]

In [None]:
original_df.head()

In [None]:
original_df.columns = map(lambda name: re.sub("(?!^)([A-Z]+)", r"_\1", name).lower(), original_df.columns)

In [None]:
original_df.info()

In [None]:
original_df.isnull().sum()

In [None]:
# sns.pairplot(pd.DataFrame.sample(original_df[list(original_df.columns)[:]], frac=0.1, random_state=randomness_id),
#              hue="class")

In [None]:
plt.figure(figsize=(15, 8))
sns.heatmap(original_df.corr(), annot=True)

In [None]:
# TODO: Remove redundant columns.
original_df.drop(['convex_area', 'solidity', 'roundness', 'shape_factor4', 'extent', 'aspect_ration'], axis=1,
                 inplace=True)
# original_df = original_df[
#     ["area", "aspect_ration", "extent", "solidity", "roundness", "shape_factor1", "shape_factor2", "shape_factor3",
#      "shape_factor4", "class"]]

label_encoder = preprocessing.LabelEncoder()
label_encoder.fit(original_df["class"])
original_df["class"] = label_encoder.transform(original_df["class"])

In [None]:
original_df.describe()

In [None]:
original_df["class"].value_counts()

In [None]:
# df = original_df.groupby('class', as_index=False).apply(lambda x: x.sample(original_df["class"].value_counts().min(), random_state=randomness_id)).reset_index(drop=True)
df = original_df.sample(frac=0.1, random_state=randomness_id)
# df = original_df
df["class"].value_counts()

In [None]:
X = df.drop("class", axis=1).copy()
y = df["class"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=randomness_id)

# from sklearn.preprocessing import StandardScaler
# scaler_x = StandardScaler()
# X_train = pd.DataFrame(scaler_x.transform(X_train), index=X_train.index, columns=X_train.columns)
# X_test = pd.DataFrame(scaler_x.transform(X_test), index=X_test.index, columns=X_test.columns)

len(X_train), len(X_test)

In [None]:
X_train.describe()

In [None]:
suboptimal_clf_dt = DecisionTreeClassifier(random_state=randomness_id).fit(X_train, y_train)

# plt.figure(figsize=(15, 10))
# plot_tree(suboptimal_clf_dt, filled=True, rounded=True, feature_names=X.columns)

In [None]:
y_suboptimal_pred = suboptimal_clf_dt.predict(X_test)
fig, ax = plt.subplots(figsize=(10, 10))
ConfusionMatrixDisplay.from_predictions(y_test, y_suboptimal_pred, ax=ax)

In [None]:
accuracy_score(y_test, y_suboptimal_pred)

In [None]:
path = suboptimal_clf_dt.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas[:-1]

dts = []
for ccp_alpha in ccp_alphas:
    dt = DecisionTreeClassifier(random_state=randomness_id, ccp_alpha=ccp_alpha)
    dt.fit(X_train, y_train)
    dts.append(dt)

In [None]:
train_scores = [dt.score(X_train, y_train) for dt in dts]
test_scores = [dt.score(X_test, y_test) for dt in dts]

fig, ax = plt.subplots(figsize=(10, 7))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.plot(ccp_alphas, train_scores, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker="o", label="test", drawstyle="steps-post")
ax.legend()

In [None]:
cv_dt = DecisionTreeClassifier(random_state=42, ccp_alpha=0.046)
scores = cross_val_score(cv_dt, X_train, y_train, cv=5)
tree_df = pd.DataFrame(data={"tree": range(5), "accuracy": scores})
tree_df.plot(x="tree", y="accuracy", marker="o", linestyle="--")

In [None]:
alpha_loop_values = []
for ccp_alpha in ccp_alphas:
    dt = DecisionTreeClassifier(random_state=randomness_id, ccp_alpha=ccp_alpha)
    scores = cross_val_score(dt, X_train, y_train, cv=5)
    alpha_loop_values.append([ccp_alpha, np.mean(scores), np.std(scores)])

alpha_results = pd.DataFrame(alpha_loop_values, columns=["alpha", "mean_accuracy", "std"])

alpha_results.plot(x="alpha", y="mean_accuracy", yerr="std", marker="o", linestyle="--")

In [None]:
best_alpha = 0.01
pruned_dt = DecisionTreeClassifier(random_state=randomness_id, ccp_alpha=best_alpha).fit(X_train, y_train)

y_predicted = pruned_dt.predict(X_test)

fig, ax = plt.subplots(figsize=(10, 10))
ConfusionMatrixDisplay.from_predictions(y_test, y_predicted, ax=ax)

In [None]:
plt.figure(figsize=(15, 10))
plot_tree(pruned_dt, filled=True, rounded=True, feature_names=X.columns)
# plt.savefig('tree.eps', format='eps', bbox_inches="tight")

In [None]:
accuracy_score(y_test, y_predicted)

In [None]:
original_X_test = original_df.drop("class", axis=1).copy()
original_y_test = original_df["class"].copy()

original_y_predicted = pruned_dt.predict(original_X_test)

fig, ax = plt.subplots(figsize=(10, 10))
ConfusionMatrixDisplay.from_predictions(original_y_test, original_y_predicted, ax=ax)

In [None]:
accuracy_score(original_y_test, original_y_predicted)