# Dry Beans Classification
The data comes from the analysis of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. A total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains.

The original study is from KOKLU, M. and OZKAN, I.A., (2020), “Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques.” Computers and Electronics in Agriculture, 174, 105507. DOI: https://doi.org/10.1016/j.compag.2020.105507. 

1. Main objective.
2. Dataset description.
3. Data exploration.
4. Clustering models.
5. Key findings.
6. Future steps.

## 1. Main objective

The analysis intends to classify the beans in different clusters and see whether the clusters are similar to the original classes.

## 2. Dataset description

The dataset contains the following variables:

1. Area (A): The area of a bean zone and the number of pixels within its boundaries.
2. Perimeter (P): Bean circumference is defined as the length of its border.
3. Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.
4. Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis.
5. Aspect ratio (K): Defines the relationship between L and l.
6. Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.
7. Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area.
9. Extent (Ex): The ratio of the pixels in the bounding box to the bean area.
10.Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11.Roundness (R): Calculated with the following formula: (4piA)/(P^2)
12.Compactness (CO): Measures the roundness of an object: Ed/L
13.ShapeFactor1 (SF1)
14.ShapeFactor2 (SF2)
15.ShapeFactor3 (SF3)
16.ShapeFactor4 (SF4)
17.Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)

## 3. Data exploration

### 3.1. Importing data

In [None]:
# Import python packages to be used
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:70% !important; }</style>"))

import os, seaborn as sns, pandas as pd, numpy as np, matplotlib.pyplot as plt
from pprint import pprint

In [None]:
# Import the data
dataset = pd.read_csv('../input/beans-classification/Dry_Bean_Dataset.csv')
dataset

### 3.2. Checking for ranges and invalid values

In [None]:
COLUMNS = dataset.columns.tolist()
for c in COLUMNS:
    if dataset[c].isnull().values.any():
        print('{0}: {1} invalid values found'.format(c, dataset[c].isnull().sum()))
    else:
        print('{0}: ok'.format(c))

In [None]:
dataset.dtypes

In [None]:
dataset.describe()

No rows are incomplete. All columns have very different scales, so they will have to be normalized.

Let's see if the categories are balanced.

In [None]:
dataset.Class.value_counts()

The categories are not very balanced. 

In [None]:
feature_cols = COLUMNS[:-1]
sns.set(style='darkgrid')
fig, ax_list = plt.subplots(nrows=4, ncols=4, sharey=False, figsize=(36,24))
ax_list = ax_list.flatten()
for name, ax in zip(feature_cols, ax_list):
     g = sns.histplot(dataset, x=name, bins=10, ax=ax).set(title=name)

There appear to be a few individuals whose solidity is considerably larger than the rest.

In [None]:
dataset.sort_values(['Solidity'], ascending=False)

### 3.3. Encoding

Let's encode our class labels.

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
dataset['Class'] = le.fit_transform(dataset.Class)
dataset['Class']

Let us see if there is any correlation between variables.

### 3.4. Correlation

In [None]:
# Calculate the correlation values
corr_values = dataset[feature_cols].corr()

# Simplify by emptying all the data below the diagonal
tril_index = np.tril_indices_from(corr_values)

# Make the unused values NaNs
for coord in zip(*tril_index):
    corr_values.iloc[coord[0], coord[1]] = np.NaN
    
# Stack the data and convert to a data frame
corr_values = (corr_values
               .stack()
               .to_frame()
               .reset_index()
               .rename(columns={'level_0':'feature1',
                                'level_1':'feature2',
                                0:'correlation'}))

# Get the absolute values for sorting
corr_values['abs_correlation'] = corr_values.correlation.abs()

In [None]:
sns.set_context('talk')
sns.set_style('white')

ax = corr_values.abs_correlation.hist(bins=10, figsize=(12, 5))
ax.set(xlabel='Absolute Correlation', ylabel='Frequency');

# The most highly correlated values
corr_values.sort_values('abs_correlation', ascending=False).query('abs_correlation>0.8')

Before continuing, we will scale our X columns to values between zero and one, using the Yeo-Johnson transformation and the standard scaler.

In [None]:
feature_columns = [x for x in dataset.columns if x not in ['Class']]

skew_columns = (dataset[feature_columns].skew().sort_values(ascending=False))
skew_columns = skew_columns.loc[skew_columns > 0.8]
skew_columns

In [None]:
from scipy.stats import yeojohnson

# Apply transformation to long-tailed columns
yeoj = dict()
yeoj_fields = ['ShapeFactor4', 'Solidity', 'Eccentricity', 'roundness']
yeoj_fields = yeoj_fields
for f in yeoj_fields:
    yeoj[f] = yeojohnson(dataset[f])
    dataset[f] = yeoj[f][0]
    print("{0} transformed with lambda {1}".format(f, yeoj[f][1]))

In [None]:
from sklearn.preprocessing import StandardScaler

for col in skew_columns.index.tolist():
    dataset[col] = np.log1p(dataset[col])

sc = StandardScaler()
dataset[feature_columns] = sc.fit_transform(dataset[feature_columns])

In [None]:
%matplotlib inline
plt.close()
sns.set_style('whitegrid')
sns.pairplot(dataset, hue='Class', height=3)
plt.show()

At first sight, it would seem that it will be possible to split our beans into different classes from their attributes alone. 

In [None]:
fig, ax_list = plt.subplots(nrows=4, ncols=4, sharey=False, figsize=(36,36))
ax_list = ax_list.flatten()
for name, ax in zip(feature_cols, ax_list):
     g = sns.histplot(dataset, x=name, bins=10, ax=ax).set(title=name)

In [None]:
dataset

## 4. Clustering models

Definition of a function that will find the maximum values of the confusion matrices using the Hungarian algorithm, and also rearrange the confusion matrix to have the maxima in a neat diagonal.

In [None]:
!pip install munkres

import sys
from munkres import Munkres, print_matrix

def rearrange_confusion_matrix(df):
    m = Munkres()
    matrix = (df.copy()).to_numpy().tolist()

    cost_matrix = []
    for row in matrix:
        cost_row = []
        for col in row:
            cost_row += [sys.maxsize - col]
        cost_matrix += [cost_row]

    indexes = m.compute(cost_matrix)
#     print(indexes)
#     print_matrix('Highest profit through this matrix:\n', matrix)
    total = 0
    for row, column in indexes:
        value = matrix[row][column]
        total += value
    print('Correctly classified =', total)

    matrix = pd.DataFrame(matrix, columns=list(range(len(matrix))))
    rearranged = matrix.copy()
#     print(rearranged.columns)
    row_order = list(range(len(matrix)))
    col_order = rearranged.columns.tolist()

    for i, j in zip(indexes, row_order):
        new_cols = rearranged.columns.tolist()
        if j == new_cols[j] and i[1] == new_cols[i[1]]:
            new_cols[new_cols[j]] = i[1]
            new_cols[i[1]] = j
            rearranged = rearranged.copy().reindex(columns=new_cols)
        elif j != new_cols[j] and i[1] != new_cols[i[1]]:
            m = new_cols[j]
            n = new_cols.index(i[1])
            new_cols[j], new_cols[n] = new_cols[n], new_cols[j]
            rearranged = rearranged.copy().reindex(columns=new_cols)
        else:
            m = new_cols[j]
            n = new_cols.index(i[1])
            new_cols[j], new_cols[n] = new_cols[n], new_cols[j]
            rearranged = rearranged.copy().reindex(columns=new_cols)

    print('Accuracy: {0}'.format(total/rearranged.stack().sum()))
    return(rearranged)

### 4.1. K-means clustering model
- Fit a K-means clustering model with seven clusters, one per class.
- Examine the clusters by counting each class in each cluster.

In [None]:
from sklearn.cluster import KMeans
method = 'kmeans'

km = KMeans(n_clusters=7, random_state=42)
km = km.fit(dataset[feature_columns])
dataset[method] = km.predict(dataset[feature_columns])

In [None]:
# Create a DataFrame with labels and varieties as columns: df
confusion = pd.DataFrame({'Class': dataset['Class'], 'Clusters': dataset[method]})
# Create crosstab: ct
ct_km = pd.crosstab(confusion['Clusters'], confusion['Class'])
ct_km = rearrange_confusion_matrix(ct_km)

# Plot confusion matrix
_, ax = plt.subplots(figsize=(5,5))
ax = sns.heatmap(ct_km, annot=True, fmt='d')  
ax.set_ylabel('Classes', fontsize=20);
ax.set_xlabel('Clusters', fontsize=20)

Classes 5 and 6 get partly confused.

### 4.2. Agglomerative clustering model (ward)
- Fit an agglomerative clustering model with seven clusters, one per class.
- Examine the clusters by counting each class in each cluster.

In [None]:
from sklearn.cluster import AgglomerativeClustering
method = 'agglom-ward'

ag = AgglomerativeClustering(n_clusters=7, linkage='ward', compute_full_tree=True)
ag = ag.fit(dataset[feature_columns])
dataset[method] = ag.fit_predict(dataset[feature_columns])

In [None]:
# Create a DataFrame with labels and varieties as columns: df
confusion = pd.DataFrame({'Class': dataset['Class'], 'Clusters': dataset[method]})
# Create crosstab: ct
ct_ag = pd.crosstab(confusion['Class'], confusion['Clusters'])
ct_ag = rearrange_confusion_matrix(ct_ag)

# Plot confusion matrix
_, ax = plt.subplots(figsize=(5,5))
ax = sns.heatmap(ct_ag, annot=True, fmt='d')  
ax.set_ylabel('Classes', fontsize=20);
ax.set_xlabel('Clusters', fontsize=20)

Classes 0 and 3, and 0 and 2 remain partly confused.

### 4.3. Agglomerative clustering model (complete)
- Fit an agglomerative clustering model with seven clusters, one per class.
- Examine the clusters by counting each class in each cluster.

In [None]:
method = 'agglom-complete'

agc = AgglomerativeClustering(n_clusters=7, linkage='complete', compute_full_tree=True)
agc = agc.fit(dataset[feature_columns])
dataset[method] = agc.fit_predict(dataset[feature_columns])

In [None]:
# Create a DataFrame with labels and varieties as columns: df
confusion = pd.DataFrame({'Class': dataset['Class'], 'Clusters': dataset[method]})
# Create crosstab: ct
ct_agc = pd.crosstab(confusion['Class'], confusion['Clusters'])
ct_agc = rearrange_confusion_matrix(ct_agc)

# Plot confusion matrix
_, ax = plt.subplots(figsize=(5,5))
ax = sns.heatmap(ct_agc, annot=True, fmt='d')  
ax.set_ylabel('Classes', fontsize=20);
ax.set_xlabel('Clusters', fontsize=20)

Classes 6 and 3 are almost merged in the same cluster, while cluster 2 has a mixture of classes 0, 2, and 6.

### 4.4. Agglomerative clustering model (single)
- Fit an agglomerative clustering model with seven clusters, one per class.
- Examine the clusters by counting each class in each cluster.

In [None]:
method = 'agglom-single'

ags = AgglomerativeClustering(n_clusters=7, linkage='single', compute_full_tree=True)
ags = ags.fit(dataset[feature_columns])
dataset[method] = ags.fit_predict(dataset[feature_columns])

In [None]:
# Create a DataFrame with labels and varieties as columns: df
confusion = pd.DataFrame({'Class': dataset['Class'], 'Clusters': dataset['agglom-single']})
# Create crosstab: ct
ct_ags = pd.crosstab(confusion['Class'], confusion['Clusters'])
ct_ags = rearrange_confusion_matrix(ct_ags)

# Plot confusion matrix
_, ax = plt.subplots(figsize=(5,5))
ax = sns.heatmap(ct_ags, annot=True, fmt='d')  
ax.set_ylabel('Classes', fontsize=20);
ax.set_xlabel('Clusters', fontsize=20)

Almost all observations were merged in a single cluster, so this method clearly is not appropriate for this dataset.

## 5. Key findings

1. The K-means method was the most accurate of those attempted.
2. It was possible to build confusion matrices to see which clusters represented which classes, because each algorighm would build the clusters its own way. By rearranging each matrix and finding the diagonal with the highest values, it was possible to find the accuracy of our models.

## 6. Future steps

1. The mean shift method was attempted, but the computation time was very long. A future step of adding this method would be more likely to succeed if PCA is applied first.

2. A check for overfitting should be included, perhaps by including test/train splits, or some other method.
