# **STINTSY Machine Learning Project: Pumpkin Seeds Dataset**

**STINTSY S11 - Milky Way** \
*Group Members:*
- Gutierrez, Mark Daniel
- Refuerzo, Lloyd Dominic
- Romblon, Kathleen Mae
- Stinson, Audrey Lauren

## **1** | **Introduction**

The pumpkin plant belongs to the Cucurbitaceae family and has seasonal varieties. Confectionery pumpkins, grown in Turkey, are usually produced from the pumpkin species, Cucurbita pepo L and sometimes from the Cucurbita moschata Duchesne type. Pumpkin seeds are considered as important for human health because it contains 37 percent of carbohydrate, 35 percent to 40 percent of fat and protein along with calcium, potassium, phosphorus, magnesium, iron, and zinc. Pumpkins are divided into many types, and one of these species is known as “Urgup Sivrisi”. Urgup Sivrisi is a type of pumpkin seed that has a long, white, very bright, thin, and hardly distinguishable shell with a pointed tip. The other type of pumpkin seeds is “Cercevelik”. It is a particular species grown in Turkey, Nevsehir, Karacaoren, and known as “Topak” in Turkey. <span style="color:#42adf5">(*taken directly from* Details *section of the Pumpkin Seeds Dataset pdf*)</span>

The target task for this dataset is to correctly classify whether an image of a pumpkin seed is of the species type "Urgup Sivrisi" or "Cercevelik". The dataset then offers a <span style="color:#f5b942">classification problem</span> that the group will address through the use of various machine learning models, namely **k-Nearest Neighbors**, **Decision Trees**, and **Convolutional Neural Networks (CNNs) / Logistic Regression**. <span style="color:red">(note/to delete/todo: pick between logreg and cnns)</span>

## **2** | **About the Dataset**

The dataset, collected by Koklu et al. (2021), contains extracted features from 2500 images of two varieties of pumpkin seeds, Urgup Sivrisi and Cercevelik. These images were taken inside a product shooting box to prevent shadows from showing if light from outside of the box were to get in. To process the original RGB images, they were converted to gray-toned images, and then to binary images to simplify the value of each pixel in the image. As the RGB images will be converted to binary images for the image processing part, the shadows can make the acquired size and shape of the seed appear smaller.

From the image binarization of each of the 2500 images, 12 features were extracted for each instance. The extracted features are based on the shape of the pumpkin seeds, where each pixel in the image was calculated while considering the values of other nearby pixels.

As such, the extracted features are as follows:

1. <span style="color:#f5b942">Area:</span> Number of pixels within the borders of a pumpkin seed
2. <span style="color:#f5b942">Perimeter:</span> Circumference in pixels of a pumpkin seed
3. <span style="color:#f5b942">Major Axis Length:</span> Large axis distance of a pumpkin seed
4. <span style="color:#f5b942">Minor Axis Length:</span> Small axis distance of a pumpkin seed
5. <span style="color:#f5b942">Convex Area:</span> Number of pixels of the smallest convex shell at the region formed by the pumpkin seed
6. <span style="color:#f5b942">Equiv Diameter:</span> Computed as $\sqrt{4a/\pi}$, where *a* is the area of the pumpkin seed
7. <span style="color:#f5b942">Eccentricity:</span> Eccentricity of a pumpkin seed
8. <span style="color:#f5b942">Solidity:</span> Convex condition of the pumpkin seeds
9. <span style="color:#f5b942">Extent:</span> Ratio of a pumpkin seed area to the bounding box pixels
10. <span style="color:#f5b942">Roundness:</span> Ovality of pumpkin seeds without considering the distortion of the edges
11. <span style="color:#f5b942">Aspect Ration:</span> Aspect ratio of the pumpkin seeds
12. <span style="color:#f5b942">Compactness:</span> Proportion of the area of the pumpkin seed relative to the area of the circle with the same circumference

## **3** | **List of Requirements**

The following cell imports the libraries needed to run the notebook:

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

import matplotlib.pyplot as plt
import seaborn as sns

## **4** | **Data Preprocessing**

### *Importing the dataset*

For the following cell, we will be using the `read_csv` function to import the pumpkin seeds dataset to our notebook.

In [None]:
data = pd.read_csv('pumpkin_seeds.csv')

To check that we have imported the data, we can take a look at the first and last 10 instances in the dataset.

In [None]:
data.head(10)

In [None]:
data.tail(10)

In [None]:
# (no. of instances, no. of columns [features + target label])
data.shape

The dataset consists of 13 columns, where the first 12 columns are the input features and the last column is the target label. There are 2500 instances in total, and the shape of the dataset is (2500, 13).

Optionally, we can see the statistical summary of the dataset by calling the `describe()` function.

In [None]:
data.describe()

### *Data Cleaning*

As we can see, one of the features, `Aspect_Ration`, is spelled incorrectly. To fix this, we will rename that column to `Aspect_Ratio`.

In [None]:
#renaming and reformatting the features
data.columns = ['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length', 'Convex_Area', 'Equiv_Diameter',
                'Eccentricity', 'Solidity', 'Extent', 'Roundness', 'Aspect_Ratio', 'Compactness', 'Class']

The class labels can be renamed so that it only includes letters from the English alphabet, and this will be done by running the following cell block.

In [None]:
data['Class'] = data['Class'].str.strip().str.title().replace({'Çerçevelik': 'Cercevelik'})
data['Class'] = data['Class'].str.strip().str.title().replace({'Ürgüp Sivrisi': 'Urgup Sivrisi'})
#data.info()
#data.hist(figsize=(12,12))
#plt.show()

In continuation, we can check if there are other representations of the Class column by calling the `unique()` function on it. Since there are only two unique values, we do not need to make any changes for this column for now.

In [None]:
data['Class'].unique()

Using the `info()` function, we can check any feature with incorrect datatype. If there are inconsistencies with the datatype, it is likely to be assigned an `object` datatype. It should also be noted that for features that we would usually assume to have a `string` datatype, it is possible that they have `object` datatype.

In [None]:
data.info()

Only one of these features have the `object` datatype assigned to it, which is the `Class` column. However, since we have already queried its unique values in the previous section, we know that there shouldn't be inconsistencies in this column, so we can keep this column as is.

Next, we need to check if there are any missing values or instances where a default value has been assigned. In a pandas dataframe, these are usually represented as `None` or `NaN`, so we can query for any null values in our dataset. Additionally, we can also check if there are any duplicated instances.

In [None]:
display(data.isnull().sum())
display(data.duplicated().sum())

Now that we have finished checking the data, we can now split the data into features (X) and the target label (y).

In [None]:
#split features and label
X = data.drop(columns=['Class']).values
y = data['Class'].values

Some models may require that our features are normalized, so we'll define a normalized X variable using the `MinMaxScaler` library's `fit_transform()` function.

In [None]:
#normalize
scaler = MinMaxScaler()
X_norm = scaler.fit_transform(X)

Our class labels can be represented numerically. To do this, we use the `LabelEncoder` library's `fit_transform()` function on our target label `y` to represent **Cercevelik** as class `0` and **Urgup Sivrisi** as class `1`.

In [None]:
#Encode Cercevelik as 0, Urgup Sivrisi as 1
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

#Check if y is properly transformed
np.unique(y)

Similarly, for the purposes of our exploratory data analysis, we will also transform the `Class` column in our dataframe so that **Cercevelik** is represented as `0` and **Urgup Sivrisi** is represented as class `1`.

In [None]:
data['Class'] = label_encoder.fit_transform(data['Class'])

data['Class'].unique()

## **5** | **Exploratory Data Analysis**

### Correlation Checking

Let’s play around with the data and find association among them. First, we check the correlation between features and labels.

In [None]:
data.corr()['Class'].sort_values()

As shown above, the first four features which have strongest relationship with Class are <b>Aspect_Ratio, Eccentricity, Major_Axis_Length and Perimeter</b>.

Then we display the correlations of each combination of two features.

In [None]:
corr = data.corr().round(2)
sns.heatmap(corr,cmap="rocket",annot=True)

The brighter the color is, the stronger the relationship between 2 variables.<br>
Notably, 3 features have perfect positive correlation with each other: Area, Convex Area and Equiv Diameter. Other closely correlated features are Aspect Ratio and Eccentricity with 0.95 correlation, then Perimeter with Area, Convex Area and Equiv Diameter with 0.93 correlation. Meanwhile, compactness and aspect ratio are highly inversely correlated at -0.99 correlation, impying compactness decreases with increasing aspect ratio.<br>
Let’s plot some interesting pattern.

### The number of data in each class

In [None]:
g = sns.catplot(data=data,x='Class',kind='count')
g.set_axis_labels("", "The number of seeds")
g.set_xticklabels(["Cercevelik", "Urgup Sivrisi"])

We can see that in the dataset, the number of instances classified as Çerçevelik seeds is slightly more than that of Ürgüp Sivrisi.

### Boxplot

We display the relationship between Class and the first four features which have strongest relationship with it.

In [None]:
# Boxplot
f = plt.figure(figsize=(12,8))

plt.subplot(2,2,1)
# Aspect_Ration vs Class
a=sns.boxplot(data=data,x='Class',y='Aspect_Ratio')
a.set_xticklabels(["Cercevelik", "Urgup Sivrisi"])

plt.subplot(2,2,2)
# Eccentricity vs Class
b=sns.boxplot(data=data,x='Class',y='Eccentricity')
b.set_xticklabels(["Cercevelik", "Urgup Sivrisi"])

plt.subplot(2,2,3)
# Major_Axis_Length vs Class
c=sns.boxplot(data=data,x='Class',y='Major_Axis_Length')
c.set_xticklabels(["Cercevelik", "Urgup Sivrisi"])

plt.subplot(2,2,4)
# Perimeter vs Class
d=sns.boxplot(data=data,x='Class',y='Perimeter')
d.set_xticklabels(["Cercevelik", "Urgup Sivrisi"])

Ürgüp Sivrisi has a higher median in all 4 features. Since these features are related to shape and size, this may imply that Ürgüp Sivrisi seeds are generally bigger and more elongated than Çerçevelik seeds.

### Scatterplot

As we can see from the correlation plot, some other combinations of the variables also show strong relationships (around 0.95). Let’s have a look at the four of them.

In [None]:
#The relationships among other features
f = plt.figure(figsize=(16,12))
#Roundness vs. Compactness
plt.subplot(2,2,1)
sns.scatterplot(data=data,x='Compactness', y='Roundness',hue='Class')
plt.grid()

#Perimeter vs. Major_Axis_Length
plt.subplot(2,2,2)
sns.scatterplot(data=data,x='Perimeter', y='Major_Axis_Length',hue='Class')
plt.grid()

#Perimeter vs. Area
plt.subplot(2,2,3)
sns.scatterplot(data=data,x='Perimeter', y='Area',hue='Class')
plt.grid()

#Perimeter vs. Convex_Area
plt.subplot(2,2,4)
sns.scatterplot(data=data,x='Perimeter', y='Convex_Area',hue='Class')
plt.grid()

The scatterplot is divided according to their class. As you can see two features got very strong relationships. While they don’t have strong relationships with **Class**, which can be seen from the distribution of orange and blue points, representing two different seed classes. The distribution of two classes doesn’t appear like cluster.

## **6** | **Initial Model Training**

### Model 1: K-Nearest Neighbors

First, we import the needed libraries for KNN model training and evaluation metrics.

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

We'll split the data to training set (70%) and testing set (30%).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_norm, y, test_size=0.3, random_state=1)
print("X train: \n" + str(X_train))
print("y train: \n" + str(y_train))
print("X test: \n" + str(X_test))
print("y test: \n" + str(y_test))

Possible values for the hyperparameter k will be numbers 1-20.<br>For each k, we will store the average cross-validation accuracy in accuracy_scores. We will use these scores to determine the best k later on.

In [None]:
k_choices = range(1, 21)
accuracy_scores = []
for k in k_choices:
    knn = KNeighborsClassifier(n_neighbors=k)
    score = cross_val_score(knn, X_train, y_train, cv=10).mean()
    accuracy_scores.append(score)

Let's visualize the data.

In [None]:
for i in range(len(accuracy_scores)):
    plt.scatter(k_choices[i], accuracy_scores[i])
plt.xlabel("k")
plt.ylabel("Cross-validation accuracy")
plt.title("Cross-validation on k")
plt.grid()
plt.show()

The best k seems to be between 12.5 and 15.0, with a cross-validation accuracy higher than 0.87. Now let's compute this based on the accuracy scores.

In [None]:
best_k = k_choices[np.argmax(accuracy_scores)]
print("Best k:", best_k)
print("Accuracy:", max(accuracy_scores))

Let's train the model using the best k (14).

In [None]:
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)

Let's test the model on the testing data.

In [None]:
y_pred = knn.predict(X_test)
y_pred

Let's evaluate the model. First, let's print out the confusion matrix.

In [None]:
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Note that Çerçevelik is encoded as 0 and Ürgüp Sivrisi as 1.<br>
<b>True positives:</b> 342 instances were labeled correctly as Ürgüp Sivrisi.<br>
<b>True negatives:</b> 309 instances were labeled correctly as Çerçevelik.<br>
<b>False negatives:</b> 37 instances were labeled incorrectly as Çerçevelik.<br>
<b>False positives:</b> 62 instances were labeled incorrectly as Ürgüp Sivrisi.<br>

Let's get the evaluation metrics per class using classification_report(), and then the overall test accuracy.

In [None]:
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

print("Test Accuracy:", accuracy_score(y_test, y_pred))

## **7** | **Error Analysis**

## **8** | **Improving Model Performance**

## **9** | **Model Performance Summary**

## **10** | **Insights and Conclusions**

## **11** | **References**