## SECTION 1 
### Introduction to the problem/task and dataset

PumpkinSeed-ML-Insights is a comprehensive repository dedicated to the exploration and analysis of Turkish pumpkin seed varieties, with a focus on classifying whether a seed belongs to Urgup Sivrisi or Cercevelik species. This project demonstrates the knowledge of authors in data science and machine learning.

Within this repository, you will find a Jupyter Notebook that serves as a self-explanatory document, guiding you through the entire process. This repository also contains three Python files, each implementing a different machine learning model: `knn_pumpkinseed.py`, `logistic_regression_pumkinseed.py`, and `neural_network_pumpkinseed.py`. There is also the `pumpkin_seeds.csv`, which contains the data and `Pumpkin_seeds.pdf`, which contains some description of the dataset.



---
## SECTION 2
### Description of the dataset

`pumpkin_seeds.csv` is a CSV file containing information about Pumpkin Seeds found in Turkey. 

This dataset came from the study `The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.).` by Koklu, M., Sarigil, S., & Ozbek, O. in 2021. In their paper, they used a product shooting box to obtain quality images of the pumpkin seeds. The authors converted the images to a gray tone and then to binary images. To convert the image data into a CSV file, they extracted 12 morphological features.

Overall, the CSV file has 13 columns/features and 2500 rows. The first 12 columns are from the extracted morphological features, while the last column classifies whether it belongs to the Urgup Sivrisi or Cercevelik species. There are 2500 rows, representing a single seed used in the study. There are 1200 Urgup Sivrisi and 1300 Cercevelik species of pumpkin seeds. 

The features found in this CSV file are as follows:
1. Area – Number of pixels within the borders of a pumpkin seed
2. Perimeter – Circumference in pixels of a pumpkin seed
3. Major_Axis_Length – Large axis distance of a pumpkin seed
4. Minor_Axis_Length – Small axis distance of a pumpkin seed
5. Convex_Area – Number of pixels of the smallest convex shell at the region formed by the
pumpkin seed.
6. Equiv_Diameter – Computed as !4𝑎⁄𝜋, where 𝑎 is the area of the pumpkin seed.
7. Eccentricity – Eccentricity of a pumpkin seed
8. Solidity – Convex condition of the pumpkin seeds
9. Extent – Ratio of a pumpkin seed area to the bounding box pixels
10. Roundness – Ovality of pumpkin seeds without considering the distortion of the edges.
11. Aspect_Ration – Aspect ratio of the pumpkin seeds
12. Compactness – Proportion of the area of the pumpkin seed relative to the area of the circle
with the same circumference
13. Class – Either Cercevelik or Urgup Sivrisi



---
## SECTION 3

### List of Requirements

---
## SECTION 4

### Data preprocessing and cleaning



In [1]:
import numpy as np
import pandas as pd
import csv


In [2]:
data = []

with open('pumpkin_seeds.csv', 'r', encoding='utf-8', errors='replace') as csv_file:
    raw_data = csv.reader(csv_file)

    #Skip headers
    next(raw_data)

    #Store data into data array
    for row in raw_data:
        row_data = []
        for i in range(13): #Convert errors into 1 or 2 (depending on their specie)
            if i == 12 and row[i] == '�er�evelik':
                row_data.append(int(0))
            elif i == 12 and row[i] == '�rg�p Sivrisi':
                row_data.append(int(1))
            else:
                row_data.append(row[i])

        data.append(row_data)

#Convert data into numpy array
np_data = np.array(data)
np_data



array([['56276', '888.242', '326.1485', ..., '1.4809', '0.8207', '0'],
       ['76631', '1068.146', '417.1932', ..., '1.7811', '0.7487', '0'],
       ['71623', '1082.987', '435.8328', ..., '2.0651', '0.6929', '0'],
       ...,
       ['87994', '1210.314', '507.22', ..., '2.2828', '0.6599', '1'],
       ['80011', '1182.947', '501.9065', ..., '2.4513', '0.6359', '1'],
       ['84934', '1159.933', '462.8951', ..., '1.9735', '0.7104', '1']],
      dtype='<U11')

The first step is to address the encoding issues, especially in the "Class" column. The unique values in the "Class" column are showing encoding issues, as evidenced by the presence of escape characters like \x82. These values are intended to represent the two species of pumpkin seeds. To correct this, we need to replace these incorrectly encoded strings with a correct format. We replaced the string class names with an integer value of 0 and 1.

### Data Scaling

The numerical features are scaled using `StandardScaler` from `sklearn.preprocessing`. This ensures that all features have a mean of 0 and a standard deviation of 1, which is particularly important for many machine learning algorithms.

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Replace 'original_column_names' with the actual list of column names
original_column_names = ['area', 'perimeter', 'major_axis_length', 'minor_axis_length',
                         'convex_area', 'equiv_diameter', 'eccentricity', 'solidity',
                         'extent', 'roundness', 'aspect_Ration', 'compactness', 'class']

# Convert the numpy array to a pandas DataFrame using the original column names
pumpkin_seeds_data = pd.DataFrame(np_data, columns=original_column_names)

# Selecting only the numerical features for scaling
numerical_features = pumpkin_seeds_data.iloc[:, :-1]

# Initializing the Standard Scaler
scaler = StandardScaler()

# Scaling the numerical features
scaled_numerical_features = scaler.fit_transform(numerical_features)

# Creating a new DataFrame with scaled values using the original column names (excluding 'Class')
scaled_numerical_df = pd.DataFrame(scaled_numerical_features, columns=original_column_names[:-1])

# Adding the non-numerical column ('Class') back to the DataFrame
scaled_pumpkin_seeds_data = pd.concat([scaled_numerical_df, pumpkin_seeds_data['class']], axis=1)

scaled_pumpkin_seeds_data

# Creating the feature set 'X' and the target 'y'
X = scaled_pumpkin_seeds_data.iloc[:, :-1]  # All columns except the last one
y = scaled_pumpkin_seeds_data['class']      # Only the last column

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)



---------------------------------------------------------------------------
## SECTION 5 - Exploratory Data Analysis (EDA)

---------------------------------------------------------------------------
## SECTION 6 - Model training

### kNN

In [4]:
print(np_data.shape)
print("testbryce")


(2500, 13)
testbryce


### Logistic Regression

### Naive Bayes

In [5]:
from sklearn.naive_bayes import GaussianNB


In [6]:
seed_nb = GaussianNB()


In [7]:
seed_nb.fit(X_train, y_train)


In [8]:
predictions = seed_nb.predict(X_train)
predictions


array(['1', '1', '0', ..., '0', '1', '0'], dtype='<U1')

In [9]:
def compute_accuracy(predictions, actual):
    return (np.sum(predictions == actual)/len(actual)) * 100


In [10]:
print("Training accuracy: ", compute_accuracy(predictions, y_train), "%")


Training accuracy:  87.36 %


In [11]:
predictions = seed_nb.predict(X_test)
predictions


array(['0', '0', '1', '1', '0', '0', '0', '0', '0', '0', '1', '0', '1',
       '1', '1', '0', '1', '1', '1', '1', '1', '0', '1', '0', '1', '1',
       '0', '1', '0', '0', '0', '1', '0', '0', '1', '1', '1', '1', '0',
       '1', '0', '0', '1', '0', '0', '0', '0', '0', '1', '1', '0', '0',
       '0', '0', '0', '0', '0', '0', '0', '0', '0', '1', '1', '0', '1',
       '0', '0', '0', '1', '0', '0', '1', '1', '0', '1', '0', '1', '0',
       '0', '0', '1', '0', '0', '1', '1', '0', '1', '0', '0', '1', '0',
       '1', '1', '1', '0', '1', '1', '0', '1', '0', '0', '0', '1', '0',
       '0', '1', '1', '0', '0', '0', '0', '0', '0', '0', '0', '0', '1',
       '0', '1', '1', '1', '0', '1', '1', '0', '1', '0', '1', '1', '0',
       '1', '1', '0', '1', '0', '1', '1', '0', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '0', '0', '0', '0', '0', '0', '1',
       '0', '0', '0', '1', '0', '1', '0', '1', '0', '0', '0', '1', '1',
       '0', '0', '1', '0', '1', '1', '1', '1', '0', '0', '0', '0

In [12]:
print("Test accuracy: ", compute_accuracy(predictions, y_test), "%")


Test accuracy:  86.08 %


In [13]:
seed_nb.class_count_


array([975., 900.])

In [14]:
seed_nb.class_prior_


array([0.52, 0.48])

In [15]:
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


In [16]:
X_train, X_validation, y_train, y_validation = train_test_split(X_train_val, y_train_val, test_size=0.25, stratify=y_train_val, random_state=42)


In [17]:
from sklearn.model_selection import ParameterGrid


In [18]:
seed_nb = GaussianNB()


In [19]:
seed_nb.get_params()


{'priors': None, 'var_smoothing': 1e-09}

In [20]:
hyperparameters = [{
    'var_smoothing': [1e-12, 1e-11, 1e-10, 1e-9, 1e-8, 1e-7, 1e-6],
    'priors': [None, [0.1, 0.9], [0.2, 0.8], [0.3, 0.7], [0.6, 0.4], [0.5, 0.5], [0.6, 0.4], [0.7, 0.3], [0.8, 0.2], [0.9, 0.1]]
}]


In [21]:
list (ParameterGrid(hyperparameters))


[{'priors': None, 'var_smoothing': 1e-12},
 {'priors': None, 'var_smoothing': 1e-11},
 {'priors': None, 'var_smoothing': 1e-10},
 {'priors': None, 'var_smoothing': 1e-09},
 {'priors': None, 'var_smoothing': 1e-08},
 {'priors': None, 'var_smoothing': 1e-07},
 {'priors': None, 'var_smoothing': 1e-06},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-12},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-11},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-10},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-09},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-08},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-07},
 {'priors': [0.1, 0.9], 'var_smoothing': 1e-06},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-12},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-11},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-10},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-09},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-08},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-07},
 {'priors': [0.2, 0.8], 'var_smoothing': 1e-06},
 {'priors': [

In [22]:
best_score = 0
for g in ParameterGrid(hyperparameters):
    print(g)

    seed_nb.set_params(**g)

    seed_nb.fit(X_train, y_train)
    predictions = seed_nb.predict(X_train)
    train_acc = compute_accuracy(predictions, y_train)

    predictions = seed_nb.predict(X_validation)
    val_acc = compute_accuracy(predictions, y_validation)

    print(f"Train acc: {train_acc}% \t Val acc: {val_acc}%", end="\n\n")

    if val_acc > best_score:
        best_score = val_acc
        best_grid = g

print("Best accuracy: ", best_score, "%")
print("Best grid: ", best_grid)


{'priors': None, 'var_smoothing': 1e-12}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': None, 'var_smoothing': 1e-11}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': None, 'var_smoothing': 1e-10}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': None, 'var_smoothing': 1e-09}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': None, 'var_smoothing': 1e-08}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': None, 'var_smoothing': 1e-07}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': None, 'var_smoothing': 1e-06}
Train acc: 88.13333333333333% 	 Val acc: 84.6%

{'priors': [0.1, 0.9], 'var_smoothing': 1e-12}
Train acc: 87.6% 	 Val acc: 83.6%

{'priors': [0.1, 0.9], 'var_smoothing': 1e-11}
Train acc: 87.6% 	 Val acc: 83.6%

{'priors': [0.1, 0.9], 'var_smoothing': 1e-10}
Train acc: 87.6% 	 Val acc: 83.6%

{'priors': [0.1, 0.9], 'var_smoothing': 1e-09}
Train acc: 87.6% 	 Val acc: 83.6%

{'priors': [0.1, 0.9], 'var_smoothing': 1e-08}
Tr

In [23]:
seed_nb = GaussianNB(**best_grid)
seed_nb.get_params()


{'priors': [0.7, 0.3], 'var_smoothing': 1e-12}

In [24]:
seed_nb.fit(X_train_val, y_train_val)


In [25]:
predictions = seed_nb.predict(X_test)


In [26]:
print("Test accuracy: ", compute_accuracy(predictions, y_test), "%")


Test accuracy:  85.2 %
