## SECTION 1 
### Introduction to the problem/task and dataset

PumpkinSeed-ML-Insights is a comprehensive repository dedicated to the exploration and analysis of Turkish pumpkin seed varieties, with a focus on classifying whether a seed belongs to Urgup Sivrisi or Cercevelik species. This project demonstrates the knowledge of authors in data science and machine learning.

Within this repository, you will find a Jupyter Notebook that serves as a self-explanatory document, guiding you through the entire process. This repository also contains three Python files, each implementing a different machine learning model: `knn_pumpkinseed.py`, `logistic_regression_pumkinseed.py`, and `neural_network_pumpkinseed.py`. There is also the `pumpkin_seeds.csv`, which contains the data and `Pumpkin_seeds.pdf`, which contains some description of the dataset.



---
## SECTION 2
### Description of the dataset

`pumpkin_seeds.csv` is a CSV file containing information about Pumpkin Seeds found in Turkey. 

This dataset came from the study `The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.).` by Koklu, M., Sarigil, S., & Ozbek, O. in 2021. In their paper, they used a product shooting box to obtain quality images of the pumpkin seeds. The authors converted the images to a gray tone and then to binary images. To convert the image data into a CSV file, they extracted 12 morphological features.

Overall, the CSV file has 13 columns/features and 2500 rows. The first 12 columns are from the extracted morphological features, while the last column classifies whether it belongs to the Urgup Sivrisi or Cercevelik species. There are 2500 rows, representing a single seed used in the study. There are 1200 Urgup Sivrisi and 1300 Cercevelik species of pumpkin seeds. 

The features found in this CSV file are as follows:
1. Area – Number of pixels within the borders of a pumpkin seed
2. Perimeter – Circumference in pixels of a pumpkin seed
3. Major_Axis_Length – Large axis distance of a pumpkin seed
4. Minor_Axis_Length – Small axis distance of a pumpkin seed
5. Convex_Area – Number of pixels of the smallest convex shell at the region formed by the
pumpkin seed.
6. Equiv_Diameter – Computed as !4𝑎⁄𝜋, where 𝑎 is the area of the pumpkin seed.
7. Eccentricity – Eccentricity of a pumpkin seed
8. Solidity – Convex condition of the pumpkin seeds
9. Extent – Ratio of a pumpkin seed area to the bounding box pixels
10. Roundness – Ovality of pumpkin seeds without considering the distortion of the edges.
11. Aspect_Ration – Aspect ratio of the pumpkin seeds
12. Compactness – Proportion of the area of the pumpkin seed relative to the area of the circle
with the same circumference
13. Class – Either Cercevelik or Urgup Sivrisi



---
## SECTION 3

### List of Requirements

---
## SECTION 4

### Data preprocessing and cleaning



In [1]:
import numpy as np
import pandas as pd
import csv

In [2]:
data = []

with open('pumpkin_seeds.csv', 'r', encoding='utf-8', errors='replace') as csv_file:
    raw_data = csv.reader(csv_file)

    #Skip headers
    next(raw_data)

    #Store data into data array
    for row in raw_data:
        row_data = []
        for i in range(13): #Convert errors into 1 or 2 (depending on their specie)
            if i == 12 and row[i] == '�er�evelik':
                row_data.append(int(0))
            elif i == 12 and row[i] == '�rg�p Sivrisi':
                row_data.append(int(1))
            else:
                row_data.append(row[i])

        data.append(row_data)

#Convert data into numpy array
np_data = np.array(data)
np_data


array([['56276', '888.242', '326.1485', ..., '1.4809', '0.8207', '0'],
       ['76631', '1068.146', '417.1932', ..., '1.7811', '0.7487', '0'],
       ['71623', '1082.987', '435.8328', ..., '2.0651', '0.6929', '0'],
       ...,
       ['87994', '1210.314', '507.22', ..., '2.2828', '0.6599', '1'],
       ['80011', '1182.947', '501.9065', ..., '2.4513', '0.6359', '1'],
       ['84934', '1159.933', '462.8951', ..., '1.9735', '0.7104', '1']],
      dtype='<U11')

The first step is to address the encoding issues, especially in the "Class" column. The unique values in the "Class" column are showing encoding issues, as evidenced by the presence of escape characters like \x82. These values are intended to represent the two species of pumpkin seeds. To correct this, we need to replace these incorrectly encoded strings with a correct format. We replaced the string class names with an integer value of 0 and 1.

### Data Scaling

The numerical features are scaled using `StandardScaler` from `sklearn.preprocessing`. This ensures that all features have a mean of 0 and a standard deviation of 1, which is particularly important for many machine learning algorithms.

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Replace 'original_column_names' with the actual list of column names
original_column_names = ['area', 'perimeter', 'major_axis_length', 'minor_axis_length', 
                         'convex_area', 'equiv_diameter', 'eccentricity', 'solidity', 
                         'extent', 'roundness', 'aspect_Ration', 'compactness', 'class']

# Convert the numpy array to a pandas DataFrame using the original column names
pumpkin_seeds_data = pd.DataFrame(np_data, columns=original_column_names)

# Selecting only the numerical features for scaling
numerical_features = pumpkin_seeds_data.iloc[:, :-1]

# Initializing the Standard Scaler
scaler = StandardScaler()

# Scaling the numerical features
scaled_numerical_features = scaler.fit_transform(numerical_features)

# Creating a new DataFrame with scaled values using the original column names (excluding 'Class')
scaled_numerical_df = pd.DataFrame(scaled_numerical_features, columns=original_column_names[:-1])

# Adding the non-numerical column ('Class') back to the DataFrame
scaled_pumpkin_seeds_data = pd.concat([scaled_numerical_df, pumpkin_seeds_data['class']], axis=1)

scaled_pumpkin_seeds_data

# Creating the feature set 'X' and the target 'y'
X = scaled_pumpkin_seeds_data.iloc[:, :-1]  # All columns except the last one
y = scaled_pumpkin_seeds_data['class']      # Only the last column

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)



---------------------------------------------------------------------------
## SECTION 5 - Exploratory Data Analysis (EDA)

---------------------------------------------------------------------------
## SECTION 6 - Models

### kNN

In [4]:
print(np_data.shape)
print("testbryce")


(2500, 13)
testbryce


### Logistic Regression

In [5]:
# Instantiate the Logistic Regression model
from sklearn.linear_model import SGDClassifier

logi_model = SGDClassifier(loss = 'log_loss', eta0 = 0.001, max_iter = 200, 
                            learning_rate = 'constant', random_state = 42, verbose = 0)


In [6]:
# Convert back to numpy array since dataloader (for minibatch gradient descent) expects numpy array 
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

In [7]:
# Train the model and get the predictions
logi_model.fit(X_train, y_train)
predictions_train = logi_model.predict(X_train)
predictions_test = logi_model.predict(X_test)

In [8]:
def compute_accuracy(predictions, actual):
    
    correct = np.sum(predictions == actual)
    
    accuracy = correct/len(predictions) * 100.0
    
    return accuracy

In [9]:
print("Training accuracy: ", compute_accuracy(y_train, predictions_train),"%")
print("Testing accuracy: ", compute_accuracy(y_test, predictions_test),"%")

Training accuracy:  88.16000000000001 %
Testing accuracy:  85.11999999999999 %


In [10]:
# Improving the model by using minibatch gradient descent

logi_model_bgd = SGDClassifier(loss = 'log_loss', eta0 = 0.001, learning_rate = 'constant', random_state = 42, verbose = 0)

In [11]:
# Use data_loader python file from our past notebooks
from data_loader import DataLoader
data_loader = DataLoader(X_train, y_train, 10)

In [12]:
from sklearn.metrics import log_loss

max_epochs = 200
e = 0
is_converged = False
previous_loss = 0
labels = np.unique(y_train)

# For each epoch
while e < max_epochs and is_converged is not True:
    
    loss = 0
    
    X_batch, y_batch = data_loader.get_batch()
    
    for X, y in zip(X_batch, y_batch):
        
        # Partial fit the model
        logi_model_bgd.partial_fit(X,y,labels)
        
        # Compute the loss
        y_pred = logi_model_bgd.predict_proba(X_train)
        loss += log_loss(y_train, y_pred)
        
    # Display the average loss per epoch
    print('Epoch:', e + 1, '\tLoss:', (loss / len(X_batch)))
    
    if abs(previous_loss - loss) < 0.005:
        is_converged = True
    else:
        previous_loss = loss
        e += 1

Epoch: 1 	Loss: 0.4595718714356607
Epoch: 2 	Loss: 0.35516606210237706
Epoch: 3 	Loss: 0.3357882392212421
Epoch: 4 	Loss: 0.3279783392659519
Epoch: 5 	Loss: 0.32360040382474337
Epoch: 6 	Loss: 0.321005672417068
Epoch: 7 	Loss: 0.3195198723298882
Epoch: 8 	Loss: 0.3185109131186593
Epoch: 9 	Loss: 0.3177632775753124
Epoch: 10 	Loss: 0.3169705934716691
Epoch: 11 	Loss: 0.3166404296483876
Epoch: 12 	Loss: 0.3161968519421993
Epoch: 13 	Loss: 0.3159738371374772
Epoch: 14 	Loss: 0.315747827507336
Epoch: 15 	Loss: 0.3155693458584706
Epoch: 16 	Loss: 0.315457750447972
Epoch: 17 	Loss: 0.3153374591525945
Epoch: 18 	Loss: 0.31519492246815084
Epoch: 19 	Loss: 0.31512172143226175
Epoch: 20 	Loss: 0.31505562212541716
Epoch: 21 	Loss: 0.31498762217018716
Epoch: 22 	Loss: 0.31491607914036507
Epoch: 23 	Loss: 0.31486102299157664
Epoch: 24 	Loss: 0.3148125730302644
Epoch: 25 	Loss: 0.31478546543035113
Epoch: 26 	Loss: 0.31470970250015495
Epoch: 27 	Loss: 0.31466784649149004
Epoch: 28 	Loss: 0.3146458481

In [13]:
predictions_train = logi_model_bgd.predict(X_train)
predictions_test = logi_model_bgd.predict(X_test)

In [14]:
print("Training accuracy: ", compute_accuracy(y_train, predictions_train),"%")
print("Testing accuracy: ", compute_accuracy(y_test, predictions_test),"%")

Training accuracy:  88.32 %
Testing accuracy:  85.92 %


## Hyperparameter Tuning

In [15]:
hyperparameters = {
    'alpha': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'l1_ratio': [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1],
    'tol': [0.0001, 0.001, 0.01, 0.1],
    'eta0': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['none', 'l2', 'l1', 'elasticnet']
}

In [16]:
logi_model_hyper = SGDClassifier(loss='log_loss', learning_rate='constant', random_state=42, verbose=0)

In [17]:
from sklearn.model_selection import RandomizedSearchCV

random_search_logi_model = RandomizedSearchCV(estimator=logi_model_hyper, param_distributions=hyperparameters, n_iter=100, cv=5, random_state=42)

In [18]:
random_search_logi_model.fit(X_train, y_train)

90 fits failed out of a total of 500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
90 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Bryce Salvador\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Bryce Salvador\anaconda3\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\Bryce Salvador\anaconda3\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\Bryce Salvador\anaconda3\lib\site-packages\sklearn\utils\_param_validation.py", line 95, in validate_parameter_c

In [19]:
print(random_search_logi_model.best_estimator_)

SGDClassifier(alpha=0.001, eta0=0.001, l1_ratio=0.4, learning_rate='constant',
              loss='log_loss', penalty='l1', random_state=42, tol=0.0001)


In [20]:
logi_model_besthyper = SGDClassifier(alpha=0.001, eta0=0.001, l1_ratio=0.4, learning_rate='constant', loss='log_loss', penalty='l1', random_state=42, tol=0.0001)

In [21]:
logi_model_besthyper.fit(X_train, y_train)

In [22]:
predictions_train = logi_model_besthyper.predict(X_train)
predictions_test = logi_model_besthyper.predict(X_test)

In [23]:
print("Training accuracy: ", compute_accuracy(y_train, predictions_train),"%")
print("Testing accuracy: ", compute_accuracy(y_test, predictions_test),"%")

Training accuracy:  88.21333333333334 %
Testing accuracy:  85.28 %


In [24]:
logi_model_besthyper_bgd = SGDClassifier(alpha=0.001, eta0=0.001, l1_ratio=0.4, learning_rate='constant', loss='log_loss', penalty='l1', random_state=42, tol=0.0001)

In [25]:
from sklearn.metrics import log_loss

max_epochs = 200
e = 0
is_converged = False
previous_loss = 0
labels = np.unique(y_train)

# For each epoch
while e < max_epochs and is_converged is not True:
    
    loss = 0
    
    X_batch, y_batch = data_loader.get_batch()
    
    for X, y in zip(X_batch, y_batch):
        
        # Partial fit the model
        logi_model_besthyper_bgd.partial_fit(X,y,labels)
        
        # Compute the loss
        y_pred = logi_model_besthyper_bgd.predict_proba(X_train)
        loss += log_loss(y_train, y_pred)
        
    # Display the average loss per epoch
    print('Epoch:', e + 1, '\tLoss:', (loss / len(X_batch)))
    
    if abs(previous_loss - loss) < 0.005:
        is_converged = True
    else:
        previous_loss = loss
        e += 1

Epoch: 1 	Loss: 0.4614656530193545
Epoch: 2 	Loss: 0.3574764300866157
Epoch: 3 	Loss: 0.3361700942131213
Epoch: 4 	Loss: 0.3287530326963546
Epoch: 5 	Loss: 0.32362751212378377
Epoch: 6 	Loss: 0.3212425572964651
Epoch: 7 	Loss: 0.31968856564839115
Epoch: 8 	Loss: 0.3185586716237862
Epoch: 9 	Loss: 0.3179358987085836
Epoch: 10 	Loss: 0.31726491661695755
Epoch: 11 	Loss: 0.31683570866980926
Epoch: 12 	Loss: 0.3164973433041607
Epoch: 13 	Loss: 0.3161556643726129
Epoch: 14 	Loss: 0.3159659857940005
Epoch: 15 	Loss: 0.3157938604376914
Epoch: 16 	Loss: 0.31565034641279255
Epoch: 17 	Loss: 0.31558557319229946
Epoch: 18 	Loss: 0.315429836158283
Epoch: 19 	Loss: 0.3153470983209677
Epoch: 20 	Loss: 0.3152556356553599
Epoch: 21 	Loss: 0.31524052887584375


In [26]:
predictions_train = logi_model_besthyper_bgd.predict(X_train)
predictions_test = logi_model_besthyper_bgd.predict(X_test)

In [27]:
print("Training accuracy: ", compute_accuracy(y_train, predictions_train),"%")
print("Testing accuracy: ", compute_accuracy(y_test, predictions_test),"%")

Training accuracy:  88.21333333333334 %
Testing accuracy:  85.44 %


### Naive Bayes