## SECTION 1 
### Introduction to the problem/task and dataset

PumpkinSeed-ML-Insights is a comprehensive repository dedicated to the exploration and analysis of Turkish pumpkin seed varieties, with a focus on classifying whether a seed belongs to Urgup Sivrisi or Cercevelik species. This project demonstrates the knowledge of authors in data science and machine learning.

Within this repository, you will find a Jupyter Notebook that serves as a self-explanatory document, guiding you through the entire process. This repository also contains three Python files, each implementing a different machine learning model: `knn_pumpkinseed.py`, `logistic_regression_pumkinseed.py`, and `neural_network_pumpkinseed.py`. There is also the `pumpkin_seeds.csv`, which contains the data and `Pumpkin_seeds.pdf`, which contains some description of the dataset.



---
## SECTION 2
### Description of the dataset

`pumpkin_seeds.csv` is a CSV file containing information about Pumpkin Seeds found in Turkey. 

This dataset came from the study `The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.).` by Koklu, M., Sarigil, S., & Ozbek, O. in 2021. In their paper, they used a product shooting box to obtain quality images of the pumpkin seeds. The authors converted the images to a gray tone and then to binary images. To convert the image data into a CSV file, they extracted 12 morphological features.

Overall, the CSV file has 13 columns/features and 2500 rows. The first 12 columns are from the extracted morphological features, while the last column classifies whether it belongs to the Urgup Sivrisi or Cercevelik species. There are 2500 rows, representing a single seed used in the study. There are 1200 Urgup Sivrisi and 1300 Cercevelik species of pumpkin seeds. 

The features found in this CSV file are as follows:
1. Area – Number of pixels within the borders of a pumpkin seed
2. Perimeter – Circumference in pixels of a pumpkin seed
3. Major_Axis_Length – Large axis distance of a pumpkin seed
4. Minor_Axis_Length – Small axis distance of a pumpkin seed
5. Convex_Area – Number of pixels of the smallest convex shell at the region formed by the
pumpkin seed.
6. Equiv_Diameter – Computed as !4𝑎⁄𝜋, where 𝑎 is the area of the pumpkin seed.
7. Eccentricity – Eccentricity of a pumpkin seed
8. Solidity – Convex condition of the pumpkin seeds
9. Extent – Ratio of a pumpkin seed area to the bounding box pixels
10. Roundness – Ovality of pumpkin seeds without considering the distortion of the edges.
11. Aspect_Ration – Aspect ratio of the pumpkin seeds
12. Compactness – Proportion of the area of the pumpkin seed relative to the area of the circle
with the same circumference
13. Class – Either Cercevelik or Urgup Sivrisi



---
## SECTION 3

### List of Requirements

---
## SECTION 4

### Data preprocessing and cleaning



In [1]:
import numpy as np
import pandas as pd
import csv

In [2]:
data = []

with open('pumpkin_seeds.csv', 'r', encoding='utf-8', errors='replace') as csv_file:
    raw_data = csv.reader(csv_file)

    #Skip headers
    next(raw_data)

    #Store data into data array
    for row in raw_data:
        row_data = []
        for i in range(13): #Convert errors into 1 or 2 (depending on their specie)
            if i == 12 and row[i] == '�er�evelik':
                row_data.append(int(0))
            elif i == 12 and row[i] == '�rg�p Sivrisi':
                row_data.append(int(1))
            else:
                row_data.append(row[i])

        data.append(row_data)

#Convert data into numpy array
np_data = np.array(data)
np_data


array([['56276', '888.242', '326.1485', ..., '1.4809', '0.8207', '0'],
       ['76631', '1068.146', '417.1932', ..., '1.7811', '0.7487', '0'],
       ['71623', '1082.987', '435.8328', ..., '2.0651', '0.6929', '0'],
       ...,
       ['87994', '1210.314', '507.22', ..., '2.2828', '0.6599', '1'],
       ['80011', '1182.947', '501.9065', ..., '2.4513', '0.6359', '1'],
       ['84934', '1159.933', '462.8951', ..., '1.9735', '0.7104', '1']],
      dtype='<U11')

The first step is to address the encoding issues, especially in the "Class" column. The unique values in the "Class" column are showing encoding issues, as evidenced by the presence of escape characters like \x82. These values are intended to represent the two species of pumpkin seeds. To correct this, we need to replace these incorrectly encoded strings with a correct format. We replaced the string class names with an integer value of 0 and 1.

### Data Scaling

The numerical features are scaled using `StandardScaler` from `sklearn.preprocessing`. This ensures that all features have a mean of 0 and a standard deviation of 1, which is particularly important for many machine learning algorithms.

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Replace 'original_column_names' with the actual list of column names
original_column_names = ['area', 'perimeter', 'major_axis_length', 'minor_axis_length', 
                         'convex_area', 'equiv_diameter', 'eccentricity', 'solidity', 
                         'extent', 'roundness', 'aspect_Ration', 'compactness', 'class']

# Convert the numpy array to a pandas DataFrame using the original column names
pumpkin_seeds_data = pd.DataFrame(np_data, columns=original_column_names)

# Selecting only the numerical features for scaling
numerical_features = pumpkin_seeds_data.iloc[:, :-1]

# Initializing the Standard Scaler
scaler = StandardScaler()

# Scaling the numerical features
scaled_numerical_features = scaler.fit_transform(numerical_features)

# Creating a new DataFrame with scaled values using the original column names (excluding 'Class')
scaled_numerical_df = pd.DataFrame(scaled_numerical_features, columns=original_column_names[:-1])

# Adding the non-numerical column ('Class') back to the DataFrame
scaled_pumpkin_seeds_data = pd.concat([scaled_numerical_df, pumpkin_seeds_data['class']], axis=1)

scaled_pumpkin_seeds_data

# Creating the feature set 'X' and the target 'y'
X = scaled_pumpkin_seeds_data.iloc[:, :-1]  # All columns except the last one
y = scaled_pumpkin_seeds_data['class']      # Only the last column

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)



---------------------------------------------------------------------------
## SECTION 5 - Exploratory Data Analysis (EDA)

---------------------------------------------------------------------------
## SECTION 6 - Models

### kNN

In [4]:
print(np_data.shape)
print("testbryce")


(2500, 13)
testbryce


### Logistic Regression

In [5]:
from sklearn.linear_model import SGDClassifier

logi_model = SGDClassifier(loss = 'log_loss', eta0 = 0.001, max_iter = 200, 
                      learning_rate = 'constant', random_state = 1, verbose = 0)

In [6]:
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)
X_train, X_test, y_train, y_test

(array([[ 0.35572087,  0.66646474,  0.87137989, ..., -0.87695115,
          0.9175903 , -0.96916952],
        [ 0.01630675,  0.49562138,  0.63304054, ..., -1.19888124,
          0.91220943, -0.98236306],
        [-0.66200911, -0.85867523, -0.90766268, ...,  0.82570131,
         -0.78276657,  0.76106895],
        ...,
        [ 0.58570559,  0.37418546,  0.29899195, ...,  0.37142219,
         -0.2082789 ,  0.10516156],
        [-0.57351403,  0.00906293,  0.46316962, ..., -1.36700029,
          1.47783447, -1.38005403],
        [-0.99066328, -1.41294119, -1.91190717, ...,  1.68239305,
         -1.98016977,  2.51204012]]),
 array([[-0.47930959, -0.86551373, -0.99813754, ...,  1.36940546,
         -1.0951739 ,  1.185147  ],
        [-0.37617513, -0.35504343, -0.30522732, ...,  0.09062761,
         -0.13547881,  0.02788511],
        [ 1.2058299 ,  1.64731728,  1.90923253, ..., -1.44032881,
          1.62596683, -1.5176438 ],
        ...,
        [ 1.58674667,  1.01361676,  0.60362985, ...,  

In [7]:
logi_model.fit(X_train, y_train)

In [8]:
predictions_train = logi_model.predict(X_train)
predictions_test = logi_model.predict(X_test)

In [9]:
def compute_accuracy(predictions, actual):
    
    correct = np.sum(predictions == actual)
    
    accuracy = correct/len(predictions) * 100.0
    
    return accuracy

In [10]:
print("Training accuracy: ", compute_accuracy(y_train, predictions_train),"%")
print("Testing accuracy: ", compute_accuracy(y_test, predictions_test),"%")

Training accuracy:  88.16000000000001 %
Testing accuracy:  85.11999999999999 %


In [11]:
logi_model_bgd = SGDClassifier(loss = 'log_loss', eta0 = 0.001, learning_rate = 'constant', random_state = 1, verbose = 0)

In [12]:
max_epochs = 200

In [13]:
from data_loader import DataLoader
data_loader = DataLoader(X_train, y_train, 10)

In [14]:
from sklearn.metrics import log_loss

e = 0
is_converged = False
previous_loss = 0
labels = np.unique(y_train)

# For each epoch
while e < max_epochs and is_converged is not True:
    
    loss = 0
    
    # TODO: Get the batch for this epoch.
    X_batch, y_batch = data_loader.get_batch()
    
    # For each batch
    for X, y in zip(X_batch, y_batch):
        
        # TODO: Partial fit the model to the subset you selected
        # In partial fit, you have to pass a classes parameters, use labels as the value
        logi_model_bgd.partial_fit(X,y,labels)
        
        # Compute the loss
        y_pred = logi_model_bgd.predict_proba(X_train)
        loss += log_loss(y_train, y_pred)
        
    # Display the average loss per epoch
    print('Epoch:', e + 1, '\tLoss:', (loss / len(X_batch)))
    
    if abs(previous_loss - loss) < 0.005:
        is_converged = True
    else:
        previous_loss = loss
        e += 1

Epoch: 1 	Loss: 0.4595580688557542
Epoch: 2 	Loss: 0.35516975350191476
Epoch: 3 	Loss: 0.33578917902910055
Epoch: 4 	Loss: 0.3279767319217446
Epoch: 5 	Loss: 0.32359851704401693
Epoch: 6 	Loss: 0.32100495262883827
Epoch: 7 	Loss: 0.3195191929069395
Epoch: 8 	Loss: 0.3185104974313403
Epoch: 9 	Loss: 0.3177635524181088
Epoch: 10 	Loss: 0.31697117309753303
Epoch: 11 	Loss: 0.3166406538222048
Epoch: 12 	Loss: 0.3161972442281314
Epoch: 13 	Loss: 0.3159744285119822
Epoch: 14 	Loss: 0.31574850802950377
Epoch: 15 	Loss: 0.31556985149789435
Epoch: 16 	Loss: 0.3154582922351599
Epoch: 17 	Loss: 0.3153380350933481
Epoch: 18 	Loss: 0.31519502472595645
Epoch: 19 	Loss: 0.3151218321962104
Epoch: 20 	Loss: 0.31505597658987405
Epoch: 21 	Loss: 0.31498764551413183
Epoch: 22 	Loss: 0.31491652797790737
Epoch: 23 	Loss: 0.31486072911640256
Epoch: 24 	Loss: 0.31481247938957624
Epoch: 25 	Loss: 0.3147852765719127
Epoch: 26 	Loss: 0.3147099247755563
Epoch: 27 	Loss: 0.31466817383729706
Epoch: 28 	Loss: 0.3146

In [15]:
predictions_train = logi_model_bgd.predict(X_train)
predictions_test = logi_model_bgd.predict(X_test)

In [16]:
print("Training accuracy: ", compute_accuracy(y_train, predictions_train),"%")
print("Testing accuracy: ", compute_accuracy(y_test, predictions_test),"%")

Training accuracy:  88.32 %
Testing accuracy:  85.92 %


## Hyperparameter Tuning

In [18]:
logi_model_bgd.get_params

<bound method BaseEstimator.get_params of SGDClassifier(eta0=0.001, learning_rate='constant', loss='log_loss',
              random_state=1)>

In [None]:
hyperparameters = [
    {
      'min_impurity_decrease': [0.001, 0.01, 0.05, 0.1, 0.3, 0.5],
      'max_depth': [5, 10, 20, 30],
      'min_samples_split': [2, 4, 6, 10, 15, 20],
      'max_leaf_nodes': [3, 5, 10, 20, 50, 100]
    }
]

### Naive Bayes