# Problem Session 01
## Classifying Pumpkin Seeds

In this notebook we will utilize the following tools we learned about in lecture `01_supervised_learning`:

- Obtaining data
- Data cleaning
- Exploratory Data Analysis
- Modeling
- Pipelines
- Basic model evaluation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")

#### 1. Load the data

##### a.

First load the data stored in `Pumpkin_Seeds_Dataset.xlsx` in the `data` folder.

Note you will want to use the `read_excel` function from `pandas`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel">https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html?highlight=read_excel</a>. Print a random sample of five rows.

In [None]:
# load data
seeds = 

In [None]:
# print 5 random rows

##### b.

Create a new column of the `DataFrame` called `y` where `y=1` if `Class=Ürgüp Sivrisi` and `y=0` if `Class=Çerçevelik`.

In [None]:
# Create column here

#### 2. Learn about the data

##### a.

These data represent various measurements of pumpkin seeds that come from high quality photos of the seeds. The data was provided as supplementary material to <a href="https://link.springer.com/article/10.1007/s10722-021-01226-0">The use of machine learning methods in classification of pumpkin seeds (Cucurbita pepo L.)</a> by Koklu, Sarigil and Ozbek (2021).

In this work the researchers demonstrated how various algorithms could be used to predict whether a pumpkin seed was a Ürgüp Sivrisi seed or a Çerçevelik seed. These data were generated by engineering features from special photos of seeds like so:
<br>
<br>
<img src="problem_session_assets/pumpkin_seeds.jpg" width="55%"></img>

As you can see these two seeds can be quite difficult for the human eye to discern, hence the appeal to machine learning algorithms.

A PDF of this paper is provided here, <a href="problem_session_assets/pumpkin_seed_paper.pdf">pumpkin_seed_paper.pdf</a>. Scroll down to Figure 5 and Table 1 and read about the features of this data set.

#### 3. Train test split

##### a.

Look at how the data is split between the two classes. Does this appear to be imbalanced data? <i>Recall that we say data is imbalanced if one of the classes has a very small presence in the data set.

In [None]:
# Check percentage of targets of each class.

This data set seems pretty well balanced.

##### b.

Make a train test split, set aside $20\%$ of the data as the test set.  You should stratify with respect to the target to ensure class balance.

In [None]:
# import train_test_split

In [None]:
# Create stratified split.  Use 123 as random seed to agree with solutions.
# Name them seeds_train and seeds_test

To compare different candidate models we will use a single *validation set*.  Further split the training set into training and validation sets.

In [None]:
# Create another stratified split of seeds_train using 123 as random seed.
# Name them seeds_tt (short for "train_train") and seed_val (short for "validation").


#### 4. Exploratory data analysis (EDA)

Before building any models you will do some EDA.

##### a. 

One way to try and identify key features for classification algorithms is to plot histograms of the feature values for each of the classes.

Below is an example of such a histogram for the `Area` column made using `plt.hist`.

In [None]:
plt.figure(figsize=(9,5))


plt.hist(seeds_tt.loc[seeds_tt.y==0].Area,
            color='blue',
            alpha=.8,
            label="$y=0$")

plt.hist(seeds_tt.loc[seeds_tt.y==1].Area,
            color='red',
            alpha=.4,
            hatch = '\\',
            edgecolor='black',
            label="$y=1$")

plt.xlabel("Area", fontsize=14)
plt.legend(fontsize=14)

plt.show()

In this plot we can see that the two histograms are right on top of one another, indicating that the two classes of pumpkin seeds tend to have similar areas. This suggests that `Area` may not be a useful variable for discerning the seed class.

Use a `for` loop or some comparable method to produce similar histograms for each of the features. Write down the features that look like they may be useful for classification.

In [None]:
# Write for loop here.

These features seem like they may be useful in classifying the seeds.
- `Major_Axis_Length`
- `Eccentricity`
- `Roundness`
- `Aspect_Ration`
- `Compactness`

##### b.

Now try making a `seaborn` `pairplot` using the variables you identified in part <i>a.</i> as the arguments for `x_vars` and `y_vars`. Use `y` as the argument to `hue`. The main goal with this question is to see if you can identify any pairs of variables that seem to separate the two classes. You will use these plots later in the notebook.

In [None]:
# Make pairplot here

Discuss anything interesting you see in the plots here.

#### 5. Metric selection

In the remainder of this notebook you will make some initial models.

##### a.

Now that you have read about the data and looked at the split between the two classes what seems like a reasonable performance metric for this problem? Explain your answer.

##### b.

Recalling that `y=1` implies that the seed is of the Ürgüp Sivrisi class and `y=0` implies that the seed is of the Çerçevelik class, what do the following metrics measure in the context of this classification problem:
- recall
- precision
- false positive rate.

#### 6. Initial modeling attempts

In the remainder of this notebook you will make some initial models.

##### a.

You will train each model using `seeds_tt`.  You will the evaluate the accuracy of each model on `seeds_val`.  We don't touch the final test set until we are satisfied with the performance of one of our models.

Since [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) has some default L2 regularization it *is* sensitive to the scale of the data.  We could just turn that off with `penalty = False`.  To give you some practice with pipelines, instead put the logistic regression models in [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) with [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

Also compare to the baseline [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html) model which just predicts the most frequent class.

In [None]:
#Import everything you will need here 

In [None]:
# Define a model accuracy array or dictionary to hold the accuracies.  

# Now write your cross validation loop.  Instantiate each model inside the loop.  Remember to write pipelines!
# Fit on the "seeds_tt" data and evaluate on "seeds_val" data.
# Record fold holdout accuracy in the appropriate place in your accuracy array or dictionary.


In [None]:
# Determine which of the models had the best validation accuracy.

##### c.

Compare these models to the logistic regression model that incorporates all of the features you identified with your histogram exploration.  Which will you choose as your final model and why?

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)