This is the template for the image recognition exercise. <Br>
Some **general instructions**, read these carefully:
 - The final assignment is returned as a clear and understandable *report*
    - define shortly the concepts and explain the phases you use
    - use the Markdown feature of the notebook for larger explanations
 - return your output as a *working* Jupyter notebook
 - name your file as Exercise_MLPR2023_Partx_uuid.jpynb
    - use the uuid code determined below
    - use this same code for each part of the assignment
 - write easily readable code with comments     
     - if you exploit code from web, provide a reference
 - it is ok to discuss with a friend about the assignment. But it is not ok to copy someone's work. Everyone should submit their own implementation
     - in case of identical submissions, both submissions are failed 

**Deadlines:**
- Part 1: Mon 6.2 at 23:59**
- Part 2: Mon 20.2 at 23:59**
- Part 3: Mon 6.3 at 23:59**

**No extensions for the deadlines** <br>
- after each deadline, example results are given, and it is not possible to submit anymore

**If you encounter problems, Google first and if you can’t find an answer, ask for help**
- Moodle area for questions
- pekavir@utu.fi
- teacher available for questions
    - Monday 30.1 at 14:00-15:00 room 407B Honka (Agora 4th floor)
    - Monday 13.2 at 14:00-15:00 room 407B Honka (Agora 4th floor)
    - Thursday 2.3 at lecture 10:15-12:00 

**Grading**

The exercise covers a part of the grading in this course. The course exam has 5 questions, 6 points of each. Exercise gives 6 points, i.e. the total score is 36 points.

From the template below, you can see how many exercise points can be acquired from each task. Exam points are given according to the table below: <br>
<br>
7 exercise points: 1 exam point <br>
8 exercise points: 2 exam points <br>
9 exercise points: 3 exam points <br>
10 exercise points: 4 exam points <br>
11 exercise points: 5 exam points <br>
12 exercise points: 6 exam points <br>
<br>
To pass the exercise, you need at least 7 exercise points, and at least 1 exercise point from each Part.
    
Each student will grade one submission from a peer and their own submission. After each Part deadline, example results are given. Study them carefully and perform the grading according to the given instructions. Mean value from the peer grading and self-grading is used for the final points. 

In [7]:
import uuid
# Run this cell only once and save the code. Use the same id code for each Part.
# Printing random id using uuid1()
print ("The id code is: ",end="")
print ("52875c60-a63d-11ed-874c-00155d8198c4")

The id code is: 52875c60-a63d-11ed-874c-00155d8198c4


# Part 1

Read the original research article:

İ. Çınar and M. Koklu. Identification of rice varieties using machine learning algorithms. Journal of Agricultural Sciences, 28(2):307–325, 2022. doi: 10.15832/ankutbd.862482.

https://dergipark.org.tr/en/download/article-file/1513632

## Introduction

Will be written in Part 3

## Preparations of the data (1 p)

Make three folders in your working folder: "notebooks", "data" and "training_data". Save this notebook in "notebooks" folder.
<br> <br>
Perform preparations for the data
- import all the packages needed for this notebook in one cell
- import the images. Data can be found from (downloading starts as you press the link) https://www.muratkoklu.com/datasets/vtdhnd09.php <br>
    - save the data folders "Arborio", "Basmati" and "Jasmine" in "data" folder
- take a random sample of 100 images from Arborio, Basmati and Jasmine rice species (i.e. 300 images in total)
- determine the contour of each rice (you can use e.g. *findContours* from OpenCV)
- plot one example image of each rice species, including the contour

## Feature extraction (2 p)

Gather the feature data <br>
<br>
Color features (15) <br>
- Calculate the following color features for each image, including only the pixels within the contour (you can use e.g. *pointPolygonTest* from OpenCV)
    - Mean for each RGB color channel
    - Variance for each RGB color channel
    - Skewness for each RGB color channel
    - Kurtosis for each RGB color channel
    - Entropy for each RGB color channel
    
Dimension features (6) <br>
- Fit an ellipse to the contour points (you can use e.g. *fitEllipse* from OpenCV)
- Plot one example image of each rice species including the fitted ellipse
- Calculate the following features for each image (for details, see the original article)
    - the major axis length the ellipse
    - the minor axis length of the ellipse
    - area inside the contour (you can use e.g. *contourArea* from OpenCV)
    - perimeter of the contour (you can use e.g. *arcLength* from OpenCV)
    - roundness
    - aspect ratio
    
Gather all the features in one array or dataframe: one data point in one row, including all feature values in columns. <br>
For each data point, include also information of the original image and the label (rice species). Save the data in "training_data" folder.

# Part 2

## Data exploration (2 p)

- Standardize the data
- Plot a boxplot of each feature
- Plot histogram of each feature, use a different color for each class
- Plot pairplot (each feature against each feature and the label against each feature)
- Discuss your findings from the above figures, e.g. can you spot features which might be very useful in predicting the correct class? 
- Fit PCA using two components
- Plot the PCA figure with two components, color the data points according to their species
- Can you see any clusters in PCA? Does this figure give you any clues, how well you will be able to classify the image types? Explain.
- How many PCA components are needed to cover 99% of the variance?
- Make clear figures, use titles and legends for clarification

## Model selection (2 p)

Select the best model for each classifier. Use 5-fold repeated cross validation with 3 repetitions (*RepeatedKFold* from sklearn). You can choose the hyperparameter ranges to use (i.e. from which values the best hyperparameters are selected if they are not stated below.) <br>

- k Nearest Neighbors classifier: hyperparameter k
- random forest: hyperparameters max_depth and max_features
- MLP: use one hidden layer and Early stopping. Hyperparameters:
    - number of neurons in the hidden layer
    - activation function: logistic sigmoid function and rectified linear unit function
    - solver: stochastic gradient descent and adam
    - validation_fraction: 0.1 and 0.5

For each classifier:
- Report the best hyperparameter or the best combination of hyperparameters. <br>
- Plot the accuracy versus the hyperparameter/hyperparameter combination and highlight the best value. <br>

For random forest model, report the feature importance for each feature. Which features seem to be the most important? Does this correspond with the observations you made in the data exploration? <br>
Ponder the model selection process. What things should be considered when selecting the model to be used?

In [13]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import numpy as np

In [10]:
df = pd.read_parquet("./training_data/rice_feature_data.parquet")

In [14]:
df.sample(5)

Unnamed: 0,Full path,File name,Species,Mean B,Mean G,Mean R,Variance B,Variance G,Variance R,Skew B,...,kurtosis R,entropy B,entropy G,entropy R,Major axis,Minor axis,Area,Perimeter,Roundness,Aspect ratio
267,./Rice_Image_Dataset/data/Jasmine/Jasmine (136...,Jasmine (13651).jpg,Jasmine,127.252,121.74359,121.74359,122,224,224,-0.244651,...,-1.908758,8.641166,8.63378,8.63378,63.808964,202.088409,9945.5,468.801077,1.758493,0.315748
210,./Rice_Image_Dataset/data/Jasmine/Jasmine (786...,Jasmine (7863).jpg,Jasmine,101.451191,96.20907,96.20907,234,89,89,0.249342,...,-1.920471,7.767942,7.756532,7.756532,47.689842,146.179672,5353.0,341.421353,1.732903,0.326241
3,./Rice_Image_Dataset/data/Arborio/Arborio (277...,Arborio (2776).jpg,Arborio,144.965049,142.314827,142.314827,56,243,243,-0.60699,...,-1.578551,8.50137,8.496563,8.496563,72.970932,133.909424,7593.0,356.391916,1.331167,0.544928
214,./Rice_Image_Dataset/data/Jasmine/Jasmine (884...,Jasmine (8840).jpg,Jasmine,126.860804,120.969046,120.969046,96,163,163,-0.248986,...,-1.886292,7.997342,7.988523,7.988523,49.607388,137.599854,5209.0,324.877197,1.612405,0.360519
288,./Rice_Image_Dataset/data/Jasmine/Jasmine (736...,Jasmine (7363).jpg,Jasmine,191.300079,185.004562,185.004562,54,170,170,-2.352079,...,3.695892,8.42216,8.414696,8.414696,47.699261,144.892746,5174.0,328.960456,1.664375,0.329204


- Standardize the data

In [27]:
new_df = df.drop(axis=1, columns=["Full path", "File name", "Species"])
new_df

Unnamed: 0,Species,Mean B,Mean G,Mean R,Variance B,Variance G,Variance R,Skew B,Skew G,Skew R,...,kurtosis R,entropy B,entropy G,entropy R,Major axis,Minor axis,Area,Perimeter,Roundness,Aspect ratio
0,Arborio,163.935239,160.837732,160.837732,48,30,30,-0.941987,-0.938905,-0.938905,...,-1.012412,8.705178,8.699617,8.699617,74.493927,146.274567,8343.5,379.404108,1.372922,0.509275
1,Arborio,141.885234,138.871421,138.871421,234,90,90,-0.497980,-0.493298,-0.493298,...,-1.706574,8.535238,8.529326,8.529326,74.299797,140.718155,8127.5,372.492421,1.358526,0.528004
2,Arborio,138.714764,132.149394,132.149394,235,180,180,-0.627457,-0.623434,-0.623434,...,-1.559273,8.336205,8.328647,8.328647,65.685539,127.541199,6418.0,333.764500,1.381245,0.515014
3,Arborio,144.965049,142.314827,142.314827,56,243,243,-0.606990,-0.604705,-0.604705,...,-1.578551,8.501370,8.496563,8.496563,72.970932,133.909424,7593.0,356.391916,1.331167,0.544928
4,Arborio,138.763231,137.686920,137.686920,97,1,1,-0.411155,-0.405394,-0.405394,...,-1.799149,8.536842,8.531288,8.531288,74.780594,147.199463,8460.0,375.078208,1.323316,0.508022
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,Jasmine,87.889934,84.000800,84.000800,241,26,26,0.522350,0.524273,0.524273,...,-1.710700,7.574416,7.563671,7.563671,43.512386,158.146530,5151.0,354.735061,1.944047,0.275140
296,Jasmine,161.720974,152.693706,152.693706,5,71,71,-1.097680,-1.094952,-1.094952,...,-0.743252,8.409119,8.394948,8.394948,48.630066,165.142944,5979.0,370.901586,1.830960,0.294473
297,Jasmine,71.654098,70.055389,70.055389,156,208,208,0.759720,0.759100,0.759100,...,-1.414350,8.162082,8.160696,8.160696,71.459953,217.993317,10916.0,495.279220,1.788244,0.327808
298,Jasmine,165.849721,157.775233,157.775233,97,59,59,-1.379652,-1.378129,-1.378129,...,-0.010946,8.374257,8.366791,8.366791,48.329914,149.392639,5507.5,342.031526,1.690316,0.323509


In [28]:
scaler = StandardScaler

scaled_df = scaler.fit_transform(new_df)

TypeError: TransformerMixin.fit_transform() missing 1 required positional argument: 'X'

- Plot a boxplot of each feature

- Plot histogram of each feature, use a different color for each class

- Plot pairplot (each feature against each feature and the label against each feature)

- Discuss your findings from the above figures, e.g. can you spot features which might be very useful in predicting the correct class? 
- Fit PCA using two components
- Plot the PCA figure with two components, color the data points according to their species
- Can you see any clusters in PCA? Does this figure give you any clues, how well you will be able to classify the image types? Explain.
- How many PCA components are needed to cover 99% of the variance?
- Make clear figures, use titles and legends for clarification