# Assignment 1: Fruits with color_KNN model

Submission instructions:

 - Make sure the notebook you submit is cleanly and **fully executed**. 
 - Submit your notebook back in Blackboard where you downloaded this file.
 - Submit your work **as a .ipynb python file** on blackboard.

*Assignment Overview*

- #### Goal:

The objective of this assignment is to apply our understanding of Python and basic machine learning concepts using the "fruit_data_with_colours.csv" dataset. This dataset includes various fruits characterized by features such as mass, width, height, and color score. 

Our goal is to build and evaluate a K-Nearest Neighbors (KNN) model to classify fruits into four categories: **1= apple, 2= mandarin, 3= orange, 4= lemon.**

- #### Assignment Description:

Questions 1-4: Aimed at reviewing and applying the Python skills we learned in our previous 522 course. These questions cover fundamental data analysis steps including exploring the dataset, data visualization, and performing summary statistics. These exercises will help you familiarize yourself with the dataset and prepare for the machine learning model.


Questions 5-6: Focus on building our first machine learning model. The code for these steps is provided. Your task is to understand and explain each step in the process of creating and training a KNN model. This includes data preparation, model creation, training, and evaluation.



#### Import required libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import sklearn
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

### Question 1: Understand the Dataset Structure

After examining the dataset, answer the following questions:
- What are the features in this dataset? List them.
- Which variable is the target variable in this dataset?


In [3]:
df = pd.read_csv("fruit_data_with_colours.csv")
df

FileNotFoundError: [Errno 2] No such file or directory: 'fruit_data_with_colours.csv'

### Question 2: Explore the Dataset

In this question, you will write your own code to explore different characteristics of the dataset. Please answer the following questions by writing the appropriate Python code:

- **Check the Sample Size**: Write code to determine how many entries (rows) are in the dataset.

- **Dataset Information**: Use the `.info()` method to get information about the dataset. What insights can you gather about the data types and the presence of null values in each column from this output?


### Question 3: Visualize the Data

In this question, use data visualization techniques to explore the dataset further. Please write the Python code for each visualization and provide a brief explanation of your findings.

- **Bar Chart of Fruit Types**: Create a bar chart showing the count of each fruit type (fruit_label) in the dataset. What does this chart tell you about the frequency of different fruits in the dataset?

- **Pair Plot**: Create a pair plot to visualize the relationships between all numerical variables in the dataset. You may choose to use Seaborn, or any other suitable library for this task. What insights can you gather about the relationships between these variables?

### Question 4: Summary Statistics Analysis
Generate the descriptive statistics of the numerical variables in the dataset. 


## Question 5: Building a KNN Model

### Step 1: Data Preparation

In [None]:
# Importing the function for splitting data into train and test sets
from sklearn.model_selection import train_test_split

# Selecting features and target variable
X = df[['mass', 'width', 'height', 'color_score']]
y = df['fruit_label']

# Splitting the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)


# Printing the shapes of the training and test sets for both features and target
print("Training set shape (features):", X_train.shape)
print("Training set shape (target):", y_train.shape)
print("Test set shape (features):", X_test.shape)
print("Test set shape (target):", y_test.shape)

#### Questions:

- What does X represent in this context?
- What does y represent? 
- Explain why we need to split the data into training and test sets.
- What are the roles of `test_size=0.25` and `random_state=0` in this context?
- What do the shapes of `X_train` and `y_train` indicate about the training dataset?
- What do the shapes of `X_test` and `y_test` tell you about the test dataset?

### Step 2: Creating and Training the KNN Model

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Creating an instance of KNN classifier
knn = KNeighborsClassifier(n_neighbors=1)

# Training the model
knn.fit(X_train, y_train)

#### Questions:

- Is the problem we are addressing with our KNN model a classification or a regression problem? Explain the basis for your determination.
- Describe how the KNN (K-Nearest Neighbors) classifier works. For example: how does the KNN algorithm determine the class of a new data point?

### Step 3: Model Evaluation

In [None]:
# Evaluating the model
accuracy = knn.score(X_test, y_test)
print("Model accuracy on the test set:", accuracy)

#### Questions:

- Explain the significance of this evaluation step.
- How would you interpret the accuracy score obtained?

### Question 6: Making Predictions with the KNN Model

Now that we have built our KNN model, let's see how it performs on new data. Consider the following two examples:

- Example 1: A small fruit with mass 18g, width 3.3 cm, height 4.5 cm, and a color score of 0.59.
- Example 2: A larger, elongated fruit with mass 110g, width 7.8 cm, height 9.3 cm, and a color score of 0.71.

Use the code below to predict the type of fruit for each example using our trained KNN model, and observe the results:

In [None]:
# Prediction for the first example
first_example = [18, 3.3, 4.5, 0.59]
first_prediction = knn.predict([first_example])

# Prediction for the second example
second_example = [110, 7.8, 9.3, 0.71]
second_prediction = knn.predict([second_example])

print("Prediction for the first example:", first_prediction)
print("Prediction for the second example:", second_prediction)

#### Question:

- What are the model's predictions for the first and second examples? 