# Introduction to Python tools for data science

## NumPy
NumPy is the fundamental package for scientific computing with Python. It contains among other things:
- a powerful N-dimensional array object
- sophisticated (broadcasting) functions
- tools for integrating C/C++ and Fortran code
- useful linear algebra, Fourier transform, and random number capabilities


In [None]:
import numpy as np

# create a simple array
z = np.array([1, 2, 3, 4, 5])
print(f"z:{z}")

# create a simple 2D array
y = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"y:\n{y}")

# perform some operations on the arrays
print(f"z + 1: {z + 1}")
print(f"y * 2:\n{y * 2}")

# some complex operations
y_1d = y.flatten() # reshape y to a 1D array
print(f"y_1d:\n{y_1d}")

# use only the first 5 elements of y_1d and multiply them with z (element-wise)
print(f"y_1d * z:\n{y_1d[:5] * z}")

# use only the first 5 elements of y_1d and multiply them with z (cross-product)
print(f"y_1d x z:\n{np.dot(y_1d[:5], z)}")

## matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Generate a sequence of numbers from -10 to 10 with 100 steps in between
x = np.linspace(-10, 10, 100)
# Create a second array using sine
y = np.sin(x)
# The plot function makes a line chart of one array against another
plt.plot(x, y, marker="x")

We can create also more advanced graphs with matplotlib, adding some interactivity.

In [None]:
# Generate some data
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

# Create the figure and axis objects
fig, ax = plt.subplots()

# Plot the data
line, = ax.plot(x, y, color='blue', lw=2)

# Create a function to update the plot based on user input
def update_plot(amplitude, frequency):
    line.set_ydata(amplitude * np.sin(frequency * x))
    fig.canvas.draw()

# Create interactive widgets
from ipywidgets import interactive
import ipywidgets as widgets

amplitude_slider = widgets.FloatSlider(value=1.0, min=0.1, max=2.0, step=0.1, description='Amplitude:')
frequency_slider = widgets.FloatSlider(value=1.0, min=0.1, max=5.0, step=0.1, description='Frequency:')

# Define the interactive function
interactive_plot = interactive(update_plot, amplitude=amplitude_slider, frequency=frequency_slider)

# Display the interactive plot
display(interactive_plot)


## pandas
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It can easily handle missing data and has a lot of useful functions for data analysis. For example:
- easy handling of missing data
- size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
- powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- intuitive merging and joining data sets
- flexible reshaping and pivoting of data sets
- hierarchical labeling of axes
- robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format

In [None]:
import pandas as pd

# create a simple dataset of people
data = {'Name': ["John", "Anna", "Peter", "Linda"],
        'Location' : ["New York", "Paris", "Berlin", "London"],
        'Age' : [24, 13, 53, 33]
       }

data_pandas = pd.DataFrame(data)
# IPython.display allows "pretty printing" of dataframes in the Jupyter notebook
display(data_pandas)

In [None]:
# Select all rows that have an age column greater than 30
display(data_pandas[data_pandas.Age > 30])

# perform some (a bit more) complex operation on the dataframe
data_pandas['Age'] = data_pandas['Age'].apply(lambda x: x + 1)
print("After applying the lambda function:")
display(data_pandas)

In [None]:
# 2. Basic data inspection
print("First 5 rows of the dataframe:")
display(data_pandas.head())

print("Last 5 rows of the dataframe:")
display(data_pandas.tail())

print("Information about the dataframe:")
data_pandas.info()

print("Statistical summary of the dataframe:")
display(data_pandas.describe())

# 3. Filtering data based on multiple conditions
print("Rows where Age is greater than 25 and Location is not 'Paris':")
filtered_data = data_pandas[(data_pandas.Age > 25) & (data_pandas.Location != 'Paris')]
display(filtered_data)

# 4. Grouping data and calculating summary statistics
print("Group by Location and calculate the mean age:")
grouped_data = data_pandas.groupby('Location').Age.mean()
display(grouped_data)

# 5. Merging two DataFrames
# Create another DataFrame to demonstrate merging
data2 = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Salary': [50000, 60000, 55000, 65000]
}
data_pandas2 = pd.DataFrame(data2)

print("First dataframe:")
display(data_pandas)

print("Second dataframe:")
display(data_pandas2)

merged_data = pd.merge(data_pandas, data_pandas2, on='Name')
print("Merged dataframe:")
display(merged_data)

# 6. Handling missing data
# Introduce missing data
data_pandas.loc[1, 'Age'] = None
print("Dataframe with missing data:")
display(data_pandas)

print("Drop rows with missing values:")
data_pandas_dropped = data_pandas.dropna()
display(data_pandas_dropped)

print("Fill missing values with the mean of the column:")
data_pandas_filled = data_pandas.fillna(data_pandas['Age'].mean())
display(data_pandas_filled)

In [None]:
# 7. Sorting data
print("Data sorted by Age:")
sorted_data = data_pandas.sort_values(by='Age')
display(sorted_data)

# 8. Value Counts
print("Counts of unique values in the 'Location' column:")
location_counts = data_pandas['Location'].value_counts()
display(location_counts)

# 9. Pivot Tables
print("Pivot table of mean Age by Location:")
pivot_table = data_pandas.pivot_table(values='Age', index='Location', aggfunc='mean')
display(pivot_table)

# 10. Datetime Manipulation
print("Adding a Date of Birth column and extracting year:")
data_pandas['Date of Birth'] = pd.to_datetime(['1998-05-01', '2009-07-15', '1968-12-22', '1987-03-11'])
data_pandas['Birth Year'] = data_pandas['Date of Birth'].dt.year
display(data_pandas)

# 11. Crosstab
print("Crosstab of Age and Location:")
crosstab = pd.crosstab(data_pandas['Age'], data_pandas['Location'])
display(crosstab)

# 12. String Methods
print("Converting Location to uppercase:")
data_pandas['Location'] = data_pandas['Location'].str.upper()
display(data_pandas)

# 13. Apply with Functions
print("Applying a custom function to the Age column:")
def custom_function(x):
    return x * 2

data_pandas['Age Doubled'] = data_pandas['Age'].apply(custom_function)
display(data_pandas)

# 14. Conditional Column Creation
print("Creating a new column based on a condition:")
data_pandas['Is Adult'] = data_pandas['Age'] >= 18
display(data_pandas)

# 15. Sampling Data
print("Random sample of 2 rows:")
sampled_data = data_pandas.sample(2)
display(sampled_data)

# 16. Visualisation with Pandas
print("Plotting Age distribution:")
data_pandas['Age'].plot(kind='hist')

In [None]:
# Create a more complex dataset
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda', 'John', 'Anna', 'Peter', 'Linda'],
    'Year': [2020, 2020, 2020, 2020, 2021, 2021, 2021, 2021],
    'Maths': [85, 90, 78, 92, 88, 91, 82, 94],
    'Science': [88, 94, 80, 95, 90, 96, 83, 97],
    'English': [90, 85, 85, 88, 91, 86, 87, 89]
}

df = pd.DataFrame(data)

# Display the original dataframe
print("Original DataFrame:")
display(df)

# Melting the DataFrame
melted_df = pd.melt(df, id_vars=['Name', 'Year'], value_vars=['Maths', 'Science', 'English'],
                    var_name='Subject', value_name='Score')

print("Melted DataFrame:")
display(melted_df)

# Pivoting the DataFrame back to wide format
pivoted_df = melted_df.pivot_table(index=['Name', 'Year'], columns='Subject', values='Score').reset_index()

print("Pivoted DataFrame:")
display(pivoted_df)

# Pandas Excercises

In [None]:
# Pandas Exercises Notebook

# Exercise 1: Creating and Inspecting DataFrames

# Task: Create a DataFrame from the given dictionary and inspect its first few rows, last few rows, and basic information.

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Location': ['New York', 'Paris', 'Berlin', 'London'],
    'Age': [24, 13, 53, 33]
}

# Create a DataFrame
# Your code here

# Display the first few rows
# Your code here

# Display the last few rows
# Your code here

# Display basic information
# Your code here

# Exercise 2: Filtering Data

# Task: Filter the DataFrame to include only rows where Age is greater than 30.

# Your code here

# Exercise 3: Sorting Data

# Task: Sort the DataFrame by Age in descending order.

# Your code here

# Exercise 4: Grouping Data and Calculating Summary Statistics

# Task: Group the DataFrame by Location and calculate the mean Age for each group.

# Your code here

# Exercise 5: Handling Missing Data

# Task: Introduce missing data into the DataFrame and demonstrate dropping and filling missing values.

# Introduce missing value
# Your code here

# Display the DataFrame with missing data
# Your code here

# Drop rows with missing values
# Your code here

# Fill missing values with the mean of the column
# Your code here

# Exercise 6: Melting Data

# Task: Melt the DataFrame to transform it from wide format to long format.

# Your code here

# Exercise 7: Pivoting Data

# Task: Pivot the melted DataFrame back to wide format.

# Your code here

# Exercise 8: Value Counts

# Task: Calculate the value counts for the Location column.

# Your code here

# Exercise 9: Apply with Functions

# Task: Apply a custom function to the Age column to create a new column 'Age Doubled'.

# Define a custom function
# Your code here

# Apply the function to the Age column
# Your code here

# Exercise 10: Conditional Column Creation

# Task: Create a new column 'Is Adult' based on the condition that Age is greater than or equal to 18.

# Your code here

# Advanced Pandas Exercises

# Exercise 11: Merging DataFrames

# Task: Create another DataFrame and merge it with the original DataFrame on the 'Name' column.

data2 = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Salary': [50000, 60000, 55000, 65000]
}

# Create a second DataFrame
# Your code here

# Merge the two DataFrames on the 'Name' column
# Your code here

# Exercise 12: Datetime Manipulation

# Task: Add a 'Date of Birth' column and extract the year, month, and day into separate columns.

dob_data = ['1998-05-01', '2009-07-15', '1968-12-22', '1987-03-11']

# Add 'Date of Birth' column
# Your code here

# Extract year, month, and day into separate columns
# Your code here

# Exercise 13: Pivot Tables

# Task: Create a pivot table to calculate the mean Age for each Location and Name combination.

# Your code here

# Exercise 14: Handling Duplicate Data

# Task: Introduce duplicate rows into the DataFrame and demonstrate how to identify and remove duplicates.

# Introduce duplicate rows
# Your code here

# Identify duplicate rows
# Your code here

# Remove duplicate rows
# Your code here

# Exercise 15: String Methods

# Task: Convert all Location names to uppercase and create a new column 'Initial' with the first letter of each Name.

# Convert Location names to uppercase
# Your code here

# Create 'Initial' column
# Your code here

# Exercise 16: Crosstab

# Task: Create a crosstab to show the frequency of Age groups by Location.

# Your code here

# Exercise 17: Sampling Data

# Task: Take a random sample of 2 rows from the DataFrame.

# Your code here

# Exercise 18: Visualisation with Pandas

# Task: Plot the distribution of Ages using a histogram.

# Your code here

# Exercise 19: Multi-index DataFrames

# Task: Create a multi-index DataFrame and demonstrate basic indexing and slicing.

multi_index_data = {
    'State': ['California', 'California', 'New York', 'New York'],
    'City': ['Los Angeles', 'San Francisco', 'New York City', 'Buffalo'],
    'Population': [4000000, 870000, 8400000, 260000]
}

# Step 1: Create a DataFrame from the given dictionary with 'State' and 'City' as the indexes.
# Your code here

# Step 2: Set 'State' and 'City' as the multi-index of the DataFrame.
# Your code here

# Task: Demonstrate basic indexing and slicing
# Step 1: Select all rows corresponding to 'California'.
# Your code here

# Step 2: Select a specific row corresponding to ('California', 'Los Angeles').
# Your code here

# Step 3: Select a cross-section of all rows corresponding to 'New York City' across all states.
# Your code here

## A First Application: Classifying Iris Species
We are going to take a look at a simple machine learning application that will cover the tools we have discussed. We will use data about iris flowers (https://en.wikipedia.org/wiki/Iris_flower_data_set) to classify them into three species: Iris setosa, Iris virginica and Iris versicolor.

Found below an image of the characteristics we will use to classify the flowers:

![sepal_petal](images/iris_petal_sepal.png)


### Meet the Data
The data we will use for this example is the Iris dataset, a classical dataset in machine learning and statistics. It is included in scikit-learn in the datasets module. We can load it by calling the load_iris function:

In [None]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

# print the type of iris_dataset
print(f"Type of iris_dataset: {type(iris_dataset)}")

# python dir function return a list of the valid attributes of the object
print(f"Attributes of iris_dataset: {dir(iris_dataset)}")

The iris object that is returned by load_iris is a Bunch object, which is very similar to a dictionary. It contains keys and values:

In [None]:
print("Keys of iris_dataset:\n", iris_dataset.keys())

The value of the key DESCR is a short description of the dataset.

In [None]:
print(iris_dataset['DESCR'][:193] + "\n...")

The value of the key target_names is an array of strings, containing the species of flower that we want to predict:

In [None]:
print("Target names:", iris_dataset['target_names'])

The value of feature_names is a list of strings, giving the description of each feature:

In [None]:
print("Feature names:\n", iris_dataset['feature_names'])

The data itself is contained in the target and data fields. data contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a NumPy array:

In [None]:
print("Type of data:", type(iris_dataset['data']))

In [None]:
print("Shape of data:", iris_dataset['data'].shape)

We see that the array contains measurements for *150 different flowers*. Remember that the individual items are called samples in machine learning, and their properties are called features. The shape of the data array is the number of samples multiplied by the number of features. This is a convention in scikit-learn, and your data will
always be assumed to be in this shape. 

Here are the feature values for the first five samples:

In [None]:
print("First five rows of data:\n", iris_dataset['data'][:5])

From this data, we can see that all of the first five flowers have a petal width of 0.2 cm and that the first flower has the longest sepal, at 5.1 cm. 

The target array contains the species of each of the flowers that were measured, also as a NumPy array:

In [None]:
print("Type of target:", type(iris_dataset['target']))

In [None]:
print("Shape of target:", iris_dataset['target'].shape)

In [None]:
print("Target:\n", iris_dataset['target'])

#### Measuring Success: Training and Testing Data

We aim to construct a machine learning model using the given dataset that can accurately predict the species of iris based on new measurements. However, before we can trust the predictions of our model on new data, it is crucial to evaluate its performance.

Unfortunately, we cannot evaluate the model using the same data that was used for training. If we were to do so, the model could simply memorize the entire training set and always provide correct predictions for those points. However, this ability to remember the training set does not indicate how well the model will generalize to new, unseen data.

To assess the model's performance, we need to present it with new data for which we already know the correct labels. This is typically accomplished by splitting the labeled data we have collected (in this case, the 150 flower measurements) into two parts. One part, called the training data or training set, is used to build the machine learning model. The remaining part, known as the test data, test set, or hold-out set, is used to evaluate the model's performance.

To facilitate this process, scikit-learn provides the `train_test_split` function. This function shuffles the dataset and automatically divides it into a training set (75% of the data) and a test set (25% of the data), along with their corresponding labels. Although the choice of how much data to allocate for training and testing is somewhat arbitrary, a common guideline is to use a test set containing 25% of the data.

In scikit-learn, the convention is to represent the data as a two-dimensional array (matrix) denoted by a capital X, while the labels are represented as a one-dimensional array (vector) denoted by a lowercase y. This notation is inspired by the mathematical formulation f(x) = y, where x represents the input to a function and y represents the corresponding output.

Let's apply the `train_test_split` function to our data and assign the outputs accordingly, following this naming convention.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

Before splitting the dataset, the `train_test_split` function applies shuffling to ensure randomness in the resulting subsets. Without shuffling, if we simply took the last 25% of the data as the test set, all the data points would have the label 2. This is because the data points are sorted by label, as shown in the output for iris['target'] previously. *It is important to have a test set that includes data from all classes to properly evaluate the model's generalization performance.*

To ensure reproducibility of the results, we can set a fixed seed for the pseudorandom number generator using the random_state parameter. This means that every time we run the function with the same seed value, we will obtain the same output. Fixing the random_state is essential when using randomized procedures to ensure consistency and comparability of the results.

The output of the train_test_split function consists of four NumPy arrays: X_train, X_test, y_train, and y_test. The X_train array contains 75% of the rows from the original dataset, while the X_test array contains the remaining 25%. These arrays correspond to the features (input data) for the training and test sets, respectively.

To complete the evaluation process, we also have y_train and y_test, which are the corresponding labels for the training and test sets, respectively. These arrays contain the target values (in this case, the iris species) associated with each data point in X_train and X_test.

By splitting the data in this manner, we can train our machine learning model on the training set (X_train and y_train) and assess its performance on the test set (X_test and y_test). This allows us to estimate how well the model will generalize to unseen data and make predictions for new measurements.

In [None]:
print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

In [None]:
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

### Look at Your Data

Before constructing a machine learning model, it is beneficial to examine the data to determine if the task can be easily solved without machine learning or if the desired information is not present in the data.

Furthermore, data inspection helps identify abnormalities and peculiarities. It is not uncommon to encounter inconsistencies and unexpected measurements in real-world datasets.

Visualization is one of the most effective methods for data inspection. One approach is to use scatter plots, where one feature is plotted along the x-axis and another feature along the y-axis, with each data point represented by a dot. However, since computer screens are limited to two dimensions, it becomes challenging to visualize datasets with more than two or three features simultaneously.

To address this limitation, a pair plot can be employed, which examines all possible pairs of features. For datasets with a small number of features, such as the four features we have here, this approach is reasonable. However, it's important to note that a pair plot does not capture the interaction among all the features simultaneously, and certain interesting aspects of the data may not be revealed through this visualization.

The figure below displays a pair plot of the features in the training set, where each data point is color-coded based on the corresponding iris species. To create this plot, we convert the NumPy array into a pandas DataFrame. The pandas library provides a function called scatter_matrix specifically designed for creating pair plots. The diagonal of the matrix in the pair plot is filled with histograms of each individual feature, providing additional insights into their distributions.

In [None]:
# create dataframe from data in X_train
# label the columns using the strings in iris_dataset.feature_names
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)
# create a scatter matrix from the dataframe, color by y_train
pd.plotting.scatter_matrix(iris_dataframe, c=y_train, figsize=(15, 15),
               marker='o', hist_kwds={'bins': 20}, s=60,
               alpha=.8, cmap='viridis')

# Display the scatter matrix plot
plt.show()

The scatter plot visualizes the relationship between different pairs of features in the iris dataset using a matrix of subplots. Each subplot represents a scatter plot of two features against each other. The diagonal subplots display histograms of individual features.

In this specific scatter plot, we are using the training data (X_train) to create the scatter matrix. The data is organized in a DataFrame called iris_dataframe, where each column corresponds to a specific feature from the iris dataset.

The scatter_matrix function from pandas plotting is used to generate the scatter plot matrix. The c parameter is set to y_train, which assigns colors to the data points based on their corresponding target labels.

By coloring the data points based on the target labels (y_train), we can visually distinguish different iris species within the scatter plot matrix. The cmap parameter is set to 'viridis', which determines the color map used to represent the different species. The 'viridis' colormap ensures that the colors assigned to the species form a visually distinct and continuous gradient.

Each subplot within the scatter plot matrix represents a combination of two features. For example, the top-left subplot might show sepal length on the x-axis and sepal width on the y-axis, while the bottom-right subplot might show petal width on the x-axis and petal length on the y-axis.

The size of the scatter points is set to s=60, which determines the marker size, while the transparency is set to alpha=.8, providing a balance between visibility and transparency for overlapping data points.

Overall, the scatter plot matrix offers a comprehensive visualization of the relationships between the iris dataset's features. It allows us to identify patterns, correlations, and separations between different feature pairs, providing valuable insights into the data's structure and potential discriminatory power for iris species classification.

From the plots, we can see that the three classes seem to be relatively well separated using the sepal and petal measurements. This means that a machine learning model will likely be able to learn to separate them.

## Building Your First Machine Learning Model: k-Nearest Neighbors

Now, we can proceed with building our machine learning model. Scikit-learn offers various classification algorithms that we can employ for our task. In this case, we will utilize the k-nearest neighbors (KNN) classifier due to its simplicity and ease of understanding.

Constructing the KNN model primarily involves storing the training set. To make predictions for new data points, the algorithm locates the point in the training set that is closest to the new point and assigns the label of this nearest training point to the new data point.

The "k" in k-nearest neighbors signifies that, instead of considering only the closest neighbor, we can take into account a fixed number "k" of neighbors from the training set. For instance, we can consider the three or five nearest neighbors. By evaluating the majority class among these neighbors, we can make predictions.

In this instance, we will focus on using just a single neighbor. Further details about this approach will be explored in subsequent sessions.

In scikit-learn, each machine learning model is implemented within its dedicated class, known as an Estimator class. The K-nearest neighbors classification algorithm is implemented in the KNeighborsClassifier class, which is located in the neighbors module. Before utilizing the model, we need to instantiate this class to create an object. During this instantiation, we can specify and configure any desired parameters. For the KNeighborsClassifier, the most crucial parameter is the number of neighbors, which we will set to 1:

In [None]:
# import the class
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model (with the default parameters and k=1)
knn = KNeighborsClassifier(n_neighbors=1)

The knn object encapsulates the algorithm that will be used to build the model from the training data, as well the algorithm to make predictions on new data points. It will also hold the information that the algorithm has extracted from the training data. In the case of KNeighborsClassifier, it will just store the training set.

To build the model on the training set, we call the `fit` method of the knn object, which takes as arguments the NumPy array X_train containing the training data and the NumPy array y_train of the corresponding training labels:

In [None]:
knn.fit(X_train, y_train)

The fit method returns the knn object itself, modifying it in place. As a result, we obtain a string representation of our classifier. This representation provides valuable insight into the parameters used during the model creation process. While most of the parameters adopt their default values, we can specifically observe the presence of n_neighbors=1, which corresponds to the parameter we passed when instantiating the K-nearest neighbors classifier.

It is important to note that scikit-learn models often encompass a wide range of parameters. However, the majority of these parameters are either designed to optimize computational speed or cater to very specific use cases. As you progress through our lessons, we will delve into the details of the essential parameters, ensuring a comprehensive understanding of their significance.

Printing a scikit-learn model representation may generate lengthy strings, but there is no need to feel overwhelmed by their complexity. Rest assured, we will cover all the crucial parameters and thoroughly explore their implications in upcoming lessons.

### Making Predictions

Now, we can utilize our trained model to make predictions on new data where the correct labels are unknown to us.

Let's consider an example scenario: We have come across an iris flower in the wild with the following measurements:

Sepal length: 5 cm
Sepal width: 2.9 cm
Petal length: 1 cm
Petal width: 0.2 cm
To determine the species of this iris flower, we can organize this data into a NumPy array. The shape of the array will be calculated by multiplying the number of samples (1) by the number of features (4). Here's an example code snippet:

Executing the following code will yield the shape of the new data array, which is (1, 4). This shape indicates that we have one sample (iris flower) with four features (sepal length, sepal width, petal length, and petal width).

Now, we can proceed to use this new data array to make predictions and identify the species of the iris flower using our trained K-nearest neighbors classifier.

In [None]:
import numpy as np

# Create a NumPy array with the new data point
X_new = np.array([[5.0, 2.9, 1.0, 0.2]])

# Determine the shape of the array
shape = X_new.shape

print("Shape of the new data array:", shape)

Note that we made the measurements of this single flower into a row in a twodimensional NumPy array, as scikit-learn always expects two-dimensional arrays for the data.

To make a prediction, we call the predict method of the knn object:

In [None]:
prediction = knn.predict(X_new)
print("Prediction:", prediction)
print("Predicted target name:",
       iris_dataset['target_names'][prediction])

Our model predicts that this new iris belongs to the class 0, meaning its species is setosa. But how do we know whether we can trust our model? We don’t know the correct species of this sample, which is the whole point of building the model!

Indeed, this brings up an important point. In this case, we don't have the correct species label for the new iris sample, which raises the question of how we can assess the reliability of our model's prediction. The goal of building the model is precisely to predict the species of unseen data points accurately.

To evaluate the trustworthiness of our model, we typically employ techniques such as model evaluation, validation, and testing. Here are a few approaches we can consider:

- Test-Set Validation: We can reserve a portion of the labeled data for testing purposes, separate from the data used for training the model. By comparing the model's predictions on the test set with the known true labels, we can gauge its performance.
- Cross-Validation: Cross-validation involves dividing the data into multiple subsets or folds. The model is trained on a combination of these folds and then tested on the remaining fold. This process is repeated several times, rotating the fold used for testing each time. By aggregating the performance across multiple iterations, we can obtain a more robust evaluation.
- Metrics and Performance Measures: Various metrics can quantify the performance of our model, such as accuracy, precision, recall, or F1 score. These metrics provide insights into the model's predictive capability, allowing us to assess its effectiveness.
- Comparisons with Baseline Models: We can compare our model's performance against simple baseline models or other established models to determine its relative effectiveness and identify any areas for improvement.

By employing these evaluation techniques and comparing our model's predictions against known labels or benchmark models, we can gain confidence in its reliability and understand its limitations. It's important to remember that no model is perfect, and ongoing evaluation and refinement are necessary for robust and trustworthy predictions.

### Evaluating the Model

To assess the reliability of our model, we will employ a straightforward technique called test set evaluation.

The test set we created earlier plays a crucial role here. This set of data was specifically reserved and not used during the model building process. However, we do possess the correct species labels for each iris in the test set.

Using the trained model, we can now make predictions for each iris in the test data and compare those predictions to their known labels. By measuring the accuracy, we can evaluate how well our model performs. Accuracy is calculated as the fraction of flowers for which the correct species was predicted.

To compute the accuracy, we can follow these steps:

1. Utilize the trained K-nearest neighbors classifier to make predictions for each iris in the test set.
2. Compare these predicted species labels against the true species labels of the irises in the test set.
3. Calculate the fraction of irises for which the predicted species matches the true species.
4. This fraction represents the accuracy of our model, indicating the proportion of correctly predicted species in the test set.

By obtaining the accuracy score, we can gauge the performance and reliability of our model. A higher accuracy suggests that the model is making more correct predictions, while a lower accuracy may indicate areas for improvement. This evaluation process allows us to make informed assessments of our model's performance on unseen data and ascertain its effectiveness in predicting the species of irises.

In [None]:
y_pred = knn.predict(X_test)
print("Test set predictions:\n", y_pred)

In [None]:
print("Test set score: {:.2f}".format(np.mean(y_pred == y_test)))

We can also use the score method of the knn object, which will compute the test set accuracy for us:

In [None]:
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

After evaluating our model on the test set, we have obtained a test set accuracy of approximately 0.97. This implies that our model made the correct predictions for 97% of the irises in the test set.

Under certain *mathematical assumptions*, this high accuracy suggests that we can expect our model to be correct about 97% of the time when making predictions for new, unseen irises. These assumptions include the assumption that the test set is representative of the population of irises we want to predict on, and that the statistical properties of the test set are similar to those of the unseen data.

However, it's important to note that accuracy alone may not provide a complete picture of the model's performance. Other metrics and considerations specific to the application and dataset should be taken into account. For example, precision, recall, or the use of confidence intervals can provide additional insights into the reliability and variability of the accuracy estimate.

In our hobby botanist application, this high level of accuracy indicates that our model may be trustworthy enough to utilize. As we delve further into subsequent chapters, we will explore techniques for improving model performance and discuss important caveats and considerations when fine-tuning a model.

By addressing these topics, we aim to enhance our model's effectiveness and provide a more comprehensive understanding of its performance characteristics.

#### Mathematical Assumptions: 

We are referring to the statistical framework in which the accuracy metric is calculated. Here's a detailed explanation:

- *Assumption of Independent and Identically Distributed Data*: The accuracy of our model is evaluated based on the test set, which is assumed to be a representative sample of the population we want to make predictions on. It is assumed that the test set and the unseen data (new irises) share similar statistical properties. This assumption allows us to generalize the performance of our model from the test set to unseen data.
- *Assumption of Generalizability*: Our model's high accuracy on the test set suggests that it has learned the underlying patterns and relationships in the iris dataset effectively. Therefore, assuming that the test set is representative of future unseen data, we can expect our model to generalize well and make correct predictions for new, unseen irises.
- *Confidence Interval*: The accuracy score of 97% represents a point estimate of the model's performance. However, it's important to consider the associated uncertainty. By calculating a confidence interval, we can express the range within which we are confident the true accuracy of our model lies. This interval provides a measure of the reliability and variability of the accuracy estimate.
- *Limitations and Model Assumptions*: It's essential to acknowledge that accuracy alone may not provide a comprehensive evaluation of model performance. Depending on the specific application and dataset, other metrics, such as precision, recall, or F1 score, might be more appropriate. Additionally, the accuracy of the model assumes that the features used for training and testing are sufficient and relevant for accurately predicting the iris species. If the features are not representative of the underlying patterns in the data, the model may not perform well.

## Summary and Outlook
In this notebook we have covered the following topics:
- The machine learning workflow
- The Python data science ecosystem
- Loading the iris dataset
- Exploratory data analysis
- Building a machine learning model
- Evaluating the model

We have also discussed the following concepts:
- The importance of data representation and feature engineering
- The importance of model evaluation and validation

The following code includes all the steps need to build the model and evaluate it on the test set.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))