# [NML-24] Exercise 1: Pandas, NumPy, and Scikit-learn basics

TAs: [William Cappelletti](https://people.epfl.ch/william.cappelletti) and [Abdellah Rahmani](https://people.epfl.ch/abdellah.rahmani)

## Instructions

**!! Read carefully before starting !!**

Exercises will have the same format as assignments, so use them to get familiar with the submission format.

**Expected output:**

You will have coding and theoretical questions. Coding exercises shall be solved within the specified space:
```python
# Your solution here ###########################################################
...
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
```
Sometimes we provide variable names, such as `x = ...`; do not change names and stick to hinted typing, as they will be reused later.
Within the solution space, you can declare any other variable of function that you might need, but anything outside these lines shall not be changed, or it will invalidate your answers.

Theoretical questions shall be answered in the following markdown cell. The first line will be 
```markdown
**Your answer here:**
...
```

**Solutions:**
* Your submission is self-contained in the `.ipynb` file.

* Code has to be clean and readable. Provide meaningful variable names and comment where needed.

* Textual answers in [markdown cells][md_cells] shall be short: one to two
  sentences. Math shall be written in [LaTeX][md_latex].
    **NOTE**: handwritten notes pasted in the notebook are ignored

* You cannot import any other library than we imported, unless explicitly stated.

* Make sure all cells are executed before submitting. I.e., if you open the notebook again it should show numerical results and plots. Cells not run are ignored.

* Execute your notebook from a blank state before submission, to make sure it is reproducible. You can click "Kernel" then "Restart Kernel and Run All Cells" in Jupyter. We might re-run cells to ensure that the code is working and corresponds to the results.

[md_cells]: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html
[md_latex]: https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html#LaTex-equations

## Objective

This exercise session explore some basic Python libraries that we will use multiple times during the assignments and projects, namely
* [Pandas](https://pandas.pydata.org/docs/)
* [NumPy](https://numpy.org/devdocs/user/index.html)
* [Scikit-learn](https://scikit-learn.org/stable/user_guide.html)

If you know those libraries, most task in this notebook will be easy. If you struggle with some aspects, take time to revise the library documentations, as next sessions will rely on these tools.

## Dataset

We will use the [Palmer Archipelago (Antarctica) penguin data](https://github.com/allisonhorst/palmerpenguins/tree/main) for this exercise session.

We provide a simplified version of the data in `penguins_size.csv`

Dataset reference: https://doi.org/10.5281/zenodo.3960218

In [None]:
# Plotting functions
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme()

## Part 1: Pandas, to manipulate tabular data

In [None]:
import pandas as pd

### Question 1.1: Data loading and examination

**1.1.1** Read the `penguins_size.csv` file into a Pandas DataFrame, using the `read_csv` function.

In [None]:
# Your solution here ###########################################################
penguins: pd.DataFrame = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.2** Extract the first five rows of the data frame and 10 random ones, then concatenate and display them. You can use the built-in `display` function.

In [None]:
# Your solution here ###########################################################

display(...)

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.3** Look at the fourth entry: it is missing information, which is filled with `NaN` (not a number) values.
Let's drop all rows with missing values, then display the first 10 rows.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.4** Compute and display the mean and std of `culmen_length_mm` and `body_mass_g`.

In [None]:
# Your solution here ###########################################################

print("Mean values:")
...

print("Standard deviation:")
...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.5** Examine statistics of all columns with the `describe` method.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**1.1.6** Plot a histogram of `body_mass_g`, split by `sex`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plt.show()

**1.1.7** The [seaborn](https://seaborn.pydata.org/tutorial.html) library provides nicer visualization functionalities. Let's produce the same histogram with it.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
plt.show()

## Question 1.2: indexing and manipulation

**1.2.1** Use the `loc` property to remove penguins without sex.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
display(penguins.loc[[13, 26, 39, 52]])

**1.2.2** Make `sex` a boolean property, with value `True` for `"FEMALE"` and `False` for `"MALE"`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
display(penguins.loc[[13, 26, 39, 52]])

**1.2.3** In the next questions we will encode numerically the `island` and `species` property. Let's start by identifying the island's and species names.

In [None]:
# Your solution here ###########################################################
islands: list[str] = ...
species: list[str] = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("islands:", islands)
print("species:", species)

**1.2.4** For each island and species, add a column with boolean value, indicating whether the penguin comes from said island, or belong to said species.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
display(penguins.loc[[13, 26, 39, 52]])

**1.2.5** In some case, we might want to encode species as integers.
Use the `map` method and a dictionary mapping species to numbers to get a `y_species` vector.

In [None]:
# Your solution here ###########################################################
y_species: pd.Series = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
y_species

**1.2.6** Drop the `island` and `species` columns, since they are not needed anymore.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
display(penguins.loc[[13, 26, 39, 52]])

In [None]:
penguins.head(10)

## Part 2: NumPy, scientific computing in Python

NumPy allows manipulating vectors, matrices and higher order tensors as `arrays`. For instance vectors are 1d arrays, and matrices have two dimensions (rows and columns).

In [None]:
import numpy as np

**2.1.1** Pandas is built on top of NumPy. We can access the underlying array though the `values` attribute of a DataFrame. Let's put all but the `body_mass_g` columns in the *design matrix* `x`, and the y_mass one in the *target vector* `y_mass`.

In [None]:
# Your solution here ###########################################################
x: np.ndarray = ...
y_mass: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.1.2** Let's inspect the `shape` of these two arrays.

In [None]:
# Your solution here ###########################################################
print("x shape:", ...)
print("y shape:", ...)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.1.3** Then let's check the first five rows of `x`.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.1.4** Notice that the `dtype` of `x` is `object`, which means that it contains multiple types. Let's convert it to `float` and check the first rows again.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

### Question 2.2: Arrays manipulation


**2.2.1** Extract the values of `Dream` and `Gentoo` columns into two vectors. Convert them to boolean.

In [None]:
dream: np.ndarray
gentoo: np.ndarray
# Your solution here ###########################################################
dream, gentoo = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.2.2** Count how many penguins come from the Dream island using the `sum` method. Repeat for the Gentoo specie.

In [None]:
# Your solution here ###########################################################
print("Dream's penguins:", ...)
print("Gentoo's penguins:", ...)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.2.3** You can use a boolean mask to extract values from an array.
Compute the average mass and std of Dream's penguins using the corresponding NumPy functions.

In [None]:
# Your solution here ###########################################################
print("Average Dream's mass:", ...)
print("Dream's mass std:", ...)
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.2.4** Now, compute the average y_mass of Dream's penguins using the scalar product between the mass vector and the Dream boolean mask.

In [None]:
# Your solution here ###########################################################
dream_avg_mass: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Average Dream's mass:", dream_avg_mass)

**2.2.5** Compute the standard deviation as an inner product too and verify from previous answer.

In [None]:
# Your solution here ###########################################################
dream_std_mass: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Dream's mass std:", dream_std_mass)

### Question 2.3: Linear regression

Linear regression aims to find a weight vector $\mathbf w$ such that the target value $y_i$ can be retrieved as a weighted sum of the corresponding features $\mathbf x_i$, or in matrix notation
$$ \mathbf{X w} = \mathbf y. $$

Most of the time we cannot find an exact solution to this problem, therefore we introuce an error function and look for weights that minimize it.

**2.3.1** Find a solution for `w` by solving a linear system with `np.linalg.solve`.
For this method to work you need as many equations as variables, so choose them randomly with `np.random.choice`.

Selecting random rows your design matrix might become singular. You can use `try` to rerun until it works.

In [None]:
n_samples, n_features = x.shape

while True:
    try:
        # Your solution here ###################################################

        w_solve: np.ndarray = ...

        # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        break
    except np.linalg.LinAlgError:
        pass

print("Weights:", w_solve)

**2.3.2** Define a function to compute the mean squared error (MSE) between the real `y` and the predicted one.

In [None]:
def mse(y_true: np.array, y_pred: np.array) -> float:
    # Your solution here #######################################################
    return ...
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.3.3** Use the weights to predict the penguins masses, then compute their MSE.

In [None]:
# Your solution here ###########################################################
mse_solve: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("MSE of random subproblem:", mse_solve)

**2.3.4** Let's look for a solution that uses all the data by using the pseudoinverse of `x`.

In [None]:
# Your solution here ###########################################################
w_pinv: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.3.5** Compute the MSE of this solution prediction.

In [None]:
# Your solution here ###########################################################
mse_pinv: float = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("MSE of psudoinverse solution:", mse_pinv)

### Question 2.4: Broadcasting

In this section, we focus on a convenient way to manipulate arrays which allows parallelizing operations over the same input.

**2.4.1** extract all island one-hot encoding from the data frame and convert to boolean.

In [None]:
# Your solution here ###########################################################
islands_oh: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("islands shape", islands_oh.shape)

**2.4.2** Multiply the penguin masses to `islands_oh` to mask them in parallel.

In [None]:
try:
    # Your solution here ###########################################################
    masked_mass: np.ndarray = ...
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
except ValueError as err:
    print("There's an ERROR:", err)

**2.4.3** Arrays of different shapes cannot be automatically broadcasted! Using `np.newaxis`, add a dimension to `y_mass` and try again.

In [None]:
# Your solution here ###########################################################
masked_mass: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**2.4.4** Compute the average masses over different islands summing the masked masses along the corresponding axes.

In [None]:
# Your solution here ###########################################################
avg_masses: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Average masses:")
print(dict(zip(islands, avg_masses)))

**2.4.5** Use broadcasting and masking to compute standard deviations for each island.

In [None]:
# Your solution here ###########################################################
std_masses: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Mass standard deviations:")
print(dict(zip(islands, std_masses)))

### Question 2.5: K-Nearest neighbors

Let's implement a k-nearest neighbors (kNN) classifier to identify penguin species from their physical attributes.

For a query data point, kNN predicts its label as the most frequent one between those of the k closest samples from the training dataset. In this setting, we will use euclidean distance between points:
$$ d(\mathbf x_i, \mathbf x_j) = \sqrt{\sum_{n=1}^D (x_{id} - x_{jd})^2} $$

In the next cell, we prepare to split the data in training and test sets.

In [None]:
physical_attributes = [
    "culmen_length_mm",
    "culmen_depth_mm",
    "flipper_length_mm",
    "body_mass_g",
]

# We count the number of samples and set the amount of training data to 70% of that
n_samples = len(penguins)
samples_tr = int(n_samples * 0.7)

# We shuffle the indices of the data and take the first 70% to be in training
# Working with indices allows us to recognise which features go with which labels
rng = np.random.default_rng(11)
shuffled = rng.permutation(np.arange(n_samples))
idx_tr = shuffled[:samples_tr]
idx_te = shuffled[samples_tr:]

**2.5.1** Extract training and test features from the `penguins` data frame using the indices defined above and split the `y_species` series (computed in 1.2.5) accordingly.

You can use `DataFrame.iloc` to work with integer indexing in pandas.
Remember to extract arrays from data frames.

In [None]:
# Your solution here ###########################################################
x_tr: np.ndarray = ...
x_te: np.ndarray = ...
y_tr: np.ndarray = ...
y_te: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
print("Training features:", x_tr.shape)
print("Test features:", x_te.shape)

**2.5.2** Write a function that computes pairwise distances between all query points and training ones. Use broadcasting and sum along corresponding axes to get an efficient implementation.

In [None]:
def pairwise_distances(x_query: np.ndarray, x_tr: np.ndarray) -> np.ndarray:
    """Compute pairwise distances

    Args:
        x_query (np.ndarray): Array of shape (n_queries, n_features)
        x_tr (np.ndarray): Array of shape (n_samples, n_features)

    Returns:
        np.ndarray: Distances in array of shape (n_queries, n_samples)
    """
    # Your solution here #######################################################
    return ...
    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


pdists_te = pairwise_distances(x_te, x_tr)
print("Distances shape:", pdists_te.shape)

**2.5.3** For each query, identify the closest training samples using `np.argsort`. Use the `axis` argument to avoid iterating over the matrix.

In [None]:
# Your solution here ###########################################################
nearest_ngbs_te: np.ndarray = ...
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

print("5 nearest neighbors for query 4:", nearest_ngbs_te[4, :5])

**2.5.4** Write a function that takes an array of nearest neighbors for each query, together with training labels and the predefined $k$ and returns the predicted labels for queries.

In [None]:
def predict_labels(nearest_ngbs: np.ndarray, y_tr: np.ndarray, k: int) -> np.ndarray:
    """Predict labels from k-nearest neighbors

    Args:
        nearest_ngbs (np.ndarray): Array of neighbors indices, sorted by distance.
            Shape: (n_queries, n_samples)
        y_tr (np.ndarray): Training labels of shape (n_samples,)
        k (int): number of nearest neighbors to consider

    Returns:
        np.ndarray: Predicted labels of shape (n_queries,)
    """
    # Your solution here #######################################################
    # Extract nearest ngbs labels
    ...

    # Count label occurrencies for each query
    ...

    # Return most frequent occurrence
    return ...

    # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


print(
    "Predicted labels of the first 10 queries:",
    predict_labels(nearest_ngbs_te[:10], y_tr, 5),
)
print("Real labels:                             ", y_te[:10])

**2.5.5** Compute the precision of kNN's predictions for both the training and test datasets for all k between 1 and 30.

In [None]:
# Your solution here ###########################################################
precisions_te: list[float] = ...

precisions_tr: list[float] = ...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

# Plot
fig, ax = plt.subplots(figsize=(8, 3), dpi=100)
ax.plot(precisions_tr, label="Train")
ax.plot(precisions_te, label="Test")
ax.set(title="My kNN implmentation", ylabel="precision", xlabel="k value")
plt.legend()
plt.show()

## Part 3: Scikit-learn, machine learning toolbox

Scikit-learn provides implementations for many machine learning algorithms, which all share the basic common interface of `fit` and `predict` methods. Let's compare it to our kNN implementation.

**3.1** Import the k nearest neighbor classifier from Scikit-learn.

In [None]:
# Your solution here ###########################################################
from sklearn ...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.2** Import the precision function.

In [None]:
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.3** Compute train and test precision for all k between 1 and 30. Use the "micro" option for the precision score.

In [None]:
# Your solution here ###########################################################

sk_prec_tr: list[float] = ...
sk_prec_te: list[float] = ...

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

**3.4** Plot scores and compare to your implementation.

In [None]:
fig, ax = plt.subplots(figsize=(8, 3), dpi=100)
# Your solution here ###########################################################

# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

ax.set(title="SkLearn kNN implementation", ylabel="precision", xlabel="k value")
plt.legend()
plt.show()