<a href="https://colab.research.google.com/github/tinarobfar/eccdum_assignments/blob/main/assignments/3-Machine-Learning-basics-with-Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook overview
The objective of this notebook is to familiarize yourself with the most popular tools used for Machine Learning in Python:

* Numpy
* Pandas
* Sklearn

In [1]:
import numpy as np
import pandas as pd

from collections import Counter

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

In [2]:
SEED = 2024 # Seeds are used to guarantee reproducibility. Make sure to use this seed ALWAYS!

# Exploring the IRIS dataset

In [3]:
iris_dataset = load_iris() # This returns a dictionary with the attributes of the dataset, let's build it.

In [4]:
iris_dataset.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [5]:
iris_dataset["data"]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

In [6]:
iris_dataset["target"]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
print(iris_dataset["frame"])

None


In [8]:
iris_dataset["target_names"]

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [9]:
print(iris_dataset["DESCR"])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [10]:
iris_dataset["feature_names"]

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [11]:
iris_dataset["filename"]

'iris.csv'

Implement a funtion call build_dataframe that takes as input a dictionary such as iris_dataset and returns a pandas dataframe with each column having the proper feature name.
The target value is also a column of this dataframe with name `target`. It should contain the names of the target  `setosa`, etc. and not simply the encoded numbers.

In [12]:
def build_dataframe(dataset: dict) -> pd.DataFrame:
    # Write your code here
    data = dataset["data"]
    feature_names = dataset["feature_names"]

    # Crear un DataFrame a partir de los datos
    df = pd.DataFrame(data, columns=feature_names)

    # Extraer las etiquetas (target)
    target = dataset["target"]
    target_names = dataset["target_names"]

    # Mapear los índices de target a los nombres correspondientes
    df["target"] = [target_names[i] for i in target]

    return df

In [13]:
df = build_dataframe(iris_dataset)
assert df.shape == (150, 5)
answer_columns =  sorted(df.columns)
answer_unique_targets = sorted(df["target"].unique())

print("Columns", answer_columns)
print("Targets", answer_unique_targets)

Columns ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)', 'target']
Targets ['setosa', 'versicolor', 'virginica']


# Preparing the dataset for training
Now that we have our dataset (df) ready, we can proceed to prepare it for Machine Learing. For this we will:

* Split it into two sets: training and testing.
* Create a pipeline to normalize our dataset and use SVM for clasification.

In [14]:
y = df.pop("target")
X = df.copy()

## Splitting the dataset into train and test

Split the dataset into train and test using the method `train_test_split` (remember the seed!)
Make sure that the test dataset represents 20% of the total rows (look at parameter `test_size`)

In [15]:
# Write your code here
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=SEED)

# Verificar las formas de los conjuntos resultantes
print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (120, 4)
Shape of X_test: (30, 4)
Shape of y_train: (120,)
Shape of y_test: (30,)


In [16]:
assert X_train.shape == (120, 4)
assert X_test.shape == (30, 4)
assert y_train.shape == (120,)
assert y_test.shape == (30,)

answer_y_test = sorted(y_test.index)
print("y_test index", answer_y_test)

y_test index [10, 14, 15, 17, 21, 24, 30, 34, 35, 37, 46, 49, 51, 68, 76, 78, 80, 81, 83, 98, 106, 107, 113, 118, 121, 126, 132, 137, 138, 145]


## Generate Sklearn Pipeline
Before proceeding you should take a closer look at [Sklearn pipelines](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

Let's create a pipeline where the first step is a Standard Scaler and the second step is an SVM classifier.

Crete a pipeline where the first step is a `StandardScaler` (use the name 'scaler') and the second one
an SVM classifier `SVC` (use the name 'model' and remember the SEED!)

In [17]:
# Create the pipeline
pipe = Pipeline([
    ('scaler', StandardScaler()),  # Step 1: StandardScaler
    ('model', SVC(random_state=SEED))  # Step 2: SVM Classifier
])

In [18]:
assert pipe.steps[0][0] == "scaler"
assert pipe.steps[1][0] == "model"

assert isinstance(pipe.steps[0][1], StandardScaler)
assert isinstance(pipe.steps[1][1], SVC)

## Training the model
Now it is time to train the model!

Finally, we are ready to train the model. Use the training dataset
to train the model and predict the test dataset using the pipeline.
The predictions for the test dataset should be stored in the variable `y_pred`
Also, calcualte the accuracy of the model in both: train and test and save them
as `acc_train` and `acc_test`.

In [19]:
# Write your code here
# Fit the pipeline to the training data
pipe.fit(X_train, y_train)

# Make predictions on the test dataset
y_pred = pipe.predict(X_test)

# Calculate accuracy
acc_train = accuracy_score(y_train, pipe.predict(X_train))
acc_test = accuracy_score(y_test, y_pred)

# Display accuracy results
print(f"Training Accuracy: {acc_train}")
print(f"Testing Accuracy: {acc_test}")

Training Accuracy: 0.9833333333333333
Testing Accuracy: 0.9333333333333333


In [20]:
assert np.allclose(acc_train, 0.9833333333333333)
assert np.allclose(acc_test, 0.9333333333333333)
answer_predictions = Counter(y_pred)

print("Predition count", answer_predictions)

Predition count Counter({'setosa': 12, 'virginica': 10, 'versicolor': 8})


In [21]:
print(str(answer_columns))
print(str(answer_predictions))
print(str(answer_y_test))
print(str(answer_unique_targets))

['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)', 'target']
Counter({'setosa': 12, 'virginica': 10, 'versicolor': 8})
[10, 14, 15, 17, 21, 24, 30, 34, 35, 37, 46, 49, 51, 68, 76, 78, 80, 81, 83, 98, 106, 107, 113, 118, 121, 126, 132, 137, 138, 145]
['setosa', 'versicolor', 'virginica']
