# Guía de Ejercicios
Ejercicios de aplicación de NumPy aplicados a Ingeniería de Features y Regresión Lineal.

#### Ejecicio #1:    Normalización
Muchos algoritmos de Machine Learning necesitan datos de entrada centrados y normalizados. Una normalización habitual es el z-score, que implica restarle la media y dividir por el desvío a cada feature de mi dataset. 

Dado un dataset X de n muestras y m features, implementar un método en numpy para normalizar con z-score. Pueden utilizar np.mean() y np.std().


In [1]:
import numpy as np

def z_score(dataset):
    """
    This function returns a dataset which has the z-score if the input dataset
    
    :param dataset: the input dataset
    :return: the z-score of the input dataset
    """
    
    # Normalize and return
    return (dataset - np.mean(dataset, axis=0)) / np.std(dataset, axis=0)

In [2]:
# Test the previous function

dataset = np.array([[0, 10, 100], [1, 2, 3]])
print(z_score(dataset))

[[-1.  1.  1.]
 [ 1. -1. -1.]]



#### Ejecicio #2:    Remover filas y columnas con NaNs en un dataset
Dado un dataset, hacer una función que, utilizando numpy, filtre las columnas y las filas que tienen NaNs.



In [3]:
def filter_nan(dataset, rows=True):
    """
    This function removes the rows (or columns) which contain NaNs
    
    :param dataset: the dataset which has to be cleaned up
    :param rows: if True, rows are removed. Otherwise, columns are removed
    """
    
    # Get the coordinates of the NaNs
    coords = np.where(np.isnan(dataset))
    
    # Remove the rows or columns
    if rows:
        # Get the indexes of the rows which contain NaNs
        indexes = coords[:][0]
        return np.delete(dataset, np.unique(indexes), axis=0)
    indexes = coords[:][1]
    return np.delete(dataset, np.unique(indexes), axis=1)

In [4]:
# Test the previous function

dataset = np.array([[0, 10, np.nan], [1, 2, 3], [3, 2, np.nan]])
print(filter_nan(dataset, rows=True))
print(filter_nan(dataset, rows=False))

[[1. 2. 3.]]
[[ 0. 10.]
 [ 1.  2.]
 [ 3.  2.]]


#### Ejecicio #3:    Reemplazar NaNs por la media de la columna
Dado un dataset, hacer una función que utilizando numpy reemplace los NaNs por la media de la columna.



In [5]:
def nan_by_mean(dataset):
    """
    This function replaces the NaNs by the mean of the column
    
    :param dataset: the dataset which has to be cleaned up
    """
    
    # Get the coordinates of the NaNs and replace by the column means
    means = np.nanmean(dataset, axis=0)
    coords = np.where(np.isnan(dataset))
    for x, y in coords:
        dataset[x][y] = means[y]
    return dataset

In [6]:
# Test the previous function

dataset = np.array([[0, 10, np.nan], [1, 2, 3], [3, 2, np.nan]])
print(nan_by_mean(dataset))


[[ 0. 10.  3.]
 [ 1.  2.  3.]
 [ 3.  2.  3.]]


#### Ejecicio #4:    Dado un dataset X separarlo en 70 / 20 / 10
Como vimos en el ejercicio integrador, en problemas de Machine Learning es fundamental que separemos los datasets de n muestras, en 3 datasets de la siguiente manera:

* Training dataset: los datos que utilizaremos para entrenar nuestros modelos. Ej: 70% de las muestras.
* Validation dataset: los datos que usamos para calcular métricas y ajustar los hiperparámetros de nuestros modelos. Ej: 20% de las muestras.
* Testing dataset: una vez que entrenamos los modelos y encontramos los hiperparámetros óptimos de los mísmos, el testing dataset se lo utiliza para computar las métricas finales de nuestros modelos y analizar cómo se comporta respecto a la generalización. Ej: 10% de las muestras.

A partir de utilizar np.random.permutation, hacer un método que dado un dataset, devuelva los 3 datasets como nuevos numpy arrays.



In [23]:
def split_dataset(dataset, train=70, validate=20):
    """
    This function splits a dataset into training, validation and testing
    
    :param dataset: the dataset to be split
    :param train: the percentage of the samples to be used for training, expressed as percentage
    :param validate: the percentage of the samples to be used for validation, expressed as percentage.
        The remaining of training and validation is used for testing
    :return:
        - Training set
        - Validation set
        - Testing set (might be empty)
    """
    
    # Validate input and calculate testing percentage
    if train >= 100:
        return None
    if train + validate > 100:
        validate = 100 - train
    test = 100 - train - validate
    
    # Get the indexes for splitting the dataset
    n_elements = dataset.shape[0]
    train_lim = int(n_elements * train / 100)
    validate_lim = int(n_elements * (train + validate) / 100)
    
    # Permutate and split the dataset and return it
    dataset = np.random.permutation(dataset)
    return(dataset[:train_lim], dataset[train_lim:validate_lim], dataset[validate_lim:])

In [24]:
# Test the previous function

dataset = np.array([[0, 10], [1, 2], [3, 2], [1, 2], [3, 2], [1, 2], [3, 2], [1, 2], [3, 2], [1, 2]])
print(split_dataset(dataset))


(array([[3, 2],
       [3, 2],
       [1, 2],
       [3, 2],
       [1, 2],
       [1, 2],
       [3, 2]]), array([[1, 2],
       [1, 2]]), array([[ 0, 10]]))


#### Ejercicio #5:   A partir del dataset de consigna, aplicar los conceptos de regresión lineal.
1. Armar una clase para cargar el [dataset](data/income.csv) en un ndarray estructurado, tal como se realizó en el ejercicio 10 de la Clase 1.
2. Incluir un método split a la clase para obtener los sets de training y test.
3. Crear una clase métrica base y una clase MSE (Error cuadrático medio) que herede de la clase base.
4. Crear una clase modelo base y clases regresión lineal y regresión afín que hereden de la primera. Usar los conocimientos teóricos vistos en clase.
5. Hacer un fit de las regresiones con los datos de entrenamiento.
6. Hacer un predict sobre los datos de test y reportar el MSE en cada caso.
7. Graficar la curva obtenida.


In [25]:
# 1 y 2

import tempfile
import zipfile
import os
import csv
import pickle as pkl


def get_data(file_path):
    """
    Generator for loading a csv row by row
    """
    
    # Define the generator
    with open(file_path, 'r') as opened_file:
        reader = csv.reader(opened_file, delimiter=',')
        for i, row in enumerate(reader):
            if i == 0:
                # Skip first line with headers
                continue
            yield tuple(row)


class Dataset():
    """
    This class holds a dataset as a singleton
    """
    # Initialize the instance with None
    instance = None
    
    
    def __new__(cls, folder_path, file_name):
        if Dataset.instance is None:
            Dataset.instance = super(Dataset, cls).__new__(cls)
            return Dataset.instance
        else:
            return Dataset.instance
    
    
    def __init__(self, folder_path, file_name):
        """
        Class constructor. Loads dataset into self.data
        """
        
        # Load the instance from the pickle if available
        pickle_path = os.path.join(folder_path, 'data.pkl')
        if os.path.isfile(pickle_path):
            with open(pickle_path, 'rb') as pkl_file:
                self.data = pkl.load(pkl_file)
            return
                
        # Generate the instance and load the structured array using a generator
        csv_path = os.path.join(folder_path, file_name)
        dtypes = [
            ('id', np.uint32),
            ('income', np.float32),
            ('happiness', np.float32),
        ]
        self.data = np.fromiter(get_data(csv_path), dtype=dtypes)
        
        # Save the data as a pickle
        with open(pickle_path, 'wb') as pkl_file:
            pkl.dump(self.data, pkl_file)

            
    def split(self, train=70, validate=20):
        """
        Wrapper of the split_dataset method
        """
        
        return split_dataset(self.data, train, validate)

In [26]:
# Test the previous implementation
folder_path = '/home/rodolfo/Desktop/especializacion/IA'
file_name = 'income.csv'
ds = Dataset(folder_path, file_name)

print(ds.split())


(array([(306, 2.3071978, 3.5096047 ), (191, 6.6280365, 4.402663  ),
       (474, 1.5873153, 1.3127589 ), (254, 6.4607425, 3.4269116 ),
       (448, 2.8684187, 1.7917535 ), (377, 4.381923 , 3.550337  ),
       ( 27, 3.90041  , 3.5652244 ), (422, 6.635954 , 4.7600145 ),
       (101, 6.5012746, 4.374832  ), ( 80, 5.3374605, 3.703438  ),
       ( 41, 5.061758 , 3.3580716 ), (104, 4.748859 , 4.902992  ),
       (384, 4.851314 , 3.8355777 ), (482, 7.225192 , 4.985255  ),
       (495, 3.4717987, 2.5350022 ), (397, 5.49259  , 4.1052246 ),
       (369, 6.392483 , 3.6963904 ), (420, 6.475625 , 5.368041  ),
       (235, 2.0705624, 0.6289421 ), (195, 3.670377 , 3.476499  ),
       (320, 2.8090909, 2.9468703 ), (289, 5.644714 , 3.7543013 ),
       (252, 3.5405042, 3.552737  ), (187, 6.633619 , 5.38007   ),
       (  5, 7.196409 , 5.5963984 ), (316, 4.420312 , 4.391593  ),
       (411, 4.087462 , 3.2178473 ), (446, 2.220404 , 0.68890923),
       ( 61, 6.9560795, 5.498147  ), ( 96, 5.9322696, 3.96621

In [27]:
# 3

class Metric():
    """
    This is the base class for all the metrics.
    The keys 'truth' and 'predictions' are required in the kwargs.
    """
    
    def __init__(self, **kwargs):
        """
        Class constructor
        """
        
        # Define a private member which will be True if there's any issue with the input
        self._error = False
        
        # Store the truth in a private member
        # TODO it would be better to raise errors when the required keys aren't present
        if 'truth' not in kwargs:
            print('The key `truth` is missing from the kwargs.')
            self._error = True
        else:
            self._truth = kwargs.pop('truth')
            
        # Store the predictions in a private member
        # TODO it would be better to raise errors when the required keys aren't present
        if 'predictions' not in kwargs:
            print('The key `predictions` is missing from the kwargs.')
            self._error = True
        else:
            self._predictions = kwargs.pop('predictions')

        # Check that truth and predictions have the same shape
        if self._truth.shape != self._predictions.shape:
            print('The shape of truth ({}) and the one of the predictions ({}) differ'.format(
                self._truth.shape,
                self._predictions.shape,
            ))
            self._error = True
            
        # Store the rest of the kwargs in a separate private member
        self._additionals = kwargs
        
        
class MSE(Metric):
    """
    Class for obtaining the Mean Squared Error
    """
    
    def __call__(self):
        """
        This method calculates and returns the precision
        """

        # If there was any issue with the input parameters, return None
        if self._error is True:
            return None
        
        # Calculate and return the MSE
        return np.mean((self._truth - self._predictions) ** 2)

In [28]:
# Test the previous class

truth = np.array([1, 2, 3])
predictions = np.array([-1, 0, 1])

mse = MSE(truth=truth, predictions=predictions)
mse()

4.0

In [135]:
# 4

# Define a Base Model class
class Model():
    """
    From this class should inherit all the model classes
    """
    
    def __init__(self, x, y):
        self._x = np.atleast_2d(x)
        self._y = np.atleast_2d(y)
        self._params = None
        
        
class LinearRegression(Model):
    def calculate(self):
        # Calculate and return the slopes for the linear regression
        self._params = np.linalg.inv(self._x.T.dot(self._x)).dot(self._x.T).dot(self._y).T
        return self._params
    
    def predict(self, sample):
        return np.dot(self._params, sample.T)
    

class AffineRegression(Model):
    def calculate(self):
        # Calculate and return the slopes and the bias for the affine regression
        self._x = np.concatenate((self._x, np.ones((len(self._x), 1), dtype=self._x.dtype)), axis=1)
        lr = LinearRegression(self._x, self._y)   
        self._params = lr.calculate()
        return self._params
    
    def predict(self, sample):
        sample_biased = np.atleast_2d(np.append(sample, 1)).T
        return np.dot(self._params, sample_biased)


In [136]:
# Test the previous class

# Define a test
x = np.array([[11, 0], [4, 1], [300, 2], [-15, 3], [4, 4], [-33, 5], [40, 6]])
y = np.array([[-10, 0, 10, 20, 30, 40, 50]]).T


lr = LinearRegression(x, y)
print('linear regression model: ', lr.calculate())
print('linear regression prediction: ', lr.predict(np.array([[15, 7]])))

ar = AffineRegression(x, y)
print('affine regression model: ', ar.calculate())
print('affine regression prediction: ', ar.predict(np.array([[15, 7]])))

linear regression model:  [[-0.0182077   7.82236267]]
linear regression prediction:  [[54.48342324]]
affine regression model:  [[ 1.38777878e-17  1.00000000e+01 -1.00000000e+01]]
affine regression prediction:  [[60.]]


In [137]:
# 5

# Prepare the dataset
folder_path = '/home/rodolfo/Desktop/especializacion/IA'
file_name = 'income.csv'
ds = Dataset(folder_path, file_name)
train, validate, test = ds.split()

# Train the model using the training set
x_train = train[:][:-1].reshape(len(x_train), -1)
y_train = train[:][-1]
ar = AffineRegression(x_train, y_train)
params = ar.calculate()
print(params)

(347, 1)


TypeError: Cannot cast array data from dtype([('id', '<u4'), ('income', '<f4'), ('happiness', '<f4')]) to dtype('V12') according to the rule 'safe'