# Ejercicios de la primera clase

In [1]:
import numpy as np

#### Ejecicio #1:    Operaciones Matriciales
Dada una matriz en formato numpy array, donde cada fila de la matriz representa un vector matemático: 
* Computar las normas l0, l1, l2, l-infinito
    * l0: número de elementos diferentes a cero en el vector
    * l1-l2: 
    ![](https://latex.codecogs.com/svg.latex?%7B%5Ccolor%7BOrange%7D%20%5Cleft%20%5C%7C%20x%20%5Cright%20%5C%7C_%7Bp%7D%20%3D%20%5Cleft%20%28%20%5Csum_%7B1%7D%5E%7Bn%7D%20%5Cleft%20%7C%20x_%7Bi%7D%20%5Cright%20%7C%5Ep%20%5Cright%20%29%5E%7B%5Ctfrac%7B1%7D%7Bp%7D%7D%7D)
    * l-infinito:
     ![](https://latex.codecogs.com/svg.latex?%7B%5Ccolor%7BOrange%7D%20%5Cleft%20%5C%7C%20x%20%5Cright%20%5C%7C_%7B%5Cinfty%7D%20%3D%20max_%7Bi%7D%20%5Cleft%20%7C%20x_%7Bi%7D%20%5Cright%20%7C%7D)

In [2]:
def norm_p(input, p):
    """
    calculate and return the p-norm of a numpy matrix, row-wise. For p < 0 the infinite norm is used
    :param input: 2-d matrix
    :param p: norm order
    :return: column vector with the p-norm of each row of the input matrix
    """

    # Handle case of norm 0
    if p == 0:
        return np.count_nonzero(input, axis=1)

    # Handle case of infinite norm
    if p < 0:
        return np.max(np.abs(input), axis=1)

    # Handle the rest of the cases
    return np.sum(input ** p, axis=1) ** (1 / p)


In [3]:
# Test the function norm_p
matrix = np.array([[1, 2, 3], [4, 5, 6], [0, 7, 8]])

print('norm 0: expected: [3 3 2], obtained:', norm_p(matrix, 0))
print('norm 1: expected: [6 15 15], obtained:', norm_p(matrix, 1))
print('norm 2: expected: [3.74  8.77 10.63], obtained:', norm_p(matrix, 2))
print('norm infinite: expected: [3 6 8], obtained:', norm_p(matrix, -1))


norm 0: expected: [3 3 2], obtained: [3 3 2]
norm 1: expected: [6 15 15], obtained: [ 6. 15. 15.]
norm 2: expected: [3.74  8.77 10.63], obtained: [ 3.74165739  8.77496439 10.63014581]
norm infinite: expected: [3 6 8], obtained: [3 6 8]


#### Ejecicio #2:    Sorting
Dada una matriz en formato numpy array, donde cada fila de la matriz representa un vector matemático, se requiere computar la norma l2 de cada vector.
Una vez obtenida la norma, se debe ordenar las mísmas de mayor a menor. Finalmente, obtener la matriz original ordenada por fila según la norma l2.

_Todas las operaciones debe ser vectorizadas._

In [4]:
def sorting(input):
    """
    Given a 2-d numpy array, calculate 2-norm of each row, then sort them and
    finally return the original array with its rows sorted the rows in the
    order of the 2-norm

    :param input: 2-d numpy array
    :return:
    """

    # Apply the norm to all rows
    norms = norm_p(input, 2)

    # Get the indexes of the sorted norms
    sorted_indexes = np.argsort(norms)

    # Apply those indexes to the input matrix and return
    return input[sorted_indexes, :]

In [5]:
# Test the function sorting
matrix = np.array([[3, 4, 5], [0, 1, 2], [6, 7, 8]])
print('expected: [[0 1 2], [3 4 5], [6 7 8]], obtained:', sorting(matrix))

expected: [[0 1 2], [3 4 5], [6 7 8]], obtained: [[0 1 2]
 [3 4 5]
 [6 7 8]]


#### Ejecicio #3:    Indexing
El objetivo es construir un índice para identificadores de usuarios, es decir _id2idx_ e _idx2id_.
Para ello crear una clase, donde el índice se genere en el constructor. Armar métodos _get_users_id_ y _get_users_idx_.

* Identificadores de usuarios : users_id = [15, 12, 14, 10, 1, 2, 1]
* Índice de usuarios : users_id = [0, 1, 2, 3, 4, 5, 4]

```
id2idx =  [-1     4     5    -1    -1    -1     -1    -1    -1    -1     3     -1      1    -1     2     0]
          [ 0     1     2     3     4     5      6     7     8     9    10     11     12    13    14    15]

id2idx[15] -> 0 ; id2idx[12] -> 1 ; id2idx[3] -> -1
idx2id[0] -> 15 ; idx2id[4] -> 1

```


In [6]:
class Indexing():
    def __init__(self, ids):
        self._ids = ids

    def get_users_id(self, idx):
        # Return -1 if the index is out of range
        if idx < len(self._ids):
            return self._ids[idx]
        return -1

    def get_users_idx(self, id):
        # Return -1 if there's no user with the given ID
        location = np.where(self._ids == id)[0]
        if len(location) > 0:
            return location[0]
        return -1


In [7]:
# Test the indexing class
indexing_obj = Indexing(np.array([15, 12, 14, 10, 1, 2]))
print('expected: 0, obtained:', indexing_obj.get_users_idx(15))
print('expected: 1, obtained:', indexing_obj.get_users_idx(12))
print('expected: -1, obtained:', indexing_obj.get_users_idx(3))
print('expected: 15, obtained:', indexing_obj.get_users_id(0))
print('expected: 1, obtained:', indexing_obj.get_users_id(4))


expected: 0, obtained: 0
expected: 1, obtained: 1
expected: -1, obtained: -1
expected: 15, obtained: 15
expected: 1, obtained: 1


#### Ejecicio #4:    Precision, Recall, Accuracy
En los problemas de clasificación, se cuenta con dos arreglos, la **verdad** (ground truth) y la **predicción** (prediction). 
Cada elemento de los arreglos puede tomar dos valores: _True_ (representado por 1) y _False_ (representado por 0). 
Por lo tanto, se pueden definir cuatro variables:
* True Positive (TP): la verdad es 1 y la predicción es 1.
* True Negative (TN): la verdad es 0 y la predicción es 0.
* False Negative (FN): la verdad es 1 y la predicción es 0.
* False Positive (FP): la verdad es 0 y la predicción es 1.

A partir de esas cuatro variables, se definen las siguientes métricas:
* Precision = TP / (TP + FP)
* Recall = TP / (TP + FN)
* Accuracy = (TP + TN) / (TP + TN + FP + FN)

Para los siguientes arreglos, representando la **verdad** y la **predicción**,
calcular las métricas anteriores con operaciones vectorizadas en NumPy.
* truth = [1,1,0,1,1,1,0,0,0,1]
* prediction = [1,1,1,1,0,0,1,1,0,0]


In [8]:
def precision_recall_accuracy(truth, prediction):
    """
    This function retrieves precision, recall and accuracy, calculated for the input arrays
    :param truth:
    :param prediction:
    :return:
    """

    # Calculate true positive, true negative, false positive and false negatice
    tp = np.sum(truth * prediction)
    fp = np.sum(np.logical_not(truth) * prediction)
    tn = np.sum(np.logical_not(truth) * np.logical_not(prediction))
    fn = np.sum(truth * np.logical_not(prediction))

    # Calculate the precision, recall and accuracy
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    accuracy = (tp + tn) / (tp + tn + fp + fn)

    # Return all calculated values
    return precision, recall, accuracy


In [9]:
# Test the function precision_recall_accuracy
print(precision_recall_accuracy(np.array([1,1,0,1,1,1,0,0,0,1]), np.array([1,1,1,1,0,0,1,1,0,0])))

(0.5, 0.5, 0.4)


#### Ejecicio #5:    Average Query Precision
En information retrieval o search engines, en general contamos con queries “q” y para cada “q” una lista de documentos que son verdaderamente relevantes. 
Para evaluar un search engine, es común utilizar la métrica **average query precision**.
Tomando de referencia el siguiente ejemplo, calcular la métrica con NumPy utilizando operaciones vectorizadas.
```
q_id =             [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4]
predicted_rank =   [0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1, 2, 3]
truth_relevance =  [T, F, T, F, T, T, T, F, F, F, F, F, T, F, F, T] 
```
* Precision para q_id 1 = 2 / 4
* Precision para q_id 2 = 3 / 3
* Precision para q_id 3 = 0 / 5
* Precision para q_id 4 = 2 / 4

**_average query precision_** = ((2/4) + (3/3) + (0/5) + (2/4)) / 4


In [10]:
def average_query_precision(q_id, predicted_rank, truth_relevance):
    # Get the precision for each of the unique values of q_id
    unique = np.unique(q_id)
    precisions = np.empty(len(unique))
    for i, value in enumerate(unique):
        mask = q_id == value
        n_values = np.sum(mask)
        n_true = np.sum(mask * truth_relevance)
        precisions[i] = n_true / n_values

    # Return the average of the precisions
    return np.average(precisions)

In [11]:
# Test the function average_query_precision
q_id = np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4])
predicted_rank = np.array([0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 0, 1, 2, 3])
truth_relevance = np.array([1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1])
print('expected: 0.5, obtained:', average_query_precision(q_id, predicted_rank, truth_relevance))

expected: 0.5, obtained: 0.5


#### Ejecicio #6:    Distancia a Centroides
Dada una nube de puntos _X_ y centroides _C_, obtener la distancia entre
cada vector _X_ y los centroides utilizando operaciones vectorizadas y broadcasting en NumPy.
Utilizar como referencia los siguientes valores:
```
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
C = [[1, 0, 0], [0, 1, 1]]   
```

In [12]:
def centroids_distance(points, centroids):
    """
    Calculates the distance between each point to each centroid. All
    elements are (1, 3) and the returned array is (len(points), len(centroids))
    :param points:
    :param centroids:
    :return:
    """

    # The idea is to convert the points into an array (1, n_points * points.shape[1])
    # and then to subtract the centers. I should get (n_centers, n_points * points.shape[1])
    points_flat = points.reshape(1, points.shape[0] * points.shape[1])
    subtraction = points_flat - np.tile(centroids, (1, points.shape[0]))
    column = np.reshape(subtraction ** 2, (len(centroids) * points.shape[0], points.shape[1]))
    squares_sum = np.sum(column, axis=1)

    return np.reshape(np.sqrt(squares_sum), (len(centroids), len(points)))

In [13]:
# Test function centroids_distance
points = np.array([[1, 0.5, 0.5], [0.5, 1, 1], [7, 8, 9]])
centroids = np.array([[1, 0, 0], [0, 1, 1]])
print('expected: [[0.707 1.5 13.45], [1.22  0.5  12.73]], obtained:', centroids_distance(points, centroids))

expected: [[0.707 1.5 13.45], [1.22  0.5  12.73]], obtained: [[ 0.70710678  1.5        13.45362405]
 [ 1.22474487  0.5        12.72792206]]


#### Ejecicio #7:    Etiquetar Cluster
Obtener para cada fila en _X_, el índice de la fila en _C_ con distancia euclídea más pequeña. 
Es decir, para cada fila en _X_, determinar a qué cluster pertenece en C.


_Hint_: usar np.argmin.


In [14]:
def cluster_tag(points, centroids):
    """
    Calculates which is the closest centroid to each point
    :param points:
    :param centroids:
    :return:
    """

    return np.argmin(centroids_distance(points, centroids), axis=0)

In [15]:
# Test function cluster_tag
print('expected: [0 1 1], obtained:', cluster_tag(points, centroids))

expected: [0 1 1], obtained: [0 1 1]


#### Ejercicio #8:   Implementación Básica de K-means
K-means es uno de los algoritmos más básicos en Machine Learning no supervisado.
Es un algoritmo de clusterización, que agrupa datos que comparten características similares.
Recordemos que entendemos datos como _n_ realizaciones del vector aleatorio _X_.

El algoritmo funciona de la siguiente manera:
1. El usuario selecciona la cantidad de clusters a crear _n_.
2. Se seleccionan _n_ elementos aleatorios de _X_ como posiciones iniciales del los centroides _C_.
3. Se calcula la distancia entre todos los puntos en _X_ y todos los puntos en _C_.
4. Para cada punto en _X_ se selecciona el centroide más cercano de _C_.
5. Se recalculan los centroides _C_ a partir de usar las filas de _X_ que pertenecen a cada centroide. 
6. Se itera entre 3 y 5 una cantidad fija de veces o hasta que la posición de los centroides no cambie dada una tolerancia.

Se debe por lo tanto implementar la función k_means(X, n) de manera tal que, al finalizar, devuelva la posición de los centroides
y a qué cluster pertenece cada fila de _X_. 

_Hint_: para (2) utilizar funciones de np.random, para (3) y (4) usar los ejercicios anteriores, 
para (5) es válido utilizar un for. Iterar 10 veces entre (3) y (5).  


In [16]:
def kmeans(data, n, max_iter, tol=1e-3):
    """
    k-means implementation
    :param data: data to be classified
    :param n: amount of clusters
    :param max_iter: maximum amount of iterations
    :param tol: the tolerance used for the centroids displacements:
        if the centroids move less than the tolerance, return them
    :return:
        - centroids
        - membership table
    """
    
    # Take n random elements from the data
    indexes = np.random.choice(data.shape[0], n, replace=False)
    centroids = data[indexes]
    
    # Initialize the membership table for the initial values
    membership = cluster_tag(data, centroids)
    # Iterate until the maximum amount is reached
    for i in range(max_iter):
        # Iterate over al the clusters and update their centroids
        total_displacement = 0
        for j in range(n):
            # Calculate the new centroid for the cluster
            mask = membership == j
            points_included = data[mask]
            new_centroid = np.average(points_included, axis=0)
            
            # Calculate the displacement between the old and the new centroid and add it to the total displacement
            displacement = centroids_distance(
                centroids[j].reshape((1, len(new_centroid))),
                new_centroid.reshape((1, len(new_centroid))),
            )[0]
            total_displacement += abs(displacement)
            
            # Update the centroid
            centroids[j] = new_centroid
            
        # Update the distribution of the data considering the new centroids
        membership = cluster_tag(data, centroids)
    
        # If the total displacement was below the tolerance, return
        if total_displacement < tol:
            break
    
    # Return the found centroids and the membershuip table
    return centroids, membership


In [17]:
# Test kmeans

# Generate a dataset
k = 1  # Distance modulator
n = 100000  # Samples number
data = k * np.array([[1, 0, 0, 0], [0, 1, 0, 0]])
data = np.repeat(data, n / 2, axis=0)
data = data + np.random.normal(0, 1, data.shape)

# True membership table A: A=1, B=0
index = np.array([[1], [0]])
index = np.repeat(index, n / 2, axis=0)

# Call kmeans
centroids, membership = kmeans(data, 2, 10)
print('centroids: ', centroids)
precision, recall, accuracy = precision_recall_accuracy(index.T[0], membership)
print('precision:', precision)
print('recall:', recall)
print('accuracy:', accuracy)


centroids:  [[ 1.21516465 -0.17271159  0.01284632 -0.01393214]
 [-0.2186828   1.18785086 -0.01274398  0.0238751 ]]
precision: 0.23803036980648262
recall: 0.23764
accuracy: 0.23846


#### Observation:

The values obtained for precision, recall and accuracy could be around 0.76 or 0.24, depending on if the clusters
were found in the same order they were defined or in the opposite.

#### Ejercicio #9:   Computar Métricas con \_\_call__
En problemas de machine learning, es muy común que para cada predicción que obtenemos en nuestro dataset de verificacion y evaluacion, almacenemos en arreglos de numpy el resultado de dicha predicción, junto con el valor verdadero y parámetros auxiliares (como el ranking de la predicción y el query id). 

Luego de obtener todas las predicciones, podemos utilizar la información almacenada en los arreglos de numpy, para calcular todas las métricas que queremos medir en nuestro sistema. 

Una buena práctica para implementar esto en Python, es crear clases que hereden de una clase Metric “base” y que cada métrica implemente el método \_\_call__.

Utilizar herencia, operador \_\_call__ y _kwargs_, para escribir un programa que permita calcular todas las métricas de los ejercicios anteriores mediante un for.


In [18]:
class Metric():
    """
    This is the base class for all the metrics.
    The keys 'truth' and 'predictions' are required in the kwargs.
    """
    
    def __init__(self, **kwargs):
        """
        Class constructor
        """
        
        # Define a private member which will be True if there's any issue with the input
        self._error = False
        
        # Store the truth in a private member
        # TODO it would be better to raise errors when the required keys aren't present
        if 'truth' not in kwargs:
            print('The key `truth` is missing from the kwargs.')
            self._error = True
        else:
            self._truth = kwargs.pop('truth')
            
        # Store the predictions in a private member
        # TODO it would be better to raise errors when the required keys aren't present
        if 'predictions' not in kwargs:
            print('The key `predictions` is missing from the kwargs.')
            self._error = True
        else:
            self._predictions = kwargs.pop('predictions')

        # Check that truth and predictions have the same shape
        if self._truth.shape != self._predictions.shape:
            print('The shape of truth ({}) and the one of the predictions ({}) differ'.format(
                self._truth.shape,
                self._predictions.shape,
            ))
            self._error = True
            
        # Store the rest of the kwargs in a separate private member
        self._additionals = kwargs
        
        
class Precision(Metric):
    """
    Class for obtaining the precision
    """
    
    def __call__(self):
        """
        This method calculates and returns the precision
        """

        # If there was any issue with the input parameters, return None
        if self._error is True:
            return None
        
        # Calculate and return the precision
        tp = np.sum(self._truth * self._predictions)
        fp = np.sum(np.logical_not(self._truth) * self._predictions)
        return tp / (tp + fp)


class Recall(Metric):
    """
    Class for obtaining the recall
    """
    
    def __call__(self):
        """
        This method calculates and returns the recall
        """

        # If there was any issue with the input parameters, return None
        if self._error is True:
            return None
        
        # Calculate and return the precision
        tp = np.sum(self._truth * self._predictions)
        fn = np.sum(self._truth * np.logical_not(self._predictions))
        return tp / (tp + fn)
    
    
class Accuracy(Metric):
    """
    Class for obtaining the accuracy
    """
    
    def __call__(self):
        """
        This method calculates and returns the accuracy
        """

        # If there was any issue with the input parameters, return None
        if self._error is True:
            return None
        
        # Calculate and return the accuracy
        tp = np.sum(self._truth * self._predictions)
        tn = np.sum(np.logical_not(self._truth) * np.logical_not(self._predictions))
        fp = np.sum(np.logical_not(self._truth) * self._predictions)
        fn = np.sum(self._truth * np.logical_not(self._predictions))
        return (tp + tn) / (tp + tn + fp + fn)


In [19]:
# Test the Metric class
params = {
    'truth': np.array([1,1,0,1,1,1,0,0,0,1]),
    'predictions': np.array([1,1,1,1,0,0,1,1,0,0]),
}

print(Precision(**params)())
print(Recall(**params)())
print(Accuracy(**params)())

0.5
0.5
0.4


#### Alternative implementation of the Metric class

In [20]:
class Metric():
    """
    This is the base class for all the metrics.
    """
    
    def __init__(self, truth, predictions, **kwargs):
        """
        Class constructor
        
        :param truth: numpy array with the true values.
        :param predictions: numpy array with the predictions.
            The shape must be equal to truths's.
        """
        
        # Check that truth and predictions have the same shape
        if truth.shape != predictions.shape:
            raise ValueError(
                'The shape of truth ({}) and the one of the predictions ({}) differ'.format(
                    truth.shape,
                    predictions.shape,
                ))
        
        # Store the truth and predictions in private members
        self._truth = truth
        self._predictions = predictions

        # Store the rest of the kwargs in a separate private member
        self._additionals = kwargs
        
        # Calculate the true positives and store them in a private member
        # This is done only with true positives because they're used in all metrics
        self._tp = np.sum(self._truth * self._predictions)
        
        # Initialize the rest of the intermediate results with None
        self._tn = None
        self._fp = None
        self._fn = None
        
        
    def precision(self):
        """
        This method calculates and returns the precision
        """

        # Update intermediate values if required and return the precision
        if self._fp is None:
            self._fp = np.sum(np.logical_not(self._truth) * self._predictions)
        return self._tp / (self._tp + self._fp)


    def recall(self):
        """
        This method calculates and returns the recall
        """

        # Update intermediate values if required and return the precision
        if self._fn is None:
            self._fn = np.sum(self._truth * np.logical_not(self._predictions))
        return self._tp / (self._tp + self._fn)

    
    def accuracy(self):
        """
        This method calculates and returns the accuracy
        """

        # Update intermediate values if required and return the accuracy
        if self._tn is None:
            self._tn = np.sum(np.logical_not(self._truth) * np.logical_not(self._predictions))
        if self._fp is None:
            print('kasudhcbikasdubbc')
            self._fp = np.sum(np.logical_not(self._truth) * self._predictions)
        if self._fn is None:
            print('laidfosdifuvbks')
            self._fn = np.sum(self._truth * np.logical_not(self._predictions))
        return (self._tp + self._tn) / (self._tp + self._tn + self._fp + self._fn)


In [21]:
# Test the Metric class
metric = Metric(np.array([1,1,0,1,1,1,0,0,0,1]), np.array([1,1,1,1,0,0,1,1,0,0]))

print(metric.precision())
print(metric.recall())
print(metric.accuracy())


0.5
0.5
0.4


#### Ejercicio #10:   Dataset a NumPy Estructurado - Patrón de Diseño Singleton
Para este ejercicio vamos a descargar un dataset de Kaggle. Es recomendable que se creen una cuenta porque es un lugar de donde potencialmente vamos a descargar muchos recursos.

Pueden descargar el dataset desde [aquí](https://www.kaggle.com/rounakbanik/the-movies-dataset/data?select=ratings.csv).

El objetivo del ejercicio es crear una clase que permita realizar las siguientes funciones sobre el dataset:
* Crear la estructura de un structured numpy array para el dataset.
* Leer el csv, almacenar la información en el array estructurado.
* Guardar el array estructurado en formato .pkl.
* Crear una instancia singleton del array estructurado (utilizando \_\_new__ e \_\_init__).
* Al crear la instancia, si se encuentra el .pkl cargar desde el pkl. Si el .pkl no está, comenzar por transformar el .csv en .pkl y luego levantar la información.
* Encontrar una forma de optimizar la operación usando generators [opcional].
 

In [25]:
import tempfile
import zipfile
import os
import csv
import pickle as pkl


def get_data(file_path):
    """
    Generator for loading a csv row by row
    """
    
    # Define the generator
    with open(file_path, 'r') as opened_file:
        reader = csv.reader(opened_file, delimiter=',')
        for i, row in enumerate(reader):
            if i == 0:
                # Skip first line with headers
                continue
            yield tuple(row)


class Dataset():
    """
    This class holds a dataset as a singleton
    """
    # Initialize the instance with None
    instance = None
    
    
    def __new__(cls, folder_path, zip_file_name):
        """
        Return the instance of the dataset. Define it if needed.
        """
        
        # Return the instance if it already exists
        if cls.instance is not None:
            return cls.instance
                
        # Load the instance from the pickle if available
        pickle_path = os.path.join(folder_path, 'data.pkl')
        if os.path.isfile(pickle_path):
            with open(pickle_path, 'rb') as pkl_file:
                cls.instance = pkl.load(pkl_file)
            return cls.instance            
                
        # Generate the instance
        csv_path = os.path.join(folder_path, zip_file_name)
        with tempfile.TemporaryDirectory() as temporary_dir:
            # Unzip the csv in a temporary folder
            with zipfile.ZipFile(csv_path, 'r') as zip_ref:
                zip_ref.extractall(temporary_dir)
                
            # Load the structured array using a generator
            dtypes = [
                ('userId', np.uint32),
                ('movieId', np.uint32),
                ('rating', np.float32),
                ('timestamp', np.uint32),
            ]
            file_path = os.path.join(temporary_dir, 'ratings.csv')
            cls.instance = np.fromiter(get_data(file_path), dtype=dtypes)
        
        # Save the instance as a pickle and return it
        with open(pickle_path, 'wb') as pkl_file:
            pkl.dump(cls.instance, pkl_file)
        return cls.instance


In [None]:
# Test the previous implementation
folder_path = '/home/rodolfo/Desktop/especializacion/IA'
zip_file_name = 'movies.zip'
ds = Dataset(folder_path, zip_file_name)
print(ds)
