# Curso de aprendizaje automatizado
PCIC, UNAM

Machine Learning

Rodrigo S. Cortez Madrigal

<img src="https://pcic.posgrado.unam.mx/wp-content/uploads/Ciencia-e-Ingenieria-de-la-Computacion_color.png" alt="Logo PCIC" width="128" />   

### Tarea 2: Regresión y clasificación lineal

A partir del conjunto de datos Automobile Dataset realiza la regresión de los precios de automóviles con las siguientes variantes:

- a. Mínimos cuadrados con expansión polinomial de diferentes grados.
- b. Mínimos cuadrados con expansión polinomial de grado 20 y penalización por norma l1 y l2 con diferentes valores de λ.
- c. Mínimos cuadrados con expansión polinomial de grado 2 y selección de atributos.

Grafica el error cuadrático medio en entrenamiento y validación con respecto al grado del polinomio,
valor de λ y número de atributos. Todos los modelos deberán ser evaluados con 10 repeticiones de
validación cruzada de 5 particiones. Selecciona uno de los modelos y reporta su desempeño en el
conjunto de prueba.

In [1]:
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import RepeatedKFold, cross_val_score

import plotly.express as px
import plotly.graph_objects as go

from tqdm import tqdm
import numpy as np
from joblib import Parallel, delayed

In [2]:
# Disable warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn")

## Automovil Dataset

El conjunto de datos Automobile Dataset contiene información sobre diferentes automóviles, incluyendo el precio. El objetivo es predecir el precio de los automóviles a partir de las características de los mismos.

In [3]:
# Load the data

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data', header=None, sep=',', engine='python', na_values='?')

data.columns = ['symboling', 'normalized_losses', 'maker', 'fuel_type', 'aspiration', 'num_doors', 'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type', 'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']

In [4]:
data.describe()

Unnamed: 0,symboling,normalized_losses,wheel_base,length,width,height,curb_weight,engine_size,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


In [5]:
# Missing values by column

pd.DataFrame(data.isnull().sum())

Unnamed: 0,0
symboling,0
normalized_losses,41
maker,0
fuel_type,0
aspiration,0
num_doors,2
body_style,0
drive_wheels,0
engine_location,0
wheel_base,0


In [6]:
total_faltantes = len(data) - len(data.dropna())
print("Total de datos faltantes:", total_faltantes)
porcentaje_faltantes = total_faltantes/len(data)*100
print("Porcentaje de datos faltantes:", porcentaje_faltantes, "%")

x = [porcentaje_faltantes, 100-porcentaje_faltantes]
fig = px.pie(values=x, names=["Datos faltantes", "Datos no faltantes"], title="Porcentaje de datos faltantes")
fig.show()

Total de datos faltantes: 46
Porcentaje de datos faltantes: 22.439024390243905 %


In [7]:
# Estrategias para manejar los datos faltantes

# 1. Eliminar las filas con datos faltantes
dataWN = data.dropna()

In [8]:
# Correct Types

dataWN['price'] = pd.to_numeric(dataWN['price'])

categorical_columns = ['maker', 'fuel_type', 'aspiration', 'num_doors', 'body_style', 
                       'drive_wheels', 'engine_location', 'engine_type', 'num_cylinders', 'fuel_system']

for col in categorical_columns:
    dataWN[col] = dataWN[col].astype('category')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [9]:
# Obtener X, y 

X = dataWN[['symboling', 'normalized_losses', 'maker', 'fuel_type', 'aspiration', 'num_doors', 'body_style', 
            'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 
            'engine_type', 'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 
            'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg']]
y = dataWN['price']

# Codificar las columnas categóricas en X
for col in X.select_dtypes(include=['category']).columns:
    X[col] = X[col].cat.codes



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## Selección de atributos

In [10]:
# Calcular la correlación entre las características y el precio
correlation_matrix = X.corrwith(y).sort_values(ascending=False)
correlation_matrix = pd.DataFrame(correlation_matrix).reset_index()
correlation_matrix.columns = ['feature', 'correlation']

# Filtrar características con correlación significativa
correlation_matrix = correlation_matrix[correlation_matrix['correlation'].abs() > 0.1]
correlation_matrix = correlation_matrix.sort_values(by='correlation', ascending=False)

fig = px.bar(correlation_matrix, x='feature', y='correlation', title='Correlation with price')
fig.update_layout(xaxis_title='Feature', yaxis_title='Correlation')
fig.show()

# Seleccionar las 10 características más correlacionadas
selected_features = correlation_matrix.head(5)['feature'].tolist()
print("Top 10 features:", selected_features)


invalid value encountered in divide


invalid value encountered in divide



Top 10 features: ['curb_weight', 'width', 'engine_size', 'length', 'horsepower']


In [11]:
# Plot prices by make

fig = px.box(dataWN, x='curb_weight', y='price', title='Price by curb_weight')
fig.show()
fig = px.box(dataWN, x='width', y='price', title='Price by width')
fig.show()
fig = px.box(dataWN, x='horsepower', y='price', title='Price by horsepower')
fig.show()

### Regresión Polinomial

$Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \epsilon$

La expansión polinomial consiste en transformar la variable independiente X en un conjunto de términos polinómicos, permitiendo que la regresión modele relaciones no lineales.
- Si usamos un polinomio de grado 1, obtenemos una regresión lineal simple.
- Si usamos un polinomio de grado 2 (cuadrático), el modelo puede capturar curvaturas en los datos.
- A medida que aumentamos el grado, el modelo se vuelve más flexible, pero también más propenso a sobreajustarse.

Grafica el error cuadrático medio en entrenamiento y validación con respecto al grado del polinomio,
valor de λ y número de atributos. Todos los modelos deberán ser evaluados con 10 repeticiones de
validación cruzada de 5 particiones. Selecciona uno de los modelos y reporta su desempeño en el
conjunto de prueba.

In [12]:

def PolynomialTranAndEval(degree, X_train, X_test, y_train, y_test, model):
    """
    Entrena y evalúa un modelo de regresión polinómica.

    Args:
        degree (int): Grado del polinomio.
        X_train (DataFrame): Datos de entrenamiento.
        X_test (DataFrame): Datos de prueba.
        y_train (Series): Etiquetas de entrenamiento.
        y_test (Series): Etiquetas de prueba.
        model: Modelo base.
    Returns:
        dict: Diccionario con los resultados de la evaluación.

    Scores:
        mse: Error cuadrático medio.
        r2: Coeficiente de determinación R^2.
        cross_val_mean: Media de la validación cruzada.
        cross_val_std: Desviación estándar de la validación cruzada.

    Descripción de la validación cruzada:
        rkf: RepeatedKFold(n_splits=5, n_repeats=10, random_state=42)
        n_splits=5: Divide los datos en 5 particiones (folds) en cada repetición.
        n_repeats=10: Repite el proceso de partición 10 veces, generando diferentes divisiones en cada repetición.
        random_state=42: Fija una semilla para garantizar que las divisiones sean reproducibles.
        cross_val_train: Evalúa el modelo utilizando validación cruzada. Obtiene Un array que contiene las puntuaciones obtenidas en cada iteración.
        model: El modelo que se evaluará (en este caso, un pipeline con PolynomialFeatures y el modelo base).
        X_ y y_: Los datos que se usarán para entrenar y validar el modelo.
        cv=rkf: Usa el objeto rkf para definir cómo se dividirán los datos en cada iteración de validación cruzada.
    """
    model = make_pipeline(PolynomialFeatures(degree), model)
    model.fit(X_train, y_train)

    rkf = RepeatedKFold(n_splits=5, n_repeats=10, random_state=42) # Crea un objeto que realiza validación cruzada repetida.

    # Para Train
    y_train_pred = model.predict(X_train)
    mse_train = mean_squared_error(y_train, y_train_pred)
    r2_train = r2_score(y_train, y_train_pred) 
    cross_val_train = cross_val_score(model, X_train, y_train, cv=rkf) # Evalúa el modelo utilizando validación cruzada.
    cross_val_mean_train = cross_val_train.mean()
    cross_val_std_train = cross_val_train.std()

    # Para Test
    y_test_pred = model.predict(X_test)
    mse_test = mean_squared_error(y_test, y_test_pred)
    r2_test = r2_score(y_test, y_test_pred)
    cross_val_test = cross_val_score(model, X_test, y_test, cv=rkf) # Evalúa el modelo utilizando validación cruzada.
    cross_val_mean_test = cross_val_test.mean()
    cross_val_std_test = cross_val_test.std()

    return {
        'degree': degree,
        'mse_train': mse_train,
        'r2_train': r2_train,
        'cross_val_mean_train': cross_val_mean_train,
        'cross_val_std_train': cross_val_std_train,
        'mse_test': mse_test,
        'r2_test': r2_test,
        'cross_val_mean_test': cross_val_mean_test,
        'cross_val_std_test': cross_val_std_test
    }

### A. Mínimos cuadrados con expansión polinomial de diferentes grados.

In [13]:
# Lista de grados de polinomio
degrees = [1, 2, 3, 4, 5]

X_train, X_test, y_train, y_test = train_test_split(X[selected_features], y, test_size=0.2, random_state=42)

results = Parallel(n_jobs=-1)(
    delayed(PolynomialTranAndEval)(degree, X_train, X_test, y_train, y_test, LinearRegression())
    for degree in degrees
)

results_df = pd.DataFrame(results)

print(results_df)

   degree     mse_train  r2_train  cross_val_mean_train  cross_val_std_train  \
0       1  5.789976e+06  0.848955              0.797972             0.093727   
1       2  3.226951e+06  0.915817              0.712689             0.226291   
2       3  8.131799e+05  0.978786            -18.711205            26.880194   
3       4  1.306889e+05  0.996591         -49489.420688        119669.912248   
4       5  1.145006e+05  0.997013         -82179.649822        206067.593725   

       mse_test     r2_test  cross_val_mean_test  cross_val_std_test  
0  6.466301e+06    0.636615             0.538396            0.813883  
1  4.710297e+06    0.735297           -19.467551           49.541306  
2  3.421764e+07   -0.922919           -45.876961           87.562067  
3  1.009126e+09  -55.709600           -65.718306          140.385630  
4  2.426913e+09 -135.384554           -99.119472          252.279476  


In [15]:
# Graficar los resultados
"""
'degree': degree,
'mse_train': mse_train,
'r2_train': r2_train,
'cross_val_mean_train': cross_val_mean_train,
'cross_val_std_train': cross_val_std_train,
'mse_test': mse_test,
'r2_test': r2_test,
'cross_val_mean_test': cross_val_mean_test,
'cross_val_std_test': cross_val_std_test
"""

train_fig = go.Figure()
train_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['mse_train'], mode='lines+markers', name='MSE Train'))
train_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['r2_train'], mode='lines+markers', name='R2 Train'))
train_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['cross_val_mean_train'], mode='lines+markers', name='Cross Val Mean Train'))
train_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['cross_val_std_train'], mode='lines+markers', name='Cross Val Std Train'))
train_fig.update_layout(title='Train Results', xaxis_title='Degree', yaxis_title='Score')
train_fig.show()

test_fig = go.Figure()
test_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['mse_test'], mode='lines+markers', name='MSE Test'))
test_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['r2_test'], mode='lines+markers', name='R2 Test'))
test_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['cross_val_mean_test'], mode='lines+markers', name='Cross Val Mean Test'))
test_fig.add_trace(go.Scatter(x=results_df['degree'], y=results_df['cross_val_std_test'], mode='lines+markers', name='Cross Val Std Test'))
test_fig.update_layout(title='Test Results', xaxis_title='Degree', yaxis_title='Score')
test_fig.show()

### B. Mínimos cuadrados con expansión polinomial de grado 20 y penalización por norma l1 y l2 con diferentes valores de λ.

- La penalización L1 (Lasso) agrega una penalización proporcional a la suma de los valores absolutos de los coeficientes.
- La penalización L2 (Ridge) agrega una penalización proporcional al cuadrado de los coeficientes.

In [16]:
from sklearn.linear_model import Ridge, Lasso

X_train, X_test, y_train, y_test = train_test_split(X[selected_features], y, test_size=0.2, random_state=42)

# Valores de lambda (alpha en scikit-learn)
lambdas = [0.01, 0.1, 1, 10, 100]

ridge_results = []
lasso_results = []
degree = 20

ridge_results = Parallel(n_jobs=-1)(
    delayed(PolynomialTranAndEval)(degree, X_train, X_test, y_train, y_test, Ridge(alpha=alpha))
    for alpha in lambdas
)

lasso_results = Parallel(n_jobs=-1)(
    delayed(PolynomialTranAndEval)(degree, X_train, X_test, y_train, y_test, Lasso(alpha=alpha, max_iter=10000))
    for alpha in lambdas
)

ridge_results = pd.DataFrame(ridge_results)
lasso_results = pd.DataFrame(lasso_results)

# Save the results to CSV
ridge_results.to_csv('ridge_results_20Degree.csv', index=False)
lasso_results.to_csv('lasso_results_20Degree.csv', index=False)

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = c

In [18]:
ridge_results

Unnamed: 0,degree,mse_train,r2_train,cross_val_mean_train,cross_val_std_train,mse_test,r2_test,cross_val_mean_test,cross_val_std_test
0,20,6470020.0,0.831215,-1654.463597,4269.997097,11250040.0,0.367784,-10410.81292,34587.360838
1,20,6470020.0,0.831215,-1654.463597,4269.997097,11250040.0,0.367784,-10410.81292,34587.360838
2,20,6470020.0,0.831215,-1654.463597,4269.997097,11250040.0,0.367784,-10410.81292,34587.360838
3,20,6470020.0,0.831215,-1654.463597,4269.997097,11250040.0,0.367784,-10410.81292,34587.360838
4,20,6470020.0,0.831215,-1654.463597,4269.997097,11250040.0,0.367784,-10410.81292,34587.360838


In [19]:
lasso_results

Unnamed: 0,degree,mse_train,r2_train,cross_val_mean_train,cross_val_std_train,mse_test,r2_test,cross_val_mean_test,cross_val_std_test
0,20,1001422.0,0.973876,-53871.575289,365609.421287,10650480.0,0.401478,-661.882104,3327.372369
1,20,1001530.0,0.973873,-53811.306311,365191.042002,10644720.0,0.401802,-664.834666,3344.672286
2,20,1002505.0,0.973847,-53293.308219,361599.489134,10596710.0,0.404499,-669.878245,3396.812464
3,20,1002025.0,0.97386,-54783.027379,372066.522211,10676680.0,0.400005,-659.640611,3356.213608
4,20,1000782.0,0.973892,-55202.322218,375056.53623,10633220.0,0.402448,-677.025335,3460.615109


Recordemos que la Validación Cruzada es una técnica que permite evaluar el rendimiento de un modelo de aprendizaje automático dividiendo el conjunto de datos en varias partes (o "folds"). 

En cada iteración, se entrena el modelo en una parte del conjunto de datos y se evalúa en la parte restante. Esto ayuda a obtener una estimación más robusta del rendimiento del modelo y a evitar el sobreajuste.

In [40]:
# Graficar los resultados us px

ridge_results['lambda'] = [str(alpha) for alpha in lambdas]
lasso_results['lambda'] = [str(alpha) for alpha in lambdas]

fig = px.bar(ridge_results, x='lambda', y=['mse_train', 'r2_train', 'cross_val_mean_train', 'cross_val_std_train'], title='Ridge Results')
fig.update_layout(title='Ridge Results', xaxis_title='Lambda', yaxis_title='Score')
fig.show()
