## Predictor de precios de autos usados


Obtener informacion via web scrapping de un sitio de venta de automobiles, donde posea una serie de caracteristicas y precio final.

¿ Como podemos ayudar para saber en cuanto vender el vehiculo?

In [1]:
import pandas as pd

In [2]:
cars = pd.read_csv('cars.csv')
cars.head()

Unnamed: 0,maker,model,year,transmission,mileage,fuelType,tax,mpg,engineSize,price
0,cclass,C Class,2020,Automatic,1200,Diesel,,,2.0,30495
1,cclass,C Class,2020,Automatic,1000,Petrol,,,1.5,29989
2,cclass,C Class,2020,Automatic,500,Diesel,,,2.0,37899
3,cclass,C Class,2019,Automatic,5000,Diesel,,,2.0,30399
4,cclass,C Class,2019,Automatic,4500,Diesel,,,2.0,29899


# EDA

In [3]:
# pip install pandas_profiling

In [4]:
# Primero siempre se debe realizar un analisis exploratorio de datos

# Importar la biblioteca actualizada
from ydata_profiling import ProfileReport

# Crear un reporte de perfilado de datos
profile = ProfileReport(cars, title="Reporte de Perfil CARS", explorative=True)

# Generar el reporte en un archivo HTML
profile.to_file("reporte_perfil_cars.html")



Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Feature Engineering

In [5]:
len(cars)

108540

In [6]:
cars = cars.drop_duplicates(keep='first')

In [7]:
len(cars)

106267

# Dividir el dataset


In [8]:
from sklearn.model_selection import train_test_split
import numpy as np

rest, test = train_test_split(cars, test_size=0.2, shuffle=True) # 20% del total
train, val = train_test_split(rest, test_size=0.25, shuffle=True) # 25% de 80 = 20% del total

distributions = np.array([len(train), len(val), len(test)])

print(distributions)
print(distributions/len(cars))



[63759 21254 21254]
[0.59998871 0.20000565 0.20000565]


Codificar variables categoricas siempre
Con pandas podemos usar el get_dummies para codifificar nombres de variables

In [9]:
train[['maker']]

Unnamed: 0,maker
54332,vauxhall
91600,hyundi
29621,skoda
84141,vw
83913,vw
...,...
41180,ford
13411,audi
54660,vauxhall
103046,merc


In [10]:
pd.get_dummies(train[['maker']]) #Realiza una matriz para hacer matricial la data, no se recomienda para entrenar algoritmo, solo se usa para analizar datos rapidamente

Unnamed: 0,maker_audi,maker_bmw,maker_cclass,maker_focus,maker_ford,maker_hyundi,maker_merc,maker_skoda,maker_toyota,maker_vauxhall,maker_vw
54332,False,False,False,False,False,False,False,False,False,True,False
91600,False,False,False,False,False,True,False,False,False,False,False
29621,False,False,False,False,False,False,False,True,False,False,False
84141,False,False,False,False,False,False,False,False,False,False,True
83913,False,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...
41180,False,False,False,False,True,False,False,False,False,False,False
13411,True,False,False,False,False,False,False,False,False,False,False
54660,False,False,False,False,False,False,False,False,False,True,False
103046,False,False,False,False,False,False,True,False,False,False,False


## One Hot Encoding

In [11]:
# Alternativa a get_dummies
from sklearn.preprocessing import OneHotEncoder
maker_encoder = OneHotEncoder()

maker_encoder.fit(train[['maker']])

In [12]:
mkr = maker_encoder.transform(train[['maker']])
mkr.shape

(63759, 11)

In [13]:
maker_encoder.categories_

[array(['audi', 'bmw', 'cclass', 'focus', 'ford', 'hyundi', 'merc',
        'skoda', 'toyota', 'vauxhall', 'vw'], dtype=object)]

In [14]:
# Suponiendo que 'train' es el DataFrame y 'maker' es la columna que contiene las marcas

# Paso 1: Ajustar el codificador y transformar los datos
encoder = OneHotEncoder(sparse_output=False)  # Usar sparse_output=False en lugar de sparse=False
mkr = encoder.fit_transform(train[['maker']])  # Codificamos la columna 'maker'

# Paso 2: Asignar las categorías como nombres de las columnas
categories = encoder.categories_[0]  # Esto te da el array con las categorías de 'maker'

# Paso 3: Crear el DataFrame usando las categorías como nombres de columnas
df = pd.DataFrame(data=mkr, columns=categories, index=train.index)

# Paso 4: Añadir la columna 'actual' con las marcas originales
df["actual"] = train["maker"].values

# Mostrar una muestra del DataFrame
df.sample(10)

Unnamed: 0,audi,bmw,cclass,focus,ford,hyundi,merc,skoda,toyota,vauxhall,vw,actual
53563,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,vauxhall
44673,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
44928,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
24873,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,toyota
88598,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,vw
80172,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,vw
37819,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
34485,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,ford
95224,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,hyundi
57101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,vauxhall


In [15]:
# Verifica si los índices son equivalentes antes de asignar
print(df.index.equals(train.index))  # Esto debería ser True

True


In [16]:
# Asigna el índice del DataFrame original si es necesario
df.index = train.index

In [17]:
# Añade la columna "actual"
df["actual"] = train["maker"].values  # .values para asegurar que se asignen correctamente

In [18]:
print("Shape de mkr:", mkr.shape)
print("Shape de train['maker']:", train[["maker"]].shape)

Shape de mkr: (63759, 11)
Shape de train['maker']: (63759, 1)


In [19]:
print(type(train.index))
print(type(train[["maker"]].index))

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.base.Index'>


## Feature Scaling (Comprimir valores a un rango standard)

In [20]:
from sklearn.preprocessing import RobustScaler, StandardScaler, MinMaxScaler, MaxAbsScaler
scaler = MaxAbsScaler()

In [21]:
scaler.fit(train[["mileage"]])

In [22]:
scaled= scaler.transform(train[["mileage"]])
scaled

array([[0.04385333],
       [0.02472667],
       [0.10647333],
       ...,
       [0.11572667],
       [0.04780667],
       [0.04236333]])

In [23]:
values = pd.DataFrame({'mileage': train['mileage'].values, 'scaled': scaled.squeeze()})
values.sample(5)

Unnamed: 0,mileage,scaled
47059,11285,0.037617
63701,4500,0.015
11001,16570,0.055233
3029,6405,0.02135
33578,100,0.000333


In [24]:
values.describe()

Unnamed: 0,mileage,scaled
count,63759.0,63759.0
mean,23241.649085,0.077472
std,21089.031948,0.070297
min,1.0,3e-06
25%,7773.0,0.02591
50%,17532.0,0.05844
75%,32456.5,0.108188
max,300000.0,1.0


In [25]:
scaler.inverse_transform([[1.]])

array([[300000.]])

In [26]:
scaler.inverse_transform([[0.058824]])

array([[17647.2]])

In [27]:
scaler = StandardScaler()
scaler.fit(train[["mileage"]])

In [28]:
scaled= scaler.transform(train[["mileage"]])
scaled

array([[-0.47824519],
       [-0.75033189],
       [ 0.41255659],
       ...,
       [ 0.54419003],
       [-0.42200699],
       [-0.49944121]])

In [29]:
values = pd.DataFrame({'mileage': train['mileage'].values, 'scaled': scaled.squeeze()})
values.sample(5)

Unnamed: 0,mileage,scaled
10220,47541,1.152236
55993,2798,-0.969405
2852,10248,-0.616138
9989,10305,-0.613435
29074,24427,0.056207


In [30]:
values.describe()

Unnamed: 0,mileage,scaled
count,63759.0,63759.0
mean,23241.649085,8.126904000000001e-17
std,21089.031948,1.000008
min,1.0,-1.102034
25%,7773.0,-0.7334984
50%,17532.0,-0.2707423
75%,32456.5,0.4369534
max,300000.0,13.12343


## Artefactos (herramientas para hacer una prediccion)

Para guardar artefactos se necesita un serializador, se pueden guardar objetos en python

In [31]:
import pickle

with open('scaler.pickle', 'wb') as wb:
    pickle.dump(scaler, wb)

with open('maker_encoder.pickle', 'wb') as wb:
    pickle.dump(maker_encoder, wb)

## Pipelines

In [32]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn import set_config

In [33]:
one_hot_encoder = ColumnTransformer([
    (
        'maker-transmission-fuelType', #Nombre de la transformacion
        OneHotEncoder(sparse_output=False),  #La transformacion
        ['maker', 'transmission', 'fuelType'] #Columnas a transformar
    )
])

In [34]:
one_hot_encoder.fit(train)

In [35]:
one_hot_encoder_var = one_hot_encoder.transform(train)

In [36]:
one_hot_encoder_var.shape

(63759, 20)

In [37]:
# Robust encoding
robust_encoding = ColumnTransformer([
    (
        'mileage', #Nombre de la transformacion
        RobustScaler(),  #La transformacion
        ['mileage'] #Columnas a transformar
    )
])

In [38]:
# Imput and standard scale mpg and tax
impute_and_scale = Pipeline([
    (
        'impute', #Nombre de la transformacion
        SimpleImputer(strategy='mean')
    ), 
    (
        'scale', #Nombre de la transformacion
        MinMaxScaler()
    ),
])


standard_scaling = ColumnTransformer([
    (
        'mpg-tax', #Nombre de la transformacion
        impute_and_scale,  #La transformacion
        ['mpg', 'tax'] #Columnas a transformar
    )
])

In [39]:
# Solo pasa year and engineSize
passthrough = ColumnTransformer([
    ('pass', 'passthrough', ['year', 'engineSize'])
])

In [40]:
#Ensambla los pipeline

pipel = Pipeline([
    (
        'feautures',
        FeatureUnion([
            ('one_hot_encode', one_hot_encoder),
            ('robust_encoding', robust_encoding),
            ('passth', passthrough),
            ('scale_and_impute', standard_scaling)
        ])
    )
])

In [41]:
from sklearn import set_config

set_config(display="diagram")
pipel

In [42]:
pipel.fit(train)

In [43]:
train_x = pipel.transform(train)

In [44]:
train_x.shape

(63759, 25)

In [45]:
pipel.transform(val)



array([[0.        , 0.        , 0.        , ..., 1.2       , 0.12561209,
        0.03448276],
       [0.        , 0.        , 0.        , ..., 1.6       , 0.10708963,
        0.25      ],
       [0.        , 0.        , 0.        , ..., 1.6       , 0.07196083,
        0.25      ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.8       , 0.16478603,
        0.        ],
       [0.        , 0.        , 0.        , ..., 1.        , 0.13434107,
        0.25      ],
       [0.        , 0.        , 0.        , ..., 1.2       , 0.10708963,
        0.21551724]])

In [46]:
pipel.transform(test)

array([[0.        , 0.        , 0.        , ..., 1.4       , 0.11326379,
        0.25      ],
       [0.        , 1.        , 0.        , ..., 1.5       , 0.17457952,
        0.        ],
       [0.        , 0.        , 0.        , ..., 2.        , 0.09027039,
        0.25      ],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.13434107,
        0.03448276],
       [0.        , 0.        , 1.        , ..., 2.        , 0.1152843 ,
        0.20667868],
       [0.        , 0.        , 0.        , ..., 2.        , 0.08601235,
        0.35344828]])

In [47]:
pipel.transform(val).shape

(21254, 25)

In [48]:
pipel.transform(test).shape

(21254, 25)

In [49]:
from sklearn.linear_model import LinearRegression

In [50]:
lr = LinearRegression()

In [51]:
predicting_pipeline = Pipeline([
    ('feature_engineering', pipel),
    ('estimador_precios', lr)
])

In [52]:
set_config(display="diagram")
predicting_pipeline

In [53]:
_ = predicting_pipeline.fit(train, train['price'])

In [54]:
train_pred = predicting_pipeline.predict(train)
val_pred = predicting_pipeline.predict(val)

In [55]:
pd.DataFrame({'real': val['price'], 'predicted': val_pred})

Unnamed: 0,real,predicted
84705,11500,12481.18750
97938,16722,19234.21875
91198,15498,18255.65625
89951,18876,18173.34375
18327,10999,11744.28125
...,...,...
100934,24202,24601.65625
29700,16890,19052.31250
23453,11500,12195.12500
29013,7495,8609.15625


In [56]:
# Analizar las metricas
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

In [57]:
train_mse = mean_absolute_error(train['price'], train_pred)
val_mse = mean_absolute_error(val['price'], val_pred)

print(f"Entrenamiento MSE: {train_mse:2.02f}\n"
      f"Validacion MSE: {val_mse:2.02f}")


Entrenamiento MSE: 2922.47
Validacion MSE: 2933.04


In [58]:
test_pred = predicting_pipeline.predict(test)
test_mse = mean_absolute_error(test['price'], test_pred)

print(f"Prueba MSE: {test_mse:2.02f}")

Prueba MSE: 2971.30


## Guardar Pipelines

In [59]:
from joblib import dump, load

dump(predicting_pipeline, 'car-prices.model')

['car-prices.model']

In [63]:
saved_pipeline = load('car-prices.model')

In [60]:
maker = 'ford'
model = 'focus'
year = 2020
transmission = 'Manual'
mileage = 50
fuelType = 'Petrol'
tax = 100
mpg = 30
engineSize = 1.5

mi_automovil = pd.DataFrame({
    'maker' : [maker],
    'model' : [model],
    'year' : [year],
    'transmission' : [transmission],
    'mileage' : [mileage],
    'fuelType' : [fuelType],
    'tax' : [tax],
    'mpg' : [mpg],
    'engineSize' : [engineSize],
})

In [61]:
mi_automovil

Unnamed: 0,maker,model,year,transmission,mileage,fuelType,tax,mpg,engineSize
0,ford,focus,2020,Manual,50,Petrol,100,30,1.5


In [64]:
price = saved_pipeline.predict(mi_automovil).squeeze()
print(price)

22238.03125
