<h1 align=center><font size=6>Predicción del precio de un vehículo🚗</font></h1>

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
cars = pd.read_csv(url, header=None)  

In [3]:
# Lista con los nombres de las columnas 
headers = ["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base","length","width","height","curb-weight", "engine-type",
"num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]

cars.columns = headers

In [4]:
cars.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


<div class="alert alert-danger alertdanger">
<h2> Limpieza de los datos</h2>

### Identificar los valores faltantes

Algunas columnas tienen datos sin valor que aparecen como "?", se debe reemplazar por NaN el cual es el marcador por defecto de Python para valores faltantes por razones de conveniencia y velocidad de computo. 

In [5]:
cars.replace("?", np.nan, inplace=True)
cars.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [7]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

De acuerdo a lo anterior, tenemos entonces 205 filas, con datos faltantes en las siguientes columnas:</p> 

- "normalized-losses": 41 datos faltantes
- "num-of-doors": 2 datos faltantes
- "bore": 4 datos faltantes
- "stroke" : 4 datos faltantes
- "horsepower": 2 datos faltantes
- "peak-rpm": 2 datos faltantes
- "price": 4 datos faltantes

Además se puede observar que algunas columnas tiene asignado erróneamente su tipo.

### Reemplazar o eliminar datos faltantes 
Primero podemos obtener una descripción de los datos para ver su comportamiento.

In [8]:
cars.describe(include="all")

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205,205,205,203,205,205,205,205.0,...,205.0,205,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
unique,,51.0,22,2,2,2,5,3,2,,...,,8,38.0,36.0,,59.0,23.0,,,186.0
top,,161.0,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,3.62,3.4,,68.0,5500.0,,,8921.0
freq,,11.0,32,185,168,114,96,120,202,,...,,94,23.0,20.0,,19.0,37.0,,,2.0
mean,0.834146,,,,,,,,,98.756585,...,126.907317,,,,10.142537,,,25.219512,30.75122,
std,1.245307,,,,,,,,,6.021776,...,41.642693,,,,3.97204,,,6.542142,6.886443,
min,-2.0,,,,,,,,,86.6,...,61.0,,,,7.0,,,13.0,16.0,
25%,0.0,,,,,,,,,94.5,...,97.0,,,,8.6,,,19.0,25.0,
50%,1.0,,,,,,,,,97.0,...,120.0,,,,9.0,,,24.0,30.0,
75%,2.0,,,,,,,,,102.4,...,141.0,,,,9.4,,,30.0,34.0,


Podemos entonces decidir lo siquiente:</p> 

**Reemplazar con la media:**
- "normalized-losses"
- "bore"
- "stroke"
- "horsepower"
- "peak-rpm"
   </p> 
    
**Reemplazar con la frecuencia:** 
- "num-of-doors": se puede reemplazar con "four" ya que es el dato que más se repite
    </p> 
    
**Eliminar datos:**  
- "price": se eliminan las filas con datos faltantes ya que es la variable que queremos predecir. </p> 

**Nota:** Es importante al momento de calcular la media asignarle tipo "float", ya que tienen asigando erróneamente sus tipos.

In [10]:
# eliminar filas de la columna price
cars.dropna(subset=["price"], inplace=True)

# reemplazar con la media
avr_norm_loss = cars["normalized-losses"].astype("float64").mean()
cars["normalized-losses"].replace(np.nan, avr_norm_loss, inplace=True)

avr_bore = cars["bore"].astype("float64").mean()
cars["bore"].replace(np.nan, avr_bore, inplace=True)

avr_stroke = cars["stroke"].astype("float64").mean()
cars["stroke"].replace(np.nan, avr_stroke, inplace=True)

avr_horsepower = cars["horsepower"].astype("float64").mean()
cars["horsepower"].replace(np.nan, avr_horsepower, inplace=True)

avr_rpm = cars["peak-rpm"].astype("float64").mean()
cars["peak-rpm"].replace(np.nan, avr_rpm, inplace=True)

# reemplazar con la frecuencia
freq_num_doors = cars['num-of-doors'].value_counts().idxmax()
cars["num-of-doors"].replace(np.nan, freq_num_doors, inplace=True)


In [11]:
# restablecer el índice debido a que eliminamos filas
cars.reset_index(drop=True, inplace=True)

###  Corregir el formato de datos:
Como se vió anteiormente, algunas columnas tienen un tipo de dato erróneo, así que se debe corregir entonces:
- "normalized-losses": object -> int64
- "bore": object -> float64
- "stroke": object -> float64
- "horsepower": object -> int64
- "peak-rpm": object -> int64
- "price": object -> int64


In [12]:
cars["normalized-losses"] = cars["normalized-losses"].astype("int64")
cars["bore"] = cars["bore"].astype("float64")
cars["stroke"] = cars["stroke"].astype("float64")
cars["horsepower"] = cars["horsepower"].astype("int64")
cars["peak-rpm"] = cars["peak-rpm"].astype("int64")
cars["price"] = cars["price"].astype("int64")

Verificamos que se hayan asignado los datos correctamente y se hayan reemplazado los nulos:

In [53]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          201 non-null    int64  
 1   normalized-losses  201 non-null    int64  
 2   make               201 non-null    object 
 3   fuel-type          201 non-null    object 
 4   aspiration         201 non-null    object 
 5   num-of-doors       201 non-null    object 
 6   body-style         201 non-null    object 
 7   drive-wheels       201 non-null    object 
 8   engine-location    201 non-null    object 
 9   wheel-base         201 non-null    float64
 10  length             201 non-null    float64
 11  width              201 non-null    float64
 12  height             201 non-null    float64
 13  curb-weight        201 non-null    int64  
 14  engine-type        201 non-null    object 
 15  num-of-cylinders   201 non-null    object 
 16  engine-size        201 non

Tenemos entonces un total de 201 filas sin datos nulos.
Analizamos nuevamente el comportamiento de los datos:

In [13]:
cars.describe(include="all")

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,201.0,201.0,201,201,201,201,201,201,201,201.0,...,201.0,201,201.0,201.0,201.0,201.0,201.0,201.0,201.0,201.0
unique,,,22,2,2,2,5,3,2,,...,,8,,,,,,,,
top,,,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,,,,,,,,
freq,,,32,181,165,115,94,118,198,,...,,92,,,,,,,,
mean,0.840796,122.0,,,,,,,,98.797015,...,126.875622,,3.330711,3.256904,10.164279,103.393035,5117.58209,25.179104,30.686567,13207.129353
std,1.254802,31.99625,,,,,,,,6.066366,...,41.546834,,0.268072,0.316048,4.004965,37.365623,478.113182,6.42322,6.81515,7947.066342
min,-2.0,65.0,,,,,,,,86.6,...,61.0,,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,101.0,,,,,,,,94.5,...,98.0,,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,122.0,,,,,,,,97.0,...,120.0,,3.31,3.29,9.0,95.0,5117.0,24.0,30.0,10295.0
75%,2.0,137.0,,,,,,,,102.4,...,141.0,,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0


Ahora, finalmente obtenemos el conjunto de datos limpio, sin valores faltantes y en un formato correcto.