# Task for Today  

***

## Used Car Price Prediction  

Given *data about used cars*, let's try to predict the **price** of a given car.  
  
We will use linear regression and gradient boosting (LightGBM) to make our predictions.

# Getting Started

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
import lightgbm as lgb

from sklearn.metrics import mean_squared_error

In [2]:
data = pd.read_csv('../input/craigslist-carstrucks-data/vehicles.csv')

In [3]:
data

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,condition,cylinders,...,drive,size,type,paint_color,image_url,description,county,state,lat,long
0,7184791621,https://duluth.craigslist.org/ctd/d/duluth-200...,duluth / superior,https://duluth.craigslist.org,6995,2000.0,gmc,new sierra 1500,excellent,8 cylinders,...,4wd,,,red,https://images.craigslist.org/00n0n_f06ykBMcdh...,2000 *** GMC New Sierra 1500 Ext Cab 157.5 WB...,,mn,46.8433,-92.2550
1,7184773187,https://duluth.craigslist.org/cto/d/saginaw-20...,duluth / superior,https://duluth.craigslist.org,8750,2013.0,hyundai,sonata,excellent,4 cylinders,...,fwd,,,grey,https://images.craigslist.org/00d0d_kgZ6xoeRw2...,For Sale: 2013 Hyundai Sonata GLS - $8750. O...,,mn,46.9074,-92.4638
2,7193375964,https://newhaven.craigslist.org/cto/d/stratfor...,new haven,https://newhaven.craigslist.org,10900,2013.0,toyota,prius,good,4 cylinders,...,fwd,,,blue,https://images.craigslist.org/00d0d_3sHGxPbY2O...,2013 Prius V Model Two. One owner—must sell my...,,ct,41.1770,-73.1336
3,7195108810,https://albuquerque.craigslist.org/cto/d/albuq...,albuquerque,https://albuquerque.craigslist.org,12500,2003.0,mitsubishi,lancer,good,4 cylinders,...,4wd,mid-size,sedan,grey,https://images.craigslist.org/00m0m_4a8Pb6JbMG...,"2003 Mitsubishi Lancer Evolution, silver. Abo...",,nm,35.1868,-106.6650
4,7184712241,https://duluth.craigslist.org/ctd/d/rush-city-...,duluth / superior,https://duluth.craigslist.org,16995,2007.0,gmc,sierra classic 2500hd,good,8 cylinders,...,4wd,full-size,truck,white,https://images.craigslist.org/01414_g093aPtSMW...,"**Bad Credit, No Credit... No Problem!**2007 G...",,mn,45.6836,-92.9648
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423852,7184919761,https://duluth.craigslist.org/cto/d/duluth-200...,duluth / superior,https://duluth.craigslist.org,1600,2006.0,hyundai,sonata,fair,6 cylinders,...,fwd,,sedan,blue,https://images.craigslist.org/00E0E_8o5RKLUz3o...,Motor runs and drives good. Transmission shift...,,mn,46.8348,-92.0742
423853,7184844576,https://duluth.craigslist.org/cto/d/duluth-200...,duluth / superior,https://duluth.craigslist.org,9000,2003.0,toyota,sequoia limited,excellent,8 cylinders,...,4wd,full-size,SUV,green,https://images.craigslist.org/00G0G_BT0Ha3X736...,"2 owner 0 rust not from here... Leather ,roof ...",,mn,46.9369,-91.9325
423854,7184805809,https://duluth.craigslist.org/cto/d/duluth-94-...,duluth / superior,https://duluth.craigslist.org,700,1994.0,ford,f-150,fair,6 cylinders,...,rwd,,,green,https://images.craigslist.org/00L0L_2MgECwYWhp...,I'm selling this beautiful old pickup that I j...,,mn,46.7715,-92.1279
423855,7184791927,https://duluth.craigslist.org/ctd/d/duluth-199...,duluth / superior,https://duluth.craigslist.org,3800,1999.0,lincoln,town car,excellent,8 cylinders,...,rwd,,sedan,,https://images.craigslist.org/00q0q_6msyGUIqK3...,1999 *** Lincoln Town Car 4dr Sdn Signature Se...,,mn,46.8433,-92.2550


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423857 entries, 0 to 423856
Data columns (total 25 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            423857 non-null  int64  
 1   url           423857 non-null  object 
 2   region        423857 non-null  object 
 3   region_url    423857 non-null  object 
 4   price         423857 non-null  int64  
 5   year          328743 non-null  float64
 6   manufacturer  313242 non-null  object 
 7   model         325384 non-null  object 
 8   condition     176719 non-null  object 
 9   cylinders     197679 non-null  object 
 10  fuel          327214 non-null  object 
 11  odometer      270585 non-null  float64
 12  title_status  327759 non-null  object 
 13  transmission  328065 non-null  object 
 14  vin           184420 non-null  object 
 15  drive         231119 non-null  object 
 16  size          102627 non-null  object 
 17  type          241157 non-null  object 
 18  pain

# Preprocessing

In [5]:
data.isna().sum()

id                   0
url                  0
region               0
region_url           0
price                0
year             95114
manufacturer    110615
model            98473
condition       247138
cylinders       226178
fuel             96643
odometer        153272
title_status     96098
transmission     95792
vin             239437
drive           192738
size            321230
type            182700
paint_color     201654
image_url        94196
description      94203
county          423857
state                0
lat              99453
long             99453
dtype: int64

In [6]:
null_columns = data.columns[data.isna().mean() > 0.25]

data = data.drop(null_columns, axis=1)

In [7]:
data

Unnamed: 0,id,url,region,region_url,price,year,model,fuel,title_status,transmission,image_url,description,state,lat,long
0,7184791621,https://duluth.craigslist.org/ctd/d/duluth-200...,duluth / superior,https://duluth.craigslist.org,6995,2000.0,new sierra 1500,gas,clean,automatic,https://images.craigslist.org/00n0n_f06ykBMcdh...,2000 *** GMC New Sierra 1500 Ext Cab 157.5 WB...,mn,46.8433,-92.2550
1,7184773187,https://duluth.craigslist.org/cto/d/saginaw-20...,duluth / superior,https://duluth.craigslist.org,8750,2013.0,sonata,gas,clean,automatic,https://images.craigslist.org/00d0d_kgZ6xoeRw2...,For Sale: 2013 Hyundai Sonata GLS - $8750. O...,mn,46.9074,-92.4638
2,7193375964,https://newhaven.craigslist.org/cto/d/stratfor...,new haven,https://newhaven.craigslist.org,10900,2013.0,prius,hybrid,clean,automatic,https://images.craigslist.org/00d0d_3sHGxPbY2O...,2013 Prius V Model Two. One owner—must sell my...,ct,41.1770,-73.1336
3,7195108810,https://albuquerque.craigslist.org/cto/d/albuq...,albuquerque,https://albuquerque.craigslist.org,12500,2003.0,lancer,gas,clean,manual,https://images.craigslist.org/00m0m_4a8Pb6JbMG...,"2003 Mitsubishi Lancer Evolution, silver. Abo...",nm,35.1868,-106.6650
4,7184712241,https://duluth.craigslist.org/ctd/d/rush-city-...,duluth / superior,https://duluth.craigslist.org,16995,2007.0,sierra classic 2500hd,diesel,clean,automatic,https://images.craigslist.org/01414_g093aPtSMW...,"**Bad Credit, No Credit... No Problem!**2007 G...",mn,45.6836,-92.9648
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423852,7184919761,https://duluth.craigslist.org/cto/d/duluth-200...,duluth / superior,https://duluth.craigslist.org,1600,2006.0,sonata,gas,clean,automatic,https://images.craigslist.org/00E0E_8o5RKLUz3o...,Motor runs and drives good. Transmission shift...,mn,46.8348,-92.0742
423853,7184844576,https://duluth.craigslist.org/cto/d/duluth-200...,duluth / superior,https://duluth.craigslist.org,9000,2003.0,sequoia limited,gas,clean,automatic,https://images.craigslist.org/00G0G_BT0Ha3X736...,"2 owner 0 rust not from here... Leather ,roof ...",mn,46.9369,-91.9325
423854,7184805809,https://duluth.craigslist.org/cto/d/duluth-94-...,duluth / superior,https://duluth.craigslist.org,700,1994.0,f-150,gas,clean,manual,https://images.craigslist.org/00L0L_2MgECwYWhp...,I'm selling this beautiful old pickup that I j...,mn,46.7715,-92.1279
423855,7184791927,https://duluth.craigslist.org/ctd/d/duluth-199...,duluth / superior,https://duluth.craigslist.org,3800,1999.0,town car,gas,clean,automatic,https://images.craigslist.org/00q0q_6msyGUIqK3...,1999 *** Lincoln Town Car 4dr Sdn Signature Se...,mn,46.8433,-92.2550


In [8]:
unneeded_columns = ['id', 'url', 'region_url', 'image_url', 'description']

data = data.drop(unneeded_columns, axis=1)

In [9]:
data

Unnamed: 0,region,price,year,model,fuel,title_status,transmission,state,lat,long
0,duluth / superior,6995,2000.0,new sierra 1500,gas,clean,automatic,mn,46.8433,-92.2550
1,duluth / superior,8750,2013.0,sonata,gas,clean,automatic,mn,46.9074,-92.4638
2,new haven,10900,2013.0,prius,hybrid,clean,automatic,ct,41.1770,-73.1336
3,albuquerque,12500,2003.0,lancer,gas,clean,manual,nm,35.1868,-106.6650
4,duluth / superior,16995,2007.0,sierra classic 2500hd,diesel,clean,automatic,mn,45.6836,-92.9648
...,...,...,...,...,...,...,...,...,...,...
423852,duluth / superior,1600,2006.0,sonata,gas,clean,automatic,mn,46.8348,-92.0742
423853,duluth / superior,9000,2003.0,sequoia limited,gas,clean,automatic,mn,46.9369,-91.9325
423854,duluth / superior,700,1994.0,f-150,gas,clean,manual,mn,46.7715,-92.1279
423855,duluth / superior,3800,1999.0,town car,gas,clean,automatic,mn,46.8433,-92.2550


In [10]:
{column: len(data[column].unique()) for column in data.columns if data.dtypes[column] == 'object'}

{'region': 404,
 'model': 27043,
 'fuel': 6,
 'title_status': 7,
 'transmission': 4,
 'state': 51}

In [11]:
data = data.drop('model', axis=1)

In [12]:
def onehot_encode(df, columns, prefixes):
    df = df.copy()
    for column, prefix in zip(columns, prefixes):
        dummies = pd.get_dummies(df[column], prefix=prefix)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
    return df

In [13]:
data = onehot_encode(
    data,
    ['region', 'fuel', 'title_status', 'transmission', 'state'],
    ['reg', 'fuel', 'title', 'trans', 'state']
)

In [14]:
data

Unnamed: 0,price,year,lat,long,reg_SF bay area,reg_abilene,reg_akron / canton,reg_albany,reg_albuquerque,reg_altoona-johnstown,...,state_sd,state_tn,state_tx,state_ut,state_va,state_vt,state_wa,state_wi,state_wv,state_wy
0,6995,2000.0,46.8433,-92.2550,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8750,2013.0,46.9074,-92.4638,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,10900,2013.0,41.1770,-73.1336,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,12500,2003.0,35.1868,-106.6650,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,16995,2007.0,45.6836,-92.9648,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
423852,1600,2006.0,46.8348,-92.0742,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
423853,9000,2003.0,46.9369,-91.9325,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
423854,700,1994.0,46.7715,-92.1279,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
423855,3800,1999.0,46.8433,-92.2550,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
for column in data.columns:
    data[column] = data[column].fillna(data[column].mean())

In [16]:
data.isna().sum().sum()

0

# Splitting and Scaling

In [17]:
y = data.loc[:, 'price']
X = data.drop('price', axis=1)

In [18]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=34)

# Training

In [20]:
lin_model = LinearRegression()

lin_model.fit(X_train, y_train)

lin_y_preds = lin_model.predict(X_test)

In [21]:
lgb_model = lgb.LGBMRegressor(
    boosting_type='gbdt',
    num_leaves=31,
    n_estimators=100,
    reg_lambda=1.0
)

lgb_model.fit(X_train, y_train)

lgb_y_preds = lgb_model.predict(X_test)

In [22]:
lin_loss = np.sqrt(mean_squared_error(y_test, lin_y_preds))
lgb_loss = np.sqrt(mean_squared_error(y_test, lgb_y_preds))

In [23]:
print("Linear Regression RMSE:", lin_loss)
print("Gradient Boosted RMSE:", lgb_loss)

Linear Regression RMSE: 1.860799712066682e+19
Gradient Boosted RMSE: 4160061.000556011


In [24]:
print("Linear Regression R^2 Score:", lin_model.score(X_test, y_test))
print("Gradient Boosted R^2 Score:", lgb_model.score(X_test, y_test))

Linear Regression R^2 Score: -2.8471229700375163e+25
Gradient Boosted R^2 Score: -0.4230047920159523
