# Task for Today  

***

## Used Car Price Prediction  

Given *data about used cars*, let's try to predict the **price** of a given car.  
  
We will use linear regression and gradient boosting (LightGBM) to make our predictions.

# Getting Started

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression
import lightgbm as lgb

from sklearn.metrics import mean_squared_error

In [1]:
data = pd.read_csv('../input/craigslist-carstrucks-data/vehicles.csv')

In [1]:
data

In [1]:
data.info()

# Preprocessing

In [1]:
data.isna().sum()

In [1]:
null_columns = data.columns[data.isna().mean() > 0.25]

data = data.drop(null_columns, axis=1)

In [1]:
data

In [1]:
unneeded_columns = ['id', 'url', 'region_url', 'image_url', 'description']

data = data.drop(unneeded_columns, axis=1)

In [1]:
data

In [1]:
{column: len(data[column].unique()) for column in data.columns if data.dtypes[column] == 'object'}

In [1]:
data = data.drop('model', axis=1)

In [1]:
def onehot_encode(df, columns, prefixes):
    df = df.copy()
    for column, prefix in zip(columns, prefixes):
        dummies = pd.get_dummies(df[column], prefix=prefix)
        df = pd.concat([df, dummies], axis=1)
        df = df.drop(column, axis=1)
    return df

In [1]:
data = onehot_encode(
    data,
    ['region', 'fuel', 'title_status', 'transmission', 'state'],
    ['reg', 'fuel', 'title', 'trans', 'state']
)

In [1]:
data

In [1]:
for column in data.columns:
    data[column] = data[column].fillna(data[column].mean())

In [1]:
data.isna().sum().sum()

# Splitting and Scaling

In [1]:
y = data.loc[:, 'price']
X = data.drop('price', axis=1)

In [1]:
scaler = StandardScaler()

X = scaler.fit_transform(X)

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=34)

# Training

In [1]:
lin_model = LinearRegression()

lin_model.fit(X_train, y_train)

lin_y_preds = lin_model.predict(X_test)

In [1]:
lgb_model = lgb.LGBMRegressor(
    boosting_type='gbdt',
    num_leaves=31,
    n_estimators=100,
    reg_lambda=1.0
)

lgb_model.fit(X_train, y_train)

lgb_y_preds = lgb_model.predict(X_test)

In [1]:
lin_loss = np.sqrt(mean_squared_error(y_test, lin_y_preds))
lgb_loss = np.sqrt(mean_squared_error(y_test, lgb_y_preds))

In [1]:
print("Linear Regression RMSE:", lin_loss)
print("Gradient Boosted RMSE:", lgb_loss)

In [1]:
print("Linear Regression R^2 Score:", lin_model.score(X_test, y_test))
print("Gradient Boosted R^2 Score:", lgb_model.score(X_test, y_test))

# Data Every Day  

This notebook is featured on Data Every Day, a YouTube series where I train models on a new dataset each day.  

***

Check it out!  
https://youtu.be/bFKuw3JlvCI