# Task 1: Identify Features

Assemble a dataset consisting of features and target (for example in a dataframe or in two arrays X and y). What features are relevant for the prediction task?
Are there any features that should be excluded because they leak the target information? Show visualizations or statistics to support your selection.
You are not required to use the description column, but you can try to come up with relevant features using it. Please don’t use bag-of-word approaches for now as we’ll discuss these later in the class.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
df_orig = pd.read_csv('/kaggle/input/craigslist-carstrucks-data/vehicles.csv')

In [None]:
df_orig

In [None]:
# Get subsample of data
df = df_orig.sample(frac=0.5,axis=0)

In [None]:
df.columns

In [None]:
# Paint color correlation
import seaborn as sb
plt.xticks(rotation=90)
ax = sb.scatterplot(x="paint_color", y="price", data=df_orig)

In [None]:
# Latitude correlation
ax = sb.scatterplot(x="lat", y="price", data=df_orig)

In [None]:
# Longitude correlation
ax = sb.scatterplot(x="long", y="price", data=df_orig)

In [None]:
# State correlation
plt.xticks(rotation=90)
ax = sb.scatterplot(x="state", y="price", data=df_orig)

In [None]:
# Region correlation
ax = sb.scatterplot(x="region", y="price", data=df_orig)

We decided to cut several columns:
* id/VIN: These columns may have an extremely strong correlation to price and the model may just learn to associate id/VIN and price. New values will throw off the model.
* url: image_url, region_url: Text url's are not releveant for the prediciton of the car price.
* model: There are too many unique values such that when preprocessed, there would not be any meaningful information.
* size: There are too many null values.
* title_status: Almost all (95%) of the cars had 'clean' title statuses.
* paint_color : As seen in the plots above, there did not appear to be a meaningful correlation with price.
* region, lat, long, county, state: We avoided all location data. As seen in the plots above, there is essentially no correlation for location and price.
* description: This column leaks the target information, because often times price is included in description.

In [None]:

df = df.drop(columns = ['id',
                           'url', 
                           'region', 
                           'region_url',
                           'title_status', 
                           'size', 
                           'description', 
                           'vin', 
                           'lat', 
                           'long', 
                           'image_url',
                           'county',
                           'state',
                           'model',
                           'paint_color'])
df

# Task 2 Preprocessing and Baseline Model

Create a simple minimum viable model by doing an initial selection of features, doing appropriate preprocessing and cross-validating a linear model. Feel free to exclude features or do simplified preprocessing for this task. As mentioned before, you don’t need to validate the model on the whole dataset.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

In [None]:
# Drop where nan is present
df = df.dropna()

In [None]:
# Cut price outliers
df = df[df['price'] > 1000]
df = df[df['price'] < 40000]

In [None]:
# Feature names
continuous_features = ['year', 'odometer']
categorical_features = ['drive', 'type', 'fuel', 'transmission', 'manufacturer', 'condition', 'cylinders']

In [None]:
# X and y
y = df['price']
X = df.drop(columns=['price'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
# Minimum Preprocessing
def train_and_eval_classifier(classifier, cname):
    categorical_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', categorical_transformer, categorical_features)])
    
    pipe = make_pipeline(preprocessor, classifier)
    scores = cross_val_score(pipe, X_train, y_train, scoring="r2")
    print("CV Score for {}: {}".format(cname, np.mean(scores)))

In [None]:
# Minimum viable models
# Try a couple of basic classifiers:
train_and_eval_classifier(LinearRegression(), "LinearRegression")
train_and_eval_classifier(Lasso(max_iter=6000), "Lasso")
train_and_eval_classifier(ElasticNet(), "ElasticNet")

# Task 3 Additional Feature Engineering

In [None]:
df = df_orig.sample(frac=0.5,axis=0)
df = df.drop(columns = ['id',
                   'url', 
                   'region', 
                   'region_url',
                   'title_status', 
                   'size', 
                   'description', 
                   'vin', 
                   'lat', 
                   'long', 
                   'image_url',
                   'county',
                   'state',
                   'model',
                   'paint_color'])
df

In [None]:
# Drop only null rows instead of entirely dropping null values
# We will fix null with imputation later
df = df.dropna(thresh=10)

In [None]:
# Cut continuous outliers on quantiles instead

pl = df.price.quantile(0.1)
pu = df.price.quantile(0.99)

ol = df.odometer.quantile(0.1)
ou = df.odometer.quantile(0.99)

yl = df.year.quantile(0.1)
yu = df.year.quantile(0.99)

df = df[df.price > pl]
df = df[df.price < pu]

df = df[df.odometer > ol]
df = df[df.odometer < ou]

df = df[df.year > yl]
df = df[df.year < yu]

In [None]:
# Feature names
continuous_features = ['year', 'odometer']
categorical_features = ['drive', 'type', 'fuel', 'transmission', 'manufacturer', 'condition', 'cylinders']

In [None]:
# Extract X and y
y = df['price']
X = df.drop(columns=['price'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [None]:
# Preprocessing
# Added Polynomial features
# Added StandardScaler
# Added imputation for continuous and categorical

def train_and_eval_classifier(classifier, cname):
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('polyfeatures', PolynomialFeatures()),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='UNK')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, continuous_features),
            ('cat', categorical_transformer, categorical_features)])
    
    pipe = make_pipeline(preprocessor, classifier)
    scores = cross_val_score(pipe, X_train, y_train, scoring="r2")
    print("CV Score for {}: {}".format(cname, np.mean(scores)))

In [None]:
# Try a couple of classifiers to see if we actually improved
train_and_eval_classifier(LinearRegression(), "LinearRegression")
train_and_eval_classifier(ElasticNet(), "ElasticNet")

After creating derived features adding standard scaling, and adding imputation, as well as performing more in-depth preprocessing and data cleaning,
our models did improve by a significant amount! In fact our Linear Regression model increased in score from 0.35 to 0.76, and our ElasticNet improved from 0.17 to 0.61! 

# Task 4 Any Model

In [None]:
import xgboost as xgb

In [None]:
# XGBoost
def train_and_eval_xgb():
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('polyfeatures', PolynomialFeatures()),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='UNK')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, continuous_features),
            ('cat', categorical_transformer, categorical_features)])
    
    
    pipe = make_pipeline(preprocessor, xgb.XGBRegressor(objective='reg:squarederror')) 
    
    # Parameter tuning
    param_grid = {'xgbregressor__n_estimators': [100, 120, 140],
                  'xgbregressor__learning_rate': [0.01, 0.1],
                  'xgbregressor__max_depth': [5, 7]}
    
    xgb_grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, return_train_score=True, n_jobs=-1)
    xgb_grid.fit(X_train, y_train) 
    print("Best score: %0.3f" % xgb_grid.best_score_) 
    print("Best parameters set:", xgb_grid.best_params_)

In [None]:
train_and_eval_xgb()

# Task 5 Feature Selectors

In [None]:
from sklearn.feature_selection import SelectFromModel

In [None]:
def find_influential_features():
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='UNK')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, continuous_features),
            ('cat', categorical_transformer, categorical_features)])
    
    x = xgb.XGBRegressor(objective='reg:squarederror', learning_rate=0.1, max_depth=7, n_estimators=140)
    pipe = make_pipeline(preprocessor, x)
    
    scores = cross_val_score(pipe, X_train, y_train, scoring="r2")
    print("CV Score for {}: {}".format("XGBoost", np.mean(scores)))
    
    feature_sel = SelectFromModel(pipe, 1e-5)
    feature_sel.fit(X_train, y_train)
    
    return feature_sel, pipe

In [None]:
# Inf features, baseline score for XGBoost
inf_features, pipe = find_influential_features()

In [None]:
# Get feature importances
important_features = inf_features.estimator_.named_steps['xgbregressor'].feature_importances_

In [None]:
# Get caregorical feature names
cat_feature_names = pipe.named_steps['columntransformer'].transformers[1][1].steps[1][1].fit(X_train[categorical_features], y_train).get_feature_names()

In [None]:
# Add continuous feature names
feature_names = np.concatenate((continuous_features, cat_feature_names))
feature_names

In [None]:
important_features

In [None]:
# Twenty most important features
top_twenty = important_features.argsort()[-20:][::-1]
tt_fv = [important_features[i] for i in top_twenty]
tt_fn = [feature_names[i] for i in top_twenty]

In [None]:
# Display 20 most important features
import matplotlib.pyplot as plt
plt.scatter(tt_fn, tt_fv)
plt.xticks(tt_fn, tt_fn, rotation='vertical')
plt.show()

We can see that 4_cylinders, diesel, fwd, year, 4wd, gas, 8_cylinders are some of the most significant features, whereas others can most likely be removed without huge decreases to score

In [None]:
# Create Preprocessor
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='UNK')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, continuous_features),
        ('cat', categorical_transformer, categorical_features)])

preprocessor.fit(X_train, y_train)

We can select only the 50 most important features, and see what happens to our performance!

In [None]:
# Run preprocessing, and only select dataset of 50 top features.
top_50 = important_features.argsort()[-50:][::-1]
X_train_trans = pd.DataFrame(preprocessor.transform(X_train).toarray())
X_train_trans = X_train_trans[top_50]
X_train_trans

In [None]:
# Retrain and evaluate model on new dataset.
x = xgb.XGBRegressor(objective='reg:squarederror', learning_rate=0.1, max_depth=7, n_estimators=140)
np.mean(cross_val_score(x, X_train_trans, y_train, scoring="r2"))

We can see that the performance very slightly decreased from 0.861 to 0.860. In this case, cutting the least important features (selecting the most important features) does not really improve the score, however, it does help us create smaller more explainable models: we were able to cut 20 features out with only a very tiny decrease in r2 score! For a very explainable model, we can probably cut many more features with only small losses to score. This is explored next.

In [None]:
# Best score, from part 4
# Compute the marginal score of features
best_score = 0.862

total_features = continuous_features + categorical_features
for i, feature in enumerate(total_features):
    total_features_n = total_features.copy()
    total_features_n.remove(feature)
    
    if i < 2:
        cont_n = [total_features_n[0]]
        cat_n = total_features_n[1:]
    else:
        cont_n = total_features_n[:2]
        cat_n = total_features_n[2:]
    
    X_train_cut_col = X_train[total_features_n]

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, cont_n),
            ('cat', categorical_transformer, cat_n)])

    preprocessor.fit(X_train_cut_col, y_train)
    X_train_cut_col_trans = pd.DataFrame(preprocessor.transform(X_train_cut_col).toarray())

    x = xgb.XGBRegressor(objective='reg:squarederror', learning_rate=0.1, max_depth=7, n_estimators=140)
    score = np.mean(cross_val_score(x, X_train_cut_col_trans, y_train, scoring="r2"))
    
    marginal_score = best_score - score
    print("Marginal score of column {}: {}".format(feature, marginal_score))

We can see that cutting the columns year, cylinder, odometer, fuel, and manufacturer create the largest drops in score (i.e. the largest marginal scores). Hence, we can reduce our feature set to the ones derived from these columns. These will be our selected features.

# Task 6 Explainable Model

In [None]:
# From previous exploration, we can see that certain columns are less relevant
X_train_cut_col = X_train[["cylinders", "fuel", "year", "odometer"]]

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, ['year', 'odometer']),
        ('cat', categorical_transformer, ["cylinders", "fuel"])])

preprocessor.fit(X_train_cut_col, y_train)
X_train_cut_col_trans = pd.DataFrame(preprocessor.transform(X_train_cut_col).toarray())

x = xgb.XGBRegressor(objective='reg:squarederror', learning_rate=0.1, max_depth=7, n_estimators=140)
np.mean(cross_val_score(x, X_train_cut_col_trans, y_train, scoring="r2"))

In [None]:
minimum_model = X_train_cut_col_trans.shape[1]
print("MINIMUM IMPORTANT FEATURES/COEFFICIENTS NEEDED: {}".format(minimum_model))

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Attempt a random forests regressor with only 15 leaves.
def evaluate_minimum_random_forests():
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='UNK')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))])

    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, continuous_features),
            ('cat', categorical_transformer, categorical_features)])


    pipe = make_pipeline(preprocessor, RandomForestRegressor(max_leaf_nodes=minimum_model))    
    
    scores = cross_val_score(pipe, X_train, y_train, scoring="r2")
    print("CV Score for {}: {}".format("Minimum Random Forests Regressor", np.mean(scores)))

In [None]:
evaluate_minimum_random_forests()

From the calculations above, we discovered that our performance is contingent upon really 15 very important features. We reduced a model of 78 features to only 15 leaves/important features, which is a huge increase in explainablitiy, while still maintaining a very high score of 0.69. Our best models are operating at ~0.86, but these are more sophisticated models (XGBoost). Compared to some of the simpler regression models used earlier, we are able to achieve similar accuracies with very few leaves.