## Let's build a recommender 
I want to build a recommendation system that looks at a customer's trade-in car and suggests a make and model that they might be interested in.

First, let's get reacquianted with the data.

In [None]:
import pandas as pd

#Let's load in the data after it's been cleaned
modeling_data = pd.read_csv("../data/modeling.csv")

modeling_data.drop(modeling_data.columns[0], axis = 1, inplace = True)


In [None]:
df = modeling_data.copy()

df.head(10)

Unnamed: 0,price,appraisal_offer,region,days_since_offer,make_appraisal,model_appraisal,trim_level_premium_appraisal,mileage_appraisal,engine_appraisal,mpg_city_appraisal,...,full_size_appraisal,large_suv_appraisal,luxury_appraisal,medium_suv_appraisal,mid_size_appraisal,pickup_appraisal,small_suv_appraisal,sports_car_appraisal,van_appraisal,color_grouped_appraisal
0,24000,9000,Midwest,0,Ford,Escape,True,39300,1.6,22.0,...,False,False,False,False,False,False,True,False,False,White
1,33000,14600,West,1,Toyota,Tacoma,True,105800,3.5,19.0,...,False,False,False,False,False,True,False,False,False,Gray
2,25500,3400,Midwest,0,Chevrolet,Cruze,False,97300,1.4,28.0,...,False,False,False,False,False,False,False,False,False,White
3,18700,1100,South,0,Chevrolet,Impala,True,145600,3.9,17.0,...,True,False,False,False,False,False,False,False,False,White
4,19500,15000,West,0,GMC,Yukon,False,51600,5.3,15.0,...,False,True,False,False,False,False,False,False,False,Black
5,39700,27000,South,5,Toyota,4Runner,False,31800,4.0,17.0,...,False,False,False,True,False,False,False,False,False,White
6,12700,900,Midwest,0,Honda,Pilot,False,100100,3.5,17.0,...,False,False,False,True,False,False,False,False,False,Silver
7,39000,15000,West,2,Mazda,CX-5,True,57900,2.5,23.0,...,False,False,False,False,False,False,True,False,False,White
8,42700,9800,South,0,Honda,Civic,True,69000,1.5,32.0,...,False,False,False,False,False,False,False,False,False,White
9,12000,900,West,0,Dodge,Charger,False,118900,3.5,17.0,...,True,False,False,False,False,False,False,False,False,Silver


Orginally, I wanted to use the appraisal features to predict the price of the car the customer would eventually purchase. Of course, this approach is quite naive. Unfortunately, predicting purchasing price removes the ability to capitalize on brand loyalty, common switches, etc.

I want to start with a simple price prediction algorithm. But, after that, I want to work on a collaborative filtering system that will recommend a brand of car based on the customer's appraised vehicle.

In [22]:
print(f"The data has {df.shape[0]} rows and {df.shape[1]} columns")

The data has 111543 rows and 27 columns


In [None]:
df.dtypes

price                             int64
appraisal_offer                   int64
region                           object
days_since_offer                  int64
make_appraisal                   object
model_appraisal                  object
trim_level_premium_appraisal       bool
mileage_appraisal                 int64
engine_appraisal                float64
mpg_city_appraisal              float64
mpg_highway_appraisal             int64
horsepower_appraisal              int64
fuel_capacity_appraisal         float64
online_appraisal_flag              bool
cylinders_even_appraisal           bool
cylinders_high_appraisal           bool
compact_appraisal                  bool
full_size_appraisal                bool
large_suv_appraisal                bool
luxury_appraisal                   bool
medium_suv_appraisal               bool
mid_size_appraisal                 bool
pickup_appraisal                   bool
small_suv_appraisal                bool
sports_car_appraisal               bool


I'll need to encode the region, make, model, and color before running a random forest regressor.

In [34]:
df.region.unique() #4 is reasonable for One Hot Encoding

array(['Midwest', 'West', 'South', 'Northeast'], dtype=object)

In [44]:
df.make_appraisal.unique(), df.make_appraisal.nunique() #40 seems like way too many

(array(['Ford', 'Toyota', 'Chevrolet', 'GMC', 'Honda', 'Mazda', 'Dodge',
        'Land Rover', 'Nissan', 'Jeep', 'Hyundai', 'Chrysler', 'Kia',
        'Lexus', 'Buick', 'Ram', 'Volvo', 'Volkswagen', 'Cadillac',
        'Subaru', 'Saturn', 'Mitsubishi', 'Mercedes-Benz', 'Mercury',
        'Audi', 'Fiat', 'Pontiac', 'Genesis', 'Mini', 'Smart', 'Porsche',
        'Infiniti', 'Jaguar', 'Lincoln', 'BMW', 'Suzuki', 'Scion',
        'Oldsmobile', 'Alfa Romeo', 'Saab'], dtype=object),
 40)

In [None]:
df.model_appraisal.unique(), df.model_appraisal.nunique() #342 is definitely too high of a number to encode

(array(['Escape', 'Tacoma', 'Cruze', 'Impala', 'Yukon', '4Runner', 'Pilot',
        'CX-5', 'Civic', 'Charger', 'Accord', 'Focus',
        'Range Rover Evoque', 'Sentra', 'Patriot', 'Grand Cherokee',
        'Cherokee', 'Compass', 'Altima', 'Prius', 'Fusion', 'Elantra',
        'Impala Limited', 'Envoy', 'Range Rover Sport', 'Maxima',
        'Town and Country', 'Malibu', 'Sonic', 'Camry', 'F150', 'Forte',
        'Corolla', 'Edge', 'Silverado 1500', 'Prius c', 'Rogue', 'Optima',
        'Challenger', 'Thunderbird', 'Explorer', 'Armada', 'Sonata',
        'Mustang', 'Versa', 'Santa Fe', 'IS 300', 'Equinox', 'NX 200t',
        'Mazda3', 'Dart', 'Taurus', 'Tucson', 'Encore', 'Renegade',
        'Terrain', 'RAV4', 'Grand Caravan', 'Murano', 'Wrangler', 'Tahoe',
        '200', 'Titan', 'Sportage', 'Pacifica', 'Soul', 'LS 460',
        'Discovery Sport', 'Yukon XL 1500', 'Liberty', 'Sorento', '1500',
        'XC90', 'Traverse', 'Fiesta', 'Passat', 'Escalade', 'Odyssey',
        'Fusion Hybr

In [46]:
df.color_grouped_appraisal.unique(), df.color_grouped_appraisal.nunique()

(array(['White', 'Gray', 'Black', 'Silver', 'Red', 'Other', 'Blue'],
       dtype=object),
 7)

As a first pass, let's just drop the make and model columns and encode the region and color columns.

In [47]:
df_encoded = pd.get_dummies(df.copy(), columns = ["color_grouped_appraisal", "region"])

In [48]:
df_encoded.columns

Index(['price', 'appraisal_offer', 'days_since_offer', 'make_appraisal',
       'model_appraisal', 'trim_level_premium_appraisal', 'mileage_appraisal',
       'engine_appraisal', 'mpg_city_appraisal', 'mpg_highway_appraisal',
       'horsepower_appraisal', 'fuel_capacity_appraisal',
       'online_appraisal_flag', 'cylinders_even_appraisal',
       'cylinders_high_appraisal', 'compact_appraisal', 'full_size_appraisal',
       'large_suv_appraisal', 'luxury_appraisal', 'medium_suv_appraisal',
       'mid_size_appraisal', 'pickup_appraisal', 'small_suv_appraisal',
       'sports_car_appraisal', 'van_appraisal',
       'color_grouped_appraisal_Black', 'color_grouped_appraisal_Blue',
       'color_grouped_appraisal_Gray', 'color_grouped_appraisal_Other',
       'color_grouped_appraisal_Red', 'color_grouped_appraisal_Silver',
       'color_grouped_appraisal_White', 'region_Midwest', 'region_Northeast',
       'region_South', 'region_West'],
      dtype='object')

In [49]:
# Drop non encoded columns
df_encoded.drop(["make_appraisal", "model_appraisal"], axis = 1, inplace = True)

Ok, I don't know why I was so lazy with my analysis before, but there is likely some crazy multicollinearity going on. Let's calculate the VIF.

In [65]:
df_encoded.dtypes

price                               int64
appraisal_offer                     int64
days_since_offer                    int64
trim_level_premium_appraisal         bool
mileage_appraisal                   int64
engine_appraisal                  float64
mpg_city_appraisal                float64
mpg_highway_appraisal               int64
horsepower_appraisal                int64
fuel_capacity_appraisal           float64
online_appraisal_flag                bool
cylinders_even_appraisal             bool
cylinders_high_appraisal             bool
compact_appraisal                    bool
full_size_appraisal                  bool
large_suv_appraisal                  bool
luxury_appraisal                     bool
medium_suv_appraisal                 bool
mid_size_appraisal                   bool
pickup_appraisal                     bool
small_suv_appraisal                  bool
sports_car_appraisal                 bool
van_appraisal                        bool
color_grouped_appraisal_Black     

In [55]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [71]:
numeric_cols = [name for name in df_encoded.columns if pd.api.types.is_numeric_dtype(df_encoded[name]) and df_encoded[name].dtype != "bool"]

In [72]:
numeric_cols

['price',
 'appraisal_offer',
 'days_since_offer',
 'mileage_appraisal',
 'engine_appraisal',
 'mpg_city_appraisal',
 'mpg_highway_appraisal',
 'horsepower_appraisal',
 'fuel_capacity_appraisal']

In [74]:
feature_matrix = df_encoded[numeric_cols].drop(columns = "price", axis = 1)

In [75]:
vif_df = pd.DataFrame()
vif_df["variable"] = feature_matrix.columns
vif_df["vif"] = [variance_inflation_factor(feature_matrix.values, col) for col in range(feature_matrix.shape[1]) ]

In [76]:
vif_df #Insanely high VIF values LOL

Unnamed: 0,variable,vif
0,appraisal_offer,8.001158
1,days_since_offer,1.327628
2,mileage_appraisal,10.888925
3,engine_appraisal,46.622739
4,mpg_city_appraisal,135.472536
5,mpg_highway_appraisal,148.872556
6,horsepower_appraisal,56.464608
7,fuel_capacity_appraisal,44.650622


Maybe we should make a PCA of the engine and mileage features to use as one feature in our random forest? First, let's try keeping appraisal_offer. 

In [78]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

In [79]:
pca_features = feature_matrix.drop(columns=["appraisal_offer", "days_since_offer"], axis = 1)

In [81]:
scaler = StandardScaler()
scaler.fit(pca_features)

In [82]:
pca_features_scaled = scaler.fit_transform(pca_features)

In [83]:
#Initialize and fit pca
principal = PCA(n_components = 0.95)
principal.fit(pca_features_scaled)

In [84]:
#Run PCA on features
components = principal.transform(pca_features_scaled)

In [85]:
components.shape

(111543, 4)

In [87]:
print(principal.explained_variance_ratio_)

[0.69599732 0.16969905 0.06401881 0.03888921]


In [90]:
components_df = pd.DataFrame(components)

In [102]:
components_df.rename(columns = {0 : "component_1",
                                1 : "component_2",
                                2 : "component_3",
                                3 : "component_4"},
                     inplace = True
                     )

components_df.head()

Unnamed: 0,component_1,component_2,component_3,component_4
0,-0.896257,-0.834579,-1.007772,0.258177
1,1.79822,0.466361,0.017922,-0.015119
2,-2.905041,0.47446,0.339243,-0.150217
3,1.282849,1.425559,-0.257809,-0.712288
4,3.798908,-0.809424,0.497478,0.263624


In [107]:
#Create new feature matrix
feature_matrix_new = feature_matrix[["appraisal_offer", "days_since_offer"]]
feature_matrix_new = feature_matrix_new.join(components_df)

feature_matrix_new.head()

Unnamed: 0,appraisal_offer,days_since_offer,component_1,component_2,component_3,component_4
0,9000,0,-0.896257,-0.834579,-1.007772,0.258177
1,14600,1,1.79822,0.466361,0.017922,-0.015119
2,3400,0,-2.905041,0.47446,0.339243,-0.150217
3,1100,0,1.282849,1.425559,-0.257809,-0.712288
4,15000,0,3.798908,-0.809424,0.497478,0.263624


In [108]:
#Look at VIF again
vif_df_pca = pd.DataFrame()
vif_df_pca["variable"] = feature_matrix_new.columns
vif_df_pca["vif"] = [variance_inflation_factor(feature_matrix_new.values, col) for col in range(feature_matrix_new.shape[1]) ]

In [109]:
vif_df_pca #WAY better

Unnamed: 0,variable,vif
0,appraisal_offer,1.755058
1,days_since_offer,1.29739
2,component_1,1.063421
3,component_2,1.347064
4,component_3,1.00868
5,component_4,1.00003


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

In [50]:

X = df_encoded.drop(columns = "price", axis = 1)
y = df_encoded[["price"]]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 100)

In [51]:
rf_regressor = RandomForestRegressor()

rf_regressor.fit(X_train, y_train)

  return fit_method(estimator, *args, **kwargs)


In [54]:
rf_regressor.score(X = X_test, y = y_test)

0.16173777395131883