Lambda School Data Science

*Unit 2, Sprint 1, Module 4*

---

# Logistic Regression


## Assignment 🌯

You'll use a [**dataset of 400+ burrito reviews**](https://srcole.github.io/100burritos/). How accurately can you predict whether a burrito is rated 'Great'?

> We have developed a 10-dimensional system for rating the burritos in San Diego. ... Generate models for what makes a burrito great and investigate correlations in its dimensions.

- [ ] Do train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.
- [ ] Begin with baselines for classification.
- [ ] Use scikit-learn for logistic regression.
- [ ] Get your model's validation accuracy. (Multiple times if you try multiple iterations.)
- [ ] Get your model's test accuracy. (One time, at the end.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

- [ ] Add your own stretch goal(s) !
- [ ] Make exploratory visualizations.
- [ ] Do one-hot encoding.
- [ ] Do [feature scaling](https://scikit-learn.org/stable/modules/preprocessing.html).
- [ ] Get and plot your coefficients.
- [ ] Try [scikit-learn pipelines](https://scikit-learn.org/stable/modules/compose.html).

In [534]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Linear-Models/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [535]:
# Load data downloaded from https://srcole.github.io/100burritos/
import pandas as pd
df = pd.read_csv(DATA_PATH+'burritos/burritos.csv')

In [536]:
# Derive binary classification target:
# We define a 'Great' burrito as having an
# overall rating of 4 or higher, on a 5 point scale.
# Drop unrated burritos.
df = df.dropna(subset=['overall'])
df['Great'] = df['overall'] >= 4

In [537]:
# Clean/combine the Burrito categories
df['Burrito'] = df['Burrito'].str.lower()

california = df['Burrito'].str.contains('california')
asada = df['Burrito'].str.contains('asada')
surf = df['Burrito'].str.contains('surf')
carnitas = df['Burrito'].str.contains('carnitas')

df.loc[california, 'Burrito'] = 'California'
df.loc[asada, 'Burrito'] = 'Asada'
df.loc[surf, 'Burrito'] = 'Surf & Turf'
df.loc[carnitas, 'Burrito'] = 'Carnitas'
df.loc[~california & ~asada & ~surf & ~carnitas, 'Burrito'] = 'Other'

In [538]:
# Drop some high cardinality categoricals
df = df.drop(columns=['Notes', 'Location', 'Reviewer', 'Address', 'URL', 'Neighborhood'])

In [539]:
# Drop some columns to prevent "leakage"
df = df.drop(columns=['Rec', 'overall'])

In [540]:
df.shape

(421, 59)

In [541]:
# EDA
df.head()


Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
0,California,1/18/2016,3.5,4.2,,6.49,3.0,,,,,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
1,California,1/24/2016,3.5,3.3,,5.45,3.5,,,,,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,,,x,x,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
2,Carnitas,1/24/2016,,,,4.85,1.5,,,,,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,,,,x,x,,,,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
3,Asada,1/24/2016,,,,5.25,2.0,,,,,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,,,x,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False
4,California,1/27/2016,4.0,3.8,x,6.59,4.0,,,,,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,,,x,x,,x,x,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True


In [542]:
# convert date in the right format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format= True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421 entries, 0 to 422
Data columns (total 59 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Burrito         421 non-null    object        
 1   Date            421 non-null    datetime64[ns]
 2   Yelp            87 non-null     float64       
 3   Google          87 non-null     float64       
 4   Chips           26 non-null     object        
 5   Cost            414 non-null    float64       
 6   Hunger          418 non-null    float64       
 7   Mass (g)        22 non-null     float64       
 8   Density (g/mL)  22 non-null     float64       
 9   Length          283 non-null    float64       
 10  Circum          281 non-null    float64       
 11  Volume          281 non-null    float64       
 12  Tortilla        421 non-null    float64       
 13  Temp            401 non-null    float64       
 14  Meat            407 non-null    float64       
 15  Fillin

In [543]:
df.shape

(421, 59)

Converting some columns with x into values

In [544]:
def convert_to_bool(value):
  if value == 'x':
    return 1
  else:
    return 0

df.Beef.apply(convert_to_bool)
col = ['Unreliable', 'NonSD', 'Beef', 'Pico', 'Guac',
       'Cheese', 'Fries', 'Sour cream', 'Pork', 'Chicken', 'Shrimp', 'Fish',
       'Rice', 'Beans', 'Lettuce', 'Tomato', 'Bell peper', 'Carrots',
       'Cabbage', 'Sauce', 'Salsa.1', 'Cilantro', 'Onion', 'Taquito',
       'Pineapple', 'Ham', 'Chile relleno', 'Nopales', 'Lobster',
       'Egg', 'Mushroom', 'Bacon', 'Sushi', 'Avocado', 'Corn', 'Zucchini']
for x in col:
  df[x] = df[x].apply(convert_to_bool)
df.Beef.value_counts()

0    284
1    137
Name: Beef, dtype: int64

In [545]:
df.Beef

0      1
1      1
2      0
3      1
4      1
      ..
418    0
419    0
420    0
421    0
422    0
Name: Beef, Length: 421, dtype: int64

## Doing train/validate/test split. Train on reviews from 2016 & earlier. Validate on 2017. Test on 2018 & later.

In [546]:
# train, val and test split
test_cond = (df['Date'].dt.year >= 2018)
val_cond = (df['Date'].dt.year == 2017)
train_cond = (df['Date'].dt.year <= 2016)
train = df[train_cond]
val = df[val_cond]
test = df[test_cond]
assert len(df) == (len(train) + len(val) + len(test))
train.shape, val.shape, test.shape

((298, 59), (85, 59), (38, 59))

## Begin with baselines for classification.

In [547]:
# baseline of classification
train['Great'].value_counts()

False    176
True     122
Name: Great, dtype: int64

Clearly there is more False therefore our baselines is False which we will assign to 1 and True to 0.

In [548]:
# assigning 1 and 0 
# def convert(x):
#   if x == True:
#     return 0
#   else:
#     return 1

# train['Great'] = train['Great'].apply(convert)
# val['Great'] = val['Great'].apply(convert)
# test['Great'] = test['Great'].apply(convert)

# convert to X_train, y_train, X_val, y_val, X_test, y_test
features = ['Burrito', 'Cost', 'Hunger', 'Volume', 'Tortilla', 'Temp', 'Meat', 'Fillings', 'Meat:filling', 'Uniformity',\
            'Salsa', 'Synergy', 'Wrap', 'Beef', 'Pico', 'Guac', 'Cheese', 'Fries']
target = ['Great']

X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

In [549]:
# finding out our model accuracy score with baseline
baseline = train['Great'].mode()[0]
# train.mode()
print(f'Baseline is {baseline} or False')

# importing library to compute accuracy score
from sklearn.metrics import accuracy_score
y_pred = len(y_val) * [baseline]
print(f'Accuracy score with y_val and predicting with baseline is')
accuracy_score(y_val, y_pred)

Baseline is False or False
Accuracy score with y_val and predicting with baseline is


0.5529411764705883

## Use scikit-learn for logistic regression.

In [550]:
X_train

Unnamed: 0,Burrito,Cost,Hunger,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Beef,Pico,Guac,Cheese,Fries
0,California,6.49,3.0,,3.0,5.0,3.0,3.5,4.0,4.0,4.0,4.0,4.0,1,1,1,1,1
1,California,5.45,3.5,,2.0,3.5,2.5,2.5,2.0,4.0,3.5,2.5,5.0,1,1,1,1,1
2,Carnitas,4.85,1.5,,3.0,2.0,2.5,3.0,4.5,4.0,3.0,3.0,5.0,0,1,1,0,0
3,Asada,5.25,2.0,,3.0,2.0,3.5,3.0,4.0,5.0,4.0,4.0,5.0,1,1,1,0,0
4,California,6.59,4.0,,4.0,5.0,4.0,3.5,4.5,5.0,2.5,4.5,4.0,1,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
296,California,5.65,3.0,0.75,4.0,1.5,2.0,3.0,4.2,4.0,3.0,2.0,4.5,0,0,0,0,0
297,Other,5.49,3.0,0.64,4.5,5.0,2.0,2.0,2.5,3.5,3.0,2.5,3.0,0,0,0,0,0
298,California,7.75,4.0,0.70,3.5,2.5,3.0,3.3,1.4,2.3,2.2,3.3,4.5,0,0,0,0,0
299,Asada,7.75,4.0,0.68,4.0,4.5,2.0,2.0,3.5,3.5,2.0,2.0,4.0,0,0,0,0,0


In [551]:
X_val

Unnamed: 0,Burrito,Cost,Hunger,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Beef,Pico,Guac,Cheese,Fries
301,California,6.60,,0.77,4.0,4.5,4.0,3.5,3.5,5.0,1.5,3.50,4.5,0,0,0,0,0
302,Other,6.60,,0.75,4.0,2.0,,4.0,,4.6,4.2,3.75,5.0,0,0,0,0,0
303,Other,8.50,3.9,0.74,3.0,4.5,4.1,3.0,3.7,4.0,4.3,4.20,5.0,0,0,0,0,0
304,Other,7.90,4.0,0.72,3.5,4.0,4.0,3.0,4.0,4.5,4.0,3.80,4.8,0,0,0,0,0
305,Other,4.99,3.5,0.75,2.5,4.5,3.0,2.5,3.0,3.0,2.0,2.00,4.0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381,California,6.85,3.5,0.91,3.0,4.5,3.8,3.8,4.0,3.5,3.5,4.00,3.0,0,0,0,0,0
382,California,6.85,3.5,0.89,3.0,4.5,4.0,4.0,4.5,3.0,4.0,4.00,3.5,0,0,0,0,0
383,Other,11.50,3.5,0.75,2.0,2.0,4.0,3.5,3.0,4.5,3.5,4.00,2.0,0,0,0,0,0
384,California,7.89,4.0,0.80,4.0,3.0,4.0,4.0,3.0,4.0,3.5,4.30,4.5,0,0,0,0,0


In [552]:
# importing libraries
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegressionCV

# OneHotEncoding for categorical variable
X_train.describe(include = 'all')

Unnamed: 0,Burrito,Cost,Hunger,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Beef,Pico,Guac,Cheese,Fries
count,298,292.0,297.0,174.0,298.0,283.0,288.0,297.0,292.0,296.0,278.0,296.0,296.0,298.0,298.0,298.0,298.0,298.0
unique,5,,,,,,,,,,,,,,,,,
top,California,,,,,,,,,,,,,,,,,
freq,118,,,,,,,,,,,,,,,,,
mean,,6.896781,3.445286,0.77092,3.472315,3.70636,3.551215,3.519024,3.52887,3.395946,3.32464,3.540203,3.955068,0.436242,0.385906,0.338926,0.40604,0.325503
std,,1.211412,0.85215,0.137833,0.797606,0.991897,0.869483,0.850348,1.040457,1.089044,0.971226,0.922426,1.167341,0.496752,0.487627,0.474141,0.491918,0.469351
min,,2.99,0.5,0.4,1.4,1.0,1.0,1.0,0.5,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,,6.25,3.0,0.6625,3.0,3.0,3.0,3.0,3.0,2.5,2.5,3.0,3.5,0.0,0.0,0.0,0.0,0.0
50%,,6.85,3.5,0.75,3.5,4.0,3.5,3.5,4.0,3.5,3.5,3.75,4.0,0.0,0.0,0.0,0.0,0.0
75%,,7.5,4.0,0.87,4.0,4.5,4.0,4.0,4.0,4.0,4.0,4.0,5.0,1.0,1.0,1.0,1.0,1.0


OHE

In [553]:
#only Burrito is the object with 5 categories in my X_train

# OHE
encoder = ce.OneHotEncoder(use_cat_names=True)
X_train_encoded = encoder.fit_transform(X_train)
X_train_encoded

X_val_encoded = encoder.transform(X_val)
X_val_encoded

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Burrito_California,Burrito_Carnitas,Burrito_Asada,Burrito_Other,Burrito_Surf & Turf,Cost,Hunger,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Beef,Pico,Guac,Cheese,Fries
301,1,0,0,0,0,6.60,,0.77,4.0,4.5,4.0,3.5,3.5,5.0,1.5,3.50,4.5,0,0,0,0,0
302,0,0,0,1,0,6.60,,0.75,4.0,2.0,,4.0,,4.6,4.2,3.75,5.0,0,0,0,0,0
303,0,0,0,1,0,8.50,3.9,0.74,3.0,4.5,4.1,3.0,3.7,4.0,4.3,4.20,5.0,0,0,0,0,0
304,0,0,0,1,0,7.90,4.0,0.72,3.5,4.0,4.0,3.0,4.0,4.5,4.0,3.80,4.8,0,0,0,0,0
305,0,0,0,1,0,4.99,3.5,0.75,2.5,4.5,3.0,2.5,3.0,3.0,2.0,2.00,4.0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381,1,0,0,0,0,6.85,3.5,0.91,3.0,4.5,3.8,3.8,4.0,3.5,3.5,4.00,3.0,0,0,0,0,0
382,1,0,0,0,0,6.85,3.5,0.89,3.0,4.5,4.0,4.0,4.5,3.0,4.0,4.00,3.5,0,0,0,0,0
383,0,0,0,1,0,11.50,3.5,0.75,2.0,2.0,4.0,3.5,3.0,4.5,3.5,4.00,2.0,0,0,0,0,0
384,1,0,0,0,0,7.89,4.0,0.80,4.0,3.0,4.0,4.0,3.0,4.0,3.5,4.30,4.5,0,0,0,0,0


Imputer

In [554]:
# find the null values first
X_train_encoded.isnull().sum()

Burrito_California       0
Burrito_Carnitas         0
Burrito_Asada            0
Burrito_Other            0
Burrito_Surf & Turf      0
Cost                     6
Hunger                   1
Volume                 124
Tortilla                 0
Temp                    15
Meat                    10
Fillings                 1
Meat:filling             6
Uniformity               2
Salsa                   20
Synergy                  2
Wrap                     2
Beef                     0
Pico                     0
Guac                     0
Cheese                   0
Fries                    0
dtype: int64

In [555]:
# instantiate the class imputer for missing values correction
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_encoded)
X_val_imputed = imputer.transform(X_val_encoded)

In [556]:
#chaining functions to find the missing values now 
pd.DataFrame(X_train_imputed, columns = X_train_encoded.columns).isnull().sum()

Burrito_California     0
Burrito_Carnitas       0
Burrito_Asada          0
Burrito_Other          0
Burrito_Surf & Turf    0
Cost                   0
Hunger                 0
Volume                 0
Tortilla               0
Temp                   0
Meat                   0
Fillings               0
Meat:filling           0
Uniformity             0
Salsa                  0
Synergy                0
Wrap                   0
Beef                   0
Pico                   0
Guac                   0
Cheese                 0
Fries                  0
dtype: int64

Scaler

In [557]:
# instantiating the class scaler for scaling the data in the X_train and X_val
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)
X_val_scaled = scaler.transform(X_val_imputed)


In [558]:
X_train_scaled

array([[ 1.23508045, -0.22202652, -0.36480111, ...,  1.39660125,
         1.20946679,  1.43950163],
       [ 1.23508045, -0.22202652, -0.36480111, ...,  1.39660125,
         1.20946679,  1.43950163],
       [-0.80966385,  4.50396651, -0.36480111, ...,  1.39660125,
        -0.82681063, -0.69468487],
       ...,
       [ 1.23508045, -0.22202652, -0.36480111, ..., -0.71602399,
        -0.82681063, -0.69468487],
       [-0.80966385, -0.22202652,  2.74121975, ..., -0.71602399,
        -0.82681063, -0.69468487],
       [-0.80966385, -0.22202652, -0.36480111, ..., -0.71602399,
        -0.82681063, -0.69468487]])

Fitting

In [559]:
# instantiate the class to fit the model
model = LogisticRegressionCV()
model.fit(X_train_scaled, y_train)

  y = column_or_1d(y, warn=True)


LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='auto', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

Predict and find scores in validation 

- Score on train data


In [560]:
y_pred = model.predict(X_train_scaled)
train_accuracy = accuracy_score(y_train, y_pred)

- Score on validation data


In [561]:
y_pred = model.predict(X_val_scaled)
val_accuracy = accuracy_score(y_val, y_pred)

In [562]:
print(f'Train Accuracy {train_accuracy} \nValidation Accuracy {val_accuracy} ')

Train Accuracy 0.889261744966443 
Validation Accuracy 0.8705882352941177 


Compare with the dataframe to see manually

In [563]:
pd.DataFrame(y_pred, columns=y_val.columns)

Unnamed: 0,Great
0,False
1,False
2,True
3,True
4,False
...,...
80,True
81,True
82,False
83,True


Check

In [564]:
X_val

Unnamed: 0,Burrito,Cost,Hunger,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Beef,Pico,Guac,Cheese,Fries
301,California,6.60,,0.77,4.0,4.5,4.0,3.5,3.5,5.0,1.5,3.50,4.5,0,0,0,0,0
302,Other,6.60,,0.75,4.0,2.0,,4.0,,4.6,4.2,3.75,5.0,0,0,0,0,0
303,Other,8.50,3.9,0.74,3.0,4.5,4.1,3.0,3.7,4.0,4.3,4.20,5.0,0,0,0,0,0
304,Other,7.90,4.0,0.72,3.5,4.0,4.0,3.0,4.0,4.5,4.0,3.80,4.8,0,0,0,0,0
305,Other,4.99,3.5,0.75,2.5,4.5,3.0,2.5,3.0,3.0,2.0,2.00,4.0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381,California,6.85,3.5,0.91,3.0,4.5,3.8,3.8,4.0,3.5,3.5,4.00,3.0,0,0,0,0,0
382,California,6.85,3.5,0.89,3.0,4.5,4.0,4.0,4.5,3.0,4.0,4.00,3.5,0,0,0,0,0
383,Other,11.50,3.5,0.75,2.0,2.0,4.0,3.5,3.0,4.5,3.5,4.00,2.0,0,0,0,0,0
384,California,7.89,4.0,0.80,4.0,3.0,4.0,4.0,3.0,4.0,3.5,4.30,4.5,0,0,0,0,0


In [565]:
val

Unnamed: 0,Burrito,Date,Yelp,Google,Chips,Cost,Hunger,Mass (g),Density (g/mL),Length,Circum,Volume,Tortilla,Temp,Meat,Fillings,Meat:filling,Uniformity,Salsa,Synergy,Wrap,Unreliable,NonSD,Beef,Pico,Guac,Cheese,Fries,Sour cream,Pork,Chicken,Shrimp,Fish,Rice,Beans,Lettuce,Tomato,Bell peper,Carrots,Cabbage,Sauce,Salsa.1,Cilantro,Onion,Taquito,Pineapple,Ham,Chile relleno,Nopales,Lobster,Queso,Egg,Mushroom,Bacon,Sushi,Avocado,Corn,Zucchini,Great
301,California,2017-01-04,,,,6.60,,,,23.0,20.5,0.77,4.0,4.5,4.0,3.5,3.5,5.0,1.5,3.50,4.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,False
302,Other,2017-01-04,,,,6.60,,,,20.5,21.5,0.75,4.0,2.0,,4.0,,4.6,4.2,3.75,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,False
303,Other,2017-01-07,,,,8.50,3.9,,,21.0,21.0,0.74,3.0,4.5,4.1,3.0,3.7,4.0,4.3,4.20,5.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,False
304,Other,2017-01-07,,,,7.90,4.0,,,20.5,21.0,0.72,3.5,4.0,4.0,3.0,4.0,4.5,4.0,3.80,4.8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,False
305,Other,2017-01-10,,,,4.99,3.5,,,18.5,22.5,0.75,2.5,4.5,3.0,2.5,3.0,3.0,2.0,2.00,4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
381,California,2017-09-05,,,,6.85,3.5,,,22.5,22.5,0.91,3.0,4.5,3.8,3.8,4.0,3.5,3.5,4.00,3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,True
382,California,2017-09-05,,,,6.85,3.5,,,22.2,22.5,0.89,3.0,4.5,4.0,4.0,4.5,3.0,4.0,4.00,3.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,True
383,Other,2017-12-16,4.0,4.5,Yes,11.50,3.5,,,15.0,25.0,0.75,2.0,2.0,4.0,3.5,3.0,4.5,3.5,4.00,2.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,False
384,California,2017-12-29,,,,7.89,4.0,,,19.0,23.0,0.80,4.0,3.0,4.0,4.0,3.0,4.0,3.5,4.30,4.5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,0,0,0,True


## Scores for train, validation and test

In [566]:
# convert X_test to similar matrix as our X_train and val
X_test = test[features]
X_test_encoded = encoder.transform(X_test)
X_test_imputed = imputer.transform(X_test_encoded)
X_test_scaled = scaler.transform(X_test_imputed)

In [567]:
# uncomment to run only one time
y_pred = model.predict(X_test_scaled)
test_accuracy = accuracy_score(y_pred, y_test)

In [568]:
# print accuracy results
print(f'Train Accuracy {train_accuracy} \nValidation Accuracy {val_accuracy}')
print(f'Test Accuracy {test_accuracy}')

Train Accuracy 0.889261744966443 
Validation Accuracy 0.8705882352941177
Test Accuracy 0.7631578947368421
