# D.C. Properties - Condition Prediction

This notebook loads a subset of the DC Properties dataset to build an classification model that would predict the condition of a building. The columns on the data are:

 * **NUM_UNITS** - Number of Units
 * **ROOMS** - Number of Rooms
 * **BEDRM** - Number of Bedrooms
 * **BATHRM** - Number of Full Bathrooms
 * **HF_BATHRM** - Number of Half Bathrooms (no bathtub or shower)
 * **KITCHENS** - Number of kitchens
 * **STORIES** - Number of stories in primary dwelling
 * **HEAT** - Heating
 * **AC** - Cooling
 * **FIREPLACES** - Number of fireplaces
 * **ROOF** - Roof type
 * **EXTWALL** - Exterior wall
 * **AYB** - The earliest time the main portion of the building was built
 * **EYB** - The year an improvement was built more recent than actual year built
 * **YR_SALE** - Date of most recent sale
 * **CNDTN** - Condition
 * **GBA** - Gross building area in square feet
 * **LANDAREA** - Land area of property in square feet
 * **WARD** - Ward (District is divided into eight wards, each with approximately 75,000 residents)
 * **PRICE** - Price of most recent sale

## Imports and Config setting

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math

In [2]:
pd.set_option('display.max_columns', None)

## Data loading and Selection

Define a series of parameters that will be used in the notebook

In [3]:
# Params
input_data_path = '2_dc_properties_processed_zipped.csv'
numerical_cols = ['NUM_UNITS', 'ROOMS', 'BEDRM', 'BATHRM', 'HF_BATHRM', 'KITCHENS',
                   'STORIES', 'FIREPLACES', 'AYB', 'EYB', 'GBA', 'LANDAREA', 'X', 'Y', 'PRICE', 'YR_SALE']
categorical_cols = ['HEAT', 'AC', 'ROOF', 'EXTWALL', 'CNDTN', 'WARD']

Load the data and give a preview of it

In [4]:
data_df = pd.read_csv(input_data_path, low_memory=False, index_col=0, compression='zip')
data_df

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,HEAT,AC,FIREPLACES,ROOF,EXTWALL,AYB,EYB,YR_SALE,CNDTN,GBA,LANDAREA,WARD,X,Y,PRICE
0,2.0,8,4,4,0,2.0,3.0,Warm Cool,Y,5,Metal- Sms,Common Brick,1910.0,1972,2003.0,3.0,2522.0,1680,2,-77.040429,38.914881,1095000.0
1,2.0,11,5,3,1,2.0,3.0,Warm Cool,Y,4,Built Up,Common Brick,1898.0,1972,2000.0,3.0,2567.0,1680,2,-77.040429,38.914881,
2,2.0,9,5,3,1,2.0,3.0,Hot Water Rad,Y,4,Built Up,Common Brick,1910.0,1984,2016.0,4.0,2522.0,1680,2,-77.040429,38.914881,2100000.0
3,2.0,8,5,3,1,2.0,3.0,Hot Water Rad,Y,3,Built Up,Common Brick,1900.0,1984,2006.0,3.0,2484.0,1680,2,-77.040429,38.914881,1602000.0
4,1.0,11,3,2,1,1.0,3.0,Warm Cool,Y,0,Neopren,Common Brick,1913.0,1985,,3.0,5255.0,2032,2,-77.040429,38.914881,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106153,2.0,8,4,2,0,2.0,2.0,Forced Air,N,0,Built Up,Common Brick,1953.0,1962,,2.0,1600.0,6337,8,-77.006347,38.821799,
106154,2.0,10,5,2,0,2.0,2.0,Forced Air,N,0,Built Up,Common Brick,1953.0,1962,2012.0,2.0,1600.0,5348,8,-77.006347,38.821799,100000.0
106155,2.0,10,4,2,0,2.0,2.0,Forced Air,N,0,Built Up,Common Brick,1953.0,1953,2009.0,2.0,1600.0,3466,8,-77.006347,38.821799,
106156,2.0,10,4,2,0,2.0,2.0,Forced Air,N,0,Comp Shingle,Common Brick,1953.0,1971,2017.0,3.0,1600.0,3046,8,-77.006347,38.821799,215000.0


## Split data

It is very important how the data for training and testing purposes is selected. In this case, we want to keep things simple and we want to 2/3 for training and 1/3 for testing. How would you do it?

In [5]:
# Split data to train and test

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data_df.drop('CNDTN', axis=1), 
                                                    data_df['CNDTN'], 
                                                    stratify=data_df['CNDTN'], 
                                                    test_size=0.33, 
                                                    random_state=0)

In [6]:
# Preview the training features
X_train

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,HEAT,AC,FIREPLACES,ROOF,EXTWALL,AYB,EYB,YR_SALE,GBA,LANDAREA,WARD,X,Y,PRICE
7066,1.0,6,4,2,0,1.0,2.0,Forced Air,N,0,Built Up,Common Brick,1925.0,1964,2014.0,1088.0,1038,6,-77.001118,38.904749,557000.0
77712,1.0,6,2,1,1,1.0,1.5,Forced Air,Y,0,Comp Shingle,Common Brick,1955.0,1973,2016.0,1835.0,7121,5,-76.973150,38.939045,398000.0
83667,1.0,6,3,1,0,1.0,2.0,Hot Water Rad,N,0,Metal- Sms,Vinyl Siding,1900.0,1954,,1408.0,2405,6,-76.982857,38.898478,
27870,1.0,8,5,3,1,1.0,2.0,Warm Cool,Y,1,Comp Shingle,Common Brick,1944.0,1986,2015.0,1650.0,6160,3,-77.089724,38.942936,1190000.0
3538,2.0,6,2,2,0,2.0,2.0,Hot Water Rad,Y,1,Metal- Sms,Common Brick,1900.0,1982,2010.0,1178.0,592,6,-77.024206,38.910767,505000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1398,1.0,11,5,4,1,2.0,2.0,Hot Water Rad,Y,4,Built Up,Common Brick,1895.0,1972,2017.0,3041.0,1734,1,-77.024969,38.914676,2325000.0
44449,1.0,6,3,4,1,1.0,2.0,Hot Water Rad,Y,1,Metal- Sms,Common Brick,1923.0,1986,2013.0,2450.0,2816,4,-77.034830,38.942555,399000.0
70753,1.0,6,3,1,0,1.0,2.0,Hot Water Rad,N,0,Comp Shingle,Vinyl Siding,1913.0,1954,,1340.0,4587,5,-76.985259,38.931133,
12806,1.0,9,3,2,1,1.0,2.0,Hot Water Rad,Y,2,Metal- Sms,Common Brick,1924.0,1964,2014.0,1690.0,2109,6,-76.992200,38.892800,830000.0


In [7]:
# Preview the training labels
y_train

7066     3.0
77712    3.0
83667    2.0
27870    4.0
3538     2.0
        ... 
1398     4.0
44449    3.0
70753    2.0
12806    3.0
25088    4.0
Name: CNDTN, Length: 71125, dtype: float64

In [8]:
# Preview the test features
X_test

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,HEAT,AC,FIREPLACES,ROOF,EXTWALL,AYB,EYB,YR_SALE,GBA,LANDAREA,WARD,X,Y,PRICE
48744,1.0,7,3,3,0,1.0,2.0,Hot Water Rad,Y,0,Metal- Sms,Common Brick,1912.0,1967,2004.0,1372.0,1434,1,-77.024626,38.927407,330000.0
67385,1.0,8,3,2,1,1.0,1.0,Warm Cool,Y,1,Comp Shingle,Common Brick,1965.0,1973,,1562.0,8325,4,-77.004015,38.962180,
63609,4.0,16,4,4,0,4.0,2.0,Hot Water Rad,N,0,Built Up,Common Brick,1936.0,1943,1997.0,2772.0,2437,4,-77.016860,38.947051,142000.0
62180,1.0,9,3,2,0,1.0,2.0,Hot Water Rad,Y,1,Slate,Brick/Siding,1936.0,1947,2001.0,2064.0,4988,4,-77.014772,38.963238,229000.0
35883,1.0,9,5,3,1,1.0,2.5,Hot Water Rad,Y,1,Clay Tile,Stucco,1924.0,1988,2016.0,3391.0,5500,3,-77.065544,38.928214,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
43968,1.0,7,4,3,1,1.0,2.0,Hot Water Rad,Y,0,Metal- Sms,Common Brick,1915.0,1986,2013.0,1600.0,1660,4,-77.033713,38.945601,654000.0
65774,1.0,5,3,1,1,1.0,2.0,Hot Water Rad,N,0,Built Up,Common Brick,1947.0,1947,,1088.0,1903,4,-77.010439,38.967161,
95500,1.0,6,3,2,1,1.0,1.5,Hot Water Rad,N,1,Comp Shingle,Common Brick,1952.0,1964,2001.0,1539.0,6940,7,-76.952876,38.868919,200000.0
87022,1.0,8,4,1,0,1.0,2.0,Hot Water Rad,N,0,Metal- Sms,Shingle,1890.0,1943,2004.0,1404.0,3366,7,-76.936150,38.904842,


In [9]:
# Preview the test labels
y_test

48744    2.0
67385    3.0
63609    2.0
62180    3.0
35883    4.0
        ... 
43968    3.0
65774    2.0
95500    2.0
87022    2.0
7266     2.0
Name: CNDTN, Length: 35033, dtype: float64

## Fix nulls

Our data still has a vew nulls, let's take a look at those and see what we can do about it.

In [10]:
# Check the number of nulls
num_rows = data_df.shape[0]
pd.DataFrame(X_train.isnull().sum() \
             .apply(lambda x: x/num_rows), columns=['null_perc']) \
             .sort_values(by='null_perc', ascending=False)

Unnamed: 0,null_perc
PRICE,0.305997
YR_SALE,0.141195
EXTWALL,0.0
Y,0.0
X,0.0
WARD,0.0
LANDAREA,0.0
GBA,0.0
EYB,0.0
AYB,0.0


There are multiple ways to fix missing data, here we want to keep things simple so we will use their mean values.

In [11]:
# Impute the YR_SALE and PRICE missing values with their means in the train data

from sklearn.impute import SimpleImputer

imp_cols = ['YR_SALE', 'PRICE']

imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(X_train[imp_cols])
imp_train = pd.DataFrame(imp_mean.transform(X_train[imp_cols]), columns=imp_cols)

X_train = pd.concat([X_train.reset_index(drop=True), imp_train], axis=1)
X_train = X_train.drop(imp_cols, axis=1)
X_train


Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,HEAT,AC,FIREPLACES,ROOF,EXTWALL,AYB,EYB,GBA,LANDAREA,WARD,X,Y
0,1.0,6,4,2,0,1.0,2.0,Forced Air,N,0,Built Up,Common Brick,1925.0,1964,1088.0,1038,6,-77.001118,38.904749
1,1.0,6,2,1,1,1.0,1.5,Forced Air,Y,0,Comp Shingle,Common Brick,1955.0,1973,1835.0,7121,5,-76.973150,38.939045
2,1.0,6,3,1,0,1.0,2.0,Hot Water Rad,N,0,Metal- Sms,Vinyl Siding,1900.0,1954,1408.0,2405,6,-76.982857,38.898478
3,1.0,8,5,3,1,1.0,2.0,Warm Cool,Y,1,Comp Shingle,Common Brick,1944.0,1986,1650.0,6160,3,-77.089724,38.942936
4,2.0,6,2,2,0,2.0,2.0,Hot Water Rad,Y,1,Metal- Sms,Common Brick,1900.0,1982,1178.0,592,6,-77.024206,38.910767
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71120,1.0,11,5,4,1,2.0,2.0,Hot Water Rad,Y,4,Built Up,Common Brick,1895.0,1972,3041.0,1734,1,-77.024969,38.914676
71121,1.0,6,3,4,1,1.0,2.0,Hot Water Rad,Y,1,Metal- Sms,Common Brick,1923.0,1986,2450.0,2816,4,-77.034830,38.942555
71122,1.0,6,3,1,0,1.0,2.0,Hot Water Rad,N,0,Comp Shingle,Vinyl Siding,1913.0,1954,1340.0,4587,5,-76.985259,38.931133
71123,1.0,9,3,2,1,1.0,2.0,Hot Water Rad,Y,2,Metal- Sms,Common Brick,1924.0,1964,1690.0,2109,6,-76.992200,38.892800


In [12]:
# Impute the YR_SALE and PRICE missing values with their means in the test data

imp_test = pd.DataFrame(imp_mean.transform(X_test[imp_cols]), columns=imp_cols)

X_test = pd.concat([X_test.reset_index(drop=True), imp_test], axis=1)
X_test = X_test.drop(imp_cols, axis=1)
X_test

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,HEAT,AC,FIREPLACES,ROOF,EXTWALL,AYB,EYB,GBA,LANDAREA,WARD,X,Y
0,1.0,7,3,3,0,1.0,2.0,Hot Water Rad,Y,0,Metal- Sms,Common Brick,1912.0,1967,1372.0,1434,1,-77.024626,38.927407
1,1.0,8,3,2,1,1.0,1.0,Warm Cool,Y,1,Comp Shingle,Common Brick,1965.0,1973,1562.0,8325,4,-77.004015,38.962180
2,4.0,16,4,4,0,4.0,2.0,Hot Water Rad,N,0,Built Up,Common Brick,1936.0,1943,2772.0,2437,4,-77.016860,38.947051
3,1.0,9,3,2,0,1.0,2.0,Hot Water Rad,Y,1,Slate,Brick/Siding,1936.0,1947,2064.0,4988,4,-77.014772,38.963238
4,1.0,9,5,3,1,1.0,2.5,Hot Water Rad,Y,1,Clay Tile,Stucco,1924.0,1988,3391.0,5500,3,-77.065544,38.928214
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35028,1.0,7,4,3,1,1.0,2.0,Hot Water Rad,Y,0,Metal- Sms,Common Brick,1915.0,1986,1600.0,1660,4,-77.033713,38.945601
35029,1.0,5,3,1,1,1.0,2.0,Hot Water Rad,N,0,Built Up,Common Brick,1947.0,1947,1088.0,1903,4,-77.010439,38.967161
35030,1.0,6,3,2,1,1.0,1.5,Hot Water Rad,N,1,Comp Shingle,Common Brick,1952.0,1964,1539.0,6940,7,-76.952876,38.868919
35031,1.0,8,4,1,0,1.0,2.0,Hot Water Rad,N,0,Metal- Sms,Shingle,1890.0,1943,1404.0,3366,7,-76.936150,38.904842


## Encode categorical variables

There are two main ways to encode categorical values in your data. Either you can use a One Hot Encoder or a Labeling Encoding. Let's go ahead and decide which strategy we want to use and transform the categorical variables to a numeric value

### One Hot Encoder

In [13]:
# One hot encoding of the training data

from sklearn.preprocessing import OneHotEncoder

one_hot_cols = ['HEAT', 'ROOF', 'EXTWALL']

one_hot_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
one_hot_encoder.fit(X_train[one_hot_cols])
encoded_train = pd.DataFrame(one_hot_encoder.transform(X_train[one_hot_cols]), columns=one_hot_encoder.get_feature_names())

X_train = pd.concat([X_train.reset_index(drop=True), encoded_train], axis=1)
X_train = X_train.drop(one_hot_cols, axis=1)
X_train



Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,AC,FIREPLACES,AYB,EYB,GBA,LANDAREA,WARD,X,Y,x0_Air Exchng,x0_Air-Oil,x0_Elec Base Brd,x0_Electric Rad,x0_Evp Cool,x0_Forced Air,x0_Gravity Furnac,x0_Hot Water Rad,x0_Ht Pump,x0_Ind Unit,x0_No Data,x0_Wall Furnace,x0_Warm Cool,x0_Water Base Brd,x1_Built Up,x1_Clay Tile,x1_Comp Shingle,x1_Composition Ro,x1_Concrete,x1_Concrete Tile,x1_Metal- Cpr,x1_Metal- Pre,x1_Metal- Sms,x1_Neopren,x1_Shake,x1_Shingle,x1_Slate,x1_Typical,x1_Water Proof,x1_Wood- FS,x2_Adobe,x2_Aluminum,x2_Brick Veneer,x2_Brick/Siding,x2_Brick/Stone,x2_Brick/Stucco,x2_Common Brick,x2_Concrete,x2_Concrete Block,x2_Default,x2_Face Brick,x2_Hardboard,x2_Metal Siding,x2_Plywood,x2_Rustic Log,x2_Shingle,x2_Stone,x2_Stone Veneer,x2_Stone/Siding,x2_Stone/Stucco,x2_Stucco,x2_Stucco Block,x2_Vinyl Siding,x2_Wood Siding
0,1.0,6,4,2,0,1.0,2.0,N,0,1925.0,1964,1088.0,1038,6,-77.001118,38.904749,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,6,2,1,1,1.0,1.5,Y,0,1955.0,1973,1835.0,7121,5,-76.973150,38.939045,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,6,3,1,0,1.0,2.0,N,0,1900.0,1954,1408.0,2405,6,-76.982857,38.898478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,8,5,3,1,1.0,2.0,Y,1,1944.0,1986,1650.0,6160,3,-77.089724,38.942936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2.0,6,2,2,0,2.0,2.0,Y,1,1900.0,1982,1178.0,592,6,-77.024206,38.910767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71120,1.0,11,5,4,1,2.0,2.0,Y,4,1895.0,1972,3041.0,1734,1,-77.024969,38.914676,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71121,1.0,6,3,4,1,1.0,2.0,Y,1,1923.0,1986,2450.0,2816,4,-77.034830,38.942555,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
71122,1.0,6,3,1,0,1.0,2.0,N,0,1913.0,1954,1340.0,4587,5,-76.985259,38.931133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
71123,1.0,9,3,2,1,1.0,2.0,Y,2,1924.0,1964,1690.0,2109,6,-76.992200,38.892800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# One hot encoding of the test data

X_test = pd.concat([X_test.reset_index(drop=True), pd.DataFrame(one_hot_encoder.transform(X_test[one_hot_cols]), columns=one_hot_encoder.get_feature_names())], axis=1)
X_test = X_test.drop(one_hot_cols, axis=1)
X_test

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,AC,FIREPLACES,AYB,EYB,GBA,LANDAREA,WARD,X,Y,x0_Air Exchng,x0_Air-Oil,x0_Elec Base Brd,x0_Electric Rad,x0_Evp Cool,x0_Forced Air,x0_Gravity Furnac,x0_Hot Water Rad,x0_Ht Pump,x0_Ind Unit,x0_No Data,x0_Wall Furnace,x0_Warm Cool,x0_Water Base Brd,x1_Built Up,x1_Clay Tile,x1_Comp Shingle,x1_Composition Ro,x1_Concrete,x1_Concrete Tile,x1_Metal- Cpr,x1_Metal- Pre,x1_Metal- Sms,x1_Neopren,x1_Shake,x1_Shingle,x1_Slate,x1_Typical,x1_Water Proof,x1_Wood- FS,x2_Adobe,x2_Aluminum,x2_Brick Veneer,x2_Brick/Siding,x2_Brick/Stone,x2_Brick/Stucco,x2_Common Brick,x2_Concrete,x2_Concrete Block,x2_Default,x2_Face Brick,x2_Hardboard,x2_Metal Siding,x2_Plywood,x2_Rustic Log,x2_Shingle,x2_Stone,x2_Stone Veneer,x2_Stone/Siding,x2_Stone/Stucco,x2_Stucco,x2_Stucco Block,x2_Vinyl Siding,x2_Wood Siding
0,1.0,7,3,3,0,1.0,2.0,Y,0,1912.0,1967,1372.0,1434,1,-77.024626,38.927407,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,8,3,2,1,1.0,1.0,Y,1,1965.0,1973,1562.0,8325,4,-77.004015,38.962180,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,4.0,16,4,4,0,4.0,2.0,N,0,1936.0,1943,2772.0,2437,4,-77.016860,38.947051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,9,3,2,0,1.0,2.0,Y,1,1936.0,1947,2064.0,4988,4,-77.014772,38.963238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,9,5,3,1,1.0,2.5,Y,1,1924.0,1988,3391.0,5500,3,-77.065544,38.928214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35028,1.0,7,4,3,1,1.0,2.0,Y,0,1915.0,1986,1600.0,1660,4,-77.033713,38.945601,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35029,1.0,5,3,1,1,1.0,2.0,N,0,1947.0,1947,1088.0,1903,4,-77.010439,38.967161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35030,1.0,6,3,2,1,1.0,1.5,N,1,1952.0,1964,1539.0,6940,7,-76.952876,38.868919,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
35031,1.0,8,4,1,0,1.0,2.0,N,0,1890.0,1943,1404.0,3366,7,-76.936150,38.904842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Ordinal encoding

In [16]:
# Ordinal encoding of the training data

from sklearn.preprocessing import OrdinalEncoder

ordinal_cols = ['AC']

ordinal_encoder = OrdinalEncoder(categories=[['0', 'N', 'Y']], handle_unknown='use_encoded_value', unknown_value=-1)
ordinal_encoder.fit(X_train[ordinal_cols])
encoded_train = pd.DataFrame(ordinal_encoder.transform(X_train[ordinal_cols]), columns=['AC'])
X_train = X_train.drop(ordinal_cols, axis=1).reset_index(drop=True)

X_train = pd.concat([X_train, encoded_train], axis=1)
X_train

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,FIREPLACES,AYB,EYB,GBA,LANDAREA,WARD,X,Y,x0_Air Exchng,x0_Air-Oil,x0_Elec Base Brd,x0_Electric Rad,x0_Evp Cool,x0_Forced Air,x0_Gravity Furnac,x0_Hot Water Rad,x0_Ht Pump,x0_Ind Unit,x0_No Data,x0_Wall Furnace,x0_Warm Cool,x0_Water Base Brd,x1_Built Up,x1_Clay Tile,x1_Comp Shingle,x1_Composition Ro,x1_Concrete,x1_Concrete Tile,x1_Metal- Cpr,x1_Metal- Pre,x1_Metal- Sms,x1_Neopren,x1_Shake,x1_Shingle,x1_Slate,x1_Typical,x1_Water Proof,x1_Wood- FS,x2_Adobe,x2_Aluminum,x2_Brick Veneer,x2_Brick/Siding,x2_Brick/Stone,x2_Brick/Stucco,x2_Common Brick,x2_Concrete,x2_Concrete Block,x2_Default,x2_Face Brick,x2_Hardboard,x2_Metal Siding,x2_Plywood,x2_Rustic Log,x2_Shingle,x2_Stone,x2_Stone Veneer,x2_Stone/Siding,x2_Stone/Stucco,x2_Stucco,x2_Stucco Block,x2_Vinyl Siding,x2_Wood Siding,AC
0,1.0,6,4,2,0,1.0,2.0,0,1925.0,1964,1088.0,1038,6,-77.001118,38.904749,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,6,2,1,1,1.0,1.5,0,1955.0,1973,1835.0,7121,5,-76.973150,38.939045,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
2,1.0,6,3,1,0,1.0,2.0,0,1900.0,1954,1408.0,2405,6,-76.982857,38.898478,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,1.0,8,5,3,1,1.0,2.0,1,1944.0,1986,1650.0,6160,3,-77.089724,38.942936,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
4,2.0,6,2,2,0,2.0,2.0,1,1900.0,1982,1178.0,592,6,-77.024206,38.910767,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71120,1.0,11,5,4,1,2.0,2.0,4,1895.0,1972,3041.0,1734,1,-77.024969,38.914676,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
71121,1.0,6,3,4,1,1.0,2.0,1,1923.0,1986,2450.0,2816,4,-77.034830,38.942555,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
71122,1.0,6,3,1,0,1.0,2.0,0,1913.0,1954,1340.0,4587,5,-76.985259,38.931133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
71123,1.0,9,3,2,1,1.0,2.0,2,1924.0,1964,1690.0,2109,6,-76.992200,38.892800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0


In [18]:
# Ordinal encoding of the test data

encoded_test = pd.DataFrame(ordinal_encoder.transform(X_test[ordinal_cols]), columns=['AC'])
X_test = X_test.drop(ordinal_cols, axis=1).reset_index(drop=True)

X_test = pd.concat([X_test, encoded_test], axis=1)
X_test

Unnamed: 0,NUM_UNITS,ROOMS,BEDRM,BATHRM,HF_BATHRM,KITCHENS,STORIES,FIREPLACES,AYB,EYB,GBA,LANDAREA,WARD,X,Y,x0_Air Exchng,x0_Air-Oil,x0_Elec Base Brd,x0_Electric Rad,x0_Evp Cool,x0_Forced Air,x0_Gravity Furnac,x0_Hot Water Rad,x0_Ht Pump,x0_Ind Unit,x0_No Data,x0_Wall Furnace,x0_Warm Cool,x0_Water Base Brd,x1_Built Up,x1_Clay Tile,x1_Comp Shingle,x1_Composition Ro,x1_Concrete,x1_Concrete Tile,x1_Metal- Cpr,x1_Metal- Pre,x1_Metal- Sms,x1_Neopren,x1_Shake,x1_Shingle,x1_Slate,x1_Typical,x1_Water Proof,x1_Wood- FS,x2_Adobe,x2_Aluminum,x2_Brick Veneer,x2_Brick/Siding,x2_Brick/Stone,x2_Brick/Stucco,x2_Common Brick,x2_Concrete,x2_Concrete Block,x2_Default,x2_Face Brick,x2_Hardboard,x2_Metal Siding,x2_Plywood,x2_Rustic Log,x2_Shingle,x2_Stone,x2_Stone Veneer,x2_Stone/Siding,x2_Stone/Stucco,x2_Stucco,x2_Stucco Block,x2_Vinyl Siding,x2_Wood Siding,AC
0,1.0,7,3,3,0,1.0,2.0,0,1912.0,1967,1372.0,1434,1,-77.024626,38.927407,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
1,1.0,8,3,2,1,1.0,1.0,1,1965.0,1973,1562.0,8325,4,-77.004015,38.962180,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
2,4.0,16,4,4,0,4.0,2.0,0,1936.0,1943,2772.0,2437,4,-77.016860,38.947051,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,9,3,2,0,1.0,2.0,1,1936.0,1947,2064.0,4988,4,-77.014772,38.963238,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
4,1.0,9,5,3,1,1.0,2.5,1,1924.0,1988,3391.0,5500,3,-77.065544,38.928214,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35028,1.0,7,4,3,1,1.0,2.0,0,1915.0,1986,1600.0,1660,4,-77.033713,38.945601,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
35029,1.0,5,3,1,1,1.0,2.0,0,1947.0,1947,1088.0,1903,4,-77.010439,38.967161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
35030,1.0,6,3,2,1,1.0,1.5,1,1952.0,1964,1539.0,6940,7,-76.952876,38.868919,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
35031,1.0,8,4,1,0,1.0,2.0,0,1890.0,1943,1404.0,3366,7,-76.936150,38.904842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Train model

There are countless training algorithms that we could use. Let's take a second to think which one/s would you think would be a good fit for this case and why. 

In [19]:
# Train a simple model

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=15, random_state=0)
clf = clf.fit(X_train, y_train)

'finished training'

'finished training'

## Test model

In order to evaluate the performance of the model we want to obtain the predictions on the unseen data (test set).

In [20]:
# Predict the labels on the test set

y_pred = clf.predict(X_test)

y_pred

array([3., 3., 2., ..., 2., 2., 2.])

## Evaluate model

Finally, we can get the performance of the model using the predictions and the real labels of the test set

In [21]:
# Find the classification metrics on the test set

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.18      0.05      0.08        57
         1.0       0.19      0.07      0.11       433
         2.0       0.81      0.83      0.82     19120
         3.0       0.66      0.70      0.68     12314
         4.0       0.68      0.47      0.56      2672
         5.0       0.82      0.75      0.78       437

    accuracy                           0.75     35033
   macro avg       0.56      0.48      0.50     35033
weighted avg       0.74      0.75      0.74     35033

