<a href="https://colab.research.google.com/github/cboyda/MachineLearning/blob/main/PA4_Team1_W23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Assignment #4: Decision Tree**

Team member names:

*  Brett Adams
*  Cailenys Leslie
*  Clinton Boyda 
*  Tanvir Hossain
*  Ram Dershan

Dataset: 
[New York City Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
from  sklearn import neighbors
import plotly.graph_objects as go
import math
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Connect to Dataset

#original filename = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/AB_NYC_2019.csv"
#df = pd.read_csv(filename)

# load both data sets in
original = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/AB_NYC_2019.csv"
df_original = pd.read_csv(original)
additional = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/full_nyc_dataset_cleaned_table-1.csv"
df_additional = pd.read_csv(additional)

In [3]:
# Merge the two datasets with an inner join, validate that no duplicate id values exist for a one to one join
df = pd.merge(df_original, df_additional, how = "inner", on = "id", validate="one_to_one", suffixes=("_original","_additional"))
df.shape

(16005, 22)

In [4]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type_original', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'property_type', 'room_type_additional',
       'accommodates', 'bathrooms_text', 'bedrooms', 'beds'],
      dtype='object')

# **Data Cleaning**

In [5]:
# check value counts for property_type
df['property_type'].value_counts()

Entire rental unit                    6975
Private room in rental unit           5153
Private room in home                   844
Entire home                            513
Entire condo                           418
Private room in townhouse              352
Entire loft                            326
Entire townhouse                       297
Private room in condo                  180
Shared room in rental unit             178
Private room in loft                   149
Entire guest suite                     133
Entire serviced apartment               98
Room in boutique hotel                  68
Room in hotel                           56
Private room in guest suite             37
Entire place                            33
Room in serviced apartment              24
Shared room in loft                     19
Entire guesthouse                       19
Private room                            18
Private room in resort                  17
Private room in bed and breakfast       14
Shared room

There are property types that we do not want to consider in our analysis (Boats, Caves and Villa's) so we will remove these examples.

In [6]:
# Check shape before dropping examples
df.shape

(16005, 22)

In [7]:
df = df.drop(df[(df['property_type'] == 'Cave') | (df['property_type'] == 'Boat') | 
                (df['property_type'] == 'Floor') | (df['property_type'] == 'Private room in farm stay') |
                (df['property_type'] == 'Entire villa') | (df['property_type'] == 'Private room in houseboat') |
                (df['property_type'] == 'Private room in villa') | (df['property_type'] == 'Private room in tent') |
                (df['property_type'] == 'Houseboat')].index)

In [8]:
# Check shape after dropping examples
df.shape

(15986, 22)

In [9]:
# assess new value counts for property_type
df['property_type'].value_counts()

Entire rental unit                    6975
Private room in rental unit           5153
Private room in home                   844
Entire home                            513
Entire condo                           418
Private room in townhouse              352
Entire loft                            326
Entire townhouse                       297
Private room in condo                  180
Shared room in rental unit             178
Private room in loft                   149
Entire guest suite                     133
Entire serviced apartment               98
Room in boutique hotel                  68
Room in hotel                           56
Private room in guest suite             37
Entire place                            33
Room in serviced apartment              24
Entire guesthouse                       19
Shared room in loft                     19
Private room                            18
Private room in resort                  17
Private room in bed and breakfast       14
Shared room

In [10]:
# extract the numerical values from the bathroom_text column for consideration 
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Shared half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Private half-bath', 0.5, inplace=True)
df['bathrooms'] = df['bathrooms_text'].str.extract(r'\b([\d.]+)\b')

In [11]:
# let's look closer at the property_type values, perhaps this can be simplified
print(df['property_type'].unique())
print("Number of property_type unique values:",df['property_type'].nunique())

['Entire rental unit' 'Private room in rental unit'
 'Private room in townhouse' 'Entire guest suite' 'Entire loft'
 'Private room in home' 'Entire condo' 'Private room in condo'
 'Private room in loft' 'Entire home' 'Entire townhouse'
 'Private room in bed and breakfast' 'Entire guesthouse'
 'Private room in guest suite' 'Room in boutique hotel'
 'Shared room in rental unit' 'Shared room in home' 'Private room'
 'Entire place' 'Entire serviced apartment' 'Private room in guesthouse'
 'Room in serviced apartment' 'Entire cottage' 'Shared room in loft'
 'Private room in serviced apartment' 'Entire bungalow' 'Room in hotel'
 'Shared room in townhouse' 'Private room in hostel'
 'Private room in bungalow' 'Shared room in condo'
 'Private room in resort' 'Shared room in floor' 'Private room in floor'
 'Tiny home' 'Entire home/apt' 'Shared room in guest suite'
 'Room in resort' 'Room in aparthotel' 'Shared room in guesthouse'
 'Room in bed and breakfast']
Number of property_type unique value

In [12]:
df['property_type'] = df.property_type.str.replace(r'(^.*Private room.*$)', 'Private Room')
#df.property_type.replace(['Private room in rental unit', 'female'], [1, 0], inplace=True)
#replace_property_values = {'Small' : 1, 'Medium' : 2, 'High' : 3 }
#replace_property_values = df.loc[df['property_type'].str.contains('Private room', case=False), 'property_type'] = 'Private Room'

In [13]:
df['property_type'] = df.property_type.str.replace(r'(^.*Entire.*$)', 'Entire Unit')

In [14]:
df['property_type'] = df.property_type.str.replace(r'(^.*Shared room.*$)', 'Shared Room')

In [15]:
df['property_type'] = df.property_type.str.replace(r'(^.*Room in.*$)', 'Room In')

In [16]:
print(df['property_type'].unique())
print("Number of property_type unique values:",df['property_type'].nunique())

['Entire Unit' 'Private Room' 'Room In' 'Shared Room' 'Tiny home']
Number of property_type unique values: 5


In [17]:
# Convert bathroom to float type
df['bathrooms'] = df['bathrooms'].astype(float)

In [18]:
# drop bathroom_text, beds, and duplicated room_type column
df.drop(['bathrooms_text', 'room_type_additional', 'beds'], axis = 1, inplace = True)

In [19]:
# drop suffix from room_type_original
df = df.rename(columns = {'room_type_original' : 'room_type'})

In [20]:
# check for null values
df.isnull().sum()

id                                   0
name                                11
host_id                              0
host_name                           10
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       3010
reviews_per_month                 3010
calculated_host_listings_count       0
availability_365                     0
property_type                        0
accommodates                         0
bedrooms                          1562
bathrooms                           52
dtype: int64

For bedrooms and bathrooms with null values, fill with zero as properties can have no bedrooms or bathrooms

In [21]:
df[['bedrooms', 'bathrooms']] = df[['bedrooms', 'bathrooms']].fillna(value=0)

In [22]:
# Check null values again to confirm
df.isnull().sum()

id                                   0
name                                11
host_id                              0
host_name                           10
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       3010
reviews_per_month                 3010
calculated_host_listings_count       0
availability_365                     0
property_type                        0
accommodates                         0
bedrooms                             0
bathrooms                            0
dtype: int64

All other columns with null values are not important for this analysis as these columns will be dropped.

In [23]:
df.duplicated().any()

False

In [24]:
# any duplicates in the data?
duplicate_rows = df.duplicated()
df_no_dups = df[~duplicate_rows]
print ("There are " + str(duplicate_rows.sum()) + " duplicate rows in our dataframe that need to be considered.")

There are 0 duplicate rows in our dataframe that need to be considered.


In [25]:
df.shape

(15986, 20)

In [26]:
# really only needed if duplicate_rows > 0
df = df_no_dups
df.reset_index(inplace=True)

In [27]:
df.shape

(15986, 21)

In [28]:
df_no_dups

Unnamed: 0,index,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,...,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,property_type,accommodates,bedrooms,bathrooms
0,0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,...,1,45,2019-05-21,0.38,2,355,Entire Unit,1,0.0,1.0
1,1,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,...,45,49,2017-10-05,0.40,1,0,Private Room,2,1.0,0.0
2,2,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,...,2,430,2019-06-24,3.47,1,220,Private Room,2,1.0,1.0
3,3,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,...,2,118,2017-07-21,0.99,1,0,Private Room,1,1.0,1.0
4,4,5803,"Lovely Room 1, Garden, Best Area, Legal rental",9744,Laurie,Brooklyn,South Slope,40.66829,-73.98779,Private room,...,4,167,2019-06-24,1.34,3,314,Private Room,2,1.0,1.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15981,16000,36457832,"❥NYC Apt: 4min/subway, 25m/city, 20m/LGA,JFK❥",63272360,Annie Lawrence,Queens,Woodhaven,40.69482,-73.86618,Entire home/apt,...,3,0,,,6,300,Entire Unit,2,1.0,1.0
15982,16001,36471896,Private Bedroom & PRIVATE BATHROOM in Manhattan,23548340,Sarah,Manhattan,Upper East Side,40.77192,-73.95369,Private room,...,1,0,,,1,2,Private Room,2,1.0,1.0
15983,16002,36477307,Brooklyn paradise,241945355,Clement & Rose,Brooklyn,Flatlands,40.63116,-73.92616,Entire home/apt,...,1,0,,,2,363,Entire Unit,6,2.0,1.0
15984,16003,36481615,"Peaceful space in Greenpoint, BK",274298453,Adrien,Brooklyn,Greenpoint,40.72585,-73.94001,Private room,...,6,0,,,1,15,Private Room,2,1.0,1.0


# **Feature Scaling**


In [29]:
df.columns

Index(['index', 'id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'property_type', 'accommodates', 'bedrooms',
       'bathrooms'],
      dtype='object')

In [30]:
# drop all columns not necessary
# over simplifying for our first iteration

df.drop(['index','neighbourhood','name','host_name','number_of_reviews','last_review','reviews_per_month',
         'calculated_host_listings_count','id','host_id','latitude','longitude'], axis=1, inplace = True)
# df.drop('a', inplace=True, axis=1)

In [31]:
#define clean as duplicate
df_clean = df.copy()

In [32]:
df_clean

Unnamed: 0,neighbourhood_group,room_type,price,minimum_nights,availability_365,property_type,accommodates,bedrooms,bathrooms
0,Manhattan,Entire home/apt,225,1,355,Entire Unit,1,0.0,1.0
1,Brooklyn,Private room,60,45,0,Private Room,2,1.0,0.0
2,Manhattan,Private room,79,2,220,Private Room,2,1.0,1.0
3,Manhattan,Private room,79,2,0,Private Room,1,1.0,1.0
4,Brooklyn,Private room,89,4,314,Private Room,2,1.0,1.5
...,...,...,...,...,...,...,...,...,...
15981,Queens,Entire home/apt,85,3,300,Entire Unit,2,1.0,1.0
15982,Manhattan,Private room,95,1,2,Private Room,2,1.0,1.0
15983,Brooklyn,Entire home/apt,170,1,363,Entire Unit,6,2.0,1.0
15984,Brooklyn,Private room,54,6,15,Private Room,2,1.0,1.0


In [33]:
df_clean.shape

(15986, 9)

In [34]:
zero_availability = df_clean.loc[df_clean.availability_365 == 0, 'availability_365'].index
# zero availability means unit is NOT available so best drop from out model
df_clean.drop(zero_availability,axis=0,inplace=True)

DROP units that are simply not able to be rented. This includes availability = 0

In [35]:
df_clean.shape

(8624, 9)

In [36]:
# dropping availability_365 feature at this stage since it was a filter not a feature
df_clean.drop(['availability_365'], axis=1, inplace = True)

In [37]:
df_clean.shape

(8624, 8)

In [38]:
numeric_data = df_clean.select_dtypes(include=[np.number])
categorical_data = df_clean.select_dtypes(exclude=[np.number])

In [39]:
df_clean['neighbourhood_group'] = df_clean['neighbourhood_group'].astype('category')

In [40]:
numeric_data

Unnamed: 0,price,minimum_nights,accommodates,bedrooms,bathrooms
0,225,1,1,0.0,1.0
2,79,2,2,1.0,1.0
4,89,4,2,1.0,1.5
5,140,2,3,0.0,1.0
6,215,2,4,1.0,1.0
...,...,...,...,...,...
15981,85,3,2,1.0,1.0
15982,95,1,2,1.0,1.0
15983,170,1,6,2.0,1.0
15984,54,6,2,1.0,1.0


In [41]:
categorical_data

Unnamed: 0,neighbourhood_group,room_type,property_type
0,Manhattan,Entire home/apt,Entire Unit
2,Manhattan,Private room,Private Room
4,Brooklyn,Private room,Private Room
5,Brooklyn,Entire home/apt,Entire Unit
6,Brooklyn,Entire home/apt,Entire Unit
...,...,...,...
15981,Queens,Entire home/apt,Entire Unit
15982,Manhattan,Private room,Private Room
15983,Brooklyn,Entire home/apt,Entire Unit
15984,Brooklyn,Private room,Private Room


In [42]:
# any null values? 0 means none found == no need to fix nulls
df_clean.isna().sum()

neighbourhood_group    0
room_type              0
price                  0
minimum_nights         0
property_type          0
accommodates           0
bedrooms               0
bathrooms              0
dtype: int64

In [43]:
# what are the unique values for each column?
# label can be category but others should be binary for simplicity
for col in df_clean:
    print(col, df_clean[col].unique(), df_clean[col].nunique() )

neighbourhood_group ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']
Categories (5, object): ['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'] 5
room_type ['Entire home/apt' 'Private room' 'Shared room'] 3
price [  225    79    89   140   215   120   150    52    70    68   130   110
    80   228   144   180   375   200    99   230    65   105    98   175
   500   220   100   170   185   115    77    76   135   195    69   125
   475   165   350   265    64   159   250   305   155    60    92   285
    90   390    95    75   190   212   124   122   575   229    59   113
   179    71   349   249   169   599    55   189   260    97   495   259
   451   129   300    72    88   450    37    85    91   255    50   160
   248   145   199    42   400    96   299   325    45    34    56   402
   800   275   219   178   119    87   395    49   142   174   235   311
    39   102   209   104    82   118    36    93   295   107   151   700
   331   149   128   136  1000   

In [44]:
# how many of each unique value exists in our cleaned data?
for col in df_clean:
  print("\nFor column", col)
  print(df_clean[col].value_counts(sort=True))



For column neighbourhood_group
Brooklyn         3556
Manhattan        3308
Queens           1353
Bronx             303
Staten Island     104
Name: neighbourhood_group, dtype: int64

For column room_type
Entire home/apt    4980
Private room       3523
Shared room         121
Name: room_type, dtype: int64

For column price
150     361
100     349
50      222
200     220
125     212
       ... 
995       1
337       1
429       1
2800      1
393       1
Name: price, Length: 440, dtype: int64

For column minimum_nights
2      2144
1      1811
3      1497
30     1134
4       562
       ... 
23        1
62        1
265       1
185       1
85        1
Name: minimum_nights, Length: 80, dtype: int64

For column property_type
Entire Unit     4990
Private Room    3364
Room In          144
Shared Room      120
Tiny home          6
Name: property_type, dtype: int64

For column accommodates
2     3884
4     1358
1     1193
3      893
6      473
5      418
8      154
7      119
10      41
9       24

In [45]:
df_clean.dtypes

neighbourhood_group    category
room_type                object
price                     int64
minimum_nights            int64
property_type            object
accommodates              int64
bedrooms                float64
bathrooms               float64
dtype: object

CLINTONS QUESTIONS TO DISCUSS ???? Look at counts above AND histograms/whisker boxes below...
1.   Should we drop bedrooms = 0? If they have no bedrooms do they still have a unit to rent?
2.   Above: Why aren't these in proper order, 0.0 bathrooms should be first.
3.   for room_type there are only 112 'Shared room", is this significant, drop?
4.   for property_typ there are only 6 'Timy home', is this significant, drop?
5.   PRICE: should we log_price or categorize?
6.   minimum_nights: same with this value



In [46]:
#for column in features:
for column in df_clean.columns:
  fig = px.histogram(df_clean, x=column, marginal="box")
  fig.show()

Consider how to manage extreme values.

In [47]:
extreme_values = []
for column in numeric_data.columns:
  # Select the first quantile
  q1 = df[column].quantile(0.25)

  # Select the third quantile
  q3 = df[column].quantile(0.75)

  max = df[column].quantile(1)

  # Create a mask inbetween q1 & q3
  IQR = q3 - q1

  # Filtering the initial dataframe with a mask
  #filtered = df.query('(@q1 - 1.5 * @IQR) <= [column] <= (@q3 + 1.5 * @IQR)')
  # Filtering Values between Q1-1.5IQR and Q3+1.5IQR  

  #maximum outliers
  bottom_fence = 0 if (q1 - 1.5 * IQR) < 0 else q1 - 1.5 * IQR
  upper_fence = max if (q3 + 1.5 * IQR) > max else (q3 + 1.5 * IQR)
  #display(column, bottom_fence, upper_fence)
  extreme_values.append([column, bottom_fence, upper_fence])


In [48]:
  extreme_values

[['price', 0, 332.5],
 ['minimum_nights', 0, 11.0],
 ['accommodates', 0, 7.0],
 ['bedrooms', 1.0, 1.0],
 ['bathrooms', 1.0, 1.0]]

In [49]:
# lookup in extreme_values UPPER/LOWER FENCE values
def get_upperfence(name=''):
  for i in range(len(extreme_values)):
    if extreme_values[i][0] == name:
      return extreme_values[i][2]
    else:
      continue

def get_lowerfence(name=''):
  for i in range(len(extreme_values)):
    if extreme_values[i][0] == name:
      return extreme_values[i][1]
    else:
      continue

In [50]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Pricing percentage over extreme:')
(df_clean.loc[df_clean.price > get_upperfence('price'), 'price'].count() / df_clean.price.count()) * 100 

'Pricing percentage over extreme:'

6.8181818181818175

In [51]:
# drop upperfence extreme prices
df_clean.drop(df_clean[df_clean['price'] > get_upperfence('price')].index, inplace = True)


In [52]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Minimum nights percentage over extreme:')
(df_clean.loc[df_clean.minimum_nights > get_upperfence('minimum_nights'), 'minimum_nights'].count() / df_clean.minimum_nights.count()) * 100 

'Minimum nights percentage over extreme:'

19.051767048282727

In [53]:
# NOT DROPPING minimum_nights because of high percentage
# drop upperfence extreme minimum nights
# df_clean.drop(df_clean[df_clean['minimum_nights'] > get_upperfence('minimum_nights')].index, inplace = True)

In [54]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Accommodates percentage over extreme:')
(df_clean.loc[df_clean.accommodates > get_upperfence('accommodates'), 'accommodates'].count() / df_clean.accommodates.count()) * 100 

'Accommodates percentage over extreme:'

1.991040318566451

In [55]:
# drop upperfence extreme accomodations
df_clean.drop(df_clean[df_clean['accommodates'] > get_upperfence('accommodates')].index, inplace = True)

In [56]:
# after extreme values dropped, how do histograms look now?
for column in df_clean.columns:
  fig = px.histogram(df_clean, x=column, marginal="box")
  fig.show()

In [57]:
# ??? we can now drop price column since we have price group

In [58]:
# log of zero fails so we count how many have zero, if small, then drop
df_clean.loc[df_clean.price  == 0, 'price'].count()

3

In [59]:
zero_price = df_clean.loc[df_clean.price  == 0, 'price'].index

In [60]:
df_clean.shape

(7876, 8)

In [61]:
# zero price rows is low, dropping
df_clean.drop(zero_price,axis=0,inplace=True)

In [62]:
df_clean.shape

(7873, 8)

In [63]:
# add log of price to dataframe
df_clean['log_price'] = np.log(df_clean['price'])

Minimum_nights needs log to get gaussian graph.

In [64]:
# log of zero fails so we count how many have zero, if small, then drop
df_clean.loc[df_clean.minimum_nights  == 0, 'minimum_nights'].count()

0

In [65]:
zero_minimum_nights = df_clean.loc[df_clean.minimum_nights  == 0, 'minimum_nights'].index

In [66]:
# zero price rows is low, dropping
df_clean.drop(zero_minimum_nights,axis=0,inplace=True)

In [67]:
# add log of price to dataframe
df_clean['log_minimum_nights'] = np.log(df_clean['minimum_nights'])

In [68]:
# after price and minimum_nights LOGGED, how do histograms look now?
for column in df_clean.columns:
  fig = px.histogram(df_clean, x=column, marginal="box")
  fig.show()

Should we keep minimum_nights logged ??? Still doesn't look very gaussian.

Consider dropping minimum_nights original features now... Then choose between Price_Group Price or Log_price???

In [69]:
df_clean.dtypes

neighbourhood_group    category
room_type                object
price                     int64
minimum_nights            int64
property_type            object
accommodates              int64
bedrooms                float64
bathrooms               float64
log_price               float64
log_minimum_nights      float64
dtype: object

In [70]:
q1

1.0

In [71]:
df_clean.head()

Unnamed: 0,neighbourhood_group,room_type,price,minimum_nights,property_type,accommodates,bedrooms,bathrooms,log_price,log_minimum_nights
0,Manhattan,Entire home/apt,225,1,Entire Unit,1,0.0,1.0,5.4161,0.0
2,Manhattan,Private room,79,2,Private Room,2,1.0,1.0,4.369448,0.693147
4,Brooklyn,Private room,89,4,Private Room,2,1.0,1.5,4.488636,1.386294
5,Brooklyn,Entire home/apt,140,2,Entire Unit,3,0.0,1.0,4.941642,0.693147
6,Brooklyn,Entire home/apt,215,2,Entire Unit,4,1.0,1.0,5.370638,0.693147


In [72]:
df_clean.dtypes

neighbourhood_group    category
room_type                object
price                     int64
minimum_nights            int64
property_type            object
accommodates              int64
bedrooms                float64
bathrooms               float64
log_price               float64
log_minimum_nights      float64
dtype: object

In [73]:
q1 = df_clean['price'].quantile(0.25)
q1

70.0

In [74]:
mean = df_clean['price'].quantile(0.5)
mean

101.0

In [75]:
q3 = df_clean['price'].quantile(0.75)
q3

159.0

??? should the price group be based on **PRICE or on LOG_PRICE**?



In [76]:
# break apart price groups by <q1, q1>median, median>q3, q3>
feature_name = 'log_price' # or 'price'

q1 = df_clean[feature_name].quantile(0.25)
mean = df_clean[feature_name].quantile(0.5)
q3 = df_clean[feature_name].quantile(0.75)


# Categorizing Price Group

# we could use price category names
df_clean.loc[ df_clean[feature_name] <= q1, 'price_group'] = 'budget'
df_clean.loc[(df_clean[feature_name] > q1) & (df_clean[feature_name] <= mean), 'price_group'] = 'standard'
df_clean.loc[(df_clean[feature_name] > mean) & (df_clean[feature_name] <= q3), 'price_group'] = 'premium'
df_clean.loc[(df_clean[feature_name] > q3), 'price_group'] = 'luxury'

# OR we could clearly articulate our data group definitions, works well if feature_name is PRICE
#df_clean.loc[ df_clean[feature_name] <= q1, 'price_group'] = "Price Less than "+ str(q1)
#df_clean.loc[(df_clean[feature_name] > q1) & (df_clean[feature_name] <= mean), 'price_group'] = 'Price Between ' + str(q1) + ' and ' + str(mean)
#df_clean.loc[(df_clean[feature_name] > mean) & (df_clean[feature_name] <= q3), 'price_group'] = 'Price Between ' + str(mean) + ' and ' + str(q3)
#df_clean.loc[(df_clean[feature_name] > q3), 'price_group'] = "Price Greater than " + str(q3)

# OR we could call them numerical as well to be explicit, but this overwrites price feature
#df_clean.loc[ df_clean[feature_name] <= q1, feature_name] = 0
#df_clean.loc[(df_clean[feature_name] > q1) & (df_clean[feature_name'] <= mean), 'price'] = 1
#df_clean.loc[(df_clean[feature_name] > mean) & (df_clean[feature_name] <= q3), 'price'] = 2
#df_clean.loc[(df_clean[feature_name] > q3), feature_name] = 3



In [77]:
df_clean['price'].value_counts()

150    349
100    348
50     222
200    217
125    208
      ... 
256      1
277      1
22       1
323      1
223      1
Name: price, Length: 280, dtype: int64

In [78]:
df_clean['price_group'].value_counts()

budget      2090
premium     1987
luxury      1944
standard    1852
Name: price_group, dtype: int64

In [79]:
for col in df_clean:
    print(col, df_clean[col].unique(), df_clean[col].nunique() )

neighbourhood_group ['Manhattan', 'Brooklyn', 'Queens', 'Staten Island', 'Bronx']
Categories (5, object): ['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island'] 5
room_type ['Entire home/apt' 'Private room' 'Shared room'] 3
price [225  79  89 140 215 120 150  52  70  68 130 110  80 228 144 180 200  99
 230  65 105  98 175 220 100 170 185 115  77  76 135 195  69 125 165  64
 159 250 305 155  60  92 285  90  95  75 190 212 124 122 229  59 113 179
  71 249 169  55 260  97 259 129  72  88  37  85  91 189 300 255  50 160
 145 199  42  96 299 325  45  34  56 275 219 178 265 119  87  49 142 174
 235 311  39 102 209 104  82 118  36  93 295 107 151 331 149 128 136 263
  61 234 109 197 127 167  54 134  62  73 240 210 171 103  81  57 121  51
 131 166  44 108  35  53  78 191 187 172  38  46 139  83  40 182 158 133
  47  94 152  41 290 147 269 188  67 111 217 112  66  84  31 226  74  29
 143 184 193 106 320 221 162  63 176 117 218 116 288 316 146 318 148 216
  58  30  86 198 245 239 247 205 

In [80]:
column_names= df_clean.columns
features = column_names[1:]
label = column_names[0]
display(features, label)
# set our label to type category to be explicit
df_clean['neighbourhood_group'] = df_clean['neighbourhood_group'].astype('category')

Index(['room_type', 'price', 'minimum_nights', 'property_type', 'accommodates',
       'bedrooms', 'bathrooms', 'log_price', 'log_minimum_nights',
       'price_group'],
      dtype='object')

'neighbourhood_group'

In [81]:
df_clean.dtypes

neighbourhood_group    category
room_type                object
price                     int64
minimum_nights            int64
property_type            object
accommodates              int64
bedrooms                float64
bathrooms               float64
log_price               float64
log_minimum_nights      float64
price_group              object
dtype: object

## Normalization and Scaling of Data

In [82]:
Example_Count = len(df_clean)
Feature_Count = len(df_clean.columns) - 1

print("Number of Examples:", Example_Count)
print("Number Features:", Feature_Count)

Number of Examples: 7873
Number Features: 10


In [83]:
fig = px.scatter_matrix(df_clean, dimensions=features, color=label)

fig.update_layout(width=(Feature_Count + 1) * 200,
                 height=(Feature_Count + 1) * 200,
                 margin=dict(l=0, r=0, t=0, b=0))

fig.show()

## Convert Strings to Numerical 

In [84]:
df_clean.dtypes

neighbourhood_group    category
room_type                object
price                     int64
minimum_nights            int64
property_type            object
accommodates              int64
bedrooms                float64
bathrooms               float64
log_price               float64
log_minimum_nights      float64
price_group              object
dtype: object

For room_type, price_group and property_type, from objects to mode/numerics.

In [85]:
# features need to be numerical for decision trees, only really needed for correlation graph/comparison
# but maybe NOT needed for decision tree?
df_clean = pd.get_dummies(df_clean, columns=["room_type","property_type","price_group"], prefix='mode')
#df_clean = pd.get_dummies(df_clean, columns=["room_type","property_type"], prefix='mode')


In [86]:
# leaving features as objects is a problem when it comes to calculating precision & classification_reports
# if get_dummies is not used ERROR is
#ValueError                                Traceback (most recent call last)
#
#<ipython-input-565-2fe0e0137cb5> in <module>
#      2 print(classification_report(y_train, yhat_train))
#      3 print()
#----> 4 yhat_test = dtree.predict(X_test)
#      5 
#      6 print("Results on test data:")
#
#4 frames
#
#/usr/local/lib/python3.8/dist-packages/sklearn/utils/_array_api.py in _asarray_with_order(array, dtype, order, copy, xp)
#    183     if xp.__name__ in {"numpy", "numpy.array_api"}:
#    184         # Use NumPy API to support order
#--> 185         array = numpy.asarray(array, order=order, dtype=dtype)
#    186         return xp.asarray(array, copy=copy)
#    187     else:
#ValueError: could not convert string to float: 'Private room'

In [87]:
df_clean.dtypes

neighbourhood_group     category
price                      int64
minimum_nights             int64
accommodates               int64
bedrooms                 float64
bathrooms                float64
log_price                float64
log_minimum_nights       float64
mode_Entire home/apt       uint8
mode_Private room          uint8
mode_Shared room           uint8
mode_Entire Unit           uint8
mode_Private Room          uint8
mode_Room In               uint8
mode_Shared Room           uint8
mode_Tiny home             uint8
mode_budget                uint8
mode_luxury                uint8
mode_premium               uint8
mode_standard              uint8
dtype: object

In [88]:
# overly simplified correlation chart, using another version
#fig = px.imshow(df_clean.corr())
#fig.show()

In [89]:
# Correlation Graph Enhanced
df_corr = df_clean.corr().round(1)  
# Mask to matrix
mask = np.zeros_like(df_corr, dtype=bool)
mask[np.triu_indices_from(mask)] = True
# Viz
df_corr_viz = df_corr.mask(mask).dropna(how='all').dropna(how='all')
# colour variable https://plotly.com/python/colorscales/
fig = px.imshow(df_corr_viz, text_auto=True, color_continuous_scale=[(0.00, "black"),   (0.33, "black"),
                                                     (0.33, "white"), (0.66, "white"),
                                                     (0.66, "blue"),  (1.00, "blue")])
fig.show()


In [90]:
df_corr

Unnamed: 0,price,minimum_nights,accommodates,bedrooms,bathrooms,log_price,log_minimum_nights,mode_Entire home/apt,mode_Private room,mode_Shared room,mode_Entire Unit,mode_Private Room,mode_Room In,mode_Shared Room,mode_Tiny home,mode_budget,mode_luxury,mode_premium,mode_standard
price,1.0,0.0,0.5,0.2,0.1,1.0,0.1,0.6,-0.6,-0.1,0.6,-0.6,0.1,-0.1,0.0,-0.6,0.8,0.1,-0.3
minimum_nights,0.0,1.0,-0.0,-0.0,0.0,0.0,0.6,0.1,-0.1,0.0,0.1,-0.1,-0.0,0.0,-0.0,-0.0,0.0,0.0,-0.0
accommodates,0.5,-0.0,1.0,0.5,0.2,0.5,-0.0,0.5,-0.5,-0.1,0.5,-0.5,0.0,-0.1,0.0,-0.4,0.4,0.1,-0.1
bedrooms,0.2,-0.0,0.5,1.0,0.3,0.2,-0.0,0.1,-0.1,-0.0,0.1,-0.1,-0.0,-0.0,-0.0,-0.1,0.2,-0.0,-0.1
bathrooms,0.1,0.0,0.2,0.3,1.0,0.0,0.0,-0.1,0.1,-0.0,-0.0,0.1,-0.0,-0.0,-0.0,0.1,0.1,-0.1,-0.1
log_price,1.0,0.0,0.5,0.2,0.0,1.0,0.1,0.6,-0.6,-0.1,0.6,-0.6,0.1,-0.1,0.0,-0.8,0.7,0.2,-0.2
log_minimum_nights,0.1,0.6,-0.0,-0.0,0.0,0.1,1.0,0.2,-0.2,-0.1,0.2,-0.2,-0.1,-0.1,-0.0,-0.1,0.1,0.1,-0.1
mode_Entire home/apt,0.6,0.1,0.5,0.1,-0.1,0.6,0.2,1.0,-1.0,-0.1,0.9,-0.9,-0.1,-0.1,0.0,-0.6,0.4,0.3,-0.1
mode_Private room,-0.6,-0.1,-0.5,-0.1,0.1,-0.6,-0.2,-1.0,1.0,-0.1,-0.9,0.9,0.1,-0.1,-0.0,0.5,-0.4,-0.3,0.1
mode_Shared room,-0.1,0.0,-0.1,-0.0,-0.0,-0.1,-0.1,-0.1,-0.1,1.0,-0.1,-0.1,-0.0,0.9,-0.0,0.1,-0.0,-0.0,-0.0


This correlation chart shows ZERO reasons to keep log_minimum_nights that log was NOT necessary, dropping now before building tree.

In [91]:
df_clean.drop(['log_minimum_nights'], axis=1, inplace = True)

Dropping price as well, keeping log_price and price groups.

In [92]:
df_clean.drop(['price'], axis=1, inplace = True)

Before continuing need to decide which features to keep, if get_dummies was necessary?

# **Decision Tree**

The objective of this assignment is for you to perform a complete implementation of a decision
tree classifier using your team’s project dataset.
0. Prior to building the ML model:

*   Split your data into testing and training.
*   Determine whether your label data needs to be discretized (if you have a numerical label).

In [93]:
df_clean.shape

(7873, 18)

In [94]:
df_clean.reset_index(inplace=True, drop=True)

In [95]:
df_clean

Unnamed: 0,neighbourhood_group,minimum_nights,accommodates,bedrooms,bathrooms,log_price,mode_Entire home/apt,mode_Private room,mode_Shared room,mode_Entire Unit,mode_Private Room,mode_Room In,mode_Shared Room,mode_Tiny home,mode_budget,mode_luxury,mode_premium,mode_standard
0,Manhattan,1,1,0.0,1.0,5.416100,1,0,0,1,0,0,0,0,0,1,0,0
1,Manhattan,2,2,1.0,1.0,4.369448,0,1,0,0,1,0,0,0,0,0,0,1
2,Brooklyn,4,2,1.0,1.5,4.488636,0,1,0,0,1,0,0,0,0,0,0,1
3,Brooklyn,2,3,0.0,1.0,4.941642,1,0,0,1,0,0,0,0,0,0,1,0
4,Brooklyn,2,4,1.0,1.0,5.370638,1,0,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7868,Queens,3,2,1.0,1.0,4.442651,1,0,0,1,0,0,0,0,0,0,0,1
7869,Manhattan,1,2,1.0,1.0,4.553877,0,1,0,0,1,0,0,0,0,0,0,1
7870,Brooklyn,1,6,2.0,1.0,5.135798,1,0,0,1,0,0,0,0,0,1,0,0
7871,Brooklyn,6,2,1.0,1.0,3.988984,0,1,0,0,1,0,0,0,1,0,0,0


In [96]:
df_clean.columns

Index(['neighbourhood_group', 'minimum_nights', 'accommodates', 'bedrooms',
       'bathrooms', 'log_price', 'mode_Entire home/apt', 'mode_Private room',
       'mode_Shared room', 'mode_Entire Unit', 'mode_Private Room',
       'mode_Room In', 'mode_Shared Room', 'mode_Tiny home', 'mode_budget',
       'mode_luxury', 'mode_premium', 'mode_standard'],
      dtype='object')

In [97]:
#move label to last column
cols = list(df_clean.columns.values) #Make a list of all of the columns in the df
cols.pop(cols.index(label)) #Remove b from list
df_clean = df_clean[cols+[label]] #Create new dataframe with columns in the order you want

In [98]:
#move columns so label is at end
#df_clean.insert(-1,"neightbourhood_group",df_clean.pop("neighbourhood_group"))

# TEST TEST TEST what if we drop bedrooms and bathrooms and log price at first ???
#new_cols = ['room_type','minimum_nights','property_type','accommodates','bedrooms','bathrooms','price_group','log_price','neighbourhood_group']

#this is used when get_dummies is NOT used
#new_cols = ['room_type','minimum_nights','property_type','accommodates','price_group','neighbourhood_group']

new_cols = df_clean.columns

df_clean = df_clean.reindex(columns=new_cols)


In [99]:
df_clean.columns

Index(['minimum_nights', 'accommodates', 'bedrooms', 'bathrooms', 'log_price',
       'mode_Entire home/apt', 'mode_Private room', 'mode_Shared room',
       'mode_Entire Unit', 'mode_Private Room', 'mode_Room In',
       'mode_Shared Room', 'mode_Tiny home', 'mode_budget', 'mode_luxury',
       'mode_premium', 'mode_standard', 'neighbourhood_group'],
      dtype='object')

In [100]:
df_clean

Unnamed: 0,minimum_nights,accommodates,bedrooms,bathrooms,log_price,mode_Entire home/apt,mode_Private room,mode_Shared room,mode_Entire Unit,mode_Private Room,mode_Room In,mode_Shared Room,mode_Tiny home,mode_budget,mode_luxury,mode_premium,mode_standard,neighbourhood_group
0,1,1,0.0,1.0,5.416100,1,0,0,1,0,0,0,0,0,1,0,0,Manhattan
1,2,2,1.0,1.0,4.369448,0,1,0,0,1,0,0,0,0,0,0,1,Manhattan
2,4,2,1.0,1.5,4.488636,0,1,0,0,1,0,0,0,0,0,0,1,Brooklyn
3,2,3,0.0,1.0,4.941642,1,0,0,1,0,0,0,0,0,0,1,0,Brooklyn
4,2,4,1.0,1.0,5.370638,1,0,0,1,0,0,0,0,0,1,0,0,Brooklyn
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7868,3,2,1.0,1.0,4.442651,1,0,0,1,0,0,0,0,0,0,0,1,Queens
7869,1,2,1.0,1.0,4.553877,0,1,0,0,1,0,0,0,0,0,0,1,Manhattan
7870,1,6,2.0,1.0,5.135798,1,0,0,1,0,0,0,0,0,1,0,0,Brooklyn
7871,6,2,1.0,1.0,3.988984,0,1,0,0,1,0,0,0,1,0,0,0,Brooklyn


In [101]:
# set variable names
feature_names = new_cols[:-1]
label_name = new_cols[-1]


In [102]:
label_name

'neighbourhood_group'

In [103]:
feature_names

Index(['minimum_nights', 'accommodates', 'bedrooms', 'bathrooms', 'log_price',
       'mode_Entire home/apt', 'mode_Private room', 'mode_Shared room',
       'mode_Entire Unit', 'mode_Private Room', 'mode_Room In',
       'mode_Shared Room', 'mode_Tiny home', 'mode_budget', 'mode_luxury',
       'mode_premium', 'mode_standard'],
      dtype='object')

## Spliting Data (before preprocessing)

In [104]:
from sklearn.model_selection import train_test_split
X_values = df_clean[new_cols].values
y_values = df_clean[label].values
display(X_values,y_values)

array([[1, 1, 0.0, ..., 0, 0, 'Manhattan'],
       [2, 2, 1.0, ..., 0, 1, 'Manhattan'],
       [4, 2, 1.0, ..., 0, 1, 'Brooklyn'],
       ...,
       [1, 6, 2.0, ..., 0, 0, 'Brooklyn'],
       [6, 2, 1.0, ..., 0, 0, 'Brooklyn'],
       [7, 2, 1.0, ..., 0, 1, 'Manhattan']], dtype=object)

['Manhattan', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Brooklyn', ..., 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Manhattan']
Length: 7873
Categories (5, object): ['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']

In [108]:
# code from blog https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50
# recommends this
# want identical distributions, for better validation? = use stratify=df_clean[label]
# want the indices of each set separately? = use all_indices
## WORKS only with 2 sets, not 4
#all_indices = list(range(len(df_clean)))
# train, test = train_test_split(all_indices, test_size=0.2,stratify=df_clean[label])

# but this doesn't work when we need 4 sets like
# X_train, X_test, y_train, y_test

In [110]:
# code from labs
X_train, X_test, y_train, y_test = train_test_split(X_values, y_values, test_size=0.2,stratify=df_clean[label])


In [112]:
# from https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50
# importance of using stratify argument with train_test_split
def get_class_counts(df_split):
    print(df_split)
    grp = df_split.groupby(label).nunique()
    return {key: grp[key] for key in list(grp.keys())}

def get_class_proportions(df_split):
    class_counts = get_class_counts(df_split)
    return {val[0]: round(val[1]/df_split.shape[0],4) for val in class_counts.items()}


In [113]:
# ABOVE def code only works with
#train, test = train_test_split(df_clean, test_size=0.2)
 
  # NOT 4
  #X_train, X_test, y_train, y_test = train_test_split(X_values, y_values, test_size=0.2)

#train_class_proportions = get_class_proportions(train)
#test_class_proportions = get_class_proportions(test)

#print("\nX_Train data class proportions", train_class_proportions)
#print("\nX_Test data class proportions", test_class_proportions)

We can see variations in the distributions. Although small here, it can easily be quite large enough for smaller and more skewed datasets. The solution to this problem is something called stratification which will lock the distribution of classes in train and test sets.

In order to get a similar distribution, we need to stratify the dataset based on the class values. So we pass our class labels "stratify=df_clean[label]" part of the data to this parameter and check what happens.

In [114]:
#train,test = train_test_split(df_clean, test_size=0.2, stratify=df_clean[label])

#train_class_proportions = get_class_proportions(train)
#test_class_proportions = get_class_proportions(test)

#print("\nTrain data class proportions", train_class_proportions)
#print("\nTest data class proportions", test_class_proportions)

In [115]:
# continue with object values for our decision tree L3-4 prepocess features with _transformer

## Preprocessing Data

**You should not use a preprocessing method that is fitted on the whole dataset, to transform the test or train data.**

In [116]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

X_preprocess = make_column_transformer((OrdinalEncoder(), feature_names), 
                                       remainder='drop')
y_preprocess = LabelEncoder()

In [117]:
X_preprocess

In [118]:
y_preprocess

In [119]:
X_train = X_preprocess.fit_transform(df_clean[feature_names])
y_train = y_preprocess.fit_transform(df_clean[label_name])

In [120]:
display(X_train,y_train)

array([[0., 0., 0., ..., 1., 0., 0.],
       [1., 1., 1., ..., 0., 0., 1.],
       [3., 1., 1., ..., 0., 0., 1.],
       ...,
       [0., 5., 2., ..., 1., 0., 0.],
       [5., 1., 1., ..., 0., 0., 0.],
       [6., 1., 1., ..., 0., 0., 1.]])

array([2, 2, 1, ..., 1, 1, 2])

In [121]:
# how are we going to define X_new ????


**Exploring decision tree construction:**
Vary the following hyperparameters to build your decision tree classifier and report the evaluation metrics for both your training and testing data.

**1. Vary the criterion hyperparameter:**


a. Create a DT using the criterion parameter “gini” and report the accuracy,
precision, recall and F1 score.


b. Create a DT using the criterion parameter “entropy” and report the accuracy,
precision, recall and F1 score.

**2. Vary the splitter hyperparameter:**

a. Create a DT using the splitter parameter “best” and report the accuracy,
precision, recall and F1 score.

b. Create a DT using the splitter parameter “random” and report the accuracy,
precision, recall and F1 score.

**3. Vary the min_samples_split hyperparameter:**

a. Choose value 1 as your min_samples_split and report the accuracy, precision, recall and F1 score.


In [122]:
from sklearn.tree import DecisionTreeClassifier

In [123]:
dtree = DecisionTreeClassifier(min_samples_split = 1)

In [124]:
dtree.fit(X_train, y_train)

In [125]:
# how are we going to define X_new ????

In [126]:
from sklearn.tree import export_graphviz

In [138]:
import graphviz
from sklearn.tree import export_graphviz
dot_data = export_graphviz(dtree,
                           out_file=None, 
                           class_names=class_names.tolist(),
                           feature_names=feature_names,  
                           filled=True,
                           rounded=True,  
                           special_characters=True,
                           rotate=True)  

display(graphviz.Source(dot_data))

NameError: ignored

Evaluation

In [131]:
from sklearn.metrics import confusion_matrix

In [132]:
yhat_train = dtree.predict(X_train)

In [133]:
train_3a = confusion_matrix(y_train, yhat_train)

In [134]:
display(train_3a)

array([[ 185,   69,   15,   22,    0],
       [  35, 3054,  136,   96,    0],
       [  13,  473, 2341,   55,    0],
       [  24,  351,  100,  811,    0],
       [   2,   25,    7,    6,   53]])

In [135]:
from sklearn.metrics import classification_report

In [137]:
print("Results on training data:")
print(classification_report(y_train, yhat_train))
print()
yhat_test = dtree.predict(X_test)

print("Results on test data:")
print(classification_report(y_test, yhat_test))

Results on training data:
              precision    recall  f1-score   support

           0       0.71      0.64      0.67       291
           1       0.77      0.92      0.84      3321
           2       0.90      0.81      0.85      2882
           3       0.82      0.63      0.71      1286
           4       1.00      0.57      0.73        93

    accuracy                           0.82      7873
   macro avg       0.84      0.71      0.76      7873
weighted avg       0.83      0.82      0.82      7873




ValueError: ignored

In [None]:
df_clean

b. Choose value 2 as your min_samples_split and report the accuracy, precision, recall and F1 score.


**4. Vary the min_samples_leaf hyperparameter:**

a. Choose value 1 as your min_samples_leaf and report the accuracy, precision, recall and F1 score.

b. Choose value 2 as your min_samples_leaf and report the accuracy, precision, recall and F1 score.


**5. Vary the max_depth hyperparameter:**


a. Assign a limiting depth, e.g. 4, for our hyperparameter and report the accuracy, precision, recall and F1 score.


b. Assign a 2nd limiting depth, e.g. 8, for our hyperparameter and report the accuracy, precision, recall and F1 score.


**6. Hyperparameter overview:**

Provide a 2–3 paragraph summary of the results of your hyperparameter exploration. How did your ML model improve or depreciate with these variations?


# **Final Decision Tree & Evaluation**

**1. Which feature was used for the first split?**

**2. How many leaves are in the optimal classifier/ML model?**

**3. Produce a confusion_matrix and describe your ML model’s accuracy in terms of the number of true and false positives and negatives.** (Cailenys)

**4. Using scikit-learn’s classification_report method, generate the accuracy, precision, recall, and F1 score for your model and describe your ML model’s accuracy.** (Cailenys)


# **Visualize the structure of your final ML model:**


**5. Plot your tree. [Hint: using scikit-learn’s tree.plot_tree**

**6. Plot the decision surface of your tree using paired features. [See the following for help implementing:**
https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html#sphx-glr-auto-examples-tree-plot-iris-dtc-py]

# **Decision tree path:** 

**7. Provide a description of the potential path along your tree that a given new data point might take and provide its final result. The idea being that we want to know what decisions would be made along the way for that data point to end up at a particular label.**

# **ML model Accuracy**

Perform a comparison of our decision tree model vs. k-NN model: provide a comparison table of accuracy for your various DT ML models and your k-NN ML models. This will be a tool for comparison for you as a technician, but it will also serve as a communication tool to summarize to stakeholders what you tried, what worked best, and why.


# **Business Evaluation** (cailenys)


One of the key objectives of this course is to learn how to implement ML algorithms to tackle business problems and objectives. Please provide us with a complete scenario of how the results of your decision tree classifier might be used.

**Note:** you’ve previously considered some of these questions, the intent with reconsidering them is to iterate on our problem after obtaining results from our ML model:


**1. What might be the motivation for a decision tree classifier?**


**2. What is the “action” that should be taken given the results of this prediction?**

**3. Who is the best immediate person(s) to make use of the results of your prediction?**

**4. What is the potential payoff of this prediction for an organization? (e.g., costs or efficiency).**


**5. Do your ML models’ results change your problem? If so, how and why? If not, please explain.**