<a href="https://colab.research.google.com/github/cboyda/MachineLearning/blob/main/PA4_Team1_W23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Assignment #4: Decision Tree**

Team member names:

*  Brett Adams
*  Cailenys Leslie
*  Clinton Boyda 
*  Tanvir Hossain
*  Ram Dershan

Dataset: 
[New York City Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)

In [3]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn import model_selection
from  sklearn import neighbors
import plotly.graph_objects as go
import math
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")

In [4]:
# Connect to Dataset

#filename = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/AB_NYC_2019.csv"

#df = pd.read_csv(filename)

# we can work on merging the new dataset in the code below = Brett?

# load both data sets in
original = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/AB_NYC_2019.csv"
df_original = pd.read_csv(original)
additional = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/full_nyc_dataset_cleaned_table-1.csv"
df_additional = pd.read_csv(additional)

In [5]:
# Merge the two datasets with an inner join, validate that no duplicate id values exist for a one to one join
df = pd.merge(df_original, df_additional, how = "inner", on = "id", validate="one_to_one", suffixes=("_original","_additional"))
df.shape

(16005, 22)

In [6]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type_original', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'property_type', 'room_type_additional',
       'accommodates', 'bathrooms_text', 'bedrooms', 'beds'],
      dtype='object')

# **Data Cleaning**

In [7]:
# check value counts for property_type
df['property_type'].value_counts()

Entire rental unit                    6975
Private room in rental unit           5153
Private room in home                   844
Entire home                            513
Entire condo                           418
Private room in townhouse              352
Entire loft                            326
Entire townhouse                       297
Private room in condo                  180
Shared room in rental unit             178
Private room in loft                   149
Entire guest suite                     133
Entire serviced apartment               98
Room in boutique hotel                  68
Room in hotel                           56
Private room in guest suite             37
Entire place                            33
Room in serviced apartment              24
Shared room in loft                     19
Entire guesthouse                       19
Private room                            18
Private room in resort                  17
Private room in bed and breakfast       14
Shared room

There are property types that we do not want to consider in our analysis (Boats, Caves and Villa's) so we will remove these examples.

In [8]:
# Check shape before dropping examples
df.shape

(16005, 22)

In [9]:
df = df.drop(df[(df['property_type'] == 'Cave') | (df['property_type'] == 'Boat') | 
                (df['property_type'] == 'Floor') | (df['property_type'] == 'Private room in farm stay') |
                (df['property_type'] == 'Entire villa') | (df['property_type'] == 'Private room in houseboat') |
                (df['property_type'] == 'Private room in villa') | (df['property_type'] == 'Private room in tent') |
                (df['property_type'] == 'Houseboat')].index)

In [10]:
# Check shape after dropping examples
df.shape

(15986, 22)

In [11]:
# assess new value counts for property_type
df['property_type'].value_counts()

Entire rental unit                    6975
Private room in rental unit           5153
Private room in home                   844
Entire home                            513
Entire condo                           418
Private room in townhouse              352
Entire loft                            326
Entire townhouse                       297
Private room in condo                  180
Shared room in rental unit             178
Private room in loft                   149
Entire guest suite                     133
Entire serviced apartment               98
Room in boutique hotel                  68
Room in hotel                           56
Private room in guest suite             37
Entire place                            33
Room in serviced apartment              24
Entire guesthouse                       19
Shared room in loft                     19
Private room                            18
Private room in resort                  17
Private room in bed and breakfast       14
Shared room

In [12]:
# extract the numerical values from the bathroom_text column for consideration 
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Shared half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Private half-bath', 0.5, inplace=True)
df['bathrooms'] = df['bathrooms_text'].str.extract(r'\b([\d.]+)\b')

In [13]:
# Convert bathroom to float type
df['bathrooms'] = df['bathrooms'].astype(float)

In [14]:
# drop bathroom_text, beds, and duplicated room_type column
df.drop(['bathrooms_text', 'room_type_additional', 'beds'], axis = 1, inplace = True)

In [15]:
# drop suffix from room_type_original
df = df.rename(columns = {'room_type_original' : 'room_type'})

In [16]:
# check for null values
df.isnull().sum()

id                                   0
name                                11
host_id                              0
host_name                           10
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       3010
reviews_per_month                 3010
calculated_host_listings_count       0
availability_365                     0
property_type                        0
accommodates                         0
bedrooms                          1562
bathrooms                           52
dtype: int64

For bedrooms and bathrooms with null values, fill with zero as properties can have no bedrooms or bathrooms

In [17]:
df[['bedrooms', 'bathrooms']] = df[['bedrooms', 'bathrooms']].fillna(value=0)

In [18]:
# Check null values again to confirm
df.isnull().sum()

id                                   0
name                                11
host_id                              0
host_name                           10
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       3010
reviews_per_month                 3010
calculated_host_listings_count       0
availability_365                     0
property_type                        0
accommodates                         0
bedrooms                             0
bathrooms                            0
dtype: int64

All other columns with null values are not important for this analysis as these columns will be dropped.

# **Feature Scaling**


In [19]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'property_type', 'accommodates', 'bedrooms',
       'bathrooms'],
      dtype='object')

In [20]:
# drop all columns not necessary
# over simplifying for our first iteration

df.drop(['neighbourhood','name','host_name','number_of_reviews','last_review','reviews_per_month',
         'calculated_host_listings_count','id','host_id','latitude','longitude'], axis=1, inplace = True)
# df.drop('a', inplace=True, axis=1)

In [21]:
numeric_data = df.select_dtypes(include=[np.number])
categorical_data = df.select_dtypes(exclude=[np.number])


In [22]:
#define clean as duplicate
df_clean = df.copy()

# **Decision Tree**

The objective of this assignment is for you to perform a complete implementation of a decision
tree classifier using your team’s project dataset.
0. Prior to building the ML model:

*   Split your data into testing and training.
*   Determine whether your label data needs to be discretized (if you have a numerical label).

**Exploring decision tree construction:**
Vary the following hyperparameters to build your decision tree classifier and report the evaluation metrics for both your training and testing data.

**1. Vary the criterion hyperparameter:**


a. Create a DT using the criterion parameter “gini” and report the accuracy,
precision, recall and F1 score.


b. Create a DT using the criterion parameter “entropy” and report the accuracy,
precision, recall and F1 score.

**2. Vary the splitter hyperparameter:**

a. Create a DT using the splitter parameter “best” and report the accuracy,
precision, recall and F1 score.

b. Create a DT using the splitter parameter “random” and report the accuracy,
precision, recall and F1 score.

**3. Vary the min_samples_split hyperparameter:**

a. Choose value 1 as your min_samples_split and report the accuracy, precision, recall and F1 score.


b. Choose value 2 as your min_samples_split and report the accuracy, precision, recall and F1 score.


**4. Vary the min_samples_leaf hyperparameter:**

a. Choose value 1 as your min_samples_leaf and report the accuracy, precision, recall and F1 score.

b. Choose value 2 as your min_samples_leaf and report the accuracy, precision, recall and F1 score.


**5. Vary the max_depth hyperparameter:**


a. Assign a limiting depth, e.g. 4, for our hyperparameter and report the accuracy, precision, recall and F1 score.


b. Assign a 2nd limiting depth, e.g. 8, for our hyperparameter and report the accuracy, precision, recall and F1 score.


**6. Hyperparameter overview:**

Provide a 2–3 paragraph summary of the results of your hyperparameter exploration. How did your ML model improve or depreciate with these variations?


# **Final Decision Tree & Evaluation**

**1. Which feature was used for the first split?**

**2. How many leaves are in the optimal classifier/ML model?**

**3. Produce a confusion_matrix and describe your ML model’s accuracy in terms of the number of true and false positives and negatives.** (Cailenys)

**4. Using scikit-learn’s classification_report method, generate the accuracy, precision, recall, and F1 score for your model and describe your ML model’s accuracy.** (Cailenys)


# **Visualize the structure of your final ML model:**


**5. Plot your tree. [Hint: using scikit-learn’s tree.plot_tree**

**6. Plot the decision surface of your tree using paired features. [See the following for help implementing:**
https://scikit-learn.org/stable/auto_examples/tree/plot_iris_dtc.html#sphx-glr-auto-examples-tree-plot-iris-dtc-py]

# **Decision tree path:** 

**7. Provide a description of the potential path along your tree that a given new data point might take and provide its final result. The idea being that we want to know what decisions would be made along the way for that data point to end up at a particular label.**

# **ML model Accuracy**

Perform a comparison of our decision tree model vs. k-NN model: provide a comparison table of accuracy for your various DT ML models and your k-NN ML models. This will be a tool for comparison for you as a technician, but it will also serve as a communication tool to summarize to stakeholders what you tried, what worked best, and why.


# **Business Evaluation** (cailenys)


One of the key objectives of this course is to learn how to implement ML algorithms to tackle business problems and objectives. Please provide us with a complete scenario of how the results of your decision tree classifier might be used.

**Note:** you’ve previously considered some of these questions, the intent with reconsidering them is to iterate on our problem after obtaining results from our ML model:


**1. What might be the motivation for a decision tree classifier?**


**2. What is the “action” that should be taken given the results of this prediction?**

**3. Who is the best immediate person(s) to make use of the results of your prediction?**

**4. What is the potential payoff of this prediction for an organization? (e.g., costs or efficiency).**


**5. Do your ML models’ results change your problem? If so, how and why? If not, please explain.**