# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename, header=0)

df.head()

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. I have chosen the Airbnb Dataset.
2. I will be predicting review_scores_value, which represents the overall review score given by users.
3. This would be a supervised learning problem, since we have already labeled data (review scores) that we want to predict. This is a regression problem, since the review_scores_value is a numerical value.
4. Features that I plan to look at are name, description, neighborhood overview, etc. For now, I will keep all of them and choose to take some features out while testing.
5. Predicting the review_scores_value can help hosts and improve the quality of airbnb listings and overall user experience. Airbnb itself can look at this data and see whether there are areas of improvement and maintain as high standards as possible.

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
df.head

<bound method NDFrame.head of                                                     name  \
0                                  Skylit Midtown Castle   
1      Whole flr w/private bdrm, bath & kitchen(pls r...   
2               Spacious Brooklyn Duplex, Patio + Garden   
3                       Large Furnished Room Near B'way　   
4                     Cozy Clean Guest Room - Family Apt   
...                                                  ...   
28017                            Astoria Luxury suite 2A   
28018  Newly renovated suite in the heart of Williams...   
28019      Perfect Room to Stay in Brooklyn! Near Metro!   
28020       New Beautiful Modern One Bedroom in Brooklyn   
28021    Large, modern, private 1 bedroom in beach condo   

                                             description  \
0      Beautiful, spacious skylit studio in the heart...   
1      Enjoy 500 s.f. top floor in 1899 brownstone, w...   
2      We welcome you to stay in our lovely 2 br dupl...   
3      Pl

In [4]:
df['review_scores_rating']

0        4.70
1        4.45
2        5.00
3        4.21
4        4.91
         ... 
28017    5.00
28018    5.00
28019    1.00
28020    5.00
28021    5.00
Name: review_scores_rating, Length: 28022, dtype: float64

In [5]:
df['review_scores_rating'].head(15).unique

<bound method Series.unique of 0     4.70
1     4.45
2     5.00
3     4.21
4     4.91
5     4.70
6     4.56
7     4.88
8     4.86
9     4.87
10    4.86
11    4.76
12    4.52
13    4.70
14    4.89
Name: review_scores_rating, dtype: float64>

In [6]:
# Get rid of unnecessary features
calc_colnames = [col for col in df.columns if 'calculated' in col] 
calc_colnames.append('name')
calc_colnames

['calculated_host_listings_count',
 'calculated_host_listings_count_entire_homes',
 'calculated_host_listings_count_private_rooms',
 'calculated_host_listings_count_shared_rooms',
 'name']

In [7]:
df.drop(columns=calc_colnames, inplace=True)
df.columns

Index(['description', 'neighborhood_overview', 'host_name', 'host_location',
       'host_about', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_listings_count', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified',
       'neighbourhood_group_cleansed', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instan

In [8]:
# YOUR CODE HERE
# look at missing values
missing_values = df.isnull().sum()
print(missing_values)

data_types = df.dtypes
print(data_types)

print(df.shape)

description                       570
neighborhood_overview            9816
host_name                           0
host_location                      60
host_about                      10945
host_response_rate              11843
host_acceptance_rate            11113
host_is_superhost                   0
host_listings_count                 0
host_total_listings_count           0
host_has_profile_pic                0
host_identity_verified              0
neighbourhood_group_cleansed        0
room_type                           0
accommodates                        0
bathrooms                           0
bedrooms                         2918
beds                             1354
amenities                           0
price                               0
minimum_nights                      0
maximum_nights                      0
minimum_minimum_nights              0
maximum_minimum_nights              0
minimum_maximum_nights              0
maximum_maximum_nights              0
minimum_nigh

In [9]:
nan_detected = nan_count = df.isnull().sum()
nan_detected = nan_count > 0

nan_detected

description                      True
neighborhood_overview            True
host_name                       False
host_location                    True
host_about                       True
host_response_rate               True
host_acceptance_rate             True
host_is_superhost               False
host_listings_count             False
host_total_listings_count       False
host_has_profile_pic            False
host_identity_verified          False
neighbourhood_group_cleansed    False
room_type                       False
accommodates                    False
bathrooms                       False
bedrooms                         True
beds                             True
amenities                       False
price                           False
minimum_nights                  False
maximum_nights                  False
minimum_minimum_nights          False
maximum_minimum_nights          False
minimum_maximum_nights          False
maximum_maximum_nights          False
minimum_nigh

In [10]:
numerical_cols = ['bedrooms', 'beds']
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].mean())

In [11]:
categorical_cols = ['description', 'neighborhood_overview', 'host_location', 'host_about', 
                    'host_response_rate', 'host_acceptance_rate']
df[categorical_cols] = df[categorical_cols].fillna(df[categorical_cols].mode().iloc[0])

In [12]:
missing_values_after = df.isnull().sum()
missing_values_after

description                     0
neighborhood_overview           0
host_name                       0
host_location                   0
host_about                      0
host_response_rate              0
host_acceptance_rate            0
host_is_superhost               0
host_listings_count             0
host_total_listings_count       0
host_has_profile_pic            0
host_identity_verified          0
neighbourhood_group_cleansed    0
room_type                       0
accommodates                    0
bathrooms                       0
bedrooms                        0
beds                            0
amenities                       0
price                           0
minimum_nights                  0
maximum_nights                  0
minimum_minimum_nights          0
maximum_minimum_nights          0
minimum_maximum_nights          0
maximum_maximum_nights          0
minimum_nights_avg_ntm          0
maximum_nights_avg_ntm          0
has_availability                0
availability_3

In [13]:
# One Hot Code Categorical Features
df_encoded = pd.get_dummies(df, columns=['host_is_superhost', 'instant_bookable', 'n_host_verifications'])
df_encoded.drop(columns=['description', 'neighborhood_overview', 'host_name', 'host_location', 
                         'host_about'], inplace=True)
df_encoded.head()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_group_cleansed,room_type,accommodates,bathrooms,...,n_host_verifications_4,n_host_verifications_5,n_host_verifications_6,n_host_verifications_7,n_host_verifications_8,n_host_verifications_9,n_host_verifications_10,n_host_verifications_11,n_host_verifications_12,n_host_verifications_13
0,0.8,0.17,8.0,8.0,True,True,Manhattan,Entire home/apt,1,1.0,...,0,0,0,0,0,1,0,0,0,0
1,0.09,0.69,1.0,1.0,True,True,Brooklyn,Entire home/apt,3,1.0,...,0,0,1,0,0,0,0,0,0,0
2,1.0,0.25,1.0,1.0,True,True,Brooklyn,Entire home/apt,4,1.5,...,0,0,0,0,0,0,0,0,0,0
3,1.0,1.0,1.0,1.0,True,True,Manhattan,Private room,2,1.0,...,1,0,0,0,0,0,0,0,0,0
4,1.0,1.0,1.0,1.0,True,True,Manhattan,Private room,1,1.0,...,0,0,0,1,0,0,0,0,0,0


## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

- Yes, I do have a new feature list. I've chosen to remove name, description, neighborhood_overview, host_name, host_location, and host_about. I also don't need the calculate ones either.
- The data preparation techniques that I used address missing values. More specifically, we replaced missing numerical values with a median value, and missing categorical values with the item that shows up the most. I also applied hot encoding to the categorical features.
- I plan to use a Linear Regression model to predict the review value of the airbnb listing.
- To train my model and analyze its performance, I plan to split the data first, into something like a 80-20 split. When training, I will use a linear regression model using the training set. To evaluate whether my model performs well, I can use the mean squared error and R^2 values. To improve performance, I can use more complex models if necessary and tune hyperparameters. 

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [14]:
# YOUR CODE HERE
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [15]:
# YOUR CODE HERE
feature_list = ['host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_listings_count', 
             'review_scores_communication', 'review_scores_location', 'instant_bookable', 'n_host_verifications']
X = df[feature_list]
y = df['review_scores_value']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [17]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 0.12391867163796871
R-squared: 0.5549155176109803


This seems reasonable; a lower mean squared error would indicate a better fit, which is 0.1239. R-squared represents the fact that about 55% of the variance in review_scores_value can be explained by the model. We could definitely try and improve this.

In [18]:
# We can try using a Logistic model to see if things improve.
model = DecisionTreeRegressor()
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}
print('Running Grid Search...')
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Best Parameters:", grid_search.best_params_)
print("R-squared:", r2)

print('Done')

Running Grid Search...
Mean Squared Error: 0.12435780907835524
Best Parameters: {'max_depth': 5, 'min_samples_leaf': 2, 'min_samples_split': 5}
R-squared: 0.5533382471498903
Done


Using a logistic regression model with decision trees did not necessarily seem to make that big of a difference. We can try another model. 

In [19]:
from sklearn.ensemble import RandomForestRegressor

rf_model = RandomForestRegressor()
param_grid_rf = {
    'n_estimators': [10, 50],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

In [20]:
print('Running Grid Search...')
grid_search_rf = GridSearchCV(rf_model, param_grid_rf, cv=5, scoring='neg_mean_squared_error')
grid_search_rf.fit(X_train, y_train)
best_rf_model = grid_search_rf.best_estimator_
y_pred_rf = best_rf_model.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred_rf))
print("R-squared:", r2_score(y_test, y_pred_rf))
print("Best Parameters:", grid_search_rf.best_params_)
print('Done')

Running Grid Search...
Mean Squared Error: 0.1196056210213966
R-squared: 0.5704069030157803
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 50}
Done


In [23]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 20]}
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
best_knn = grid_search.best_estimator_

y_pred = best_knn.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print('Mean Squared Error:', mse)
print('R-squared:', r2)
print('Best Parameters:', grid_search.best_params_)

Mean Squared Error: 0.1386847431086471
R-squared: 0.5018795288404838
Best Parameters: {'n_neighbors': 11}


In [25]:
from sklearn.ensemble import GradientBoostingRegressor
model = GradientBoostingRegressor()
param_grid = {
    'n_estimators': [100, 200],
    'learning_rate': [0.01, 0.1],
    'max_depth': [3, 5]
}

In [26]:
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print('Mean Squared Error:', mean_squared_error(y_test, y_pred))
print('R-squared:', r2_score(y_test, y_pred))
print('Best Parameters:', grid_search.best_params_)

Mean Squared Error: 0.1186110380907948
R-squared: 0.5739791929944298
Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}


I have modeled the training data off a linear regression model, and several logistic models including KNN, Decision Trees, and Gradient Boosting. Based on their performances just looking at the best parameters, the Gradient Boosting Regressor had the best performance in predicting review_scores_value, with learning rate = 0.1, max depth of 3, and n_estimators = 100. It provides the best balance of predictive accuracy and minimal error.

Linear Regression and Decision Tree have similar performance, with R-squared values around 0.55.
K-Nearest Neighbors performed worse than the other models, with an R-squared of 0.5019, which indicates it explained less variance in the target variable. Gradient Boosting outperformed the other models overall.