# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
# YOUR CODE HERE
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestClassifier

## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
# YOUR CODE HERE
#load data into dataframe
filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
df = pd.read_csv(filename)

## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
#view rows in data frame
df.head(20)

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7
5,"Lovely Room 1, Garden, Best Area, Legal rental","Beautiful house, gorgeous garden, patio, cozy ...",Neighborhood is amazing!<br />Best subways to ...,Laurie,"New York, New York, United States","Hello, \r\nI will be welcoming and helpful, w...",1.0,1.0,True,3.0,...,4.82,4.87,4.73,False,3,1,2,0,1.48,7
6,Only 2 stops to Manhattan studio,Comfortable studio apartment with super comfor...,,Allen & Irina,"New York, New York, United States",We love to travel. When we travel we like to s...,1.0,1.0,True,1.0,...,4.8,4.67,4.57,True,1,1,0,0,1.24,7
7,UES Beautiful Blue Room,Beautiful peaceful healthy home<br /><br /><b>...,"Location: Five minutes to Central Park, Museum...",Cyn,"New York, New York, United States",Capturing the Steinbeck side of life in its Fi...,1.0,1.0,True,3.0,...,4.95,4.84,4.84,True,1,0,1,0,1.82,5
8,"Amazing location! Wburg. Large, bright & tranquil","Large, private loft-like room in a spacious 2-...","- One stop from the East Village, Lower East S...",Joelle,"New York, New York, United States",I have lived in the same apartment in Brooklyn...,1.0,0.0,True,2.0,...,5.0,5.0,5.0,False,2,0,2,0,0.07,5
9,Perfect for Your Parents: Privacy + Garden,"Parents/grandparents coming to town, or are yo...","Residential, village-like atmosphere. Lots of ...",Jane,"New York, New York, United States",I have been an Airbnb host since 2009 -- just ...,1.0,0.99,True,1.0,...,4.91,4.93,4.78,True,2,1,1,0,3.05,8


In [5]:
df.shape
#help visualize data

(28022, 50)

In [6]:
#get non null counts and datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28022 entries, 0 to 28021
Data columns (total 50 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   name                                          28017 non-null  object 
 1   description                                   27452 non-null  object 
 2   neighborhood_overview                         18206 non-null  object 
 3   host_name                                     28022 non-null  object 
 4   host_location                                 27962 non-null  object 
 5   host_about                                    17077 non-null  object 
 6   host_response_rate                            16179 non-null  float64
 7   host_acceptance_rate                          16909 non-null  float64
 8   host_is_superhost                             28022 non-null  bool   
 9   host_listings_count                           28022 non-null 

In [7]:
#get insight into key statistics
df.describe()

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
count,16179.0,16909.0,28022.0,28022.0,28022.0,28022.0,25104.0,26668.0,28022.0,28022.0,...,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0,28022.0
mean,0.906901,0.791953,14.554778,14.554778,2.874491,1.142174,1.329708,1.629556,154.228749,18.689387,...,4.8143,4.808041,4.750393,4.64767,9.5819,5.562986,3.902077,0.048283,1.758325,5.16951
std,0.227282,0.276732,120.721287,120.721287,1.860251,0.421132,0.700726,1.097104,140.816605,25.569151,...,0.438603,0.464585,0.415717,0.518023,32.227523,26.121426,17.972386,0.442459,4.446143,2.028497
min,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,29.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01,1.0
25%,0.94,0.68,1.0,1.0,2.0,1.0,1.0,1.0,70.0,2.0,...,4.81,4.81,4.67,4.55,1.0,0.0,0.0,0.0,0.13,4.0
50%,1.0,0.91,1.0,1.0,2.0,1.0,1.0,1.0,115.0,30.0,...,4.96,4.97,4.88,4.78,1.0,1.0,0.0,0.0,0.51,5.0
75%,1.0,1.0,3.0,3.0,4.0,1.0,1.0,2.0,180.0,30.0,...,5.0,5.0,5.0,5.0,3.0,1.0,1.0,0.0,1.83,7.0
max,1.0,1.0,3387.0,3387.0,16.0,8.0,12.0,21.0,1000.0,1250.0,...,5.0,5.0,5.0,5.0,421.0,308.0,359.0,8.0,141.0,13.0


In [8]:
#Check if a given value in any data cell is missing, 
#and sum up the resulting values (True/False) by columns. 
#Assign the results to variable nan_count. Print the results.
nan_count = np.sum(df.isnull(), axis = 0)
nan_count
#price has 0 null values which is good because
#that is what I will be using as my y value


name                                                5
description                                       570
neighborhood_overview                            9816
host_name                                           0
host_location                                      60
host_about                                      10945
host_response_rate                              11843
host_acceptance_rate                            11113
host_is_superhost                                   0
host_listings_count                                 0
host_total_listings_count                           0
host_has_profile_pic                                0
host_identity_verified                              0
neighbourhood_group_cleansed                        0
room_type                                           0
accommodates                                        0
bathrooms                                           0
bedrooms                                         2918
beds                        

In [9]:
condition = nan_count != 0 # look for all columns with missing values

col_names = nan_count[condition].index # get the column names
print(col_names)

nan_cols = list(col_names) # convert column names to list
print(nan_cols)

#this gives me a better understanding of missing values and which values i need to drop

Index(['name', 'description', 'neighborhood_overview', 'host_location',
       'host_about', 'host_response_rate', 'host_acceptance_rate', 'bedrooms',
       'beds'],
      dtype='object')
['name', 'description', 'neighborhood_overview', 'host_location', 'host_about', 'host_response_rate', 'host_acceptance_rate', 'bedrooms', 'beds']


In [10]:
nan_col_types = df[nan_cols].dtypes
nan_col_types
#this displays the columns with data type labels

name                      object
description               object
neighborhood_overview     object
host_location             object
host_about                object
host_response_rate       float64
host_acceptance_rate     float64
bedrooms                 float64
beds                     float64
dtype: object

In [11]:
#compute mean val for missing float nums
mean_response_rate = df['host_response_rate'].mean()
mean_response_rate = df['host_acceptance_rate'].mean()
mean_response_rate = df['bedrooms'].mean()
mean_response_rate = df['beds'].mean()

#fill missing values with mean
df['host_response_rate'].fillna(value=mean_response_rate, inplace=True)
df['host_acceptance_rate'].fillna(value=mean_response_rate, inplace=True)
df['bedrooms'].fillna(value=mean_response_rate, inplace=True)
df['beds'].fillna(value=mean_response_rate, inplace=True)

In [12]:
df = df.drop(columns=['name'], axis=1)
df.drop(columns=['description'], inplace=True)
df.drop(columns=['neighborhood_overview'], inplace=True)
df.drop(columns=['host_location'], inplace=True)
df.drop(columns=['host_about'], inplace=True)
df.drop(columns=['host_name'], inplace=True)
#df.drop(columns=['neighbourhood_group_cleansed'], inplace=True)
# df.drop(columns=['room_type'], inplace=True)
df.drop(columns=['amenities'], inplace=True)
#dropping irrelevant columns and those with null values


In [13]:
# def Encoder(df):
#           columnsToEncode = list(df.select_dtypes(include=['category','object']))
#           le = LabelEncoder()
#           for feature in columnsToEncode:
#               try:
#                   df[feature] = le.fit_transform(df[feature])
#               except:
#                   print('Error encoding '+feature)
#           return df
    


In [14]:
# df['room_type'] = pd.to_numeric(df['room_type'],errors='coerce')
# df['neighbourhood_group_cleansed'] = pd.to_numeric(df['neighbourhood_group_cleansed'],errors='coerce')
# df['amenities'] = pd.to_numeric(df['amenities'],errors='coerce')

In [15]:
# Find columns containing string values
df.dtypes

# Add all column names whose values are of type 'object' to a list named to_encode
to_encode = list(df.select_dtypes(include=['object']).columns)

# Take a closer look at the candidates for one-hot encoding
df[to_encode].nunique()

# # Create an instance of OneHotEncoder
# encoder = OneHotEncoder()

# # Apply one-hot encoding to the data
# encoded_data = encoder.fit_transform(df[to_encode])









neighbourhood_group_cleansed    5
room_type                       4
dtype: int64

In [16]:
# #fill missing object data type cols with 'missing'
df[to_encode] = df[to_encode].fillna('missing')

In [17]:
neighbor_dummies = pd.get_dummies(df.neighbourhood_group_cleansed)
neighbor_dummies

Unnamed: 0,Bronx,Brooklyn,Manhattan,Queens,Staten Island
0,0,0,1,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,0,1,0,0
4,0,0,1,0,0
...,...,...,...,...,...
28017,0,0,0,1,0
28018,0,1,0,0,0
28019,0,1,0,0,0
28020,0,1,0,0,0


In [18]:
# merged = pd.concat([df, neighbor_dummies], axis='columns')
# merged

In [19]:
# amenities_dummies = pd.get_dummies(df.amenities)
# amenities_dummies

In [20]:
room_type_dummies = pd.get_dummies(df.room_type)
room_type_dummies

Unnamed: 0,Entire home/apt,Hotel room,Private room,Shared room
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
28017,0,0,1,0
28018,1,0,0,0
28019,0,0,1,0
28020,1,0,0,0


In [21]:
merged = pd.concat([df, neighbor_dummies, room_type_dummies], axis='columns')
merged

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,neighbourhood_group_cleansed,room_type,accommodates,...,n_host_verifications,Bronx,Brooklyn,Manhattan,Queens,Staten Island,Entire home/apt,Hotel room,Private room,Shared room
0,0.800000,0.170000,True,8.0,8.0,True,True,Manhattan,Entire home/apt,1,...,9,0,0,1,0,0,1,0,0,0
1,0.090000,0.690000,True,1.0,1.0,True,True,Brooklyn,Entire home/apt,3,...,6,0,1,0,0,0,1,0,0,0
2,1.000000,0.250000,True,1.0,1.0,True,True,Brooklyn,Entire home/apt,4,...,3,0,1,0,0,0,1,0,0,0
3,1.000000,1.000000,True,1.0,1.0,True,True,Manhattan,Private room,2,...,4,0,0,1,0,0,0,0,1,0
4,1.629556,1.629556,True,1.0,1.0,True,True,Manhattan,Private room,1,...,7,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,1.000000,1.000000,True,8.0,8.0,True,True,Queens,Private room,2,...,2,0,0,0,1,0,0,0,1,0
28018,0.910000,0.890000,True,0.0,0.0,True,True,Brooklyn,Entire home/apt,6,...,5,0,1,0,0,0,1,0,0,0
28019,0.990000,0.990000,True,6.0,6.0,True,True,Brooklyn,Private room,2,...,2,0,1,0,0,0,0,0,1,0
28020,0.900000,1.000000,True,3.0,3.0,True,True,Brooklyn,Entire home/apt,3,...,7,0,1,0,0,0,1,0,0,0


In [22]:
final = merged.drop(['room_type', 'neighbourhood_group_cleansed'], axis='columns')
final

Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,...,n_host_verifications,Bronx,Brooklyn,Manhattan,Queens,Staten Island,Entire home/apt,Hotel room,Private room,Shared room
0,0.800000,0.170000,True,8.0,8.0,True,True,1,1.0,1.629556,...,9,0,0,1,0,0,1,0,0,0
1,0.090000,0.690000,True,1.0,1.0,True,True,3,1.0,1.000000,...,6,0,1,0,0,0,1,0,0,0
2,1.000000,0.250000,True,1.0,1.0,True,True,4,1.5,2.000000,...,3,0,1,0,0,0,1,0,0,0
3,1.000000,1.000000,True,1.0,1.0,True,True,2,1.0,1.000000,...,4,0,0,1,0,0,0,0,1,0
4,1.629556,1.629556,True,1.0,1.0,True,True,1,1.0,1.000000,...,7,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,1.000000,1.000000,True,8.0,8.0,True,True,2,1.0,1.000000,...,2,0,0,0,1,0,0,0,1,0
28018,0.910000,0.890000,True,0.0,0.0,True,True,6,1.0,2.000000,...,5,0,1,0,0,0,1,0,0,0
28019,0.990000,0.990000,True,6.0,6.0,True,True,2,2.0,1.000000,...,2,0,1,0,0,0,0,0,1,0
28020,0.900000,1.000000,True,3.0,3.0,True,True,3,1.0,1.000000,...,7,0,1,0,0,0,1,0,0,0


## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [23]:
#create features and labels
#label(output) for my training data is: airbnb price (y)
#features(input):(x)
#perform stepwise feature selection
#these are the object type cols that need to have 0 and 1 values to train

In [24]:
model = LinearRegression()

In [25]:
y = final.price
X = final.drop(columns = 'price', axis=1)

In [26]:
#label(output) for my training data is: will room be booked (y)
#features(input): host response rate, host acceptance rate, review scores val, instant bookable (x)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=1234)


In [27]:
#fit the data
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [28]:
model.score(X_test, y_test)

0.4678613519095418

In [29]:
model_pred = model.predict(X_test)

In [30]:
#evaluate linear regression model
lr_rmse = mean_squared_error(y_test, model_pred, squared=False)
lr_r2 = r2_score(y_test, model_pred)
#end solution
print('[LR] Root Mean Squared Error: {0}'.format(lr_rmse))
print('[LR] R2: {0}'.format(lr_r2))

[LR] Root Mean Squared Error: 105.72375208900178
[LR] R2: 0.4678613519095418


In [31]:
#use random forest to find the best possible features
forest = RandomForestRegressor()
forest.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)

In [32]:
forest.score(X_test, y_test)

0.5891455858706676

In [33]:
param_grid = {
    "n_estimators": [1, 10, 30],
    "max_features": [2, 4, 6, 8]
}

grid_search = GridSearchCV(forest, param_grid, cv=5,
                          scoring="neg_mean_squared_error",
                          return_train_score=True)

grid_search.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestRegressor(bootstrap=True, ccp_alpha=0.0,
                                             criterion='mse', max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             max_samples=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators=100, n_jobs=None,
                                             oob_score=False, random_state=None,
                                             verbose=0, warm_start=False),
             iid='deprecated', n_jo

In [34]:
best = grid_search.best_estimator_

In [35]:
best.score(X_test, y_test)

0.5951944938581586