<a href="https://colab.research.google.com/github/cboyda/MachineLearning/blob/main/PA5_Team1_W23%20(Cai).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Assignment #5: Linear Regression**

Team member names:

*  Brett Adams
*  Cailenys Leslie
*  Clinton Boyda 
*  Tanvir Hossain
*  Ram Dershan

Dataset: 
[New York City Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)

# Data Initialization

In [196]:
# import the libraries we will use
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
import warnings
warnings.filterwarnings("ignore") # disable warnings when making remote calls

In [112]:
# load both data sets in
original = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/AB_NYC_2019.csv"
df_original = pd.read_csv(original)
additional = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/full_nyc_dataset_cleaned_table-1.csv"
df_additional = pd.read_csv(additional)

In [113]:
# Merge the two datasets with an inner join, validate that no duplicate id values exist for a one to one join
df = pd.merge(df_original, df_additional, how = "inner", on = "id", validate="one_to_one", suffixes=("_original","_additional"))
df.shape

(16005, 22)

In [114]:
df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type_original,price,...,last_review,reviews_per_month,calculated_host_listings_count,availability_365,property_type,room_type_additional,accommodates,bathrooms_text,bedrooms,beds
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,...,2019-05-21,0.38,2,355,Entire rental unit,Entire home/apt,1,1 bath,,1.0
1,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,...,2017-10-05,0.40,1,0,Private room in rental unit,Private room,2,,1.0,1.0
2,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79,...,2019-06-24,3.47,1,220,Private room in rental unit,Private room,2,1 bath,1.0,1.0
3,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,79,...,2017-07-21,0.99,1,0,Private room in rental unit,Private room,1,1 shared bath,1.0,1.0
4,5803,"Lovely Room 1, Garden, Best Area, Legal rental",9744,Laurie,Brooklyn,South Slope,40.66829,-73.98779,Private room,89,...,2019-06-24,1.34,3,314,Private room in townhouse,Private room,2,1.5 baths,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16000,36457832,"❥NYC Apt: 4min/subway, 25m/city, 20m/LGA,JFK❥",63272360,Annie Lawrence,Queens,Woodhaven,40.69482,-73.86618,Entire home/apt,85,...,,,6,300,Entire home,Entire home/apt,2,1 bath,1.0,3.0
16001,36471896,Private Bedroom & PRIVATE BATHROOM in Manhattan,23548340,Sarah,Manhattan,Upper East Side,40.77192,-73.95369,Private room,95,...,,,1,2,Private room in rental unit,Private room,2,1 private bath,1.0,1.0
16002,36477307,Brooklyn paradise,241945355,Clement & Rose,Brooklyn,Flatlands,40.63116,-73.92616,Entire home/apt,170,...,,,2,363,Entire rental unit,Entire home/apt,6,1 bath,2.0,2.0
16003,36481615,"Peaceful space in Greenpoint, BK",274298453,Adrien,Brooklyn,Greenpoint,40.72585,-73.94001,Private room,54,...,,,1,15,Private room in rental unit,Private room,2,1 shared bath,1.0,1.0


#Data Cleaning

## Data Preprocessing
1. Dropping examples to distill data into higher quality
2. Adding and dropping of features required for linear regression

### 1. Dropping examples

In [115]:
# check value counts for property_type
df['property_type'].value_counts()

Entire rental unit                    6975
Private room in rental unit           5153
Private room in home                   844
Entire home                            513
Entire condo                           418
Private room in townhouse              352
Entire loft                            326
Entire townhouse                       297
Private room in condo                  180
Shared room in rental unit             178
Private room in loft                   149
Entire guest suite                     133
Entire serviced apartment               98
Room in boutique hotel                  68
Room in hotel                           56
Private room in guest suite             37
Entire place                            33
Room in serviced apartment              24
Shared room in loft                     19
Entire guesthouse                       19
Private room                            18
Private room in resort                  17
Private room in bed and breakfast       14
Shared room

There are property types that we do not want to consider in our analysis (Boats, Caves and Villa's) so we will remove these examples.

In [116]:
# Check shape before dropping examples
df.shape

(16005, 22)

In [117]:
# drop property types which are outliers
df = df.drop(df[(df['property_type'] == 'Cave') | (df['property_type'] == 'Boat') | 
                (df['property_type'] == 'Floor') | (df['property_type'] == 'Private room in farm stay') |
                (df['property_type'] == 'Entire villa') | (df['property_type'] == 'Private room in houseboat') |
                (df['property_type'] == 'Private room in villa') | (df['property_type'] == 'Private room in tent') |
                (df['property_type'] == 'Houseboat')].index)

In [118]:
# distill all similar property types into main type
df['property_type'] = df.property_type.str.replace(r'(^.*Private room.*$)', 'Private Room')
df['property_type'] = df.property_type.str.replace(r'(^.*Entire.*$)', 'Entire Unit')
df['property_type'] = df.property_type.str.replace(r'(^.*Shared room.*$)', 'Shared Room')
df['property_type'] = df.property_type.str.replace(r'(^.*Room in.*$)', 'Room In')

In [119]:
# Check shape after dropping examples
df.shape

(15986, 22)

In [120]:
# assess new value counts for property_type
df['property_type'].value_counts()

Entire Unit     8826
Private Room    6780
Shared Room      220
Room In          154
Tiny home          6
Name: property_type, dtype: int64

In [121]:
# drop suffix from room_type_original
df = df.rename(columns = {'room_type_original' : 'room_type'})

In [122]:
# extract the numerical values from the bathroom_text column 
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Shared half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Private half-bath', 0.5, inplace=True)
df['bathrooms'] = df['bathrooms_text'].str.extract(r'\b([\d.]+)\b')

In [123]:
# Convert bathroom to float type
df['bathrooms'] = df['bathrooms'].astype(float)

In [124]:
# check for null values
df.isnull().sum()

id                                   0
name                                11
host_id                              0
host_name                           10
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       3010
reviews_per_month                 3010
calculated_host_listings_count       0
availability_365                     0
property_type                        0
room_type_additional                 0
accommodates                         0
bathrooms_text                      22
bedrooms                          1562
beds                               109
bathrooms                           52
dtype: int64

In [125]:
# For bedrooms and bathrooms with null values, fill with zero as properties can have no bedrooms or bathrooms
df[['bedrooms', 'bathrooms']] = df[['bedrooms', 'bathrooms']].fillna(value=0)

In [126]:
# Check null values again to confirm
df.isnull().sum()

id                                   0
name                                11
host_id                              0
host_name                           10
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       3010
reviews_per_month                 3010
calculated_host_listings_count       0
availability_365                     0
property_type                        0
room_type_additional                 0
accommodates                         0
bathrooms_text                      22
bedrooms                             0
beds                               109
bathrooms                            0
dtype: int64

All other columns with null values are not important for this analysis as these columns will be dropped.

In [127]:
# Drop bathroom_text, beds, and duplicated room_type features
df.drop(['bathrooms_text', 'room_type_additional', 'beds'], axis = 1, inplace = True)

In [128]:
df.shape

(15986, 20)

In [129]:
df.columns

Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365', 'property_type', 'accommodates', 'bedrooms',
       'bathrooms'],
      dtype='object')

### 2. Feature Adding / Dropping

In [160]:
# set price as the label
label = df['price']
# create new dataframe for linear regression 
df_regression = df.copy()
df_regression

Unnamed: 0,neighbourhood_group,room_type,price,minimum_nights,availability_365,property_type,accommodates,bedrooms,bathrooms,log_price
0,Manhattan,Entire home/apt,225,1,355,Entire Unit,1,0.0,1.0,5.416100
2,Manhattan,Private room,79,2,220,Private Room,2,1.0,1.0,4.369448
4,Brooklyn,Private room,89,4,314,Private Room,2,1.0,1.5,4.488636
5,Brooklyn,Entire home/apt,140,2,46,Entire Unit,3,0.0,1.0,4.941642
6,Brooklyn,Entire home/apt,215,2,321,Entire Unit,4,1.0,1.0,5.370638
...,...,...,...,...,...,...,...,...,...,...
16000,Queens,Entire home/apt,85,3,300,Entire Unit,2,1.0,1.0,4.442651
16001,Manhattan,Private room,95,1,2,Private Room,2,1.0,1.0,4.553877
16002,Brooklyn,Entire home/apt,170,1,363,Entire Unit,6,2.0,1.0,5.135798
16003,Brooklyn,Private room,54,6,15,Private Room,2,1.0,1.0,3.988984


In [131]:
# add log of price to dataframe
df['log_price'] = np.log(df['price'])

In [132]:
# drop all columns which are not necessary for this analysis

df.drop(['neighbourhood','name','host_name','number_of_reviews','last_review','reviews_per_month',
         'calculated_host_listings_count','id','host_id','latitude','longitude'], axis=1, inplace = True)

In [133]:
# drop units that are not able to be rented (availability_365 = 0)
zero_availability = df.loc[df.availability_365 == 0, 'availability_365'].index
# zero availability means unit is NOT available so best drop from out model
df.drop(zero_availability,axis=0,inplace=True)

In [134]:
df.shape

(8624, 10)

In [178]:
column_names= df_regression.columns
features = column_names[1:]
label = column_names[9]
display(features, label)

Index(['room_type', 'price', 'minimum_nights', 'availability_365',
       'property_type', 'accommodates', 'bedrooms', 'bathrooms', 'log_price'],
      dtype='object')

'log_price'

In [179]:
df_regression.dtypes

neighbourhood_group     object
room_type               object
price                    int64
minimum_nights           int64
availability_365         int64
property_type           object
accommodates             int64
bedrooms               float64
bathrooms              float64
log_price              float64
dtype: object

In [135]:
numeric_data = df.select_dtypes(include=[np.number])

In [180]:
# convert "neighbourhood_group" to numerical

In [137]:
numeric_data

Unnamed: 0,price,minimum_nights,availability_365,accommodates,bedrooms,bathrooms,log_price
0,225,1,355,1,0.0,1.0,5.416100
2,79,2,220,2,1.0,1.0,4.369448
4,89,4,314,2,1.0,1.5,4.488636
5,140,2,46,3,0.0,1.0,4.941642
6,215,2,321,4,1.0,1.0,5.370638
...,...,...,...,...,...,...,...
16000,85,3,300,2,1.0,1.0,4.442651
16001,95,1,2,2,1.0,1.0,4.553877
16002,170,1,363,6,2.0,1.0,5.135798
16003,54,6,15,2,1.0,1.0,3.988984


In [138]:
# any null values? 0 means none found == no need to fix nulls
df.isna().sum()

neighbourhood_group    0
room_type              0
price                  0
minimum_nights         0
availability_365       0
property_type          0
accommodates           0
bedrooms               0
bathrooms              0
log_price              0
dtype: int64

In [185]:
feature_names = column_names[:-1]
label_name = column_names[-1]

X_preprocess = make_column_transformer((OrdinalEncoder(), feature_names), 
                                       remainder='drop')
y_preprocess = LabelEncoder()

In [190]:
X = X_preprocess.fit_transform(df_regression[feature_names])
y = y_preprocess.fit_transform(df_regression[label_name])

In [191]:
display(X, y)

array([[  2.,   0., 200., ...,   0.,   0.,   1.],
       [  2.,   1.,  58., ...,   1.,   1.,   1.],
       [  1.,   1.,  68., ...,   1.,   1.,   2.],
       ...,
       [  1.,   0., 148., ...,   5.,   2.,   1.],
       [  1.,   1.,  33., ...,   1.,   1.,   1.],
       [  2.,   1.,  69., ...,   1.,   1.,   3.]])

array([200,  58,  68, ..., 148,  33,  69])

# So our business question for this Linear Regression probllem: Predit the rental price of a property in a specific neighbourhood with the the type of accomation the user is looking for (bedroom/accomodates/bathrooms/property_type/room_type)

# **Project Assignment: Linear Regression**
The objective of this assignment is for you to perform a complete implementation of linear
regression using your group’s chosen dataset.


 **1. Use scikit-learn’s *sklearn.linear_model.LinearRegression* to implement a linear surface for your dataset.**

In [213]:
## LinearRegression

# Setup random seed 
np.random.seed(42)

linear_regressor = LinearRegression()
linear_regressor.fit(X, y)

print("Estimated intercept coefficient (bias term) of the linear regression model (b) : ", linear_regressor.intercept_)
# `intercept_` represents the value of the target variable 'y' when all the features have zero values.
print("\n")
print("Estimated coefficients for the linear regression model (w):", linear_regressor.coef_)
# the coefficients 'w' represent the slope or gradient of the linear regression line. Specifically, each coefficient w[i] 
# represents the change in the predicted output y for a one-unit change in the corresponding feature X[i], 
# while holding all other features constant.

Estimated intercept coefficient (bias term) of the linear regression model (b) :  -2.842170943040401e-14


Estimated coefficients for the linear regression model (w): [ 2.49929164e-15 -4.44644321e-14  1.00000000e+00 -6.48786580e-16
  2.22044605e-16  7.93635990e-16  1.64365049e-16  4.25007252e-17
 -8.14832143e-16]


In [202]:
linear_regressor.score(X, y)

1.0

 **2. Use scikit-learn’s *sklearn.linear_model.Ridge* to implement linear least squares with L2 regularization for your dataset using the default parameters.**


In [195]:
## Ridge

# Setup random seed 
np.random.seed(42)

# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Instantiate and fit the model (on the training set)

ridge_model = Ridge()

ridge_model.fit(X_train, y_train)

# Predict the response for test dataset

y_pred = ridge_model.predict(X_test)

# Check the score of the model (on the test set)
ridge_model.score(X_test, y_test) # linear relationship between the features and the label

0.9999999999999994