<a href="https://colab.research.google.com/github/cboyda/MachineLearning/blob/main/PA5_Team1_W23.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Assignment #5: Linear Regression**

Team member names:

*  Brett Adams
*  Cailenys Leslie
*  Clinton Boyda 
*  Tanvir Hossain
*  Ram Dershan

Dataset: 
[New York City Airbnb Open Data](https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data)

# Data Initialization

In [195]:
# import the libraries we will use
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso
import warnings
warnings.filterwarnings("ignore") # disable warnings when making remote calls

In [149]:
# load both data sets in
original = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/AB_NYC_2019.csv"
df_original = pd.read_csv(original)
additional = "https://raw.githubusercontent.com/cboyda/MachineLearning/main/full_nyc_dataset_cleaned_table-1.csv"
df_additional = pd.read_csv(additional)

In [150]:
# Merge the two datasets with an inner join, validate that no duplicate id values exist for a one to one join
df = pd.merge(df_original, df_additional, how = "inner", on = "id", validate="one_to_one", suffixes=("_original","_additional"))
df.shape

(16005, 22)

In [151]:
df

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type_original,price,...,last_review,reviews_per_month,calculated_host_listings_count,availability_365,property_type,room_type_additional,accommodates,bathrooms_text,bedrooms,beds
0,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,...,2019-05-21,0.38,2,355,Entire rental unit,Entire home/apt,1,1 bath,,1.0
1,5121,BlissArtsSpace!,7356,Garon,Brooklyn,Bedford-Stuyvesant,40.68688,-73.95596,Private room,60,...,2017-10-05,0.40,1,0,Private room in rental unit,Private room,2,,1.0,1.0
2,5178,Large Furnished Room Near B'way,8967,Shunichi,Manhattan,Hell's Kitchen,40.76489,-73.98493,Private room,79,...,2019-06-24,3.47,1,220,Private room in rental unit,Private room,2,1 bath,1.0,1.0
3,5203,Cozy Clean Guest Room - Family Apt,7490,MaryEllen,Manhattan,Upper West Side,40.80178,-73.96723,Private room,79,...,2017-07-21,0.99,1,0,Private room in rental unit,Private room,1,1 shared bath,1.0,1.0
4,5803,"Lovely Room 1, Garden, Best Area, Legal rental",9744,Laurie,Brooklyn,South Slope,40.66829,-73.98779,Private room,89,...,2019-06-24,1.34,3,314,Private room in townhouse,Private room,2,1.5 baths,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16000,36457832,"❥NYC Apt: 4min/subway, 25m/city, 20m/LGA,JFK❥",63272360,Annie Lawrence,Queens,Woodhaven,40.69482,-73.86618,Entire home/apt,85,...,,,6,300,Entire home,Entire home/apt,2,1 bath,1.0,3.0
16001,36471896,Private Bedroom & PRIVATE BATHROOM in Manhattan,23548340,Sarah,Manhattan,Upper East Side,40.77192,-73.95369,Private room,95,...,,,1,2,Private room in rental unit,Private room,2,1 private bath,1.0,1.0
16002,36477307,Brooklyn paradise,241945355,Clement & Rose,Brooklyn,Flatlands,40.63116,-73.92616,Entire home/apt,170,...,,,2,363,Entire rental unit,Entire home/apt,6,1 bath,2.0,2.0
16003,36481615,"Peaceful space in Greenpoint, BK",274298453,Adrien,Brooklyn,Greenpoint,40.72585,-73.94001,Private room,54,...,,,1,15,Private room in rental unit,Private room,2,1 shared bath,1.0,1.0


#Data Cleaning

1. Dropping examples to distill data into higher quality
2. Adding and dropping of features required for linear regression

### 1. Dropping examples

In [152]:
# check value counts for property_type
df['property_type'].value_counts()

Entire rental unit                    6975
Private room in rental unit           5153
Private room in home                   844
Entire home                            513
Entire condo                           418
Private room in townhouse              352
Entire loft                            326
Entire townhouse                       297
Private room in condo                  180
Shared room in rental unit             178
Private room in loft                   149
Entire guest suite                     133
Entire serviced apartment               98
Room in boutique hotel                  68
Room in hotel                           56
Private room in guest suite             37
Entire place                            33
Room in serviced apartment              24
Shared room in loft                     19
Entire guesthouse                       19
Private room                            18
Private room in resort                  17
Private room in bed and breakfast       14
Shared room

There are property types that we do not want to consider in our analysis (Boats, Caves and Villa's) so we will remove these examples.

In [153]:
# Check shape before dropping examples
df.shape

(16005, 22)

In [154]:
# drop property types which are outliers
df = df.drop(df[(df['property_type'] == 'Cave') | (df['property_type'] == 'Boat') | 
                (df['property_type'] == 'Floor') | (df['property_type'] == 'Private room in farm stay') |
                (df['property_type'] == 'Entire villa') | (df['property_type'] == 'Private room in houseboat') |
                (df['property_type'] == 'Private room in villa') | (df['property_type'] == 'Private room in tent') |
                (df['property_type'] == 'Houseboat')].index)

In [155]:
# distill all similar property types into main type
df['property_type'] = df.property_type.str.replace(r'(^.*Private room.*$)', 'Private Room')
df['property_type'] = df.property_type.str.replace(r'(^.*Entire.*$)', 'Entire Unit')
df['property_type'] = df.property_type.str.replace(r'(^.*Shared room.*$)', 'Shared Room')
df['property_type'] = df.property_type.str.replace(r'(^.*Room in.*$)', 'Room In')

In [156]:
# Check shape after dropping examples
df.shape

(15986, 22)

In [157]:
# assess new value counts for property_type
df['property_type'].value_counts()

Entire Unit     8826
Private Room    6780
Shared Room      220
Room In          154
Tiny home          6
Name: property_type, dtype: int64

In [158]:
# drop suffix from room_type_original
df = df.rename(columns = {'room_type_original' : 'room_type'})

In [159]:
# extract the numerical values from the bathroom_text column 
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Shared half-bath', 0.5, inplace=True)
df['bathrooms_text'].mask(df['bathrooms_text'] == 'Private half-bath', 0.5, inplace=True)
df['bathrooms'] = df['bathrooms_text'].str.extract(r'\b([\d.]+)\b')

In [160]:
# Convert bathroom to float type
df['bathrooms'] = df['bathrooms'].astype(float)

In [161]:
# drop units that are not able to be rented (availability_365 = 0)
zero_availability = df.loc[df.availability_365 == 0, 'availability_365'].index
# zero availability means unit is NOT available so best drop from out model
df.drop(zero_availability,axis=0,inplace=True)

In [162]:
# check for null values
df.isnull().sum()

id                                   0
name                                 3
host_id                              0
host_name                            2
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       1035
reviews_per_month                 1035
calculated_host_listings_count       0
availability_365                     0
property_type                        0
room_type_additional                 0
accommodates                         0
bathrooms_text                       8
bedrooms                           808
beds                                79
bathrooms                           23
dtype: int64

In [163]:
# For bedrooms and bathrooms with null values, fill with zero as properties can have no bedrooms or bathrooms
df[['bedrooms', 'bathrooms']] = df[['bedrooms', 'bathrooms']].fillna(value=0)

In [164]:
# Check null values again to confirm
df.isnull().sum()

id                                   0
name                                 3
host_id                              0
host_name                            2
neighbourhood_group                  0
neighbourhood                        0
latitude                             0
longitude                            0
room_type                            0
price                                0
minimum_nights                       0
number_of_reviews                    0
last_review                       1035
reviews_per_month                 1035
calculated_host_listings_count       0
availability_365                     0
property_type                        0
room_type_additional                 0
accommodates                         0
bathrooms_text                       8
bedrooms                             0
beds                                79
bathrooms                            0
dtype: int64

All other columns with null values are not important for this analysis as these columns will be dropped.

In [165]:
df.shape

(8624, 23)

### 2. Feature Adding / Dropping

In [147]:
# create new dataframe for linear regression 
df_regression = df.copy()

In [166]:
# drop all columns which are not necessary for this analysis
df_regression.drop(['neighbourhood','name','host_name','number_of_reviews','last_review','reviews_per_month',
         'calculated_host_listings_count','id','host_id','latitude','longitude','bathrooms_text',
         'room_type_additional', 'beds'], axis=1, inplace = True)

In [167]:
df_regression.shape

(8624, 9)

In [168]:
numeric_data = df_regression.select_dtypes(include = ['int64' , 'float64'])
numeric_data

Unnamed: 0,price,minimum_nights,availability_365,accommodates,bedrooms,bathrooms
0,225,1,355,1,0.0,1.0
2,79,2,220,2,1.0,1.0
4,89,4,314,2,1.0,1.5
5,140,2,46,3,0.0,1.0
6,215,2,321,4,1.0,1.0
...,...,...,...,...,...,...
16000,85,3,300,2,1.0,1.0
16001,95,1,2,2,1.0,1.0
16002,170,1,363,6,2.0,1.0
16003,54,6,15,2,1.0,1.0


In [169]:
for column in numeric_data.columns:
  fig = px.histogram(numeric_data, x=column, marginal="box")
  fig.show()

Consider how to manage extreme values.

In [170]:
extreme_values_numeric= []
for column in numeric_data.columns:
  # Select the first quantile
  q1 = numeric_data[column].quantile(0.25)

  # Select the third quantile
  q3 = numeric_data[column].quantile(0.75)

  max = numeric_data[column].quantile(1)

  # Create a mask inbetween q1 & q3
  IQR = q3 - q1

  # Filtering the initial dataframe with a mask
  #filtered = df.query('(@q1 - 1.5 * @IQR) <= [column] <= (@q3 + 1.5 * @IQR)')
  # Filtering Values between Q1-1.5IQR and Q3+1.5IQR  

  #maximum outliers
  bottom_fence = 0 if (q1 - 1.5 * IQR) < 0 else q1 - 1.5 * IQR
  upper_fence = max if (q3 + 1.5 * IQR) > max else (q3 + 1.5 * IQR)
  #display(column, bottom_fence, upper_fence)
  extreme_values_numeric.append([column, bottom_fence, upper_fence])


In [171]:
  extreme_values_numeric

[['price', 0, 339.0],
 ['minimum_nights', 0, 12.0],
 ['availability_365', 0, 365.0],
 ['accommodates', 0, 7.0],
 ['bedrooms', 1.0, 1.0],
 ['bathrooms', 1.0, 1.0]]

In [172]:
# lookup in extreme_values UPPER/LOWER FENCE values
def get_upperfence(name=''):
  for i in range(len(extreme_values_numeric)):
    if extreme_values_numeric[i][0] == name:
      return extreme_values_numeric[i][2]
    else:
      continue

def get_lowerfence(name=''):
  for i in range(len(extreme_values_numeric)):
    if extreme_values_numeric[i][0] == name:
      return extreme_values_numeric[i][1]
    else:
      continue

In [173]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Minimum nights percentage over extreme:')
(numeric_data.loc[numeric_data.minimum_nights > get_upperfence('minimum_nights'), 'minimum_nights'].count() / numeric_data.minimum_nights.count()) * 100 

'Minimum nights percentage over extreme:'

19.237012987012985

In [174]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Accommodates percentage over extreme:')
(numeric_data.loc[numeric_data.accommodates > get_upperfence('accommodates'), 'accommodates'].count() / numeric_data.accommodates.count()) * 100 

'Accommodates percentage over extreme:'

3.316326530612245

In [175]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Bedrooms percentage over extreme:')
(numeric_data.loc[numeric_data.bedrooms > get_upperfence('bedrooms'), 'bedrooms'].count() / numeric_data.bedrooms.count()) * 100 

'Bedrooms percentage over extreme:'

23.017161410018552

In [176]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('Bathrooms percentage over extreme:')
(numeric_data.loc[numeric_data.bathrooms > get_upperfence('bathrooms'), 'bathrooms'].count() / numeric_data.bathrooms.count()) * 100 

'Bathrooms percentage over extreme:'

15.248144712430426

In [177]:
# calculate percentage of values over our extreme, if under 5% consider dropping
display ('availability_365 percentage over extreme:')
(numeric_data.loc[numeric_data.availability_365 > get_upperfence('availability_365'), 'availability_365'].count() / numeric_data.availability_365.count()) * 100 

'availability_365 percentage over extreme:'

0.0

In [178]:
# drop upperfence extreme accomodations
numeric_data.drop(numeric_data[numeric_data['accommodates'] > get_upperfence('accommodates')].index, inplace = True)

In [179]:
numeric_data.shape

(8338, 6)

In [180]:
# after extreme values dropped, how do histograms look now?
for column in numeric_data.columns:
  fig = px.histogram(numeric_data, x=column, marginal="box")
  fig.show()

1. remove extreme values from df_regression
2. one_hot_encoder for
3. split data 

In [236]:
df_regression.dtypes

neighbourhood_group     object
room_type               object
price                    int64
minimum_nights           int64
availability_365         int64
property_type           object
accommodates             int64
bedrooms               float64
bathrooms              float64
dtype: object

In [237]:
from pandas.core.arrays import categorical
n = df_regression["neighbourhood_group"].unique()
neighbourhood_groups = pd.CategoricalDtype(categories = n)
df_regression['neighbourhood_group'] = df_regression['neighbourhood_group'].astype(neighbourhood_groups)

In [238]:
room_type = pd.CategoricalDtype(categories = ["Private room", "Entire home/apt", "Shared room"])
df_regression["room_type"] = df_regression['room_type'].astype(room_type)

In [239]:
m = df_regression["property_type"].unique()
property_type = pd.CategoricalDtype(categories = m)
df_regression["property_type"] = df_regression['property_type'].astype(property_type)

In [234]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder

numeric_features = df_regression.select_dtypes(include = ['int64', 'float64'])
categorical_features = df_regression.select_dtypes(include != ['int64', 'float64'])

feature = df_regression.loc[:, df_regression.columns != 'price']
label = df_regression['price']

X_preprocess = make_column_transformer([(OneHotEncoder(), [feature.columns]) , 
                                       (MinMaxScaler(), [feature.columns])])
y_preprocess = LabelEncoder()

NameError: ignored

In [230]:
X_preprocess

In [231]:
y_preprocess

In [232]:
X = X_preprocess.fit_transform(feature)
# y = y_preprocess.fit_transform(label)

TypeError: ignored

# Data Preprocessing # 4

In [None]:
from sklearn.model_selection import train_test_split
X_values = df_clean[features].values # this was wrong with new_cols instead of features previously
y_values = df_clean[label].values
display(X_values,y_values)

array([['Entire home/apt', 'Entire Unit', 1, 0.0, 1.0, 'luxury'],
       ['Private room', 'Private Room', 2, 1.0, 1.0, 'standard'],
       ['Private room', 'Private Room', 2, 1.0, 1.5, 'standard'],
       ...,
       ['Entire home/apt', 'Entire Unit', 6, 2.0, 1.0, 'luxury'],
       ['Private room', 'Private Room', 2, 1.0, 1.0, 'budget'],
       ['Private room', 'Private Room', 2, 1.0, 2.0, 'standard']],
      dtype=object)

['Manhattan', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Brooklyn', ..., 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Manhattan']
Length: 7873
Categories (5, object): ['Bronx', 'Brooklyn', 'Manhattan', 'Queens', 'Staten Island']

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder

#OrdinalEncoder assumes EVERYTHING is categorical
# purpose here is convert strings (categorical) to numbers

X_preprocess = make_column_transformer((OrdinalEncoder(), non_numerical_features), remainder='passthrough')
y_preprocess = LabelEncoder()

In [None]:
display(X_preprocess, y_preprocess)

In [None]:
# Split the data
from sklearn.model_selection import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#df_train, df_test = train_test_split(df_clean, test_size=0.2)
#X_train, X_test, y_train, y_test = train_test_split(df_train, df_test, test_size=0.2)

X_train = df_train[features]
y_train = df_train[label]

X_test = df_test[features]
y_test = df_test[label]

In [None]:
#df_clean.head()
#X = X_preprocess.fit_transform(df_clean[features]) # causes data leakage = BAD
X_train = X_preprocess.fit_transform(X_train)
X_train

array([[1. , 1. , 0. , 1. , 1. , 1. ],
       [1. , 1. , 2. , 2. , 1. , 1. ],
       [1. , 1. , 0. , 2. , 1. , 1.5],
       ...,
       [0. , 0. , 1. , 4. , 2. , 2. ],
       [0. , 0. , 3. , 5. , 2. , 1. ],
       [1. , 1. , 0. , 2. , 1. , 1. ]])

In [None]:
X_test

Unnamed: 0,room_type,property_type,accommodates,bedrooms,bathrooms,price_group
3476,Private room,Private Room,2,1.0,2.0,budget
7215,Private room,Private Room,1,1.0,2.0,budget
3603,Entire home/apt,Entire Unit,2,1.0,1.0,standard
2075,Entire home/apt,Entire Unit,2,1.0,1.0,standard
213,Private room,Private Room,1,1.0,1.0,premium
...,...,...,...,...,...,...
5292,Private room,Private Room,2,1.0,1.0,standard
1705,Entire home/apt,Entire Unit,3,1.0,1.0,premium
5820,Entire home/apt,Entire Unit,2,1.0,1.0,premium
5410,Private room,Private Room,2,1.0,2.0,standard


In [None]:
X_train.dtype

dtype('float64')

In [None]:
X_test = X_preprocess.transform(X_test)
X_test

array([[1., 1., 0., 2., 1., 2.],
       [1., 1., 0., 1., 1., 2.],
       [0., 0., 3., 2., 1., 1.],
       ...,
       [0., 0., 2., 2., 1., 1.],
       [1., 1., 3., 2., 1., 2.],
       [0., 0., 1., 3., 0., 1.]])

In [None]:
#y = y_preprocess.fit_transform(df_clean[label])
y_train = y_preprocess.fit_transform(y_train) # standard method
#y_train = y_preprocess.fit_transform(df_clean[label]) # ensures ALL labels included

y_train

array([1, 1, 2, ..., 2, 3, 3])

In [None]:
y_train.shape

(6298,)

In [None]:
#y_test = y_preprocess.transform(y_test) breaks because unseen labels
#y_test = y_preprocess.transform(df_clean[label]) # ensures ALL labels included
#y_test = y_preprocess.transform(y_train) # ensures ALL labels included
y_test = y_preprocess.transform(df_test[label]) # ensures ALL labels included
y_test

array([1, 3, 1, ..., 1, 2, 2])

In [None]:
y_test.shape

(1575,)

In [None]:
# from Mohammad
#y_preprocess.fit(df_clean[label])
#y_train=y_preprocess.transform(df_clean[label])
#y_test=y_preprocess.transform(df_clean[label])

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
from sklearn.metrics import classification_report

In [None]:
listdf=%who_ls DataFrame # list all dataframes
listdf

['categorical_data',
 'df',
 'df_additional',
 'df_clean',
 'df_corr',
 'df_corr_viz',
 'df_no_dups',
 'df_original',
 'df_test',
 'df_train',
 'extreme_values',
 'min_nights_7',
 'min_nights_more_7',
 'numeric_data',
 'zero_beds']

In [None]:
# does df_variant exist?
if 'df_variant' in listdf: print("df_variant already exists!")
else:
  #declare new dataframe to record variant accuracy results
  df_variant = pd.DataFrame({'hyperparameter': ['1a_criterion_gini','1b_criterion_entropy','2a_splitter_best','2b_splitter_random','3a_minsamplessplit_one','3b_minsamplessplitone_two','4a_minsamplesleaf_one','4b_minsamplesleaf_two','5a_maxdepth_four','5b_maxdepth_eight'],recording:[0,0,0,0,0,0,0,0,0,0]})


In [None]:
df_variant[recording]=np.nan


In [None]:
# output current df_variant before calculations, note as we change recording variable this will
# populate new columns accuracy_v1 accuracy_v2 etc to correlate with recording constant above
df_variant

Unnamed: 0,hyperparameter,accuracy_v5
0,1a_criterion_gini,
1,1b_criterion_entropy,
2,2a_splitter_best,
3,2b_splitter_random,
4,3a_minsamplessplit_one,
5,3b_minsamplessplitone_two,
6,4a_minsamplesleaf_one,
7,4b_minsamplesleaf_two,
8,5a_maxdepth_four,
9,5b_maxdepth_eight,


# PA #5 - Linear Regression

The objective of this assignment is for you to perform a complete implementation of linear regression using your group’s chosen dataset.

## Question 1:

Use scikit-learn’s sklearn.linear_model.LinearRegression to implement a linear surface for your dataset.

## Question 2:
Use scikit-learn’s sklearn.linear_model.Ridge to implement linear least squares
with L2 regularization for your dataset using the default parameters.

## Question 3:
Use scikit-learn’s sklearn.linear_model.Ridge to implement linear least squares
with L2 regularization for your dataset tweak the value of alpha and report your ﬁndings.

## Question 4:
Use scikit-learn’s sklearn.linear_model.Lasso to implement linear least squares
with L1 regularization for your dataset using the default parameters.

## Question 5:
Use scikit-learn’s sklearn.linear_model.Lasso to implement linear least squares
with L1 regularization for your dataset tweak the value of alpha and report your ﬁndings.