# Lab 2

#### Alan Abadzic, John Girard, Eric Laigaie, Garrett Shankel

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv("NY_Listings_Validated.csv")

### Data Preparation Part 1

*Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.*

In [5]:
# Filter to only useful columns
data = df[['Host Response Rate', 'Host Is Superhost', 'Host total listings count', 'City', 'Room type',
          'Accommodates', 'Bathrooms', 'Bedrooms', 'Price', 'Minimum nights', 'Maximum nights', 'Availability 365',
          'Number of reviews', 'Reviews per month', 'Grade']]


# Create Grade Variable and Encode it
def categorise(row):  
    if row['Review Scores Rating'] > 89:
        return 1
    else:
        return 0
    return 'IDK'

df['Grade'] = df.apply(lambda row: categorise(row), axis=1)

df['Grade'].value_counts(normalize=True)

1    0.626667
0    0.373333
Name: Grade, dtype: float64

In [6]:
# One-hot Encode
city_one_hot = pd.get_dummies(data['City'])
room_one_hot = pd.get_dummies(data['Room type'])

data = data.drop('City',axis = 1)
data = data.drop('Room type',axis = 1)

data = data.join(city_one_hot)
data = data.join(room_one_hot)


# Map boolean to integer
data["Host Is Superhost"] = data["Host Is Superhost"].astype(int)

In [7]:
# Scale Data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

grade = data['Grade']
to_scale = data.drop("Grade", axis = 1)
cols = to_scale.columns

scaled = scaler.fit_transform(to_scale)

data = pd.DataFrame(scaled, columns = cols)
data['Grade'] = grade

In [14]:
data.head()

Unnamed: 0,Host Response Rate,Host Is Superhost,Host total listings count,Accommodates,Bathrooms,Bedrooms,Price,Minimum nights,Maximum nights,Availability 365,...,Reviews per month,Bronx,Brooklyn,Manhattan,Queens,Staten Island,Entire home/apt,Private room,Shared room,Grade
0,1.0,0.0,0.004086,0.0,0.064516,0.0,0.043043,0.000801,1.350418e-08,0.756164,...,0.072157,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
1,1.0,0.0,0.001021,0.0,0.064516,0.0,0.028028,0.000801,1.396984e-08,0.945205,...,0.06278,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0
2,1.0,0.0,0.016343,0.2,0.193548,0.076923,0.08008,0.001601,1.396984e-08,0.972603,...,0.156135,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
3,0.7,0.0,0.001021,0.2,0.064516,0.0,0.14014,0.000801,5.234033e-07,0.980822,...,0.027313,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0
4,1.0,0.0,0.001021,0.066667,0.064516,0.0,0.06006,0.0,5.234033e-07,0.986301,...,0.150836,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1


### Data Preparation Part 2

*Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).*

Below is table that describes each variable in the dataset. This includes our two predictor variables, Price and Grade. Additionally, all variables are of type float64, except for the int64 Grade.

#### Variable Table
| Variable | Description |
| :-- | :-- |
| Host Response Rate | The rate that hosts respond to potential customers. |
| Host total listings count | The total number of airbnbs rentals the host owns. |
| Accommodates | The number of people this rental can accommodate. |
| Bathrooms | The number of bathrooms in this rental. |
| Bedrooms | The number of bedrooms in this rental. |
| Price | The base price for one night in this rental. |
| Minimum nights | The lowest amount of nights this rental can be booked for. |
| Maximum nights | The highest amount of nights this rental can be booked for. |
| Availability 365 | The proportion of nights in a year that the rental is available. |
| Reviews per month | The average number of reviews a rental receives in a month. |
| Host Is Superhost | One-hot variable: A "1" indicates the rental is operated by a Superhost. |
| Bronx | One-hot variable: A "1" indicates a Bronx-based rental. |
| Brooklyn | One-hot variable: A "1" indicates a Brooklyn-based rental. |
| Manhattan | One-hot variable: A "1" indicates a Manhattan-based rental. |
| Queens | One-hot variable: A "1" indicates a Queens-based rental. |
| Staten Island | One-hot variable: A "1" indicates a Staten Island-based rental. |
| Entire home/apt | One-hot variable: A "1" indicates this rental is an entire home or apartment. |
| Private room | One-hot variable: A "1" indicates this rental is a private room. |
| Shared room | One-hot variable: A "1" indicates this rental is a shared room. |
| Grade | One-hot variable: A "1" indicates this rental has a rating of .9 or more. |

### Modeling and Evaluation 1

Choose and explain your evaluation metrics that you will use (i.e., accuracy,
precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

#### Grade
OPTION A:
To measure a model's performance when predicting Grade classification, we will use an F1 score. We decided on this for a few reasons. First, there isn't a large difference in costs between a false negative and false positive. In the case of a false negative, customers will likely get a rental that exceeds their expectations. For a false positive, customers will have to deal with a worse rental than they might be expecting, but this rental could still has a passing review grade (70+). Additionally, the inbalanced class distribution (67% A's, 33% Non-A's) leads us to use F1 instead of plain accuracy due to F1's strength when dealing with unbalanced classes.

OPTION B:
To measure a model's performance when predicting Grade classification, we will use a sensitivity score. This is mainly due to the potential of a large negative costs for false positives. When a rental is given a false positive prediction, customers could be dealing with a rental that drastically fails their expectations.

#### Price
To measure a model's performance when predicting Price, we will use mean absolute error (MAE). We chose this over MSE and RMSE for interpretability. With MSE, the scale of the error is boosted because of the square. While this is just as easy to use as MAE, keeping the scale consistent with Price is a nice benefit. For RMSE, the scale should be the same as Price and MAE. Therefore, we could realistically use either will no issues, but MAE won out simply because it's a simpler mathematical process.

### Modeling and Evaluation 2

Choose the method you will use for dividing your data into training and
testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why
your chosen method is appropriate or use more than one method as appropriate. For example, if you are using time series data then you should be using continuous training and testing sets across time.

#### Grade
On top of an 80% train, 20% test split, we will be using stratified 10-fold cross validation here. While non-stratified CV could also work, our imbalanced dataset (67% A's, 33% Non-A's) could yield folds of data that have very little Non-A's. Using stratification avoids this by holding these proportions constant across folds, leading to better training sets.

#### Price
On top of an 80% train, 20% test split, we will be using 10-fold cross validation here. We will not be using stratification because we're not dealing with inbalanced class labels that should be keep consistent. Since our response is continuous, we do not necessarily have to worry about certain outcomes not being selection for each fold.

### Modeling and Evaluation 3

Create three different classification/regression models for each task (e.g., random forest, KNN, and SVM for task one and the same or different algorithms for task two). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric. You must investigate different parameters of the algorithms!

In [17]:
# My thoughts here: For Grade - we can use LogReg, RandomForest, and either KNN/NB
# I'll work on this later, but we could use gridsearch for each algo while swapping out a couple parameters

# For Price - switch to SVM, RandomForest, and Linear Reg?

# Feature selection (lasso, ridge, or elastic net) could be used for Price. Not sure if that would be expectional work, but
# it's worth a try

### Modeling and Evaluation 4

Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.

### Modeling and Evaluation 5

Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods. You must use statistical comparison techniques—be sure they are appropriate for your chosen method of validation as discussed in unit 7 of the course.

### Modeling and Evaluation 6

Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

### Deployment

How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.? 

### Exceptional Work

You have free reign to provide additional analyses. One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?