# Dataset
In this homework, we will continue the *New York City Airbnb Open Data*. You can take it from Kaggle or download from here if you don't want to sign up to Kaggle.

We'll keep working with the `price` variable, and we'll transform it to a classification task.

In [32]:
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error, mutual_info_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, Ridge

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/AB_NYC_2019.csv')
data.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


# Features
For the rest of the homework, you'll need to use the features from the previous homework with additional two `neighbourhood_group` and `room_type`. So the whole feature set will be set as follows:

* `neighbourhood_group`,
* `room_type`,
* `latitude`,
* `longitude`,
* `price`,
* `minimum_nights`,
* `number_of_reviews`,
* `reviews_per_month`,
* `calculated_host_listings_count`,
* `availability_365`

Select only them and fill in the missing values with 0.

In [4]:
features = ['neighbourhood_group',
            'room_type',
            'latitude',
            'longitude',
            'minimum_nights',
            'number_of_reviews',
            'reviews_per_month',
            'calculated_host_listings_count',
            'availability_365',
            'price']

# select features
data = data.loc[:, features].fillna(0)

# string data normalization
categorical_columns = list(data.dtypes[data.dtypes == 'object'].index)
for c in categorical_columns:
    data[c] = data[c].str.lower().str.replace(' |/', '_')

data.head()

Unnamed: 0,neighbourhood_group,room_type,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365,price
0,brooklyn,private_room,40.64749,-73.97237,1,9,0.21,6,365,149
1,manhattan,entire_home_apt,40.75362,-73.98377,1,45,0.38,2,355,225
2,manhattan,private_room,40.80902,-73.9419,3,0,0.0,1,365,150
3,brooklyn,entire_home_apt,40.68514,-73.95976,1,270,4.64,1,194,89
4,manhattan,entire_home_apt,40.79851,-73.94399,10,9,0.1,1,0,80


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   neighbourhood_group             48895 non-null  object 
 1   room_type                       48895 non-null  object 
 2   latitude                        48895 non-null  float64
 3   longitude                       48895 non-null  float64
 4   minimum_nights                  48895 non-null  int64  
 5   number_of_reviews               48895 non-null  int64  
 6   reviews_per_month               48895 non-null  float64
 7   calculated_host_listings_count  48895 non-null  int64  
 8   availability_365                48895 non-null  int64  
 9   price                           48895 non-null  int64  
dtypes: float64(3), int64(5), object(2)
memory usage: 3.7+ MB


# Question 1
What is the most frequent observation (mode) for the column `neighbourhood_group`?

In [6]:
data['neighbourhood_group'].mode()[0]

'manhattan'

In [7]:
data['neighbourhood_group'].value_counts().index[0]

'manhattan'

# Split the data
Split your data in train/val/test sets, with 60%/20%/20% distribution.
Use `Scikit-Learn` for that (the `train_test_split` function) and set the `seed` to 42.
Make sure that the target value (`price`) is not in your dataframe.

In [8]:
full_train, test = train_test_split(data, test_size=0.2, random_state=42)
train, val = train_test_split(full_train, test_size=0.25, random_state=42)

len(train), len(val), len(test)

(29337, 9779, 9779)

In [9]:
train = train.reset_index(drop=True)
val = val.reset_index(drop=True)
test = test.reset_index(drop=True)

y_train = train.price.values
y_val = val.price.values
y_test = test.price.values

del train['price']
del val['price']
del test['price']

# Question 2
* Create the correlation matrix for the numerical features of your train dataset.
    * In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
* What are the two features that have the biggest correlation in this dataset?

In [10]:
numeric = list(train.select_dtypes(include=['int', 'float']).columns)
train[numeric].corr()

Unnamed: 0,latitude,longitude,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
latitude,1.0,0.080301,0.027441,-0.006246,-0.007159,0.019375,-0.005891
longitude,0.080301,1.0,-0.06066,0.055084,0.134642,-0.117041,0.083666
minimum_nights,0.027441,-0.06066,1.0,-0.07602,-0.120703,0.118647,0.138901
number_of_reviews,-0.006246,0.055084,-0.07602,1.0,0.590374,-0.073167,0.174477
reviews_per_month,-0.007159,0.134642,-0.120703,0.590374,1.0,-0.048767,0.165376
calculated_host_listings_count,0.019375,-0.117041,0.118647,-0.073167,-0.048767,1.0,0.225913
availability_365,-0.005891,0.083666,0.138901,0.174477,0.165376,0.225913,1.0


In [11]:
train[numeric].corr().abs().unstack().sort_values(ascending = False)[len(numeric):len(numeric)+1].index[0]

('number_of_reviews', 'reviews_per_month')

# Make price binary
We need to turn the `price` variable from numeric into binary.
Let's create a variable `above_average` which is 1 if the `price` is above (or equal to) 152.

In [12]:
above_average = (y_train >= 152).astype(int)

# Question 3
* Calculate the mutual information score with the (binarized) `price` for the two categorical variables that we have. Use the training set only.
* Which of these two variables has bigger score?
* Round it to 2 decimal digits using `round(score, 2)`

In [13]:
categoric = list(train.select_dtypes(include='object').columns)
for col in categoric:
    print(f'{col:<20} - {mutual_info_score(train[col], above_average):.2f}')

neighbourhood_group  - 0.05
room_type            - 0.14


# Question 4
* Now let's train a logistic regression
* Remember that we have two categorical variables in the data. Include them using one-hot encoding.
* Fit the model on the training dataset.
    * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    * `model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)`
* Calculate the accuracy on the validation dataset and rount it to 2 decimal digits.

In [15]:
above_average_test = (y_val >= 152).astype(int)

def get_accuracy(features):
    # one-hot encoding datasets
    dv = DictVectorizer(sparse=False)

    train_dict = train[features].to_dict(orient='records')
    val_dict = val[features].to_dict(orient='records')

    X_train = dv.fit_transform(train_dict)
    X_val = dv.transform(val_dict)

    # training logistic regression
    model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
    model.fit(X_train, above_average)

    # get accuracy on validation dataset
    predictions = model.predict(X_val)
    accuracy = np.sum(predictions == above_average_test) / len(X_val)
    
    return accuracy

original_accuracy = get_accuracy(categoric + numeric)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [16]:
print(f'Accuracy on validation dataset: {original_accuracy:.2f}')

Accuracy on validation dataset: 0.79


# Question 5
* We have 9 features: 7 numerical features and 2 categorical.
* Let's find the least useful one using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
* Which of following feature has the smallest difference?
    * `neighbourhood_group`
    * `room_type`
    * `number_of_reviews`
    * `reviews_per_month`

> **note**: the difference doesn't have to be positive

In [30]:
features = categoric + numeric
differences = pd.Series(dtype='float64')

for feature in features:
    features_copy = features.copy()
    features_copy.remove(feature)
    accuracy = get_accuracy(features_copy)
    difference = original_accuracy - accuracy
    differences[feature] = difference

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [31]:
differences.sort_values().index[0]

'number_of_reviews'

# Question 6
* For this question, we'll see how to use a linear regression model from `Scikit-Learn`
* We'll need to use the original column `price`. Apply the logarithmic transformation to this column.
* Fit the `Ridge` regression model on the training data.
* This model has a parameter `alpha`. Let's try the following values: `[0, 0.01, 0.1, 1, 10]`
* Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest `alpha`.

In [43]:
features = categoric + numeric

dv = DictVectorizer(sparse=False)

train_dict = train[features].to_dict(orient='records')
val_dict = val[features].to_dict(orient='records')

X_train = dv.fit_transform(train_dict)
X_val = dv.transform(val_dict)
y_train_log = np.log1p(y_train)

alphas = [0, 0.01, 0.1, 1, 10]

scores = pd.Series(dtype='float64')

for alpha in alphas:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train_log)
    predictions = model.predict(X_val)
    score = mean_squared_error(y_val, np.expm1(predictions), squared=False)
    scores[str(alpha)] = score

In [45]:
scores.sort_values().index[0]

'0.01'

<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/hashtag/DataTalksClub?src=hash&amp;ref_src=twsrc%5Etfw">#DataTalksClub</a> <a href="https://twitter.com/hashtag/MLZoomcamp?src=hash&amp;ref_src=twsrc%5Etfw">#MLZoomcamp</a> Week 3 homework is ready. Notebook is disfigured with Convergence warnings from LogisticRegression :(</p>&mdash; sha of smile (@trueRock_n_roll) <a href="https://twitter.com/trueRock_n_roll/status/1440371370026627081?ref_src=twsrc%5Etfw">September 21, 2021</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>