# Categorical and Missing Data

In this session we will work with *airbnb* data. The goal is to predict the review scores rating. 

There are many entries (i.e rows) with missing attributes in our dataset. 

We will come around this issue by employing two approaches:
1. *Remove rows with missing values*
2. *Single imputation with median*

In [None]:
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import cross_val_predict
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt
import os

## Load Raw Data

In [None]:
# Load data
if os.path.exists('abnb_listings.csv'):
    df = pd.read_csv('abnb_listings.csv')
else:
    # df = pd.read_csv('http://data.insideairbnb.com/spain/comunidad-de-madrid/madrid/2018-01-17/data/listings.csv.gz', 
    #                   compression='gzip')
    df = pd.read_csv('https://raw.githubusercontent.com/InfoTUNI/joda2022/master/koodiesimerkit/data.csv', 
                      )
    df.to_csv('abnb_listings.csv')

df.info()

### First Approach - Removing rows with missing values

In [None]:
# We will focus on three attributes only
df_rem = df[['host_response_time','host_response_rate','review_scores_rating']].copy()

print(df_rem.head())
print(df_rem.host_response_time.unique())

In [None]:
df_rem.host_response_rate = df_rem.host_response_rate.str.strip('%')
df_rem.host_response_rate = pd.to_numeric(df_rem.host_response_rate)

print(df_rem.info(), '\n')
print(df_rem.head())

In [None]:
# Remove all rows with null values
df_rem = df_rem.dropna()

In [None]:
# Converting host_response_time attribute to categorical values.

# Encoding label encoder...
le = preprocessing.LabelEncoder()

arr = le.fit_transform(df_rem.host_response_time)

df_rem.host_response_time = arr

In [None]:
print(arr)
df_rem.head()

In [None]:
# Perform Linear Regression
lr = linear_model.LinearRegression()

# define labels and data (i.e y and X)
y = df_rem.review_scores_rating
X = df_rem.drop(columns='review_scores_rating')

prediction = cross_val_predict(lr, X, y, cv=10)

fig, ax = plt.subplots(figsize=(20, 10))
ax.scatter(y, prediction, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()


In [None]:
# Print pearson correlation coefficient a.k.a standard correlation coefficient
print(df_rem.corr())


In [None]:
# Print Mean Squared Error
mse = mean_squared_error(prediction, y)
mae = mean_absolute_error(prediction, y)
print("Mean Squared Error {:.2f} \nMean Absolute Error {:.2f}".format(mse, mae))

### Second Approach - Single imputation with median.

In [None]:
#select a subset of attributes 
df_imp = df[['review_scores_accuracy','review_scores_cleanliness',
         'review_scores_checkin','review_scores_communication',
         'review_scores_location','review_scores_value',
         'review_scores_rating']].copy()

print(df_imp.isnull().sum())

In [None]:
# Drop rows where all attributes are nan
df_imp.dropna(axis=0, how='all', inplace=True)

print(df_imp.isnull().sum())

In [None]:
# Impute median value for original missing values and generate new dataframe


df_imp = df_imp.fillna(df_imp.mean())

print(df_imp.isnull().sum())

#### Predicting review scores rating 

In [None]:
# Run Linear Regression
lr = linear_model.LinearRegression()

y = df_imp.review_scores_rating
X = df_imp.drop(columns='review_scores_rating')

prediction = cross_val_predict(lr, X, y, cv=10)

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
ax.scatter(y, prediction, edgecolors=(0, 0, 0))
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

In [None]:
print(df_imp.corr()['review_scores_rating'])

In [None]:
# Print Mean Squared Error
mse = mean_squared_error(prediction, y)
mae = mean_absolute_error(prediction, y)
print("Mean Squared Error {:.2f} \nMean Absolute Error {:.2f}".format(mse, mae))

#### TODO TASK: 
Predict the review scores using the Random Forest Regressor model and evaluate the predictive performance.
Hint: Check out the sklearn implementation