With this project, I am using data from Seattle's Airbnb listings. 
The three main questions I would like to answer are the following: 
1) What is one of the main features that determines the rating a particular listing receives? 
2) My assumption is that price plays a big factor in how someone might rate a property, followed by bed type and cancellation policy. Does this seem to be true? 
3) Is there any predictive value in "nan" values? Should these rows simply be omitted, or are there certain columns that hurt the model due to having many "NaN"?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.model_selection import cross_val_score

In [4]:
listings = pd.read_csv('listings.csv')

The following are the variables I decided to include in this model, due to my own perception of what might have predictive value in predicting one's rating of a listing. 

Due to this, the "y" variable is the rating scores.

In [6]:
listing_vars = ['host_response_time','property_type','room_type','bed_type','price','cleaning_fee','review_scores_rating','cancellation_policy']
yvars = ['review_scores_rating']
df = listings[listing_vars]

In [7]:
df.dtypes

host_response_time       object
property_type            object
room_type                object
bed_type                 object
price                    object
cleaning_fee             object
review_scores_rating    float64
cancellation_policy      object
dtype: object

In [8]:
df['price'] = df['price'].apply(lambda x: x.strip('$') if x is not np.nan else x)
df['price'] = df['price'].apply(lambda x: x.replace(',','') if x is not np.nan else x)
df['price'] = df['price'].apply(pd.to_numeric)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].apply(lambda x: x.strip('$') if x is not np.nan else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].apply(lambda x: x.replace(',','') if x is not np.nan else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].apply(pd.t

In [9]:
df['cleaning_fee'] = df['cleaning_fee'].apply(lambda x: x.strip('$') if x is not np.nan else x)
df['cleaning_fee'] = df['cleaning_fee'].apply(lambda x: x.replace(',','') if x is not np.nan else x)
df['cleaning_fee'] = df['cleaning_fee'].apply(pd.to_numeric)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cleaning_fee'] = df['cleaning_fee'].apply(lambda x: x.strip('$') if x is not np.nan else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cleaning_fee'] = df['cleaning_fee'].apply(lambda x: x.replace(',','') if x is not np.nan else x)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['clea

In [10]:
df = df.dropna(subset=['review_scores_rating'], axis=0)
intvars = df.select_dtypes(include=['int','float'])
intvars = intvars.drop(intvars.columns[14:], axis=1)
intvars = intvars.fillna(intvars.mean())
df[intvars.columns] = intvars

boolvars = df.select_dtypes(include = ['bool'])
boolvars = boolvars.fillna(boolvars.median())
df[boolvars.columns] = boolvars

In [11]:
catvars = df.select_dtypes(include = ['object'])
df = df.drop(catvars, axis=1)
for col in catvars:
    catvars = pd.concat([catvars.drop(col, axis=1), pd.get_dummies(catvars[col], prefix=col, prefix_sep='_', drop_first=True, dummy_na=True)], axis=1)
df = pd.concat([df,catvars],axis=1,sort=False)



In [12]:
y = df[yvars]
X = df.drop(yvars,axis=1)
lm_model = LinearRegression(normalize=True)
lm_model.fit(X,y)
lm_model.score(X,y)

0.03911700718836231

As we can see, the R-squared of this model is quite low. This means that the model can be improved, perhaps simplified, to make a more predictive model. However, it might still be useful in a general sense to see which features seem to be the most salient.

In [13]:
coefs_df = pd.DataFrame()
coefs_df['est_int'] = X.columns
z = lm_model.coef_
coefs_df['coefs'] = np.transpose(z)
coefs_df['abs_coefs'] = abs(coefs_df['coefs'])
coefs_df.sort_values('abs_coefs', ascending=False)

Unnamed: 0,est_int,coefs,abs_coefs
11,property_type_Chalet,-11.13608,11.13608
4,host_response_time_within an hour,7.666778,7.666778
5,host_response_time_nan,6.63866,6.63866
2,host_response_time_within a day,6.596815,6.596815
3,host_response_time_within a few hours,6.084845,6.084845
20,property_type_Yurt,5.099455,5.099455
13,property_type_Dorm,-3.334618,3.334618
8,property_type_Bungalow,3.242935,3.242935
19,property_type_Treehouse,2.888348,2.888348
16,property_type_Other,2.718404,2.718404


Variables that seem to be more predictive: host response time, property type, and room type. 

This is the exact opposite of my hypothesis. 



Answers to questions: 

1) host response time and property type
2) The oppostie, price and any type of fee seem to do very little in terms of predicting a rating
3) Most of the pd.get_dummies NaN columns created seems to have 0, or extremely small coefficients, meaning that they are perhaps not necessary to the model. Perhaps more work could be done in cleaning up NaN values in the categorical variables to make the model better.