## Welcome! This notebook is divided into the following sections:

1. Data importation, cleaning and preparation.
2. Data visualization/exploratory data analysis.
3. Machine Learning/Price Prediction using a Random forest model.

### The goal is to create a model that can predict Airbnb listing prices in the San Francisco market.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing


from sklearn import set_config
set_config(display='diagram')

pd.set_option('display.float_format', lambda x: '%.3f' % x)

from sklearn.ensemble import RandomForestRegressor

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
# # Import the CSV file as a dataframe:

df = pd.read_csv('/kaggle/input/san-francisco-airbnb-listings/listings.csv')

## Data Cleaning and Preparation

In [None]:
# # Display basic dataframe info (number of columns/rows & data types):

df.info()

In [None]:
# # Take a look at the first 5 rows:

pd.set_option('display.max_columns', None)
df.head()

#### The dataframe contains 106 columns. For price prediction, we're going to utilize only these 4 features:
* Property type
* Room type
* Amenities
* Neighborhood

In [None]:
# # Create a new dataframe that omits all the unneeded columns:

mldf = df[['city', 'neighbourhood_cleansed', 'property_type', 'room_type', 'price', 'amenities']]

In [None]:
# # Show the number of null values in each column:

mldf.isnull().sum()

* #### 10 of the 8111 records contain null values for ***city***. 
* #### City is a non-numerical value and cannot be imputed. 
* #### Therefore, we will drop the affected records.

In [None]:
# # Drop null values; check what datatype each column contains:

mldf.dropna(inplace = True)

mldf.info()

In [None]:
# # Show summary stats for the dataframe:

mldf.describe()

#### There are 8 unique values for ***city***. Let's see what they are.

In [None]:
mldf['city'].value_counts()

#### We want only listings located in San Francisco. 

#### Consolidate the SF name-variant classes into the 'San Francisco' class, and eliminate the non-SF entries:

In [None]:
# # Check whether the 'San Francisco' class with only 3 records contains an extra space in its name:

mldf.loc[mldf['city'].str.contains('San Francisco ')]

In [None]:
# # Correct the 'San Francisco' name variants to read only 'San Francisco':

mldf.loc[mldf['city'] == 'Noe Valley - San Francisco', 'city'] = 'San Francisco'
mldf.loc[mldf['city'] == 'San Francisco, Hayes Valley', 'city'] = 'San Francisco'
mldf.loc[mldf['city'] == '旧金山', 'city'] = 'San Francisco'
mldf.loc[mldf['city'] == 'San Francisco ', 'city'] = 'San Francisco'

In [None]:
# # Confirm the corrections using value_counts method:

mldf['city'].value_counts()

In [None]:
# # Get the index names for non-SF records and compute the sum:

index_names = mldf[(mldf['city'] == 'Daly City') | (mldf['city'] == 'San Jose') | (mldf['city'] == 'Brisbane')].index 
len(index_names)

In [None]:
# # drop non-SF row indices from dataFrame and confirm their removal:

mldf.drop(index_names, inplace = True)

mldf['city'].value_counts()

In [None]:
# # Convert the values in price column to floats:

mldf['price'] = mldf['price'].replace('[\$,]', '', regex=True).astype(float)

In [None]:
# # Show summary stats for price column:
mldf['price'].describe()

The min price value is zero. This make no sense. Let's check for prices < 1:

In [None]:
mldf[mldf['price'] < 1]

In [None]:
# # Drop the row with index 3752 and confirm its removal:

mldf.drop([3752], inplace = True)
mldf[mldf['price'] <= 1]

In [None]:
# # We can now drop the city column from the dataframe:

mldf = mldf[['neighbourhood_cleansed', 'property_type', 'room_type', 'amenities', 'price']]

#### With that done, let's check the proportion of listings that belong to each room type class:

In [None]:
mldf['room_type'].value_counts(normalize=True)

The room_type variable specifies four classes:
* 'Entire home/apt' and 'Private room', which account for ~ 59% and ~ 36% of observations respectively. 
* 'Shared room' and 'Hotel room', which individually account for 3% or less.

To help simplify and optimize our model, we will discard the small number of records that correspond to the 'shared room' and 'hotel room' classes.

In [None]:
# # Drop 'shared room' and 'hotel room' listings from the data frame:

mldf = mldf[(mldf['room_type'] == 'Entire home/apt') | (mldf['room_type'] == 'Private room')]
mldf['room_type'].value_counts(normalize=True)

The amenities text must be *tokenized* before being fed into the ML model. This will occur later, at the pre-processing step.

Spaces between words might be (incorrectly) interpreted as token separators by the vectorizer. We must remove these spaces so that words describing individual amenities are counted as single tokens:

In [None]:
mldf['amenities'] = mldf['amenities'].str.replace(' *', flags=re.I, repl='')

In [None]:
# # Compute the proportion of listings that belong to each property type category:

print(mldf['property_type'].value_counts(normalize=True))
print(mldf['property_type'].describe())

The property_type variable specifies 26 unique values. Most of these occur with low frequency and are of little business interest or predictive value because they are so rare or peculiar e.g., 'Earth house' and 'Dome house.'

Moreover, all these classes will have to be one-hot encoded for use in our predictive model. One-hot encoding is computationally demanding.

To simplify our model and improve its predictive utility, we will discard property types with freq < 5%.

In [None]:
# # Add a column that shows the proportion of listings represented by each property type:

mldf["property_type_freq"] = 1
mldf["property_type_freq"] = mldf.groupby('property_type').transform('count').div(len(mldf))
mldf.head()

In [None]:
# # Drop property types with normalized frequencies less than 5% and confirm the change:

mldf = mldf[(mldf['property_type_freq'] >= 0.05)]

print(mldf['property_type'].value_counts(normalize=True))
print('\n')
print(mldf['property_type'].describe())

The change leaves us with 4 property types.

In [None]:
# drop property_type_freq from dataframe and confirm its removal:

mldf.drop('property_type_freq', axis=1, inplace=True)
mldf.dtypes

In [None]:
# # Compute the normalized frequency of each neighborhood in the dataframe:

print(mldf['neighbourhood_cleansed'].value_counts(normalize=True))
print('\n')
print(mldf['neighbourhood_cleansed'].describe())

There are 36 unique neighborhoods. This is a categorical variable, so all the values in this column will have to be one-hot encoded for the predictive model. This raises the same issue as was noted for the property type variable. 

To help our model's performance and predictive utility, we will use freq = 2% as a cut-off.

In [None]:
mldf["neigh_freq"] = 1
mldf["neigh_freq"] = mldf.groupby('neighbourhood_cleansed').transform('count').div(len(mldf))
mldf.head()

In [None]:
mldf = mldf[(mldf['neigh_freq'] >= 0.02)]
print(mldf['neighbourhood_cleansed'].value_counts(normalize=True))
print('\n')
print(mldf['neighbourhood_cleansed'].describe())

The change leaves us with 20 neighborhoods.

In [None]:
# drop neigh_freq from dataframe and confirm its removal:

mldf.drop('neigh_freq', axis=1, inplace=True)
mldf.dtypes

## Data Visualization

Generate a strip plot representation of the prices for each room type:

In [None]:
fig = plt.figure(figsize=(11,8))
fig.subplots_adjust(hspace=1, wspace=0.75)

plt.subplot(2,1,1)
sns.stripplot(y=mldf[mldf["room_type"] == "Entire home/apt"]['price'], 
              x=mldf[mldf["room_type"] == "Entire home/apt"]['neighbourhood_cleansed'],
              alpha=.5)
plt.title(label='Entire home prices')
plt.xticks(rotation=45, size=8, ha='right')
plt.xlabel("")

plt.subplot(2,1,2)
sns.stripplot(y=mldf[mldf["room_type"] == "Private room"]['price'], 
              x=mldf[mldf["room_type"] == "Private room"]['neighbourhood_cleansed'],
              alpha=.5)
plt.title(label='Private room prices')
plt.xticks(rotation=45, size=8, ha='right')

plt.show()

In each plot, most of the data points are clustered toward the bottom; ***a few extreme values make the overall price range higher than it would otherwise be.***

Plot the distribution of prices with and without log transformation to examine the skewness:

In [None]:
price_skew = mldf['price'].skew(axis = 0, skipna = True)
logprice_skew = (np.log(mldf.price)).skew(axis = 0, skipna = True)

fig, ax =plt.subplots(1, 2, figsize=(16,4))
chart1 = sns.distplot(mldf.price, ax=ax[0], color='b')
chart1.set_xlabel('Price',fontsize=12)
chart1.annotate(s=f'skew: {price_skew:.2f}', xy=(300, 200), xycoords='axes points')
chart2 = sns.distplot(np.log(mldf.price), ax=ax[1], color='g')
chart2.set_xlabel('log Price',fontsize=12)
chart2.annotate(s=f'log price skew: {logprice_skew:.2f}', xy=(300, 200), xycoords='axes points')

fig.show()

**Left panel**: The prices are very positively skewed, with skewness = 12.02. Normally distributed data have skewness = 0.

**Right panel**: Log transformation helps lower the skew value. But it's still considerably greater than 0.

Let's see the price ranges for the bottom 98% and top 2% for each room class:

In [None]:
print('Top: price range of the bottom 98% of listings.')
print('Bottom: price range of the top 2% of listings.')
print('\n')
      
# print(pd.qcut(cleandf['price'], q=[0, .98, 1]).value_counts(normalize=False))

print('Private room:')
print('\n')
print(pd.qcut(mldf[mldf["room_type"] == "Private room"]['price'], q=[0, .98, 1]).value_counts(normalize=False))
print('\n')
print('Entire home:')
print('\n')
print(pd.qcut(mldf[mldf["room_type"] == "Entire home/apt"]['price'], q=[0, .98, 1]).value_counts(normalize=False))

Extremely high price values are skewing the distribution. These rare/extraordinary prices are of little predictive value. 

Trim away the top 2% to eliminate the high-price extremes:

In [None]:
mldf = mldf[((mldf['price'] <= 350) & (mldf["room_type"] == "Private room")) | ((mldf['price'] <= 1000) & (mldf["room_type"] == "Entire home/apt"))]
mldf.head()

Regenerate the strip plots:

In [None]:
fig = plt.figure(figsize=(11,8))
fig.subplots_adjust(hspace=1, wspace=0.75)

plt.subplot(2,1,1)
sns.stripplot(y=mldf[mldf["room_type"] == "Entire home/apt"]['price'], 
              x=mldf[mldf["room_type"] == "Entire home/apt"]['neighbourhood_cleansed'],
              alpha=.5)
plt.title(label='Entire home prices')
plt.xticks(rotation=45, size=8, ha='right')
plt.xlabel("")

plt.subplot(2,1,2)
sns.stripplot(y=mldf[mldf["room_type"] == "Private room"]['price'], 
              x=mldf[mldf["room_type"] == "Private room"]['neighbourhood_cleansed'],
              alpha=.5)
plt.title(label='Private room prices')
plt.xticks(rotation=45, size=8, ha='right')

plt.show()

The data points are now more spread out along the price axis and cover it more evenly.

Re-plot the distributions and re-calculate the skew values:

In [None]:
price_skew = mldf['price'].skew(axis = 0, skipna = True)
logprice_skew = (np.log(mldf.price)).skew(axis = 0, skipna = True)

fig, ax =plt.subplots(1, 2, figsize=(16,4))
chart1 = sns.distplot(mldf.price, ax=ax[0], color='b')
chart1.set_xlabel('Price',fontsize=12)
chart1.annotate(s=f'skew: {price_skew:.2f}', xy=(300, 200), xycoords='axes points')
chart2 = sns.distplot(np.log(mldf.price), ax=ax[1], color='g')
chart2.set_xlabel('log Price',fontsize=12)
chart2.annotate(s=f'log price skew: {logprice_skew:.2f}', xy=(300, 200), xycoords='axes points')

fig.show()

Eliminating extreme prices from the data set lowered the skew values considerably. 

The log skew value is below 0.5, so we will apply a log transformation to the prices before running our ML algorithm.

With the extreme high prices eliminated, let's graph the median prices by room type and neighborhood:

In [None]:
y1 = mldf[mldf["room_type"] == "Entire home/apt"].groupby(['neighbourhood_cleansed']).median(['price'])
y2 = mldf[mldf["room_type"] == "Private room"].groupby(['neighbourhood_cleansed']).median(['price'])

# get the counts as a dataframe
y_df=pd.concat([y1,y2],axis=1)
y_df.columns=['Entire Hm','Pvt Rm']


y_df.index.name = 'index'

# # melt the data frame so it has a "tidy" data format
y_df=y_df.reset_index().melt(id_vars=['index'], var_name="room_type",value_name="Median Price (US$)")
y_df.head()

In [None]:
y_df_rm = y_df[y_df['room_type'] == 'Pvt Rm']

f, ax = plt.subplots(figsize=(12, 3))
plt.xticks(rotation=50, size=10, ha='right')

plt.bar(height="Median Price (US$)", x="index", data=y_df, label="Total", alpha = 0.5, color="darkorange")
plt.bar(height="Median Price (US$)", x="index", data=y_df_rm, label="Total", alpha = 0.5, color="deepskyblue")

sns.despine(left=True, bottom=True)
plt.legend(['Entire Home/Apt', 'Private Room'], loc=1)
plt.title(label='Median Listing Price x Neighborhood')
plt.xlabel("Neighborhood")
plt.ylabel("Median Price in US$")
plt.show()

As we would expect, the median listing price for a private room is lower than for an entire home, regardless of neighbhorhood.


Neighborhoods with high median listing prices include: Pacific Heights, Marina, Portrero Hill, Castro, South of Market, Western Addition, and Russian Hill.

Let's graph the data a different way to show the rank of each neighborhood by median listing price:

In [None]:
fig = plt.figure(figsize=(10,8))
fig.subplots_adjust(hspace=1, wspace=0.75)

plt.subplot(1,2,1)
mldf[mldf["room_type"] == "Entire home/apt"].groupby(["neighbourhood_cleansed"])['price'].median().sort_values(ascending=True).plot.barh(color="skyblue")
plt.xticks(rotation=50, size=10, ha='right')
plt.xlabel("Median Price in US$")
plt.title(label='Entire home/apt Median Prices')

plt.subplot(1,2,2)
mldf[mldf["room_type"] == "Private room"].groupby(["neighbourhood_cleansed"])['price'].median().sort_values(ascending=True).plot.barh()
plt.xticks(rotation=50, size=10, ha='right')
plt.xlabel("Median Price in US$")
plt.ylabel("")
plt.title(label='Private room Median Prices')

plt.show()

For both room types, the Marina district commands the highest median price.

Next, graph number of listings per neighborhood, per room type:

In [None]:
fig = plt.figure(figsize=(10,5))
ax = sns.countplot(x=mldf[mldf["room_type"] == "Entire home/apt"]['neighbourhood_cleansed'],
             order = mldf['neighbourhood_cleansed'].value_counts().index, 
                   alpha=0.4, 
                   color='blue')
plt.xticks(rotation=50, size=10, ha='right')

sns.countplot(x=mldf[mldf["room_type"] == "Private room"]['neighbourhood_cleansed'],
              ax=ax, 
              alpha=0.3, 
              color='green')

plt.legend(['Entire Home/Apt', 'Private Room'], loc=1)
plt.title(label='Number of listings per neighborhood')
plt.show()

We can see sharp disparities/imbalances in terms of the number of listings per room type: 
* Inner Sunset has the highest number of entire home listings but comparatively few private room listings.
* Mission has the third highest number of entire home listings but the highest number of private room listings.

## ML/Prediction - Random Forest

In [None]:
# # Log transform the price data to reduce skew:

mldf['price'] = np.log(mldf['price'])
display(mldf)

### Let's try to predict price:

In [None]:
X = mldf.drop('price', axis=1)
y = mldf['price']

#### The predictors must be preprocessed before they can be fed into the ML model:
* The categorical features must be one-hot encoded.
* The amenities text must be vectorized.

#### To accomplish this, we will use two transformers:
* #### One-hot encoder
* #### Term-frequency times inverse document-frequency (Tfid) vectorizer

#### **These will convert the categorical and text data to numerical representations that can be used by the algorithm.**

In [None]:
categorical_features = ['neighbourhood_cleansed', 'property_type', 'room_type']
text_features = ['amenities']

preprocessor = ColumnTransformer(
    transformers=[
        ('text', TfidfVectorizer(), 'amenities'), 
        ('category', OneHotEncoder(handle_unknown='ignore'), categorical_features)])

rfr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', RandomForestRegressor())])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

rfr.fit(X_train, y_train)

### Compare the first 10 predictions from the model to the first 10 observations in the test data:

In [None]:
y_pred = rfr.predict(X_test)

print('First 10 predictions vs first 10 observations:')

Prediction_Vs_Observation = {'pred': y_pred[0:10],
                            'obs': y_test[0:10]}

PvOdf = pd.DataFrame(Prediction_Vs_Observation).reset_index(drop=True)
display(PvOdf)

In [None]:
# # Compute the accuracy scores the model achieved for the training and test data sets:

print(f'The accuracy score of our Random Forest model on the training data: {rfr.score(X_train, y_train)}.')
print('\n')
print(f'The accuracy score of our Random Forest model on the test data: {rfr.score(X_test, y_test)}.')

### Remarks
#### The prediction accuracy for the test data is considerably lower than for the training data. This suggests the model overfit the training data.

#### However, it's important to keep in mind that this model is attempting to make numerical price predictions based on numerous combinations of categorical variables:
* 20 neighborhoods
* 4 property types
* 2 room types
* many different amenities

#### This makes for a very challenging prediction task. 
#### One possible strategy for improving the accuracy and usefulness of the model would be to make the target, price, a categorical variable. 
#### This could be accomplished by binning the price data into bands/ranges. There would be a loss of resolution but a gain of accuracy.

### Did you find this notebook instructive or helpful? Please upvote below.