# Airbnb Recommender System

## Overview

# Business Understanding

With an overabundance of information at our fingertips, decision-making can be quite overwhelming - especially when you are planning a vacation. Recommender systems help consumers find products tailored to their unique taste. In this project, I will build a recommender system for Airbnb listings in Asheville, North Carolina. 

The stakeholders for this project include the engineering team at Airbnb who would put the recommender system into production and the executives that will approve the project.

# Data Understanding

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [2]:
# Import listings data and preview first 5 rows
listings_df = pd.read_csv('data/listings.csv')
listings_df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,155305,https://www.airbnb.com/rooms/155305,20240621025915,2024-06-21,city scrape,Cottage! BonPaul + Sharky's Hostel,,"We are within easy walk of pubs, breweries, mu...",https://a0.muscache.com/pictures/8880711/cf38d...,746673,...,4.75,4.92,4.58,,f,8,2,2,4,2.78
1,156805,https://www.airbnb.com/rooms/156805,20240621025915,2024-06-21,city scrape,"Private Room ""Ader"" at BPS Hostel",,"Easy walk to pubs, cafes, bakery, breweries, l...",https://a0.muscache.com/pictures/23447d55-fa7e...,746673,...,4.61,4.84,4.46,,t,8,2,2,4,0.43
2,156926,https://www.airbnb.com/rooms/156926,20240621025915,2024-06-21,city scrape,"Mixed Dorm ""Top Bunk #1"" at BPS Hostel",This is a top bunk in the mixed dorm room<br /...,,https://a0.muscache.com/pictures/5fa7178e-c514...,746673,...,4.77,4.78,4.78,,t,8,2,2,4,2.17
3,197263,https://www.airbnb.com/rooms/197263,20240621025915,2024-06-21,city scrape,Tranquil Room & Private Bath,"This is a comfy, peaceful and clean room with ...",,https://a0.muscache.com/pictures/miso/Hosting-...,961396,...,4.93,4.85,4.98,,f,2,1,1,0,0.57
4,209068,https://www.airbnb.com/rooms/209068,20240621025915,2024-06-21,city scrape,Terrace Cottage,,Our beautiful Grove Park Historic District clo...,https://a0.muscache.com/pictures/1829924/9f3bf...,1029919,...,4.98,4.94,4.79,,f,1,1,0,0,0.42


In [3]:
# View the overall shape, dtypes and null counts for each column in train data
listings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3031 entries, 0 to 3030
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            3031 non-null   int64  
 1   listing_url                                   3031 non-null   object 
 2   scrape_id                                     3031 non-null   int64  
 3   last_scraped                                  3031 non-null   object 
 4   source                                        3031 non-null   object 
 5   name                                          3031 non-null   object 
 6   description                                   2939 non-null   object 
 7   neighborhood_overview                         2205 non-null   object 
 8   picture_url                                   3031 non-null   object 
 9   host_id                                       3031 non-null   i

In [4]:
# Import reviews data and preview first 5 rows
reviews_df = pd.read_csv('data/reviews.csv')
reviews_df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,155305,409437,2011-07-31,844309,Jillian,We had a wonderful time! The cottage was very ...
1,155305,469775,2011-08-23,343443,Katie,Place was great! Can't really speak to the ins...
2,155305,548257,2011-09-19,1152025,Katie,We had a great time! The cabin was nice and a...
3,155305,671470,2011-10-28,1245885,Jason,Clean and comfortable room with everything you...
4,155305,1606327,2012-07-01,1891395,Craig,The cabin was solid for an overnight stay. It ...


In [5]:
# View an example review
reviews_df.comments[1]

"Place was great! Can't really speak to the inside as we only went inside to check in, but looked nice and friendly. We stayed in the cabin in the back and it was lovely. Private and cute, user-friendly. Hot tub was also really good. Spoke with other guests who were all really friendly and interesting people. Great area of town - easy walk to pizza, bars, shops, and coffee. Really loved staying here. \r<br/>Suggestion: Bring an extra towel, plates or something to use if you're eating in (there's a small kitchen). Maybe another fan if it's really hot. Bed is not really that comfortable if you're really tall. Thanks!!"

In [6]:
# View the overall shape, dtypes and null counts for each column in train data
reviews_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 331718 entries, 0 to 331717
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     331718 non-null  int64 
 1   id             331718 non-null  int64 
 2   date           331718 non-null  object
 3   reviewer_id    331718 non-null  int64 
 4   reviewer_name  331717 non-null  object
 5   comments       331653 non-null  object
dtypes: int64(3), object(3)
memory usage: 15.2+ MB


In [18]:
# View value counts for reviewer ids to see if users reviewed multiple listings
reviews_df['reviewer_id'].value_counts()

reviewer_id
151320586    19
54752689     19
20741182     18
418200641    17
7502408      17
             ..
176014109     1
183451179     1
324469879     1
204718147     1
478551791     1
Name: count, Length: 295604, dtype: int64

# Data Preparation

## Data Cleaning

### Listings DataFrame

In [7]:
# Create a new dataframe with only necessary columns
columns_to_keep = ['id', 'listing_url', 'name', 'description', 'picture_url', 'room_type', 'price']
recommender_cols_df = listings_df[columns_to_keep]

In [8]:
# Check nulls for new dataframe
recommender_cols_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3031 entries, 0 to 3030
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           3031 non-null   int64 
 1   listing_url  3031 non-null   object
 2   name         3031 non-null   object
 3   description  2939 non-null   object
 4   picture_url  3031 non-null   object
 5   room_type    3031 non-null   object
 6   price        2892 non-null   object
dtypes: int64(1), object(6)
memory usage: 165.9+ KB


The `description` and `price` columns have some nulls but we will keep it as is. This dataframe is going to be used only to provide information to the user on the listings that are recommended. Every listing has a listing url, therefore the user can click the url to get additional information on the listing and book the listing.

### Reviews DataFrame

In [9]:
# Check columns with no comments
reviews_df[reviews_df['comments'].isnull()]

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
7883,13112074,328260078,2018-09-25,121268935,Tiffany,
22461,5696919,441087924,2019-04-21,140470172,Michael,
24521,14352724,557206634,2019-10-31,40817729,Jessica,
30131,6145162,759901200,2021-05-18,37832155,Kayla,
33003,6234618,311153650,2018-08-20,74602012,Eileen,
...,...,...,...,...,...,...
278579,49232318,454954076455043203,2021-09-19,420262398,Carrie,
284092,51843113,790633688751657337,2022-12-26,363935612,Ruby,
292236,554287486404930872,902919400570368908,2023-05-30,227046934,Bryan,
305037,53810860,749943288289375768,2022-10-31,90622309,Gerald,


In [10]:
# Check duplicates
reviews_df.duplicated('id').value_counts()

False    331718
Name: count, dtype: int64

We can delete the rows with no comments. They don't provide any information on what these users thought about the particular listings and there are only 65 of them.

In [11]:
# Drop nulls
reviews_df.dropna(subset='comments', inplace=True)

In [12]:
# Drop unecessary columns
reviews_df.drop(axis=1, columns=['id', 'date', 'reviewer_name'], inplace=True)

## Feature Engineering

In [13]:
# Instantiate analyzer
analyzer = SentimentIntensityAnalyzer()

# Create new column `compound_scores` to list compound scores of text comments
reviews_df['compound_scores'] = [analyzer.polarity_scores(x)['compound'] for x in reviews_df['comments']]

In [15]:
# Preview results
reviews_df.head()

Unnamed: 0,listing_id,reviewer_id,comments,compound_scores
0,155305,844309,We had a wonderful time! The cottage was very ...,0.883
1,155305,343443,Place was great! Can't really speak to the ins...,0.9954
2,155305,1152025,We had a great time! The cabin was nice and a...,0.9819
3,155305,1245885,Clean and comfortable room with everything you...,0.9775
4,155305,1891395,The cabin was solid for an overnight stay. It ...,0.8807


In [20]:
# View statistical distribution of compound scores
reviews_df['compound_scores'].describe()

count    331653.000000
mean          0.853374
std           0.197392
min          -0.993900
25%           0.826800
50%           0.922600
75%           0.963400
max           0.999700
Name: compound_scores, dtype: float64

#Include overview of statistical distribution

# Modeling

## Baseline Understanding

## Modeling Iterations

## Final Model

# Conclusion

## Recommendations

## Limitations

1. New users do not have a baseline on listing preferences, as they have not left any reviews. 

## Next Steps