# CS 109A/AC 209A/STAT 121A Data Science: Final Project
**Harvard University**<br>
**Fall 2016**<br>
**Instructors:** W. Pan, P. Protopapas, K. Rader<br>
**Members: ** Shawn Pan, Xinyuan (Amy) Wang, Ming-long Wu

# Project Summary #
## Introduction

Peer-to-peer rental networks such as Airbnb have become an alternative to hotels and other traditional accommodations for travelers seeking cheaper prices or a more personal experience. Renters will often have specific preferences such location, dates, cleanliness, or number of guests, but quantifying how these preferences affect the price is a challenge. Renters on a budget would like to know how various preferences affect the price, so they can make informed choices and compromises when planning their travel. The objective of our project is to predict the price of a Airbnb rental in the New York area based on user preferences

# Dataset
- Data from Airbnb listings in 2015 are used.
### listings
- Data that contain location, pysical properties, review scores
### calendar
- Data that contain listing price on different days of 2015.
### reviews
- Data that contain review texts from users.

# Workflow
<img src="Figures/Workflow.png">

# 01. Load data
- **File**: 010_load_data.ipynb
- We loaded three datasets and correct any incorrect format of data for later analysis.

# 02. Data exploration

## Listing data
- **File**: 021_visualize_listings_1.ipynb, 021_visualize_listings_2.ipynb
- We explored listing data to have better understanding of overall listing distribution.
- Correlation between predictors
    - <img src="Figures/Fig_021.png">
- Most of properties are apartments (> 85%).
- Most of listings rent the whole property (> 50%).
- Most of prices are below 200.
- Most of review scores are high (> 8)

## Calendar data
- **File**: 022_visualize_calendar.ipynb
- There is a temporal trend in prices, including local trend (weekend spikes) and global trend (monthly fluctuation).
    - <img src="Figures/Fig_022.png">

## Review data
- **File**: 023_visualize_reviews.ipynb
- We cleaned 10% of non-English review texts.
- We analyzed review texts with VADER (Valence Aware Dictionary and sEntiment Reasoner) to get sentiment scores for reviews.
- We checked correlation between price and sentiment scores, it turned out to be low.
    - <img src="Figures/Fig_023.png">

## Location clustering of price
- **File**: 024_location_clustering.ipynb
- We checked location clustering of price by KNN
- **This feature can be used for feature engineering in later steps.**
    - <img src="Figures/Fig_024.png">

## Conclusion from data exploration
- **We mainly use listing data for its aboundance of features for price prediction.**
- Althouth calendar data include info for temporal effect of price, we decided not to include the info because: **(1) Predicting a standard price based on the critera of the rental is our primary goal.** (2) On average, weekend price surge is only about 3%. (3) We checked different Airbnb calendar data and found that the global price fluctuation we saw varies in different dataset.
- The review text sentiment scores are not correlated to price. So, we decided not to include them in our model. In addition, in listing data, we already have various numerical review scores.

# 03. Baseline model

- **File**: 030_baseline.ipynb
- We fit the data with a linear regression model
- Train Score 0.283973128565
- Test Score 0.312829854813
- We performed cross validation and R^2 values are stable.


# 04. Preprocessing

- **File**: 041_preprocessing.ipynb
- We checked distribution of price (y) and found it's right skewed.
- By log transformation, we were able to get a response distribution that is close to symmetric (which is good for linear regression models).
    - <img src="Figures/Fig_041.png">
- We checked missing data, most of them are in review scores.
- We filled missing numerical data with mean and very few missing categorical data with mode.
    - <img src="Figures/Fig_042.png">

- We one hot encoded categorical data.


# 04. Data reduction

- **File**: 042_predictor_selection.ipynb, 043_reduce_dim.ipynb
- We used recursive feature selection and LASSO for feature selection.
- We compared results from two methods and select a set of 27 important predictors.
- Selected features = ["accommodates", "bathrooms", "bedrooms", "review_scores_checkin", "review_scores_communication", "latitude", "longitude", "property_type_0", "property_type_1", "property_type_2", "property_type_3", "property_type_4", "property_type_5", "room_type_0", "room_type_1", "room_type_2", "bed_type_0", "bed_type_1", "bed_type_2", "bed_type_3", "bed_type_4", "beds", "review_scores_value", "host_listing_count", "review_scores_cleanliness", "review_scores_accuracy", "minimum_nights"]
- We performed PCA to have 99.9% of explained variance and fit PCA reduced data to a linear regression.
    - R^2 is 0.118438992869. Therefore, we decided not to use PCA and also because our number of predictors are not large.


# 05. Linear models

- **File**: 051_linear_regression.ipynb
- **Models**: Linear regression, Lasso, Ridge, Elastic Net
    - <img src="Figures/Fig_051.png" width="70%" height="70%">


- **File**: 052_quadratic_regression.ipynb
- Quadratic transformation of numerical predictors.
- **Models**: Quadratic regression, Quadratic Lasso, Quadratic Ridge, Quadratic Elastic Net
    - <img src="Figures/Fig_052.png"  width="80%" height="80%">


- **File**: 053_tree_methods.ipynb
- **Models**: Random forest, Adaboost

- **File**: 054_SVR.ipynb
- **Model**: SVR

- **File**: 055_ensemble.ipynb
- **Model**:Gradient Boosting

- **File**: 056_neural_network.ipynb
- **Model**: Neural Network (four layers fully connected model)
- <img src="Figures/Fig_055.png" width="90%" height="90%">

- **File**: 057_chosen_model.ipynb
- **Model: Random forest**
- **Error analysis**
    - <img src="Figures/Fig_053.png">
    - <img src="Figures/Fig_054.png">

## Discussion


## Conclusion

