# New York City AirBnB Data Modeling and Analysis

Made for the University of San Diego
Course: AAI-510 (Machine learning: Fundamentals and Applications)
Professor: Wesley Pasfield, MS

By - Doug Code (dcode15), Subhabrata Ganguli (suvoganguli), Jeffrey Lehrer (J-Lehrer)

# Problem statement and justification for the proposed approach.


Business understanding - What does the business need?




## [Modify/Delete as necessary]

## Introduction

Airbnb, an online marketplace for lodging, has transformed the way people travel and find accommodations. In major cities like New York City, Airbnb listings provide a wide variety of options for travelers, ranging from entire apartments and homes to private rooms in shared apartments. This flexibility has made Airbnb a popular choice among both tourists and business travelers.

In this notebook, we will explore the Airbnb dataset for New York City. This dataset provides detailed information on listings available on Airbnb, including prices, locations, types of properties, and reviews. By analyzing this data, we can gain insights into the rental market in New York City, understand pricing strategies, identify popular neighborhoods, and much more.

## Dataset Description

The dataset used in this analysis is obtained from [Inside Airbnb](http://insideairbnb.com/get-the-data.html), a website that provides publicly available data on Airbnb listings. The New York City dataset contains various attributes for each listing, including:

- **Listing ID**: A unique identifier for each Airbnb listing.
- **Name**: The name of the listing.
- **Host ID**: A unique identifier for the host.
- **Host Name**: The name of the host.
- **Neighborhood Group**: The general area or borough where the listing is located (e.g., Manhattan, Brooklyn).
- **Neighborhood**: The specific neighborhood within the borough.
- **Latitude**: The latitude coordinate of the listing.
- **Longitude**: The longitude coordinate of the listing.
- **Room Type**: The type of room being offered (e.g., entire home/apt, private room, shared room).
- **Price**: The price per night for the listing.
- **Minimum Nights**: The minimum number of nights a guest must stay.
- **Number of Reviews**: The total number of reviews for the listing.
- **Last Review**: The date of the last review.
- **Reviews per Month**: The average number of reviews per month.
- **Calculated Host Listings Count**: The total number of listings by the host.
- **Availability 365**: The number of days the listing is available in a year.

## Objectives

In this analysis, we aim to achieve the following objectives:

1. **Data Exploration**: Understand the structure and contents of the dataset through summary statistics and visualizations.
2. **Price Analysis**: Analyze the pricing strategies of different types of listings and identify factors influencing prices.
3. **Geographical Analysis**: Examine the geographical distribution of listings and identify popular neighborhoods.
4. **Review Analysis**: Investigate the review patterns and their correlation with listing popularity and price.
5. **Availability Analysis**: Analyze the availability of listings and identify trends related to booking frequency.



# Data preparation.


Data preparation - How do we organize the data for modeling?


In [15]:
from typing import List

import optuna
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
from sklearn.impute import SimpleImputer
import sys
import os

# Add path to src
current_dir = os.getcwd()
src_path = os.path.join(current_dir, 'src')
print(src_path)
sys.path.append(src_path)

from evaluation.ModelEvaluator import ModelEvaluator
from preprocessing.ColumnEncoder import ColumnEncoder
from preprocessing.ColumnSelector import ColumnSelector
from preprocessing.DataCleaner import DataCleaner
from preprocessing.DataImputer import DataImputer
from tuners.HistGradientBoostingRegressorTuner import HistGradientBoostingRegressorTuner

/Users/suvo/Documents/MS-USD/AAI510-Machine Learning/AAI510_FinalProject/src


ModuleNotFoundError: No module named 'evaluation.ModelEvaluator'

# Feature engineering – data pre-processing – missing values, outliers, etc.


In [None]:
print("Preprocessing data.")
data_path: str = "../data/listings-full.csv"

data: pd.DataFrame = pd.read_csv(data_path)
data = DataCleaner.perform_base_cleaning(data)
data = DataImputer.remove_outliers_iqr(data, ["price"])

train_data, val_data, test_data = DataCleaner.split_train_val_test(data)

train_data = ColumnEncoder.mean_encode_columns(train_data, ColumnSelector.get_categorical_features(train_data), "price")
val_data = ColumnEncoder.mean_encode_columns(val_data, ColumnSelector.get_categorical_features(val_data), "price")
test_data = ColumnEncoder.mean_encode_columns(test_data, ColumnSelector.get_categorical_features(test_data), "price")

train_data = DataImputer.impute_missing_values(train_data, data.columns, SimpleImputer(strategy="median"))
val_data = DataImputer.impute_missing_values(val_data, data.columns, SimpleImputer(strategy="median"))

x_train, y_train, x_val, y_val, x_test, y_test = DataCleaner.perform_x_y_split(train_data, val_data, test_data)


# Data understanding (EDA) – a graphical and non-graphical representation of relationships between the response variable and predictor variables.


Data understanding - What data do we have/need? Is it clean?


In [None]:
# Read data

df = pd.read_csv("data/listings-full.csv")
df.head()

In [None]:
Data understanding - What data do we have/need? Is it clean?


In [None]:
Data understanding - What data do we have/need? Is it clean?


In [None]:
Data understanding - What data do we have/need? Is it clean?


In [None]:
Data understanding - What data do we have/need? Is it clean?


# Feature Selection – how were the features selected based on the data analysis?


# Modeling – selection, comparison, tuning, and analysis – consider ensembles.



Modeling - What modeling techniques should we apply?



# Evaluation – performance measures, results, and conclusions.


Evaluation - Which model best meets the business objectives?

# Discussion and conclusions – address the problem statement and recommendation.

Deployment - How to get the model in production and ensure it works?

# References and Sources

GitHub link: https://github.com/suvoganguli/AAI510_FinalProject

