# Airbnb Data Analysis for New York City

## Introduction

Airbnb, an online marketplace for lodging, has transformed the way people travel and find accommodations. In major cities like New York City, Airbnb listings provide a wide variety of options for travelers, ranging from entire apartments and homes to private rooms in shared apartments. This flexibility has made Airbnb a popular choice among both tourists and business travelers.

In this notebook, we will explore the Airbnb dataset for New York City. This dataset provides detailed information on listings available on Airbnb, including prices, locations, types of properties, and reviews. By analyzing this data, we can gain insights into the rental market in New York City, understand pricing strategies, identify popular neighborhoods, and much more.

## Dataset Description

The dataset used in this analysis is obtained from [Inside Airbnb](http://insideairbnb.com/get-the-data.html), a website that provides publicly available data on Airbnb listings. The New York City dataset contains various attributes for each listing, including:

- **Listing ID**: A unique identifier for each Airbnb listing.
- **Name**: The name of the listing.
- **Host ID**: A unique identifier for the host.
- **Host Name**: The name of the host.
- **Neighborhood Group**: The general area or borough where the listing is located (e.g., Manhattan, Brooklyn).
- **Neighborhood**: The specific neighborhood within the borough.
- **Latitude**: The latitude coordinate of the listing.
- **Longitude**: The longitude coordinate of the listing.
- **Room Type**: The type of room being offered (e.g., entire home/apt, private room, shared room).
- **Price**: The price per night for the listing.
- **Minimum Nights**: The minimum number of nights a guest must stay.
- **Number of Reviews**: The total number of reviews for the listing.
- **Last Review**: The date of the last review.
- **Reviews per Month**: The average number of reviews per month.
- **Calculated Host Listings Count**: The total number of listings by the host.
- **Availability 365**: The number of days the listing is available in a year.

## Objectives

In this analysis, we aim to achieve the following objectives:

1. **Data Exploration**: Understand the structure and contents of the dataset through summary statistics and visualizations.
2. **Price Analysis**: Analyze the pricing strategies of different types of listings and identify factors influencing prices.
3. **Geographical Analysis**: Examine the geographical distribution of listings and identify popular neighborhoods.
4. **Review Analysis**: Investigate the review patterns and their correlation with listing popularity and price.
5. **Availability Analysis**: Analyze the availability of listings and identify trends related to booking frequency.




In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
# Read data

df = pd.read_csv("data/listings.csv")
df.head()

In [None]:
print(df.info())

# Exploratory Data Analysis

In [None]:
# Remove the 'last_review' column
df.drop(columns=['last_review'], inplace=True)
df.drop(columns=['license'], inplace=True)
df.drop(columns=['host_id'], inplace=True)
df.info()

In [None]:
import missingno as msno

# Check for missing values
print("Missing values per column before cleaning:")
print(df.isnull().sum())

print('Shape of data before cleaning')
print(df.shape)

print('Number of unique neighborhoods before cleaning')
print(df['neighbourhood'].nunique())

# Visualize missing data
plt.figure(figsize=(12, 6))
msno.matrix(df)
plt.show()

# Visualize the missing data as a heatmap
plt.figure(figsize=(12, 6))
msno.heatmap(df, cmap='coolwarm')
plt.show()


## Removing rows with missing data

Find which neighborhoods are removed from the data after cleaning and add back those rows with the mean value

In [None]:
# Clean up whitespace or empty strings
df.replace(r'^\s*$', np.nan, regex=True, inplace=True)

# Store the original unique neighborhoods
original_neighbourhoods = df['neighbourhood'].unique()

# Remove rows with any missing data
df_cleaned = df.dropna()

# Check for missing values after dropping rows with missing data
print("Missing values per column after cleaning:")
print(df_cleaned.isnull().sum())

print('Shape of data after cleaning:')
print(df_cleaned.shape)

print('Number of unique neighborhoods after cleaning:')
print(df_cleaned['neighbourhood'].nunique())

# Verify there are no more missing values with matrix plot
plt.figure(figsize=(12, 6))
msno.matrix(df_cleaned)
plt.show()

# Find which neighborhoods were removed
cleaned_neighbourhoods = df_cleaned['neighbourhood'].unique()
removed_neighbourhoods = set(original_neighbourhoods) - set(cleaned_neighbourhoods)
print(f"Removed neighborhoods: {removed_neighbourhoods}")

# Calculate mean values for numerical columns in the cleaned dataset
mean_values = df_cleaned.mean(numeric_only=True)

# Create a DataFrame for the removed neighborhoods with mean values
rows_to_add = []
for neighborhood in removed_neighbourhoods:
    mean_row = mean_values.copy()
    mean_row['neighbourhood'] = neighborhood
    rows_to_add.append(mean_row)

# Convert the list of Series to a DataFrame
mean_rows_df = pd.DataFrame(rows_to_add)

# Append the new rows to the cleaned DataFrame
df_cleaned = pd.concat([df_cleaned, mean_rows_df], ignore_index=True)

# Verify the updated dataset
print("Data after adding back removed neighborhoods:")
print(df_cleaned.head())
print("Missing values per column after adding back removed neighborhoods:")
print(df_cleaned.isnull().sum())

# Display the first few rows after adding back the neighborhoods
print(df_cleaned.head())

# Verify the total number of unique neighborhoods
print(f"Total number of unique neighborhoods after adding back: {df_cleaned['neighbourhood'].nunique()}")

# Updating df with df_cleaned
df = df_cleaned

In [None]:
# Set styles for the plots
sns.set(style="whitegrid")
plt.style.use("fivethirtyeight")

# Histogram for price
plt.figure(figsize=(10, 6))
sns.histplot(df['price'])
plt.title('Distribution of Airbnb Prices in NYC')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.xlim(0, df['price'].quantile(0.99))  # To limit the x-axis to the 99th percentile to avoid extreme outliers
plt.show()


In [None]:

# Split the data into two groups
df_below_1000 = df[df['price'] <= 1000]
df_above_1000 = df[df['price'] > 1000]

# Plot the data
plt.figure(figsize=(12, 8))

# Scatter plot for prices <= 1000
sc = plt.scatter(df_below_1000['longitude'], df_below_1000['latitude'], c=df_below_1000['price'], cmap='viridis', alpha=0.5, label='Price <= 1000')

# Scatter plot for prices > 1000
plt.scatter(df_above_1000['longitude'], df_above_1000['latitude'], color='red', alpha=0.5, label='Price > 1000')

# Add color bar for the price <= 1000 points
cbar = plt.colorbar(sc, label='Price (<= 1000)')

# Add titles and labels
plt.title('Airbnb Prices in NYC')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.show()

The plot above shows the scatter plot of the price vs latitude-longitude. The prices above $1000 are marked by red.

In [None]:
import matplotlib.cm as cm

# Count the occurrences of each room type
room_type_counts = df['room_type'].value_counts()

# Get the viridis colormap
cmap = cm.get_cmap('viridis')

# Normalize the color range
norm = plt.Normalize(room_type_counts.min(), room_type_counts.max())
colors = cmap(norm(room_type_counts.values))

# Plot the counts
plt.figure(figsize=(10, 6))
plt.bar(room_type_counts.index, room_type_counts.values, color=colors)

# Add titles and labels
plt.title('Count of Room Types in Airbnb Listings')
plt.xlabel('Room Type')
plt.ylabel('Count')

# Add value labels on the bars
for index, value in enumerate(room_type_counts.values):
    plt.text(index, value, str(value), ha='center', va='bottom')

plt.show()

In [None]:
# Count the occurrences of each neighborhood group
neighborhood_group_counts = df['neighbourhood_group'].value_counts()

# Plot the counts
plt.figure(figsize=(10, 6))
plt.bar(neighborhood_group_counts.index, neighborhood_group_counts.values, color='skyblue')

# Add titles and labels
plt.title('Count of Neighborhood Groups in Airbnb Listings')
plt.xlabel('Neighborhood Group')
plt.ylabel('Count')

# Add value labels on the bars
for index, value in enumerate(neighborhood_group_counts.values):
    plt.text(index, value, str(value), ha='center', va='bottom')

plt.show()


In [None]:
# Plot the histogram for availability_365
plt.figure(figsize=(10, 6))
plt.hist(df['availability_365'], bins=30, edgecolor='black', color='skyblue')

# Add titles and labels
plt.title('Histogram of Availability (365 days) in Airbnb Listings')
plt.xlabel('Availability in 365 days')
plt.ylabel('Frequency')

plt.show()

In [None]:
df.describe()

# Regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Identify categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns: {categorical_columns}")

# Encode categorical variables
df = pd.get_dummies(df, columns=categorical_columns, drop_first=True)

# Ensure all data is numeric
print(df.dtypes)

# Define features and target
X = df.drop('price', axis=1)
y = df['price']

# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 5]
}

grid_search = GridSearchCV(
    DecisionTreeRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='r2',
    verbose=3  # Set verbosity to show progress
)
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Train model with best parameters
model = DecisionTreeRegressor(**best_params, random_state=42)
model.fit(X_train, y_train)  # Fit the model

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'R-squared: {r2}')


In [None]:
from sklearn.ensemble import RandomForestRegressor

# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 5]
}

grid_search = GridSearchCV(
    RandomForestRegressor(random_state=42),
    param_grid,
    cv=5,
    scoring='r2',
    verbose=3  # Set verbosity to show progress
)
grid_search.fit(X_train, y_train)

# Best parameters
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Train model with best parameters
model = RandomForestRegressor(**best_params, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'MAE: {mae}')
print(f'MSE: {mse}')
print(f'RMSE: {mse**0.5}')
print(f'R-squared: {r2}')