# **Project Name**    - AirBnb Bookings Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world.
This project focuses on analyzing Airbnb listings to uncover key insights about the rental market, with a particular emphasis on pricing, availability, and other crucial factors influencing listing performance. The dataset used in this analysis contains 48,895 rows and 16 columns, covering listings in various neighborhoods. Each row represents a specific Airbnb listing, and the columns contain attributes like listing ID, host information, location, property type, price, and booking availability.

Write the summary here within 500-600 words.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


We need to analyze Airbnb listing data to uncover key trends such as which neighborhoods command the highest prices, how room type affects pricing, and relations between number of bookings and customer reviews.


#### **Define Your Business Objective?**

Analyze Airbnb listing data to identify key factors that drive pricing and bookings, helping hosts optimize their rental strategies in a competitive market.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams


### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/AlmaBetter Projects/Airbnb NYC 2019.csv'
df = pd.read_csv(file_path)


### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
# Convert columns to appropriate data types
df['host_id'] = df['host_id'].astype(int)
df['price'] = df['price'].astype(float)
df['last_review'] = pd.to_datetime(df['last_review'])

# Convert categorical columns to 'category' data type
df['room_type'] = df['room_type'].astype('category')
df['neighbourhood_group'] = df['neighbourhood_group'].astype('category')

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(df[df.duplicated()])


In the dataframe, we want to handle rows with duplicate values in the specified columns (host_id, host_name, neighbourhood_group, neighbourhood, and room_type) and keep only one row per duplicate group based on the most recent 'last_review' date.

In [None]:
# Dataset Replacing Duplicated Hosts based on last review date

# Step 1: Sort the DataFrame by 'last_review' to get the most recent dates first
df = df.sort_values(by='last_review', ascending=False)

# Step 2: Drop duplicates, keeping the first occurrence (most recent 'last_review')
df = df.drop_duplicates(subset=['host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'room_type'], keep='first')


#### Missing Values/Null Values

In [None]:
# Replace empty strings with NaN
df.replace("", np.nan, inplace=True)
# Fill missing host names with a placeholder (e.g., 'Unknown Host')
df['host_name'].fillna('Unknown Host', inplace=True)
# Fill missing reviews_per_month by 0
df.fillna({'reviews_per_month':0},inplace=True)
# Missing Values/Null Values Count
print(df.isnull().sum())


In [None]:
# Visualizing the missing values
# Checking Null Value by plotting Heatmap
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

The dataset provided contains detailed information about AirBNB listings, specifically for New York city. The goal of this analysis is to gain insights into the characteristics of various listings and understand factors that may influence pricing, availability, and overall review in the AirBNB market.
The dataset comprises 48,895 rows and 16 columns, capturing a wide range of attributes related to each AirBNB listing.
There are no missing values or duplicate entries, ensuring the integrity and reliability of the data for analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description

*   **id :**listing ID.
*   **name :**the name or title of the listing.
*   **host_id :**unique identifier for the host.
*   **host_name :**the name of the host.
*   **neighbourhood :**area of the listing.
*   **latitude :**latitude coordinates.
*   **longitude :**longitude coordinates.
*   **room_type :**listing space type.
*   **price :**price in dollars.
*   **minimum_nights :**minimum number of nights.
*   **number_of_reviews :**Total number of reviews received.
*   **last_review :**latest review.
*   **reviews_per_month :**number of reviews per month.
*   **calculated_host_listings_count :**total listing per host.
*   **availability_365 :**number of days when listing is available for booking.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in df.columns.tolist():
  print(f'No. of unique values in {i} is {df[i].nunique()}.')

## 3. ***Data Wrangling***

### Data Wrangling Code

We get a summary of categorical columns like room_type, neighbourhood_group

In [None]:
# Write your code to make your dataset analysis ready.
# Value counts for 'room_type
df['room_type'].value_counts()


Value counts for 'neighbourhood_group'

In [None]:
df['neighbourhood_group'].value_counts()

We detect the listings with 0 in the 'price' column.


In [None]:
# Check if there are any rows with 'price' 0
zero_prices= df[df['price'] == 0].count()
print(zero_prices)


We perform Feature Engineering to create a new feature that categorizes listings based on their price range (i.e, low, medium, high, very high and extremely high).

In [None]:
# Create a new feature for price category
bins = [0, 100, 300, 1000, 2000,10000]
labels = ['Low', 'Medium', 'High', 'Very High','extremely high']
df['price_category'] = pd.cut(df['price'], bins=bins, labels=labels)
df['price_category'].value_counts()


We will check for extreme outliers in 'price' column.

In [None]:
# Handling outliers
# Visualize outliers in the 'price' column
sns.boxplot(x=df['price'])
plt.show()

Top 10 Neighbourhoods by Listings and their mean prices

In [None]:
# Top 10 neighborhoods with the most listings
# Group by 'neighbourhood' and calculate both the sum of 'calculated_host_listings_count' and the mean price
top_10_neighbourhoods = df.groupby('neighbourhood').agg({
    'calculated_host_listings_count': 'sum',
    'price': 'mean'
}).sort_values(by='calculated_host_listings_count', ascending=False).head(10)

# Display the result
print(top_10_neighbourhoods)


We calculate the mean prices based on the 'room_types'

In [None]:
# Group by 'room_type' and calculate the mean price
room_type_avg_price = df.groupby('room_type')['price'].mean()

# Display the result
print(room_type_avg_price)

### What all manipulations have you done and insights you found?

In this project, we performed several key data wrangling tasks to prepare the dataset for analysis and uncover insights into the Airbnb listings. Below is a summary of the manipulations we performed and the insights derived from them:

*   **Summary of Categorical Columns:**
We analyzed categorical variables such as room_type and neighbourhood_group using value counts to understand the distribution of listings across different room types and neighbourhood groups.

***Insight:*** The majority of listings fall into "Entire home/apt" and are concentrated in popular neighbourhood groups, such as Manhattan and Brooklyn.

*   **Listings with Zero Price**:
We detect the listings with a price of 0.

***Insight:*** There are few listings with 0 price, which can be due to dynamic pricing or the unwillingness to share the price with the Airbnb.

*   **Feature Engineering:**
A new feature was created to categorize listings based on their price ranges, dividing them into categories such as low, medium, high, very high, and extremely high price groups.

***Insight:*** The majority of listings fall within the low and medium price categories, with fewer listings in the very high and extremely high categories. This tells us that the 'price' data is right skewed.

*   **Prices Outlier Detection:**
We plot a boxplot to understand how the data is spread out for high ranges for the price irrespective of region.

***Insight:*** The Price ranges from  0-180. But there also exists price which has a maximum of $10000. We cannot discard them as an outlier because the price varies based on different factors which includes location,room type, neighbourhood , season etc.

*   **Top 10 Neighbourhoods by Listings and Mean Prices:**
We identified the top 10 neighbourhoods with the highest number of listings and calculated their mean prices.

***Insight:*** The top neighbourhood was found to be 'Williamsburg' while least listed one was 'Chelsea'. The top neighbourhood based on the average price was 'Midtown' while the least priced was 'Bushwick'.

*   **Mean Prices by Room Type:**
We calculated the mean prices for different room types (e.g., Entire home/apt, Private room, Shared room).

***Insight:*** Entire homes or apartments tend to have the highest average prices, followed by private rooms, while shared rooms have the lowest average prices. This reinforces the idea that privacy and space are highly valued in the short-term rental market.

These data wrangling tasks helped clean and structure the dataset, providing us with clear insights into the factors affecting Airbnb listings, such as location, room type, and pricing strategies.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1: Scatter Plot to show the neighbourhood group based on Latitude and Longitude.

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(12,8))
sns.scatterplot(x=df.longitude,y=df.latitude,hue=df.neighbourhood_group)
plt.show()

##### 1. Why did you pick the specific chart?

To clearly visualize the neighbourhood group as a cluster based on Latitude and Longitude.

##### 2. What is/are the insight(s) found from the chart?

The higher density of listings in Manhattan and Brooklyn suggests these areas are popular destinations for visitors.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   **Positive impact-** The insights gained from this chart will allow Airbnb hosts and the platform to focus marketing efforts on popular areas like Manhattan and Brooklyn, where demand is high.
*   **Negative impact**- The high density of listings in areas like Manhattan and Brooklyn may lead to over-saturation, causing increased competition between hosts.



#### Chart - 2: Room Types

In [None]:
# Chart - 2 visualization code
df['room_type'].value_counts().plot(kind='bar',color=['r','b','y'])
plt.show()

##### 1. Why did you pick the specific chart?



To provide a visual representation of the distribution of Airbnb listings by room type (Entire home/apt, Private room, and Shared room)

##### 2. What is/are the insight(s) found from the chart?

 From the chart, it is clear that the Apartment and Private rooms are preferred more than that of shared rooms. In general, Shared rooms costs less and can be very useful for travellers who moves from one city to another city quite frequently.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   **Positive impact-** The insights suggest that Airbnb hosts should focus on offering "Entire home/apt" and "Private room" listings to cater to the preferences of most travelers.
*   **Negative impact**- The significantly lower number of "Shared room" listings implies that this category might not be as profitable to hosts.

#### Chart - 3: Neighbourhood Group

In [None]:
# Chart - 3 visualization code
df['neighbourhood_group'].value_counts().plot(kind='bar',color=['r','b','y','g','m'])
plt.show()

##### 1. Why did you pick the specific chart?

This bar chart is chosen to provide a clear comparison of the number of AirBNB listings across different neighborhood groups.

##### 2. What is/are the insight(s) found from the chart?

 From the chart, it is evident that Manhattan and Brooklyn has more number of listing than the Queens,Bronx and Staten island.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Business owners or property managers can focus their investments in Manhattan and Brooklyn, where there is a large number of listings and likely high demand.
 The insights suggest that Airbnb hosts should focus on offering "Entire home/

*   **Positive impact**- Business owners or property managers can focus their investments in Manhattan and Brooklyn, where there is a large number of listings and likely high demand.
*   **Negative impact**- More concentrated listings implies that a high competitive market could lead to lower occupancy rates for some hosts or reduced profits due to pricing competition

#### Chart - 4: Average room rent for locality

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(12,8))
df_per_night = df[df['minimum_nights']==1]
df1 = df_per_night.groupby(['room_type','neighbourhood_group'])['price'].mean().sort_values(ascending=True)
df1.plot(kind='bar')
plt.title('Average Price for rooms in neighbourhood group')
plt.ylabel('Average Daily Price')
plt.xlabel('Neighbourhood Group')
plt.show()

##### 1. Why did you pick the specific chart?

This chart shows the list of Average Price per night based on the neighbourhood group. The reason for choosing this chart is to show the clear relationship between room type, location, and price, which can be useful for understanding market dynamics in the AirBNB sector.
This chart is useful for the tourists so that they can plan based on the budget.

##### 2. What is/are the insight(s) found from the chart?

Staying at a Apartment is always an expensive stay than shared room/private rooms for any location. This is so because Entire room is rented out by family for nice stay where privacy is also one of the major factor. Whereas Stay at Shared rooms are being preferred by travellers who generally don't wish to stay for long time at a particular place and moves around places quickly.

So looking at the plot it is clear :

a. Shared room at staten Island is the most cheapest stay per night whereas renting a Entire apartment/Home at Manhattan per night is the most expensive.

b. Average price for Private room is considerably expensive at Manhattan than other private rooms in the neighborhood. This clearly states that Manhattan offers more expensive stay than any other locality.

c. Bronx is the most cheapest stay in terms of neighbourhood group comparison in respect to room type.

d. Shared room at Staten Island is the cheapest whereas Apartment renting is not the cheapest. This can be due to the location of a perfect gateway from the rush of the city for quality time with family get-togethers.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   **Positive impact-** Hosts in lower-priced areas like Staten Island or the Bronx can attract budget-conscious travelers by emphasizing their lower prices for shared and private rooms.
* **Negative impact-**High prices in competetive markets could turn off potential renters, especially during periods of low demand.

#### Chart - 5: Expensive Neighbourhood

In [None]:
# Chart - 5 visualization code
# Group by 'neighbourhood' and calculate the mean price
top_10_expensive_neighbourhood = df.groupby('neighbourhood')['price'].mean().nlargest(10)

plt.figure(figsize=(12, 8))
sns.barplot(x=top_10_expensive_neighbourhood.values, y=top_10_expensive_neighbourhood.index, palette='magma')

plt.title('Top 10 Most Expensive Localities in Airbnb Listings (Based on Average Price)', fontsize=16)
plt.xlabel('Average Price', fontsize=12)
plt.ylabel('Neighbourhood', fontsize=12)

plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to visually represent the average prices of different neighborhoods in Airbnb listings. This format is effective for comparing multiple values and easily identifying the highest and lowest priced areas.

We have plotted only Top 10 neighbourhood with respect to average price. This will help a traveller to choose the appropriate neighbourhood based on his/her budget.

##### 2. What is/are the insight(s) found from the chart?

Fort Wadsworth is the most expensive neighborhood, significantly outpacing others in terms of average price.
Woodrow and Sea Gate also command premium prices.
NoHo is the least expensive neighborhood.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   **Positive market-** This data can help Airbnb hosts in these neighborhoods set competitive prices based on the local market.
*   **Negative impact-** If hosts in lower-priced neighborhoods significantly increase their rates based on the chart, it could deter potential guests and negatively impact occupancy.

#### Chart - 6: Top 10 neighbourhood locality based on listings

In [None]:
# Chart - 6 visualization code
df5 = df.groupby('neighbourhood')[['neighbourhood','host_name']].agg(['count'])['host_name'].sort_values(by='count',ascending=False).rename(index=str,columns={'Count':'Listing Count'})

df5.head(10).plot(kind='barh')
plt.show()

##### 1. Why did you pick the specific chart?

A horizontal bar chart was chosen to visually represent the relative popularity of different neighborhoods in the dataset, as indicated by the count of listings.

##### 2. What is/are the insight(s) found from the chart?

Williamsburg is the most popular neighborhood, followed by Bedford-Stuyvesant.
Midtown has the lowest number of listings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   **Positive market-** This data can help businesses target specific neighborhoods with higher foot traffic and potential customer bases.
*   **Negative impact-** If too many businesses concentrate in the most popular neighborhoods, it could lead to increased competition and decreased profitability.

#### Chart - 7: Location and Review Score

In [None]:
# Chart - 7 visualization code
fig = plt.figure(figsize=(12,4))
review_50 = df[df['number_of_reviews']>=50]
df2 = review_50['neighbourhood_group'].value_counts()
df2.plot(kind='bar',color=['r','b','g','y','m'])
plt.title('Location and Review Score(Min of 50)')
plt.ylabel('Number of Review')
plt.xlabel('Neighbourhood Group')
plt.show()

##### 1. Why did you pick the specific chart?

This chart shows the count of review v/s neighbourhood group filtered by a minimum review score of 50.

##### 2. What is/are the insight(s) found from the chart?

According to the plot, Brooklyn got most review in comparison to Manhattan and that is an interesting find. Also Staten Island which is cheaper has less review than the other neighbourhood group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Review is the one of the important criteria with online activity these days. This gives a lot of insights to a particular place for tourist and they can swing mood when it comes to online booking. A cheap place with bad review can drive a tourist for not booking and an expensive place with nicest review can shell a tourist more than what he have thought initially.

#### Chart - 8: Top 5 host

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(12,6))
review_50.head(2)
df1 = review_50['host_name'].value_counts()[:5].plot(kind='bar',color=['r','b','g','y','m'])
df.columns

##### 1. Why did you pick the specific chart?

A bar chart was chosen to visually represent the relative popularity of different hosts based on the review score (minimum 50). This increases the confidence of tourist before booking.

##### 2. What is/are the insight(s) found from the chart?

Michael is the most popular host, followed by Alex.
John has the lowest number of listings.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*   **Positive market-** This data can help identify top-performing hosts who might be eligible for special recognition or incentives.
*   **Negative impact-**  If a few hosts dominate the market, it could create barriers for new hosts to enter the platform.

#### Chart - 9: Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12, 8))
sns.heatmap(df[['price', 'minimum_nights', 'number_of_reviews','reviews_per_month', 'calculated_host_listings_count', 'availability_365']].corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

##### 1. Why did you pick the specific chart?

A correlation heatmap was chosen to visually represent the relationships between different numerical features in the dataset. This format is effective for quickly identifying positive and negative correlations between variables

##### 2. What is/are the insight(s) found from the chart?

Listings with longer minimum stays might be associated with hosts who have more listings. Most other pairs of variables exhibit weak or no correlations, indicating minimal relationships between them.

#### Chart - 10: Pair Plot

In [None]:
# Pair Plot visualization code
# Pair Plot for selected numeric variables
sns.pairplot(df[['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']])
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot was chosen to visually represent the relationships between different numerical features in the dataset. This format is effective for quickly exploring the distribution of each variable and identifying potential correlations between pairs of variables.

##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals the distribution of each variable, including skewness and outliers.
Visual patterns in the scatter plots suggest potential correlations between certain pairs of variables, such as a positive correlation between number_of_reviews and reviews_per_month.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?

Explain Briefly.

Solution to Achieve Business Objective:

*   Define the specific objectives(e.g., revenue growth, market expansion, customer satisfaction).
*   Evaluate existing data to identify areas of improvement.
*   Categorize customers based on demographics, preferences, and behaviors
*   Regularly assess and improve the service.
*   Gather feedback, conduct surveys, and understand customer.
*   Leverage Technology to streamline operations and improve efficiency
*   Stay Competitive with Pricing.
*   Continuously refine marketing strategies.
*   Build partnerships with other businesses.
*   Regularly track performance against business goals.

# **Conclusion**

*   Manhattan and Brooklyn dominate in terms of the number of listings, indicating high demand and competition in these boroughs.
*   Staten Island and the Bronx have fewer listings, representing potential growth opportunities for hosts willing to invest in less saturated markets.
*   Entire homes/apartments in Manhattan command the highest daily rates, making it a lucrative area for property owners with premium offerings.
*   Shared rooms across all neighborhoods are priced more affordably, with Staten Island offering the cheapest options.
*  Pricing segmentation by neighborhood and room type offers hosts insights on how to strategically price their properties to maximize returns.
*   While Manhattan and Brooklyn offer higher returns, hosts should be cautious about market saturation, which may affect occupancy rates and profits.
*   Lower-priced areas such as the Bronx and Staten Island can attract budget-conscious travelers through targeted marketing, which could improve occupancy rates.
*   These insights empower AirBNB hosts and investors to make data-driven decisions on property pricing, investment locations, and marketing strategies to enhance profitability.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***