<a href="https://colab.research.google.com/github/vishalgokak/Projects/blob/main/AirBnb_booking_analysis_EDA_project_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnb Booking Analysis EDA



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name **           - Vishal gokak

# **Project Summary -**

Airbnb is a popular online marketplace that allows people to rent out their homes or apartments to travelers. The platform provides a convenient and affordable alternative to traditional hotels, and hosts can earn extra income by renting out their properties.

The project can help aspiring Airbnb hosts to ensure that their listing is equipped with the important features to charge a higher price without losing customers, and travellers can determine the factors to look into to get the lowest price possible while having certain features they prefer.

In conclusion, the Airbnb data analysis project is a valuable study that aims to help both Airbnb hosts and travelers to help them make informed decisions based on the significant factors and features that affect the listing price.

**Approach used:**

The approach we have used in this project is defined in the given format-

1) **Loading our data** : In this section we just loaded our dataset in colab notebook and read the csv file.

2) **Data Cleaning and Processing** : In this section we have tried to remove the null values and for some of the columns we have replaced the null values with the appropriate values with reasonable assumptions .

3) **Analysis and Visualization** : In this section we have tried to explore all variables which can play an important role for the analysis . In the next parts we have tried to explore the effect of one over the other . In the next part we tried to answers our hypothetical questions.

4) **Future scope of Further Analysis**: There are many apartments having availability as 0 and the date of last_review is very old, which can mean that they must have stopped their business, we can find the relation with neighbourhood with these apartments if we could dig much, various micro trends could be unearthed, which we are not able to cover during this short duration efficiently. There are various columns which can play an important role in further analysis such as number of reviews and reviews per month finding its relation with other factors or other grouped factors can play an important role.

**Types of graphs used for data visualization:**

1) Count Plot

2) Bar Plot

3) Scatter Plot

4) Heatmap

5) Box plot

**Python Libraries used for graphs:**

1) Matplotib

2) Seaborn

3) Numpy

4) Pandas

**What is EDA?**

“Exploratory Data Analysis “ is very important in machine learning . Whenever we start our work on any project we must analyse the factors deeply .Hypothetical questions and that hypothetical questions lead to some hidden facts . This collaborative work is simply known as EDA.The following steps are involved in the process of EDA


1) Acquire and loading data

2) Understanding the variables

3) Cleaning dataset

4) Exploring and Visualizing Data

5) Analyzing relationships between variables

**Business Text:**

The business context for Airbnb is rooted in the current state and projected future of the vacation rental industry. As the tourism industry recovers from the COVID-19 pandemic, short-term rental platforms like Airbnb are seeing increased demand and revenue. In particular, the US market is experiencing a surge in listings and high occupancy rates, especially during peak vacation months. This presents an opportunity for investors and property owners to capitalize on the market trends and optimize their rental strategies to meet the needs of travelers.

However, with the potential for fluctuations in the market and varying demand across different locations, it is important for businesses to conduct data analysis to inform their decision-making and stay competitive. Through data analysis, businesses can gain insights into market trends, occupancy rates, pricing strategies, and customer preferences that can help them optimize their listings and increase their revenue.

# **GitHub Link -**

https://github.com/vishalgokak/Projects/blob/main/AirBnb_booking_analysis_EDA_project_(1)%20(2).ipynb

# **Problem Statement**


**Problem Statement**

The aim of this project is to analyze Airbnb data and provide actionable insights to improve the hosting experience and overall satisfaction of both hosts and guests. By exploring the dataset, we seek to address the following key challenges:

1. **Demand and Pricing Analysis**: Analyze historical booking data to understand the factors influencing demand for Airbnb accommodations in different locations. Determine how various attributes, such as location, property type, amenities, and seasonality, impact pricing strategies.

2. **Customer Experience Assessment**: Examine customer reviews and ratings to identify patterns and trends that impact guest satisfaction. Identify common pain points and areas for improvement in order to enhance the overall experience of guests staying at Airbnb listings.

3. **Host Performance Evaluation**: Evaluate the performance of hosts based on key metrics such as occupancy rate, average ratings, and response time. Identify successful hosting practices and provide recommendations for hosts to improve their listing's performance and attract more bookings. Explore the impact of factors like pricing, availability, and responsiveness on host ratings.

4. **Trend Analysis and Forecasting**: Identify emerging trends in the Airbnb market, such as new popular neighborhoods, changing guest preferences, or shifts in demand during specific seasons.

Overall, this project aims to leverage Airbnb data to empower hosts and improve the guest experience by providing valuable insights into pricing, customer preferences, host performance, competition, and future trends. The findings will help hosts make data-driven decisions to optimize their offerings and enhance the overall Airbnb ecosystem.

# **General Guidelines** : -

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.

     The additional credits will have advantages over other students during Star Student selection.

             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.


```
# Chart visualization code
```


*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





**Table of content**

1.Loading Data

2.checking for NaN values

3.Handling NaNs

4.Analysis

# **Questions for Analysis:**



**1) What are the top 10 neighbourhoods contributing the most number of apartments for AirBnB?**

**2) What is the average price of bookings based on neighbourhood group as per category of listing?**

**3) How does the cost vary with respect to the neighbourhood in each neighbourhood group?**

**4) How does the range of room type differ according to each neighbourhood group?**

**5) How neighbourhood is related to reviews?**

**6) How the price column is distributed over room_type?**

**7) What is the average price preferred by the customers according to each neighbourhood group for each category of room type?**

**8) What is the distribution of the room type across the locations?**

**9) Which hosts are having heighest number of appartments ?**

**10) Which are the top 5 hosts that have obatained heighest no. of reviews?**

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams


import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path = "/content/drive/MyDrive/Airbnb NYC 2019.csv"
data_set = pd.read_csv(path)
data_set.head(5)

### Dataset First View

In [None]:
#lets checkout coloumns having  NULL values
data_set.isna().sum()

In [None]:
data_set.loc[:,data_set.isna().sum()!=0]

In [None]:
# Dataset First Look
data_set.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

data_set.shape

### Dataset Information

In [None]:
# Dataset Info

data_set.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
len(data_set[data_set.duplicated()])

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(data_set.isnull().sum())

In [None]:
# Visualizing the missing values

sns.heatmap(data_set.isnull(), cbar=False)

**About the dataset:**

This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

Airbnb is an online marketplace connecting travelers with local hosts. On one side, the platform enables people to list their available space and earn extra income in the form of rent. On the other, Airbnb enables travelers to book unique homestays from local hosts, saving them money and giving them a chance to interact with locals. Catering to the on-demand travel industry, Airbnb is present in over 190 countries across the world.

The data we are going to analyse is the data of Airbnb NYC (2019). Our main objectives of analysis will be above four statements which can be briefed as learnings from hosts, areas, price, reviews, locations etc. but we are not limited to it,we will also try to explore some more insights.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data_set.columns

In [None]:
# Dataset Describe
data_set.describe(include = 'all')


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in data_set.columns.tolist():
  print("no of unique values in",i,"is",data_set[i].nunique(),".")

## 3. ***Data Wrangling***

In [None]:
# Write your code to make your dataset analysis ready.
# Create a copy of the current dataset and assigning to df
df= data_set.copy()





In [None]:
#Checking Shape of True Value
print("No of customers on Airbnb : - ", len(df[df['id']==True]))

In [None]:
df.fillna({'reviews_per_month':0},inplace=True)

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

 **Question 1: What are the top 10 neighbourhoods contributing the most number of apartments for AirBnB?**

In [None]:
df['neighbourhood'].value_counts().head(10)

In [None]:
#Plotting the top 10 neighbourhoods contributing the most number of apartments for AirBnB
pd.value_counts(df['neighbourhood'])[:10].plot.bar()

In [None]:
df['neighbourhood_group'].value_counts()

In [None]:
pd.value_counts(df['neighbourhood_group']).plot.bar()

**Insights**: We can evidentially conclude that Manhattan contributes the most number of apartments, whereas the Staten Island contributes the least. Hence we can consider the least contributing sites as potential locations for future expansion in terms of apartments

**Question 2: What is the average price of bookings based on neighbourhood group as per category of listing?**

In [None]:
# applying groupby over 'neighbourhood_groups' and 'room_type'
# then applying mean of price  and unstacking for clear visualization

average_price = df.groupby(['neighbourhood_group', 'room_type'])['price'].mean().unstack()
average_price

In [None]:
#plotting for the average price vs neighbourhood group

average_price.plot.bar(figsize=(20,8), ylabel = 'Average Booking price')

**Insights**: Manhattan has the most expensive pricing compared to other neighbourhood groups

**Question 3: How does the cost vary with respect to the neighbourhood in each neighbourhood group?**

In [None]:
# The top 3 neighbourhoods in each neighbourhood group having maximum price
df_manhattan=df[df['neighbourhood_group']=='Manhattan']
df_queens=df[df['neighbourhood_group']=='Queens']
df_brooklyn=df[df['neighbourhood_group']=='Brooklyn']
df_bronx=df[df['neighbourhood_group']=='Bronx']
df_staten=df[df['neighbourhood_group']=='Staten Island']

# top 3 neighbourhood in Manhattan which are having maximum prices
print('Top 3 neighbourhood in Manhattan which are having maximum prices ')
df_manhattan.groupby(['neighbourhood'])['price'].max().sort_values(ascending=False).reset_index().head(3)

In [None]:
# top 3 neighbourhood in Staten Island which are having maximum prices
print('Top 3 neighbourhood in Staten Island which are having maximum prices')
df_staten.groupby(['neighbourhood'])['price'].max().sort_values(ascending=False).reset_index().head(3)

In [None]:
# top 3 neighbourhood in bronx which are having maximum prices
print('Top 3 neighbourhood in bronx which are having maximum prices')
df_bronx.groupby(['neighbourhood'])['price'].max().sort_values(ascending=False).reset_index().head(3)

In [None]:
# top 3 neighbourhood in Queens which are having maximum prices
print('Top 3 neighbourhood in Queenswhich are having maximum prices')
df_queens.groupby(['neighbourhood'])['price'].max().sort_values(ascending=False).reset_index().head(3)

In [None]:
# top 3 neighbourhood in brooklyn which are having maximum prices
print('Top 3 neighbourhood in brooklyn which are having maximum prices')
df_brooklyn.groupby(['neighbourhood'])['price'].max().sort_values(ascending=False).reset_index().head(3)

**Question 4: How does the range of room type differ according to each neighbourhood group?**

In [None]:
plt.figure(figsize=(10,5))
N=5  # number of bars in each category
ind = np.arange(3)
width=0.3

# storing the values of all values counts by the room_type for specific neighbourhood_group

bronx_values=df_bronx['room_type'].value_counts().values
brooklyn_values=df_brooklyn['room_type'].value_counts().values
manhattan_values=df_manhattan['room_type'].value_counts().values
queen_values=df_queens['room_type'].value_counts().values
staten_values=df_staten['room_type'].value_counts().values
# plotting the values
plt.bar(ind,bronx_values,0.2,label='bronx')
plt.bar(ind+0.1,brooklyn_values,0.2,label='brooklyn')
plt.bar(ind+0.2,manhattan_values,0.2,label='manhattan')
plt.bar(ind+0.3,queen_values,0.2,label='queens')
plt.bar(ind+0.4,staten_values,0.2,label='Staten Island')
plt.xlabel('Room type')
plt.ylabel('Neighbourhood group')
plt.title('Distribution of Room type over the different Neighbourhood Group')

plt.xticks(ind + width / 2, ('Entire Room', 'Private', 'Shared'))

plt.legend(loc='best')
plt.show()

**Insights**: Upon examination, we can evidentially state that the ratio of room type on each neighbourhood group are marginally same

**Question 5: How neighbourhood is related to reviews?**

In [None]:
#Top 5 Neighbourhoods contributing the heighest number of reviews per month
df.groupby(['neighbourhood'])['reviews_per_month'].max().sort_values(ascending=False).reset_index().head(5)


In [None]:
#Top 5 Neighbourhoods contributing the heighest number of reviews
df.groupby(['neighbourhood'])['number_of_reviews'].sum().sort_values(ascending=False).reset_index().head(5)

**Question 6: How the price column is distributed over room_type?**

In [None]:
# lets check out the who is having highest price of all
# and we will check its rating, minimum nights, availability_365  and last reviews in order judge

df[df['price']==df['price'].max()][['host_name','reviews_per_month','last_review','availability_365','price','neighbourhood_group']]

**Insights from the analysis:**

1) We can clearly state that despite of being priced heavily, the availability of 'Kathrine' and 'Erin' is 0.

2) The hosts 'Kathrine' and 'Erin' were last reviewed in 2019

3) Generally the affordability lays a severe impact on the probablity of being preferred by the customer, which is evidential in the current scenario

**Question 7: What is the average price preferred by the customers according to each neighbourhood group for each category of room type?**

In [None]:
# applying groupby over 'neighbourhood_groups' and 'room_type'
# then applying mean of price  and unstacking for clear visualization

avg_price_df = df.groupby(['neighbourhood_group','room_type'])['price'].mean().unstack()
avg_price_df

In [None]:
avg_price_df.plot.bar(figsize=(15,5),ylabel='Average Price calculated')

**Insights:** Manhattan offers listings at premium range, whereas the offers on Bronx is comparatively budget friendly for each room type

**Question 8: What is the distribution of the room type across the locations?**

In [None]:
plt.figure(figsize=(8,5))
df['room_type'].value_counts().plot(kind='bar',color=['r','b','y'])

**Insights:**

1) Maximum numbers of room are Entire home/Apartment and Private room there are only few shared rooms.

2)Hosts would probably prefer to give Entire home/Appartment or Private Rooms rather than Shared rooms.

In [None]:
plt.figure(figsize=(15,8))
sns.scatterplot(x=df['longitude'],y=df['latitude'], hue=df['room_type'])
plt.show()

**Question 9: Which hosts are having heighest number of apartments ?**

In [None]:
df['host_name'].value_counts()

In [None]:
df["host_id"].value_counts()

**While Michael appears 417 times in the host_name column, but the host_id with maximum appearances is 327, which clearly indicates presence of multiple hosts with same name.**



In [None]:
df[df['host_id']==219517861]['host_name'].unique()

In [None]:
df_sonder=df[df['host_name']=='Sonder (NYC)']
df_sonder[['host_name','neighbourhood_group','neighbourhood','latitude','longitude']].head(6)

**Therefore, Sonder(NYC) has the highest number of apartments in the same buildings in different neighbourhood**

**Question 10: Which are the top 5 hosts that have obatained heighest no. of reviews?**

In [None]:
host_highest_df=df.groupby(['host_id','host_name'],as_index=False)['number_of_reviews'].sum().sort_values(['number_of_reviews'],ascending = False)
host_highest_df.head(5)

**Insights:** Upon analysing the major contributors for reviews on the platform, we can strategically frame a plan to enhance the review count on the other sites as well, eventually fetching more data to empathize the customer.

# **Conclusion**

Based on our analysis of the Airbnb booking data, we can conclude that Manhattan is the most popular neighborhood for Airbnb listings, with the largest number of apartments available, albeit with higher prices. Staten Island, on the other hand, has the fewest listings and is an opportunity for potential expansion.

Interestingly, despite the relatively equal distribution of room types across the different neighborhood groups, the availability of certain highly priced listings, such as "Kathrine" and "Erin," is low and may be affecting customer choices. It is also worth noting that, in general, affordability plays a significant role in customers' preferences and decision-making.

In terms of room types, entire homes/apartments and private rooms are the most popular, with just a few shared rooms available. Hosts may prefer to offer entire homes/apartments or private rooms, reflecting the customer's preference for privacy and a home-like experience.

analysis also reveals an opportunity to increase reviews on some of the less popular sites, which would be useful for gathering more data and improving the customer experience. In conclusion, our analysis suggests that Airbnb could focus its expansion efforts on Staten Island, while also considering pricing and customer preferences to meet their needs better.

Of all the neighborhoods in New York City, Manhattan has the highest number available Airbnb listings. This evidence that it is the most popular neighborhood for tourists visiting the city. However, the pricing of listings here is generally higher compared to other neighborhoods. Despite the high prices, Manhattan listings get booked frequently, indicating the desire of visitors to experience what the neighborhood has to offer.

On the other hand, Staten Island has the least number of listings. While it may not be as popular as Manhattan, it presents an opportunity area for Airbnb expansion. This neighborhood is favorable for travelers looking to avoid the busy city experience and enjoy a quiet, relaxing atmosphere.

An interesting yet concerning observation from the analysis is that some highly-priced listings such as "Kathrine" and "Erin" have low availability. This may lead to a loss in revenue for Airbnb as customers opt for other options available. Hosts should consider pricing their listings in a way that is attractive and appealing to potential customers. This would benefit them with more frequent bookings and increased revenue.

Most customers prefer entire homes/apartments or private rooms to shared rooms. This indicates that hosts may consider offering this type of room accommodation more to satisfy the customer's preferences. Airbnb can use this information to encourage more hosts to offer entire homes/apartments or private rooms to attract more bookings and meet customer demands.

Finally, the review count on Airbnb listings plays a significant role in attracting customers. This analysis presents an opportunity to enhance the review count for less popular sites. Ultimately, this would lead to increased data, fostering understanding and empathy for the customer, thus increasing overall customer satisfaction.

In conclusion, analyzing data is an integral part of growing any business. Based on the findings from this analysis, Airbnb can develop a framework for future expansion, pricing, and customer satisfaction that aligns with the preferences and demands of the customer base.