<a href="https://colab.research.google.com/github/sdshastri/Airbnb-EDA/blob/main/Airbnb_EDA_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name**    - Shubham Dilip Shastri

# **Project Summary -**

The Exploratory Data Analysis (EDA) of Airbnb booking analysis is a critical component of the overall project aimed at understanding the various factors that influence customer booking decisions on the Airbnb platform. The EDA will provide insights into the patterns and trends in the booking data and identify any underlying relationships between variables that may impact customer behavior.

The EDA will begin with data collection from the Airbnb platform, including information on the listings, their availability, prices, and customer reviews. The data will then be cleaned to remove any missing, inconsistent, or irrelevant information. The next phase of the EDA will involve data analysis to uncover patterns and trends in customer behavior, such as the most popular listing types, the most sought-after amenities, and the factors that impact booking rates.

Data visualization will be used to effectively communicate the insights and findings of the data analysis. A combination of graphs, charts, and maps will be used to showcase the trends and patterns in the data. This will help to identify areas of opportunity and highlight areas that require improvement.

The final phase of the EDA will involve summarizing the findings and providing recommendations on how the insights can be leveraged to improve the customer experience and increase booking rates. This will include recommendations on optimizing pricing strategies, enhancing the listing descriptions, and improving the overall customer experience. The findings and recommendations will be presented in a comprehensive report, detailing the methodology, findings, and recommendations.

In conclusion, the EDA of Airbnb booking analysis will provide valuable insights into the patterns and relationships in the booking data and inform the development of predictive models that can be used to optimize the platform. The EDA will play a crucial role in ensuring the success of the overall Airbnb booking analysis project by providing a foundation for further analysis and optimization. The insights and recommendations generated by the EDA will help to drive growth and success for the Airbnb platform and ensure its continued evolution and adaptation to changing customer needs and preferences.

# **GitHub Link -**

Name : Shubham Dilip Shastri


Link :https://github.com/sdshastri/Airbnb-EDA

# **Problem Statement**


To explore, analyse and visualize the dataset and find some important insight like

1- Which area has highest number of booking

2- Finding the top busy hosts

3- Which group has highest cost

4-  Types of room with number




#### **Define Your Business Objective?**

The business objective of Airbnb is to provide quality service at affordable cost to the customer so there is no need to find any other platform for rental service  . Along with that seek for growth by increasing their revenue by expanding their network of Airbnb .

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Load Dataset
dataset = pd.read_csv("/content/drive/MyDrive/AIRBNB/Copy of AIRBNB DATASET.csv")
df = pd.DataFrame(dataset)

### Dataset First View

In [None]:
# Dataset First Look
df.head(10)


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

In [None]:
# Check the data types of each attribute
df.dtypes

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
df.duplicated().value_counts()


There are no duplicate values in dataset.

In [None]:
# Dataset Duplicate Value Count
df.drop_duplicates().count()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

df.isnull()


In [None]:
df.isnull().sum()

In [None]:
# Visualizing the missing values

plt.figure(figsize=(15, 25))
sns.heatmap(df.isnull(), cmap='viridis')
plt.show()

In [None]:
# Visualization of missing values using missingno module
import missingno as mnso
mnso.matrix(df)

### What did you know about your dataset?

The given dataset is about an Airbnb business.

*    There are total 48895 entries in the form of rows in dataset.
*   The dataset contain total of 16 data columns.

*   The 16 data columns contain  3 different types of data types like int64, float64, and object type.
*  There are  only 4 columns that have missing values, in which the last_review and reviews_per_month columns contain the most number of missing values, that is 10052 in numbers for both.







## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

The variables of Airbnb business are as follows:

- *id*: A unique identifier for each listing.
- *name*: The name of the listing.
- *host_id*: A unique identifier for the host of the listing.
- *host_name*: The name of the host of the listing.
- *neighbourhood_group*: The neighbourhood group that the listing is in.
- *neighbourhood*: The neighbourhood that the listing is in.
- *latitude*: The latitude of the listing.
- *longitude*: The longitude of the listing.
- *room_type*: The type of room that is being listed.
- *price*: The price per night for the listing.
- *minimum_nights*: The minimum number of nights that a guest can book for the listing.
- *number_of_reviews*: The number of reviews that the listing has received.
- *last_review*: The date of the last review that the listing received.
- *reviews_per_month*: The number of reviews that the listing receives per month on average.
- *calculated_host_listings_count*: The number of listings that the host has on Airbnb.
- *availability_365*: The number of days that the listing is available for booking in a year.


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = df.nunique()
unique_values

## 3. ***Data Wrangling***

### Data Wrangling Code

###  Handling of Missing Value


In [None]:
# Write your code to make your dataset analysis ready.
# Replacing missing values in the reviews_per_month column
df.fillna({'reviews_per_month':0}, inplace = True)

# Replacing missing values in the name column
df.fillna({'name':'Not Mentioned'}, inplace =True)



In [None]:
# Removing the attributes which are not required for the analysis
df.drop(['id','host_name','last_review'], axis = 1, inplace = True)

In [None]:

df.isna().sum()


In [None]:
df

### What all manipulations have you done and insights you found?


The manipulations I have done on dataset


*   Handling missing values


*   Remove duplicate data






  

Insights




*   50% of  people spend 106 USD for staying and maximum is 10000 USD..
*   On Average , people stay for 7 night.

*  There are only three types of rental rooms avaliable.










*   Some Airbnb Rentals have 365 days availaibility.








## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Distribution curve for price

In [None]:
# Chart - 1 visualization code

sns.kdeplot(data = df['price'])
plt.xlim([0,2000])
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(True, alpha=0.5, linestyle="--")
plt.show()

##### 1. Why did you pick the specific chart?

This chart gives the information of price which has high frequency in the data.
Kernel Density Estimate (KDE) Plot allows us to estimate the probability density function of the continuous or non-parametric from our data set curve in one or more dimensions it means we can create plot a single graph for multiple samples which helps in more efficient data visualization.

##### 2. What is/are the insight(s) found from the chart?

Most of the property rental price is in between 0 to 250 USD.

#### Chart - 2 :  Distribution of different neighbourhood groups across the city

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(20,15))
sns.scatterplot(x=dataset.longitude,y=dataset.latitude, hue=dataset.neighbourhood_group )
plt.show()

##### 1. Why did you pick the specific chart?

Scatter plot  to show the distribution of latitude and longitude for different neighbourhood groups. The hue parameter is used to color-code the different neighbourhood groups. This helps us understand how different neighbourhood groups are distributed across the city

##### 2. What is/are the insight(s) found from the chart?

Manhatten is the neighbourhood group that is  more densely populated than others neighbourhood group

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, insights gained from scatter plots can help businesses make better decisions.
From above graph we can conclude that most of people looking for service in Manhatten and Brooklyn are more densed than other , so there is  good investment opportunities in these areas so we can expand our network of Airbnb Service in these areas.

Staten Island is the area which is less densed so there will not worth if we invest here.

#### Chart - 3 :  Distribution of different types of room in neighbourhood groups

In [None]:
# Chart 3: visualization code
f,ax = plt.subplots(figsize=(10,6))
ax = sns.scatterplot(y=df.latitude, x=df.longitude,hue=df.room_type , palette='bright')
ax.set_title("Distribution of different types of room in neighbourhood groups")
plt.show()

##### 1. Why did you pick the specific chart?

 Scatter plot of the latitude and longitude of different types of rooms in neighborhood groups. The hue parameter is used to differentiate between different types of rooms.

##### 2. What is/are the insight(s) found from the chart?

By observing the latitude and longitude data visualization,  almost every type of room is present in every area but shared room distribution is very few.

#### Chart - 4 :  Different types of room with Number

In [None]:
# Chart 4: visualization code
df['room_type'].value_counts().plot(kind='bar',color=['r','b','y'])
plt.show()


##### 1. Why did you pick the specific chart?

Bar plots are a good choice for showing the number of listings for each room type because they allow you to easily compare the number of listings for each room type.

##### 2. What is/are the insight(s) found from the chart?

The most common type of room is the entire home/apt followed by private rooms and shared rooms which is least used by the customer.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.



Yes, the insights can help you create a positive business impact. By knowing the most common type of room in the dataset, you can make decisions about what type of room to rent out or what type of room to look for when booking a place to stay. This can help you increase our revenue and improve our customer satisfaction.




#### Chart - 5 : Percentage of listings  in different locations

In [None]:
# Chart - 5: visualization code
#Visualise number of listings  in different locations with help of pie chart.
plt.figure(figsize=(13,7))
plt.pie(df.neighbourhood_group.value_counts(), labels=df.neighbourhood_group.value_counts().index,autopct='%1.1f%%', startangle=180)
plt.show()

##### 1. Why did you pick the specific chart?





 Pie chart that shows the number of listings in each neighborhood group in %.

##### 2. What is/are the insight(s) found from the chart?

From above pie chart it is observed that maximum number of listings in NewYork are found in Manhatten(44.3%) of total listings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help you create a positive business impact. By knowing the most common type neighbourhood group used by customer , we will try to expand our listings in that group. So, it will help to increase our revenue and improve our customer satisfaction




#### Chart - 6 : Number of room types in each Neighborhood Group

In [None]:
# Chart - 6: visualization code
plt.figure(figsize=(13,7))
ax = sns.countplot(data=df,x='neighbourhood_group' ,hue = 'room_type' , palette="muted")

ax.bar_label(ax.containers[0])
ax.bar_label(ax.containers[1])
ax.bar_label(ax.containers[2])

##### 1. Why did you pick the specific chart?

 I am  using countplot to create a count plot of the number of room types in each neighborhood group. The x parameter specifies the column name for the x-axis and the hue parameter specifies the column name for the color encoding. The palette parameter specifies the color palette to use for the plot.

##### 2. What is/are the insight(s) found from the chart?

The graph shows that the Entire Home/Apartment is listed most near Manhattan and in Brooklyn number of private room and apt nearly equal while in Staten Island we get to see very few rooms are avalaible.

#### Chart - 7 : Average Price for Neighbourhood group





In [None]:
# Chart - 7 :visualization code
# Let us visualise price_vs_location .
# let us find relation between location i.e.neighbourhood_group and price.
price_vs_location = df.groupby(['neighbourhood_group'])['price'].mean()
price_vs_location


In [None]:
ax= price_vs_location.plot.bar(figsize=(10,5),fontsize=14,color=['r','b','y','c','g'])
ax.set_title(' Average Price for Neighbourhood group',fontsize=20)
ax.set_xlabel('Neighbourhood group',fontsize=15)
ax.set_ylabel('Price',fontsize=15)
ax.bar_label(ax.containers[0])

##### 1. Why did you pick the specific chart?

A bar chart is a type of chart that presents categorical data with rectangular bars as Neighbourhood Group with heights as numeric values proportional to the  price values that they represent.


##### 2. What is/are the insight(s) found from the chart?

From above plot it is observed that Manhattan is most expensive location which costs around on average 200 USD  and Bronx is the least expensive location  which costs around on average 88 USD  in given dataset.


#### Chart - 8 : Top 10 busy hosts

In [None]:
# Chart - 8 : visualization code
# Let us find the busiest host using host_id and minimum nights column in our dataset.
Busy_host=df.groupby(['host_id']).minimum_nights.mean()
Busy_host=Busy_host.sort_values(ascending=True)
Busy_host

In [None]:
# Let us find top 10 busy hosts.
Top_busy_hosts=Busy_host.tail(10)
Top_busy_hosts

In [None]:
# Let us visualise top 10 busy hosts to find busiest host using bar plot.
plt.figsize=(10,5)
Top_busy_hosts.plot(kind='bar')
plt.title=('Top_busy_host')
plt.ylabel=('minimum_nights')
plt.xlabel=('host_id')
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is a type of chart that presents categorical data with rectangular bars as Host Id with heights as numeric values proportional to the  count values that they represent.




##### 2. What is/are the insight(s) found from the chart?

From above bar plot it is observed that host with host_id 17550546 is busiest host in given dataset as number of minimum nights spend at listings belongs to host id 17550546 are more that is above 1200.

#### Chart - 9 : Room types in number for neighbourhood group

In [None]:
# Chart - 9: visualization code
# Let us visualise Room types in number for neighbourhood group

g = sns.catplot(x='room_type',kind='count',hue='neighbourhood_group',data=df);
for ax in g.axes.flat:
    ax.bar_label(ax.containers[0])
    ax.bar_label(ax.containers[1])
    ax.bar_label(ax.containers[2])
plt.show()

##### 1. Why did you pick the specific chart?

sns.catplot() is a function in the Seaborn library that is used to create a categorical plot. It is used to show the relationship between a numerical and one or more categorical variables. To create a count plot of the number of room types in each neighborhood group

##### 2. What is/are the insight(s) found from the chart?

Majority of entire home/apartment are located in Manhattan

Majority of private rooms are located in brooklyn

Relation between neighbourhood group and availability of room

#### Chart -10 : Count of Review v/s neighbourhood group

In [None]:
# Chart - 10 visualization code
review_50 = dataset[dataset['number_of_reviews']>50]
df2 = review_50['neighbourhood_group'].value_counts()
df2.plot(kind='bar',color=['r','b','g','y','m'])
plt.show()

print(' Count of Review v/s neighbourhood group')
pd.DataFrame(df2)

##### 1. Why did you pick the specific chart?

  To show the bar plot of Count of Review v/s neighbourhood group

##### 2. What is/are the insight(s) found from the chart?

Location and Review Score

Review is the one of the important criteria with online activity these days. This gives a lot of insights to a particular place for tourist and they can swing mood when it comes to online booking. A cheap place with bad review can drive a tourist for not booking and an expensive place with nicest review can shell a tourist more than what he have thought initially. So we will try to figure out the review , how each neighbourhood is doing in respect to review. Since there is a limited data with review we will try to figure out as much as we can.

First criteria of our review is we will consider only those who have a review more than 50, so that we can have an insight of the data.

So according to the below plot, Brooklyn got most review in comparison to Manhattan and that is an interesting find. Also Staten Island which is cheaper has less review than the other neighbourhood group. We cannot proceed further to understand why is that case since we have a limited data.



In [None]:
#storing all the diffrent neighbourhood groups to diffrent data frames
brooklyn_group_df=df.loc[df['neighbourhood_group']== 'Brooklyn']
manhattan_group_df=df.loc[df['neighbourhood_group']== 'Manhattan']
Queens_group_df=df.loc[df['neighbourhood_group']== 'Queens']
Staten_Island_group_df=df.loc[df['neighbourhood_group']== 'Staten Island']
Bronx_group_df=df.loc[df['neighbourhood_group']== 'Bronx']

#### Chart -11 : Top 10 areas in Brooklyn with most booking





In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot( data=brooklyn_group_df, x="neighbourhood",
              order=brooklyn_group_df.neighbourhood.value_counts().iloc[:10].index).set_title('top 10 brooklyn neighbourhood value count')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45,horizontalalignment='right' )
plt.show()

##### 1. Why did you pick the specific chart?

To show the top 10 areas in brooklyn with most booking.

##### 2. What is/are the insight(s) found from the chart?

Willamsburg, bedford and bushwick have most no of bookings in brooklyn neighbourhood group.

#### Chart -12 : Top  10 areas in Manhattan with most booking

In [None]:

# top 10 areas in manhattan with most booking
#creating a count plot
sns.countplot(y="neighbourhood", data=manhattan_group_df,
              order=manhattan_group_df.neighbourhood.value_counts().iloc[:10].index).set_title('top 10 manhattan neighbourhood value count')


##### 1. Why did you pick the specific chart?

 To show top 10 areas in Manhattan with most booking

##### 2. What is/are the insight(s) found from the chart?

Harlem, Upper West Side and Hell's Kitchen have most no of bookings in Manhattan neighbourhood group.

#### Chart -13 : Top 10 areas in Queens with most booking

In [None]:
# top 10 areas in Queens with most booking
#creating a count plot

sns.countplot(y="neighbourhood", data=Queens_group_df, palette="rainbow",
              order=Queens_group_df.neighbourhood.value_counts().iloc[:10].index).set_title('top 10 Queens neighbourhood value count')


##### 1. Why did you pick the specific chart?

To show top 10 areas in Queens with most booking.

##### 2. What is/are the insight(s) found from the chart?

Astoria,  long island city and flushing have most no of bookings in Queens neighbourhood group.

#### Chart -14 : Top 10 areas in Staten Island group with most booking

In [None]:

# top 10 areas in Staten_Island with most booking
#creating a count plot
sns.countplot(y="neighbourhood", data=Staten_Island_group_df, palette="rocket_r",
              order=Staten_Island_group_df.neighbourhood.value_counts().iloc[:10].index).set_title('top 10 Staten_Island neighbourhood value count')



##### 1. Why did you pick the specific chart?

To show the top 10 areas in Staten Island group with most booking.

##### 2. What is/are the insight(s) found from the chart?

St.George,tompkinsville cancaord ave most no of bookings in Staten_Island neighbourhood group.

#### Chart -15 : Top 10 areas in Bronx neighbourhood group with most booking

In [None]:
# top 10 areas in Bronx with most booking
#creating a count plot
sns.countplot(y="neighbourhood", data=Bronx_group_df, palette="magma",
              order=Bronx_group_df.neighbourhood.value_counts().iloc[:10].index).set_title('top 10 Bronx neighbourhood value count')



##### 1. Why did you pick the specific chart?

To show the count plot of top 10 Bronx neighbourhood value count in terms of booking.

##### 2. What is/are the insight(s) found from the chart?

In bronx neighbourhood_group all the neighbourhood have almost equal no of bookings with very small difference in count.

#### Chart -16 : Availaibility of room

In [None]:

#considering rows above 0 for availabilty of romm
df2 = df.loc[df["availability_365"] > 0 ]

#box plot for availability room
plt.style.use('classic')
plt.figure(figsize=(13,7))

sns.boxplot(data=df2, x='neighbourhood_group',y='availability_365',palette="dark")
plt.show()

##### 1. Why did you pick the specific chart?

This boxplot gives information of availability of  room in dataset.



##### 2. What is/are the insight(s) found from the chart?

Mean of avalaibility of room of staten island  is  above 250 days which is higest. Manhattan, Queens and Bronx availabity room is same as abovr 170 days and Broklyn has the lowest.

#### Chart - 17 : Correlation Heatmap




In [None]:
# Correlation Heatmap visualization code
# Selecting only numerical features for correlation analysis
numerical_df = df.select_dtypes(include=np.number)

corr = numerical_df.corr(method='kendall')
plt.figure(figsize=(10,8))
plt.title=('correlation between location, price,reviews\n')
sns.heatmap(corr,annot=True)
plt.show()



##### 1. Why did you pick the specific chart?

Heatmaps are used to visualize data in a 2D format. To show the correlation between variable.


##### 2. What is/are the insight(s) found from the chart?

From above corelation plot it is observed that there is no strong corelation between any factors but calculated_host_listing_count and Availability_365 are weakly corelated.

#### Chart - 18 : Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df)
plt.show()

##### 1. Why did you pick the specific chart?

Pairplot is a function in the Seaborn library that is used to plot pairwise relationships in a dataset.

##### 2. What is/are the insight(s) found from the chart?

From above plot have not found any stronger relationship between variable.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?



*   Most of the property hosted in AIRBNB has a price range of 0 to 250 USD, so try to decrease thr price price range so we can attract more customer by numberwise.



*   The shared rooms which is least used by the customer, AIRBNB should promote this rooms by offering  discount.

*   Mean of avalaibility of room of Brooklyn group is less so company needs to find host who have more avalaibility.


*   Average price for staying in Manhatten is  more than two times than Bronx so company need to make some changes in price so it can attract more customer numberwise.

*  Staten island is less densed area in terms of number of host so company needs to increase their network here.






# **Conclusion**



*   Customer prefer Entire Home or Private room for statying, they avoid to book shared room.

*   Brooklyn and Manhatten are the most booked staying area used by  Customer.

*   Brooklyn and Manhatten have almost 85% of listing in dataset.

*   The room type 'Entire home' is booked most in Manhatten and fewest in Staten Island  Neighbourhoof Group.
*   The room type 'Private room' is booked most in Brooklyn and fewest in Staten Island Neighbourhoof Group.


*   The room type 'Shared room' is booked most in Manhatten and fewest in Staten Island  Neighbourhoof Group.

*  Host with host_id 17550546 is busiest host.

*   Brooklyn and Manhatten Neighbourhood group have got most number of reviews.



*   Mean of avalaibility of room for staten island is higest.


*   Bronx is the least expensive and Manhatten is the most expensive  group for staying if we take average price.





