<a href="https://colab.research.google.com/github/rishavdg123/Data-analysis/blob/main/Copy_of_Sample_EDA_Submission_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - AirBnB Booking analysis


>





##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This project involves a comprehensive data analysis of Airbnb listings in New York City from 2019, utilizing a dataset containing approximately 49,000 observations across 16 columns. The dataset includes crucial details such as listing names, host information, geographical coordinates, room types, pricing, and various review metrics.
The primary objective is to explore and analyze this rich dataset to uncover key insights that can inform Airbnb's business decisions, enhance security, improve customer and provider understanding, guide marketing initiatives, and facilitate the implementation of additional services. By examining the relationships between different variables and identifying patterns, the project aims to derive actionable understandings to support Airbnb's ongoing growth and operational efficiency.

The overarching business context for this analysis is rooted in Airbnb's evolution since 2008 into a global platform that has fundamentally transformed travel experiences. The immense volume of data generated by millions of listings is a strategic asset for the company. Therefore, this project is crucial for supporting various facets of Airbnb's operations. The insights derived from this analysis will be instrumental in enhancing security measures, informing critical business decisions, fostering a deeper understanding of both customers (guests) and providers (hosts), guiding targeted marketing initiatives, and facilitating the development and implementation of innovative additional services. By exploring patterns in pricing, availability, review trends, and geographical distribution, the project seeks to uncover actionable intelligence that contributes to Airbnb's continued growth, efficiency, and unique value proposition in the travel industry.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

The primary business objective of this data analysis project is to leverage the detailed Airbnb listing and review data from NYC 2019 to generate actionable insights that support and optimize key aspects of Airbnb's operations and strategic growth.
More specifically, this encompasses:
* Enhancing Business Decision-Making: Providing data-driven intelligence to inform pricing strategies, identify market trends, optimize inventory management, and make informed choices about resource allocation.
* Deepening Customer and Host Understanding: Gaining insights into user behavior, preferences, and pain points for both guests and hosts, which can lead to improved platform features, better user experience, and increased satisfaction.
* Guiding Marketing Initiatives: Identifying opportunities for targeted marketing campaigns by understanding demand patterns, popular neighborhoods, and specific room types that attract guests.
* Improving Security and Trust: Analyzing review data and listing characteristics to identify potential risks or areas for improvement in platform security and to foster a more trustworthy environment for users.
* Facilitating Service Expansion and Innovation: Pinpointing unmet needs or emerging trends that could lead to the development of new services, features, or partnerships that add value to the Airbnb ecosystem.
In essence, the project aims to translate raw data into strategic knowledge that empowers Airbnb to operate more efficiently, understand its market better, and continue its trajectory of expanding and personalizing travel possibilities globally.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

df=pd.read_csv('/content/Airbnb NYC 2019 (1).csv')
df



### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_values=df.duplicated().sum()
duplicate_values


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cbar=False)

### What did you know about your dataset?

Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df.isnull().sum()  #we find the missing values

df['reviews_per_month'].fillna(0,inplace=True) #we fill the null values of the coloumn reviews_per_momth with 0
df.isnull().sum()

df['price'] = pd.to_numeric(df['price'], errors='coerce') #we ensure that the price tag doesnt contain string datatypes
df['price'].fillna(df['price'].mean(), inplace=True) #we fill the null values of the price with mean price

df.drop_duplicates(inplace=True) #we drop the duplicate values

df.dropna(inplace=True) #we drop the null values


#Treating Outliers and handling them

# We Identify outliers using IQR
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# We Filter out outliers
df_filtered = df[(df['price'] >= lower_bound) & (df['price'] <= upper_bound)].copy()

# Alternatively, we cap the outliers (Winsorizing)
df['price_capped'] = np.where(df['price'] > upper_bound, upper_bound,
                              np.where(df['price'] < lower_bound, lower_bound, df['price']))

# Visualizing the data before and after handling outliers
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
sns.boxplot(y=df['price'])
plt.title('Price before outlier handling')

plt.subplot(1, 2, 2)
sns.boxplot(y=df_filtered['price']) # Or use df['price_capped']
plt.title('Price after outlier handling')
plt.tight_layout()
plt.show()
















### What all manipulations have you done and insights you found?

We performed extensive data wrangling on the Airbnb NYC 2019 dataset. Key manipulations included: loading and inspecting the data, filling missing reviews_per_month values with zero, converting last_review to datetime, removing duplicate rows, and treating price outliers via percentile capping. We also standardized categorical columns and engineered new features like has_reviews and days_since_last_review.
Insights revealed the NYC Airbnb market is largely concentrated in Manhattan and Brooklyn, with "Entire home/apt" and "Private room" being dominant. Prices vary significantly by location and room type, with Manhattan generally highest. We observed a mix of individual hosts and "power hosts" managing multiple listings, and noted a segment of listings with no reviews.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code


plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='neighbourhood_group')
plt.title('Distribution of Listings by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Number of Listings')
plt.show()

##### 1. Why did you pick the specific chart?

We choosed bar chart as it is ideal for comparing counts of categoical data.

##### 2. What is/are the insight(s) found from the chart?

Shows the neighbourhood which has highest concentration of Airbnb listings

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can inform marketing efforts and resource allocation in popular areas.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# Chart - 2 visualization code
plt.figure(figsize=(8, 8))
df['room_type'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Room Types')
plt.ylabel('') # Remove default y-label for pie chart
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart is effective for visualizing the proportion of each category within a whole

##### 2. What is/are the insight(s) found from the chart?

This will reveal the most common types of rooms offered on Airbnb in NYC

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding room type distribution helps in tailoring marketing messages to different guest segments, optimizing search filters, and identifying supply gaps or excesses for specific room types.

#### Chart - 3

In [None]:
# Chart - 3 visualization code


plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='price', bins=50, kde=True)
plt.title('Distribution of Listing Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is suitable for visualizing the distribution of a continuous variable like price

##### 2. What is/are the insight(s) found from the chart?

This will reveal the typical price range of Airbnb listings and potentially highlight areas with a higher frequency of very low or very high prices

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the price distribution is crucial for setting competitive pricing strategies, identifying pricing trends, and developing pricing recommendations for hosts.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='room_type', y='price')
plt.title('Price Distribution by Room Type')
plt.xlabel('Room Type')
plt.ylabel('Price')
plt.show()


##### 1. Why did you pick the specific chart?

A box plot is excellent for comparing the distribution and variability of a numerical variable (price) across different categories (room type), showing the median, quartiles, and potential outliers.

##### 2. What is/are the insight(s) found from the chart?

This will clearly show how the typical price varies between "Entire home/apt", "Private room", and "Shared room", and the spread of prices within each type.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight directly supports pricing strategies for different room types, helping hosts and Airbnb understand expected price ranges and market value.

#### Chart - 5

In [None]:
# Chart - 5 visualization code


plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='number_of_reviews', y='price', alpha=0.5)
plt.title('Price vs. Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('Price')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is used to visualize the relationship between two continuous variables..

##### 2. What is/are the insight(s) found from the chart?

This will help you understand if listings with more reviews tend to have higher or lower prices, or if there is no clear relationship.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 This insight can inform strategies to encourage reviews or investigate why some highly-reviewed listings which might have lower prices.

#### Chart - 6

In [None]:
# Chart - 6 visualization code


plt.figure(figsize=(12, 6))
top_neighbourhoods = df['neighbourhood'].value_counts().nlargest(10)
sns.barplot(x=top_neighbourhoods.index, y=top_neighbourhoods.values)
plt.title('Top 10 Neighbourhoods by Number of Listings')
plt.xlabel('Neighbourhood')
plt.ylabel('Number of Listings')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

A bar chart is effective for displaying the ranking of categorical data

##### 2. What is/are the insight(s) found from the chart?

This will pinpoint the most popular neighborhoods for Airbnb listings in NYC.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This information is valuable for targeted marketing towards hosts and guests in high-density areas, understanding market saturation, and identifying areas with potential for expansion.

#### Chart - 7

In [None]:
# Chart - 7 visualization code


plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='availability_365', bins=50, kde=True)
plt.title('Distribution of Listing Availability (Days per Year)')
plt.xlabel('Availability (Days)')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

A histogram is useful for visualizing the distribution of a numerical variable, allowing you to see how many listings are available for different numbers of days throughout the year.



##### 2. What is/are the insight(s) found from the chart?

This will show if listings are typically available for short periods, long periods, or spread across the year.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding availability patterns can help in optimizing booking algorithms, identifying opportunities for encouraging hosts to increase availability, and predicting seasonal trends.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

plt.figure(figsize=(12, 7))
sns.barplot(data=df, x='neighbourhood_group', y='price', hue='room_type', ci=None)
plt.title('Average Price by Neighbourhood Group and Room Type')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Average Price')
plt.show()

##### 1. Why did you pick the specific chart?

 A grouped bar chart is effective for comparing the average of a numerical variable across two categorical variables

##### 2. What is/are the insight(s) found from the chart?

This will show you how the average price varies not only by neighborhood group but also within each neighborhood group based on the room type

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight is valuable for pricing recommendations, helping hosts set competitive prices based on their location and room type

#### Chart - 9

In [None]:
# Chart - 9 visualization code


plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='minimum_nights', y='price', alpha=0.5)
plt.title('Price vs. Minimum Nights')
plt.xlabel('Minimum Nights')
plt.ylabel('Price')
plt.xlim(0, 30) # Limit x-axis for better visibility, adjust as needed
plt.show()


##### 1. Why did you pick the specific chart?

A scatter plot is useful for examining the relationship between two numerical variables.

##### 2. What is/are the insight(s) found from the chart?

This will show you if listings with a higher minimum night requirement tend to have higher or lower prices, or if there's no clear pattern.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight can inform pricing strategies for hosts who set minimum night stays. It can also help Airbnb understand how minimum night requirements might affect booking patterns and guest preferences, potentially leading to features or recommendations related to stay duration.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code


plt.figure(figsize=(12, 8))
numerical_cols = df.select_dtypes(include=np.number).columns
corr_matrix = df[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Variables')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is an effective way to visualize the correlation matrix, allowing you to quickly identify the strength and direction of linear relationships between multiple numerical variables. The color intensity and annotations make it easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

This chart will show you how different numerical variables (like price, minimum_nights, number_of_reviews etc) are correlated with each other.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code



numerical_cols_subset = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'availability_365']
sns.pairplot(df[numerical_cols_subset])
plt.suptitle('Pair Plot of Selected Numerical Variables', y=1.02) # Add a title for the whole plot
plt.show()

##### 1. Why did you pick the specific chart?

A pair plot is excellent for exploring the relationships between multiple numerical variables simultaneously. It provides a matrix of scatter plots for every pair, making it easy to visually detect patterns, clusters or correlations

##### 2. What is/are the insight(s) found from the chart?

Pair plots help in understanding the underlying structure of the data and identifying potential relationships that might not be obvious from simple summary statistics or individual plots. This can inform further analysis, feature engineering for machine learning models, or identification of segments within the data

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the exploratory data analysis of the Airbnb NYC 2019 dataset, here are some suggestions to help the client achieve their business objectives:

1. Optimize Pricing Strategies:

Develop dynamic pricing tools for hosts that consider location, room type, availability, and demand patterns in specific neighborhoods.
Provide hosts with data-driven recommendations on optimal pricing based on comparable listings in their area.
For "Entire home/apt" and "Private room" types, which tend to have higher prices, offer guidance on maximizing occupancy through competitive pricing.
For "Shared room" types, explore strategies to increase bookings through competitive pricing or highlighting unique selling points.
2. Enhance Customer and Host Understanding:

Implement features that encourage guests to leave reviews, such as reminders or incentives.
Analyze the sentiment of reviews (requires further text analysis) to identify common guest pain points and areas for improvement in listing quality or host services.
Develop a host dashboard that provides insights into their listing's performance, including review trends, booking rates, and pricing suggestions.
Identify listings with no reviews and offer support or resources to hosts to help them get their first bookings and build trust.

3. Guide Marketing Initiatives:

Target marketing campaigns towards popular neighborhood groups (e.g., Manhattan and Brooklyn) to attract more guests.
Identify less saturated neighborhood groups with potential for growth and launch marketing initiatives to encourage hosts to list properties in those areas.
Tailor marketing messages to specific room types and their corresponding target audiences (e.g., highlight the benefits of "Entire home/apt" for families or groups).
Leverage insights from the correlation between number of reviews and price to showcase highly-rated listings in marketing materials.
4. Improve Security and Trust:

Conduct sentiment analysis on review text to identify potential red flags or recurring issues related to safety or host behavior.
Develop an automated system to flag listings with a high number of negative reviews or suspicious patterns.
Utilize data on calculated_host_listings_count to identify "power hosts" and implement specific verification processes or support structures for them.
Consider incorporating additional data sources (if available) like user verification status to build a more comprehensive trust score for hosts and guests.
5. Facilitate Service Expansion and Innovation:

Analyze minimum_nights requirements and their impact on bookings to understand guest preferences for stay duration. This could inform the development of features that allow guests to filter by flexible minimum stays or highlight listings suitable for different trip lengths.
Explore the potential for offering ancillary services (e.g., cleaning services, photography services for listings) based on the distribution of listings and host needs in different areas.
Identify potential market segments or niches based on unique listing characteristics or locations that are currently underserved.
By implementing these data-driven strategies, the client can make more informed decisions, improve user experience, optimize operations, and ultimately achieve their business objectives in the competitive short-term rental market.

# **Conclusion**

The exploratory data analysis of the Airbnb NYC 2019 dataset has provided valuable insights into the characteristics of listings, host behavior, pricing dynamics, and guest engagement within the New York City market. Key findings from the analysis include:

Geographic Concentration: Airbnb listings are heavily concentrated in Manhattan and Brooklyn, indicating these are the primary areas of operation and demand.
Room Type Dominance: "Entire home/apt" and "Private room" are the most prevalent room types, suggesting a focus on individual or small group accommodations.
Price Variability: Listing prices vary significantly based on neighborhood group and room type, with Manhattan generally having the highest average prices.
Review Patterns: A considerable number of listings have a low count of reviews, and some have none, indicating varying levels of booking activity or guest engagement. The reviews_per_month distribution further supports this, showing many listings with infrequent reviews.
Host Activity: The calculated_host_listings_count reveals a mix of individual hosts and those managing multiple properties.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***