# **Project Name**    -



##### **Project Type**    - **AirBnb Booking Analysis**
##### **Contribution**    - Individual


# **Project Summary -**

The task at hand is to explore and analyze the Airbnb dataset to extract valuable insights that can be used for various purposes,like Improving User Engagement , Optimizing Pricing Strategies , Understanding Customer Preferences , Security , Customer and Host Behavior, Marketing Strategies and ultimately contributing to the company's success and the enhancement of the Airbnb experience for both guests and hosts.

# **GitHub Link -**

Provide your GitHub Link here.

https://github.com/verma123mani/Data_analysis_project

# **Problem Statement**


The problem statement outlines a scenario related to Airbnb, a platform that connects guests with hosts offering unique accommodations. The company has been in operation since 2008, and its success relies on the data generated by millions of listings on the platform. The data is considered crucial for various purposes such as security, making informed business decisions, understanding customer and host behavior, evaluating performance, guiding marketing strategies, and implementing additional services.

#### **Define Your Business Objective?**

Answer Here.

The objective of this project is to analyze the data provided by Airbnb to uncover valuable insights that can inform strategic decisions and enhance the overall user experience on the platform.


# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
AirbnbDataset = pd.read_csv('/content/drive/MyDrive/mani_data/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
AirbnbDataset.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
AirbnbDataset.shape

### Dataset Information

In [None]:
# Dataset Info
# Check data types and null values
print(AirbnbDataset.info())


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# Count the total number of duplicate rows in the entire DataFrame
duplicate_count = AirbnbDataset.duplicated().sum()

# Display the count of duplicate rows
print(f'Total Duplicate Rows: {duplicate_count}')

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
# Count the total number of missing values in each column
missing_values_count = AirbnbDataset.isnull().sum()

# Display the count of missing values for each column
print(missing_values_count)

In [None]:
# Visualizing the missing values
missing_values_count = AirbnbDataset.isnull().sum()
plt.figure(figsize=(9, 6))  # Adjust the figure size if necessary
missing_values_count.plot(kind='bar', color='yellow')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.title('Missing Values in Airbnb Dataset')
plt.xticks(rotation=90)  # Rotate x-axis labels for better visibility
plt.tight_layout()  # Adjust layout to prevent clipping labels
plt.show()

### What did you know about your dataset?

Answer Here

The provided dataset captures a glimpse into the world of Airbnb listings, offering a treasure trove of information ripe for exploration and analysis.
With approximately 49,000 observations and 16 columns, the dataset provides a comprehensive view of various aspects of Airbnb listings, including location, pricing, host details, and more.
There are  mising values but has no duplicate values in the dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
AirbnbDataset.columns

In [None]:
# Dataset Describe
AirbnbDataset.describe()

### Variables Description

Answer Here

In [None]:
variable_names = AirbnbDataset.columns.tolist()
data_types = AirbnbDataset.dtypes.tolist()
variable_description = pd.DataFrame({'Variable Name': variable_names, 'Data Type': data_types})
print(variable_description)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Count unique values in all columns
unique_counts = AirbnbDataset[['id', 'name', 'host_id','host_name', 'neighbourhood_group', 'neighbourhood','latitude', 'longitude', 'room_type','price', 'minimum_nights', 'number_of_reviews',
                               'last_review','reviews_per_month', 'calculated_host_listings_count', 'availability_365']].nunique()

print("Number of unique values in each column:")
print(unique_counts)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Drop rows where all values are missing
AirbnbDataset.dropna(how='all', inplace=True)


In [None]:
AirbnbDataset.info()

In [None]:
# Fill missing values in 'name' column with 'Unknown'
AirbnbDataset['host_name'].fillna('Unknown', inplace=True)
# Fill missing values in 'name' column with 'Unknown'
AirbnbDataset['name'].fillna('Unknown', inplace=True)
# Replace missing values with a placeholder datetime value
AirbnbDataset['last_review'].fillna(pd.Timestamp.min, inplace=True)
# Convert 'last_review' column to datetime
AirbnbDataset['last_review'] = pd.to_datetime(AirbnbDataset['last_review'], errors='coerce')
# Impute missing values with median
median_reviews_per_month = AirbnbDataset['reviews_per_month'].median()
AirbnbDataset['reviews_per_month'].fillna(median_reviews_per_month, inplace=True)

In [None]:
#making new fields(Feature Engineering)
AirbnbDataset['length_of_stay'] = AirbnbDataset['availability_365'] - AirbnbDataset['minimum_nights']
# Define price ranges and corresponding categories
price_ranges = [(0, 100), (101, 200), (201, 300), (301, float('inf'))]
categories = ['Budget', 'Mid-range', 'High-end', 'Luxury']

# Function to categorize prices
def categorize_price(price):
    for i, (lower, upper) in enumerate(price_ranges):
        if lower <= price <= upper:
            return categories[i]

# Apply categorize_price function to create 'price_category' column
AirbnbDataset['price_category'] = AirbnbDataset['price'].apply(categorize_price)

In [None]:
AirbnbDataset.info()

In [None]:
# Extract year, month, and day
AirbnbDataset['last_review_year'] = AirbnbDataset['last_review'].dt.year
AirbnbDataset['last_review_month'] = AirbnbDataset['last_review'].dt.month

In [None]:
AirbnbDataset.info()

In [None]:
# Check for duplicates
duplicate_count = AirbnbDataset.duplicated().sum()
print(f'Total Duplicate Rows: {duplicate_count}')

In [None]:
# Correlation matrix for numeric columns
# Specify the numerical columns you want to include in the correlation matrix
numerical_columns = ['price', 'latitude', 'longitude', 'minimum_nights', 'number_of_reviews',
                     'reviews_per_month', 'calculated_host_listings_count', 'availability_365',
                     'length_of_stay']

# Compute correlation matrix for the selected numerical columns
correlation_matrix = AirbnbDataset[numerical_columns].corr()
print(correlation_matrix)

### What all manipulations have you done and insights you found?






Answer Here.

1. Dropped rows where all values are missing.
2. Filled missing values in 'host_name' and 'name' columns with 'Unknown'.
3. Filled missing values in 'last_review' column with a placeholder datetime value and converted it to datetime.
4. Imputed missing values in 'reviews_per_month' column with the median.
5. Created a new feature 'length_of_stay': This feature represents the difference between 'availability_365' and 'minimum_nights', providing insights into the minimum length of stay for each listing.
6. Categorized prices into different categories: This categorization helps in understanding the distribution of prices in different ranges ('Budget', 'Mid-range', 'High-end', 'Luxury').
7. Extracted year and month from the 'last_review' column.
8. Computed correlation matrix for numerical columns: This provides insights into the relationships between numerical features in the dataset. For example, you can see if there is a correlation between the price of a listing and its location (latitude and longitude), or the number of reviews it has received.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 (univariate plot)

In [None]:
# Chart - 1 The frequency distribution of different price categories in an Airbnb dataset.
# Set the size of the figure (width, height) in inches
plt.figure(figsize=(10, 6))
sns.countplot(x='price_category', data=AirbnbDataset, order=AirbnbDataset['price_category'].value_counts().index , palette='Set3',edgecolor='black')
plt.xlabel('Price ($)')
plt.ylabel('Frequency')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Countplot automatically counts the number of occurrences of each category and plots the corresponding bars. This simplifies the code compared to manually aggregating counts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1. Observing that most listings are in the "Budget" and "Mid-range" categories may indicate the competitive landscape within the Airbnb platform.
2. The prevalence of listings in certain price categories reflects customer preferences and demand patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

1. Competitive Analysis:Hosts may need to adjust their pricing strategies or differentiate their offerings to stand out in a crowded market.
2. Customer Preferences: Hosts and Airbnb itself can use this information to optimize pricing strategies, offer competitive rates, and align offerings with customer preferences to attract more bookings.
3. This visualization provides valuable insights into the distribution of listings across different price categories, which can inform various strategic decisions within the Airbnb platform, including pricing strategies, marketing efforts, and customer segmentation initiatives.

#### Chart - 2 (univariate plot)

In [None]:
# Chart - 2  The number of Airbnb listings in each neighborhood group.
plt.figure(figsize=(10, 6))
sns.countplot(data=AirbnbDataset, x='neighbourhood_group', palette='Set3',edgecolor='black')
plt.title('Number of Airbnb Listings by Neighbourhood Group')
plt.xlabel('Neighbourhood Group')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Countplot automatically counts the number of occurrences of each category and plots the corresponding bars. This simplifies the code compared to manually aggregating counts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1. Manhattan and Brooklyn are the most popular areas for Airbnb listings, indicating high demand and activity in these neighborhoods.
2. This data helps Airbnb understand market dynamics in different areas, revealing popular destinations and potential travel trends.
3. Travelers are likely drawn to Manhattan and Brooklyn for reasons like proximity to attractions and amenities, and easy access to public transportation.
4. High listing concentrations suggest competitive markets, impacting pricing, listing quality, and the overall customer experience.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Insights from this visualization can inform strategic decisions related to resource allocation, investment priorities, and market expansion strategies. For example, Airbnb may prioritize marketing efforts or service improvements in areas with lower listing counts to attract more hosts and diversify its inventory.

#### Chart - 3 (univariate plot)

In [None]:
# Chart - 4 The distribution of different room types in an Airbnb dataset.
# Count the occurrences of each room type
room_type_counts = AirbnbDataset['room_type'].value_counts()

# Plotting the pie chart
plt.figure(figsize=(7, 7))
plt.pie(room_type_counts, labels=room_type_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Room Types')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Pie charts are effective for quickly understanding the relative sizes of different categories and their contribution to the total.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1. Private rooms and entire home/apartments are the most common, with shared rooms being less prevalent.
2. This indicates a preference among Airbnb guests for private accommodations.
3. It enables targeted marketing and service customization based on customer preferences.
4. Airbnb can prioritize promoting popular room types, optimize pricing, and allocate resources effectively.
This insight enhances the user experience on the platform.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

The insights from the pie chart help in understanding customer preferences, segmenting the market, making strategic decisions, and enhancing the overall user experience on the Airbnb platform, all of which are essential aspects of addressing the problem statement and optimizing the Airbnb service.

#### Chart - 4 (multivariate plot)





In [None]:
# Chart - 3 The average length of stay for different price categories and room types in an Airbnb dataset.
plt.figure(figsize=(10, 6))
sns.barplot(data=AirbnbDataset, x='price_category',hue='room_type', y='length_of_stay', palette='Set3', edgecolor='black' )
plt.title('Length of Stay by Price Category')
plt.xlabel('Price Category')
plt.ylabel('Length of Stay')
plt.legend(title='Room Type')
plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

Bar charts are commonly used to display and compare the values of different categories or groups.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1. Private rooms and entire home/apartments are popular across price categories, while shared rooms are less favored.
2. The graph suggests varying customer preferences based on price, indicating potential differences in stay durations.
3. Understanding this behavior is key for pricing, inventory management, and meeting customer needs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

Insights from the visualization can inform strategic decisions related to inventory management, pricing optimization, and customer segmentation. For example, Airbnb can use this information to adjust pricing strategies, optimize inventory allocation, and tailor marketing efforts to attract customers with specific length of stay preferences.

#### Chart - 5 (multivariate plot)

In [None]:
# Chart - 5 The distribution of hosts across different neighborhood groups in an Airbnb dataset, segmented by room type.
plt.figure(figsize=(12, 8))
# Create the bar plot with hue (color) representing neighbourhood_group and room_type
sns.barplot(data=AirbnbDataset, x='neighbourhood_group', y='host_id', hue='room_type', palette='Set3')
plt.title('Distribution of Hosts Across Neighbourhood Groups by Room Type')
plt.ylabel('Host ID')
plt.xlabel('Neighbourhood Group')
plt.legend(title='Room Type', loc='upper right')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Bar charts are commonly used to display and compare the values of different categories or groups.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

1. The bar plot shows the distribution of room types among hosts in different neighborhood groups, revealing spatial differences in accommodation options.
2. It indicates varying host preferences across neighborhoods, with shared rooms more popular in Bronx, Queens, and Brooklyn, while Staten Island, Bronx, and Queens have a more even distribution of private rooms and entire home/apartments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

This graph helps address the problem statement by providing insights into the distribution of hosts and their preferences across different neighbourhood groups. It allows stakeholders to understand the Airbnb market landscape better and make informed decisions regarding listing management, pricing strategies, and market targeting.

#### Chart - 6 (multivariate plot)

In [None]:
# Chart - 6 The availability of listings (in days) in each neighborhood group for different price categories in an Airbnb dataset.
# Set the size of the plot
plt.figure(figsize=(12, 8))
# Create the bar plot with hue (color) representing price category
sns.barplot(data=AirbnbDataset, x='neighbourhood_group', y='availability_365', hue='price_category', palette='Set3')
# Set title and labels
plt.title(' neighbourhood_group vs Availability by Price Category')
plt.xlabel('neighbourhood_group')
plt.ylabel('Availability 365')
plt.legend(title='Price Category', loc='upper right')
# Show plot
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

Bar charts are commonly used to display and compare the values of different categories or groups.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

The plot illustrates differences in listing availability among price categories within each neighborhood group. In Bronx, Queens, and Staten Island, luxury, budget, and high-end listings generally have higher availability.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

This insight into availability by price category and neighbourhood group can inform pricing strategies, listing management decisions, and market targeting efforts for hosts and guests on the Airbnb platform.

#### Chart - 7(multivariate plot)

In [None]:
# Chart - 7 The geographical distribution of Airbnb listings based on latitude and longitude.
plt.figure(figsize=(12, 8))
# Create scatter plot of latitude and longitude
sns.scatterplot(data=AirbnbDataset, x='longitude', y='latitude', hue='neighbourhood_group', palette='Set3')
plt.title('Geographical Distribution of Airbnb Listings')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Neighbourhood Group', loc='upper right')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A scatter plot was chosen to visualize the relationship between 'longitude' and 'latitude' because it allows for the exploration of the potential correlation between two continuous variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Brooklyn and Queens stand out with a high concentration of Airbnb listings, indicating their popularity among hosts and travelers. This could be due to factors like affordability, amenities, and attractions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

1. This graph contributes to the understanding of the geographical distribution of Airbnb listings, which is relevant to the problem statement. It helps identify areas with high listing concentrations, aiding in market analysis, property investment decisions, and strategic planning for hosts and Airbnb management.
2. The concentration of listings in certain neighbourhood groups can provide insights into market demand and preferences. Hosts and stakeholders can use this information to optimize their listing strategies, targeting areas with high demand and adjusting pricing or amenities accordingly.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# The correlation between different numerical features in an Airbnb dataset.
# Compute the correlation matrix
corr = AirbnbDataset.corr()
# Set up the matplotlib figure
plt.figure(figsize=(10, 8))
# Create the heatmap
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt=".2f" )
plt.title('Correlation Heatmap')
plt.show()

##### 1. Why did you pick the specific chart?

Answer Here.

A correlation heatmap is particularly useful when exploring relationships between numerical variables.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Strong positive correlations:
Number of reviews and reviews per month (0.568)
Availability and length of stay (0.988)

Strong negative correlations:
Number of reviews and last review year (-0.268)
Reviews per month and last review year (-0.176)



#### Chart - 15 - Pair Plot

In [None]:
# A pair plot showing pairwise relationships between selected numerical columns in an Airbnb dataset.
# Select the numerical columns for the pair plot
numerical_columns = [ 'price', 'number_of_reviews',
       'availability_365', 'length_of_stay']
# Create the pair plot
sns.pairplot(AirbnbDataset[numerical_columns] )

# Show plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

 Pair plots are useful for identifying patterns, trends, and correlations between variables in multivariate data.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Longer minimum nights correlate with higher availability, indicating that listings requiring longer stays are more likely to be available.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. Focus marketing efforts on Manhattan and Brooklyn due to high demand.
2. Promote private rooms and entire home/apartments for better booking rates.
3. Adjust pricing slightly higher for private rooms and entire home/apartments.
4. Ensure high-quality listings in all areas and categories for a better customer experience.
5. Offer flexible minimum night stays to accommodate short-term guests.
6. Competitive pricing in budget and mid-range categories to meet demand.
7. Luxury, budget, and high-end listings are more available in Bronx, Queens, and Staten Island, indicating an opportunity to adjust pricing or offer special deals to increase occupancy in these categories.
8. Incentivize hosts to list more in Manhattan and Brooklyn, especially luxury and high-end options.
9. Stay updated with market trends to adapt pricing and offerings.
10.  Regularly maintain listings for positive reviews and increased demand.


# **Conclusion**

Write the conclusion here.

Overall, the analysis of the Airbnb dataset provides valuable insights for optimizing pricing strategy, enhancing listing availability, and improving customer experience. By focusing on popular neighborhoods like Manhattan and Brooklyn, promoting private rooms and entire home/apartments, and diversifying price categories, Airbnb can attract more guests and increase revenue. Monitoring market trends and ensuring listing quality are also crucial for staying competitive in the dynamic Airbnb market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***