# **Project Name**    - **Airbnb NYC 2019**



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Name -**Mustufa S Galagnath


# **Project Summary -**

Summary of Airbnb Dataset Exploratory Data Analysis (EDA) Project:

**Objective:**
The primary objective of this project was to perform an Exploratory Data Analysis (EDA) on an Airbnb dataset, gaining meaningfull insights into the factors such as influencing property listings, prices, and availability etc. insights which can be usefull in understanding the needs and preference of both hosts and customers and making a informed business decisions, which will impact positive growth of the business.

**Understanding the Data:**
The analysis began with a comprehensive understanding of the dataset's structure using methods like shape, decscribe, and info were employed to comprehend the dataset's structure, size, and types of variables. Duplicate values were checked and none were found.

**Data Wrangling:**
Data wrangling began with creating a copy of the dataset to maintain the integrity of the original data. Incorrect data types were addressed, and irrelevant identifier columns were dropped. Missing values were strategically handled, and outliers in certain columns were identified and replaced.
The dataset was further refined by eliminating rows with zero prices and grouping prices for better analysis.

**Price Analysis:**
An in-depth analysis of prices revealed that most listings fell within the 10-200 USD range. Specific price groups were created for better analysis, uncovering patterns such as shared rooms dominating the 10-50 USD range and entire home/apartment listings peaking at 300 USD. A detailed breakdown of room types and neighborhood contributions was presented through visualizations.

**Room Preferences:**
Despite being the costliest option, Entire home/apartment listings were the most preferred, constituting over half of the total listings. Private rooms followed, while shared rooms were the least favored. This suggested a preference for privacy and spacious options, with customers willing to pay more for such listings.

**Factors Influencing Prices:**
The analysis highlighted that room type and neighborhood groups were the primary factors influencing prices. The flow of prices for different room types was explored, revealing distinct ranges for shared and private rooms compared to the broader range for entire home/apartment listings.

**Neighborhood Analysis:**
Manhattan emerged as the costliest neighborhood, followed by Brooklyn, Staten Island, Queens, and Bronx. Scatter plots demonstrated the lack of a clear correlation between latitude/longitude values and listing prices, emphasizing the neighborhood's significance.

**Minimum Nights Preferences**:
Hosts predominantly preferred guests to book for minimum 1-3 nights, with a gradual decline in preference. An unusual rise at 11 days marked an interesting observation in the minimum nights density plot.

**Availability Analysis:**
The availability of listings throughout the year exhibited an inverse relationship with the number of listings in an area. Staten Island had the highest mean availability, while Manhattan and Brooklyn had the least. Similar patterns were observed across different room types.

In conclusion, this EDA provided valuable insights into the factors influencing Airbnb listings, prices, and availability. The findings provide valuable insights for both hosts and potential guests, contributing to a nuanced understanding of the Airbnb market dynamics. The analysis showcased patterns and preferences, contributing to a better understanding of the dynamics within the dataset.

# **GitHub Link -**

https://github.com/raza209/AlmaBetterCapstoneProject2_Airbnb_EDA-Mustufa-/tree/main

# **Problem Statement**


**BUSINESS PROBLEM STATEMENT**

Business: Airbnb, Inc. is a San Francisco-based American company that operates a digital marketplace facilitating short- and long-term homestays and experiences. Acting as an intermediary, the company earns commissions from each booking.

The vast Airbnb ecosystem consists of millions of listings, generating substantial data encompassing crucial insights from both hosts and guests. The provided dataset, comprising 16 columns and 48895 rows, contains a mix of numerical and categorical information. Key facets include prices, room types, and neighborhood details.

Given Airbnb's dual customer base—guests and hosts—it is imperative to comprehend the distinct requirements of both to thrive in this dynamic industry. Our objective is to analyze the provided dataset, gaining insights into what influences customers when booking rooms. This involves understanding preferred room types, desired price ranges, and preferred neighborhoods. Simultaneously, we aim to uncover the needs and offerings hosts have for their business, such as availability preferences and other relevant factors.

Through a comprehensive Exploratory Data Analysis (EDA) on this dataset, we seek to enhance our understanding of the nuanced requirements of both customers and hosts. The insights derived will guide critical business decisions. This includes understanding customer and host behaviors, optimizing marketing initiatives, implementing innovative services, and more. The ultimate goal is to leverage these insights for the betterment of the Airbnb platform and its stakeholders.

#### **Define Your Business Objective?**

The objective of the EDA process for the Airbnb dataset is to extract meaningful insights that enhance the business's understanding of hosts' and customers' preferences and needs. This gathered information aims to facilitate informed and strategic business decision-making.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
# Importing all the rewuired Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# # Load Dataset
# from google.colab import drive
# drive.mount('/content/drive/')

In [None]:
# Importing the data set
# please upload the dataset in this Colab, for which the link below is for
try:
  Airbnb_df = pd.read_csv('/content/Airbnb NYC 2019.csv')
except Exception as e:
  print("please upload the dataset in the Colab, it appears you might not have uploaded the file.")

### Dataset First View

In [None]:
# Dataset First Look
Airbnb_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
Airbnb_df.shape

### Dataset Information

In [None]:
# Dataset Info
Airbnb_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
# the below code gives the number of the duplicate rows
len(Airbnb_df[Airbnb_df.duplicated()])

From above we can see that there are no duplicate rows

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
Airbnb_df.isna().sum()

In [None]:
# Visualizing the missing values
# defining a funciton which will visually show the percentage of missing values in each column

def missing_value_plot(df):
    # Creating a DataFrame to store the percentage of missing values for each column
    missing = pd.DataFrame((df.isnull().sum())*100/df.shape[0]).reset_index()

    # Setting up the plot
    plt.figure(figsize = (16,8))

    # Creating a point plot to visualize the percentage of missing values
    ax = sns.pointplot(x = 'index',y = 0, data = missing)

    # Setting plot title and axis labels
    ax.set_title('Points graph showing Percentage of Missing value per each columns', fontweight = 'bold')
    ax.set_xlabel('Column names')
    ax.set_ylabel('Percentage % of missing values')

    # Rotating x-axis labels for better readability
    plt.xticks(rotation = 90, fontsize = 10)

    # Adding grid lines to the plot
    plt.grid(True)

    # Displaying the plot
    plt.show()

In [None]:
# calling the missing_value_plot funciton to show the graph
missing_value_plot(Airbnb_df)

It seems that both the 'last_review' and 'reviews_per_month' columns contain around 20% missing values. As these two columns are not crucial for our current analysis, and removing 20% of the rows would distort the overall dataset, I have decided to disregard this information for the time being.

### **More about the data** ###

In [None]:
Airbnb_df.name.value_counts()

In [None]:
Airbnb_df.query('name == "Hillside Hotel"')

### What did you know about your dataset?

The dataset originates from Airbnb, a player in the hotel service industry. Our task involves analyzing the provided data and extracting meaningful insights from it. The focal points of interest include room type, price, and neighborhood. The objective is to comprehensively study the data, derive valuable insights, and leverage these findings for informed business decisions and understanding customer preferences.

The dataset comprises 48,895 rows and 16 columns. While there are no duplicate entries, it is noteworthy that around 20% of the data in two specific columns (last_review and reviews_per_month) is missing.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
Airbnb_df.columns

In [None]:
# Dataset Describe
Airbnb_df.describe(include = 'all')

### Variables Description



*   **id** : unique ID numbers of the guest
*   **name** : name of the listing
*   **host_id** : unique ID number of the host
*   **host_name** : Host name
*   **neighbourhood_group** : location of the listing
*   **neighbourhood** : name of the area
*   **latitude** : latitude range
*   **longitude** : longitude range
*   **room_type** : Type of listing (private or shared)
*   **price** : price of the listing
*   **minimum_nights** : minimum nights to be paid for
*   **number_of_reviews** : total number of reviews given
*   **last_review** : date of the last review
*   **reviews_per_month** : number of reveiws given online per month for the room
*   **calculated_host_listings_count** : Total count
*   **availability_365** : Availability around the year



### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for i in Airbnb_df.columns.tolist():
  print(f'No. of unique value in {i} is {Airbnb_df[i].nunique()}')

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Making a copy of the dataset with variable name as df
df = Airbnb_df.copy()

In [None]:
df.head()

In [None]:
df.columns

#### Data Types of the columns

In [None]:
# using df.info to findout the datatype of each column
df.info()

All the columns appear to have appropriate data types corresponding to the nature of the information they store, except for the "last_review" column. This column, which records the date of the last review, currently has the data type 'object' instead of the expected 'datetime'. Therefore, there is a need to modify the data type for this specific column.

In [None]:
# Converting the Column Last_review's datatype from Object to datetime
df.last_review = pd.to_datetime(df.last_review)

In [None]:
df.info()

Succefully changed the datatype of the column last_review

#### Dealing with irrelevent columns

From the dataset it is evident that the columns containing unique identifiers, namely 'id,' 'host_id,' and 'host_name,' exclusively consist of unique values. As they do not substantially contribute to the analysis, I have decided to eliminate these columns from the dataset

In [None]:
# The below code drops the given columns
df.drop(columns = ['id', 'host_id', 'host_name'], inplace = True)
df.columns

In [None]:
# checking the shape of the dataset after dropping unnecessary columns
df.shape

#### Dealing with Missing values

In [None]:
# this code gives list with column names and percentage of missing or na values
missing = pd.DataFrame((df.isnull().sum()*100 / df.shape[0]).reset_index())
missing

In [None]:
# total number of mising values in each column
df.isnull().sum()

From above it is evident that there are only 0.03% of values are missing in name column which we can ignore.
In 'last_review' and 'reviews_per_month' columns contain around 20% missing values. As these two columns are not crucial for our current analysis, and removing 20% of the rows would distort the overall dataset, I have decided to disregard this information.

#### Dealing with Outliers

From the below Table obtained from Describe() function of the pandas, we can observe that, in both price and minimum number of nights, there is a huge difference in mean value and maximum value. hence it contains many outliers which need to be dealt with.

In [None]:
df.describe()

In above table we can see tha the minimum price is 0, but the listing cannot be free, hence removing all the rows having 0 price

In [None]:
# Deleting all the rows having price as Zero
df = df[df['price'] != 0]

# printing describe after deleting the rows with 0 as price
print('After Deleting the rows having Zero as price')
df.describe()

In [None]:
df.info()

In [None]:
# to see the outliers range we need visualization with boxplot
# writing a function to plot boxplot

# Function to plot boxplots for specific columns in a DataFrame
def boxplot_price_nights(df_1):
    # Create a subplot with 1 row and 2 columns, setting the figure size
    fig, ax = plt.subplots(1, 2, figsize = (10, 5))

    # Boxplot for the 'price' column
    sns.boxplot(df_1['price'], color = 'skyblue', ax = ax[0])
    ax[0].set_title('Box plot for Price', fontsize = 14, fontweight = 'bold')
    ax[0].set_xlabel('Price Column', fontsize = 10, fontweight = 'bold')
    ax[0].set_ylabel('Price', fontsize = 10, fontweight = 'bold')

    # Boxplot for the 'minimum_nights' column
    sns.boxplot(df_1['minimum_nights'], color = 'skyblue', ax = ax[1])
    ax[1].set_title('Box plot for Minimum nights', fontsize = 14, fontweight = 'bold')
    ax[1].set_xlabel('minimum nihgts', fontsize = 10, fontweight = 'bold')
    ax[1].set_ylabel('number of nights', fontsize = 10, fontweight = 'bold')

    # Display the plots
    plt.show()

In [None]:
# calling the above function with out Airbnb dataset as input
boxplot_price_nights(df)

In [None]:
# defining a function which takes dataframe and column which needs to operated for its outliers
def deal_with_outliers(col_name, df):
  """
  This function takes column name as input and identifies the outlier in the column
  and replaces them with lower or upper bound values
  """
  # Calculating the Interquartile Range (IQR)
  Q1 = df[col_name].quantile(0.25)
  Q3 = df[col_name].quantile(0.75)
  IQR = Q3 - Q1

  # Define the upper and lower limit
  lower_bound = Q1 - 1.5*IQR
  upper_bound = Q3 + 1.5*IQR

  # Replace outliers with the upper or lower bound
  # if the value is lower than the outlier, it will be replaced with lower bound,
  # and similarly if the value is greater than upper bound it will replaced by upper bound
  df[col_name] = df[col_name].apply(lambda x : min(upper_bound, max(lower_bound, x)))

  # Visualize the distribution after handling the outliers
  plt.figure(figsize = (5, 4))
  sns.boxplot(x = col_name, data = df)
  plt.title('Box plot after handling the outliers', fontweight = 'bold')
  plt.xlabel(col_name)
  plt.ylabel(col_name)
  plt.grid(False)
  plt.show()

In [None]:
# Calling out dealing with outlier function to deal with outliers of price column
deal_with_outliers('price', df)

In [None]:
# Calling out dealing with outlier function to deal with outliers of minimum nights column
deal_with_outliers('minimum_nights', df)

In [None]:
df.describe()

In [None]:
# Printing the Maximum, minimum and mean price
print(f"Maximum price = {max(df['price'])}")
print(f"Minimum price = {min(df['price'])}")
print(f"Average price = {round(df['price'].mean(), 2)}")

#### Making price groups column for better visualisation

In [None]:
# firstly finding the minimum and maximum values of the price
print(f"minimum price = {df['price'].min()}")
print(f"Maximum price = {df['price'].max()}")
print(f"Average price = {round(df['price'].mean(), 2)}")

In [None]:
# Creating labels for Price range
labels = [f"{i} - {i +49}" for i in range(51, 350, 50)]

# manually adding the first price group from 10 -50, because 10 is the minimum value
labels.insert(0, '10 - 50')

# assign the proper price group for each price
df['price_group'] = pd.cut(df['price'], range (1, 355, 50), labels = labels)
df

### What all manipulations have you done and insights you found?

1. Altered the data type of the 'last_review' column from object to datetime, as this column contained date values.

2. Eliminated the unique identifier columns, namely 'id', 'host_id', and 'host_name'.

3. Computed the percentage of missing values in each column.

4. Identified numerous outliers in the 'price' and 'minimum_nights' columns, and substituted those outliers with lower and upper bound values.

5. Removed rows with a price value of 0 (zero), as it is not feasible for the price of an Airbnb stay to be zero.

6. Due to a wide range of prices in the 'price' column that might hinder visualization, categorized prices into groups with intervals of 50.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 : Density plot for Distribution of prices

In [None]:
# Chart - 1 visualization code

# Creating a kernel density estimate (KDE) plot to illustrate the relationship between prices and their distribution in the dataset.
def kde_plot(df_, col_name):
    # Plotting a KDE plot for the specified column in the DataFrame
    ax = df_[col_name].plot(kind = 'kde', figsize = (10,5), color = 'red')

    # Setting the title with bold text and specified font size
    ax.set_title('Distribution of Prices VS Density', fontweight = 'bold', fontsize = 14)

    # Setting the x-axis label
    ax.set_xlabel(col_name)

    # Displaying the plot
    plt.show()

In [None]:
# calling the kde plot function, to plot the density graph
kde_plot(df, 'price')

##### 1. Why did you pick the specific chart?

I opted for a KDE plot to effectively depict the distribution of prices in the dataset, offering a smooth representation of the density of price values.

##### 2. What is/are the insight(s) found from the chart?

The density plot unveiled that a significant majority of Airbnb room rents lie between 10 and 200 USD, with the 20-100 USD range contributing to almost half of the total density. Additionally, a subtle anomalous peak was observed at the 300 USD mark.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained from the density plot of Airbnb room rents can potentially lead to a positive business impact. Here's how:

The density plot insights, revealing a concentration of Airbnb room rents between 10 and 200 USD, with a noteworthy peak at 300 USD, offer valuable guidance for pricing and marketing strategies. Businesses can optimize pricing within the prevalent 20-100 USD range, capitalize on the anomaly at 300 USD for targeted promotions, and enhance competitive positioning. These insights empower businesses to align their offerings with market trends, potentially resulting in a positive impact on customer engagement and overall business performance.

##### An alternative method of visualizing the same data: employing a histogram plot.

In [None]:
# Chart - 1 visualization code
def hist_plot(df_1, col_name):
    ax = df_1[col_name].plot(kind='hist', figsize=(10, 5), )
    ax.set_title('Distribution of Prices', fontweight='bold')
    ax.set_xlabel(col_name)
    for bars in ax.containers:
        ax.bar_label(bars)
    plt.grid(axis='y')
    plt.show()

In [None]:
# calling the hist plot function
hist_plot(df, 'price')

#### Chart - 2 : Pie plot for different neighbourhood groups

In [None]:
# Distribution of the Airbnb's listingss in the different neighbourhood groups

# Displaying the percentage of each group using a pie chart
def plot_pie_chart(df_, col_name, title):
    """
    Display the percentage distribution of each group in a specified column using a pie chart.

    Parameters:
    - df_ (DataFrame): The DataFrame containing the data.
    - col_name (str): The name of the column for which the pie chart is generated.
    - title = title for the chart

    Returns:
    None (Displays the pie chart as a plot).
    """
    # Counting occurrences of each category in the specified column
    counts = df_[col_name].value_counts()

    # Extracting labels and sizes for the pie chart
    labels = counts.index.tolist()
    sizes = counts.values.tolist()

    # Creating a subplot for the pie chart with a specified size
    fig, ax = plt.subplots(figsize=(8, 6))

    # Plotting the pie chart with labels, percentages, and shadow
    ax.pie(sizes, labels=labels, autopct=lambda p: '{:.1f}%'.format(p) if p > 1 else '', shadow=True)

    # Ensuring an equal aspect ratio for a circular pie chart
    ax.axis('equal')

    # Setting the title for the pie chart
    plt.title(title, y=1.1, fontweight = 'bold', fontsize = 16)

    # Adding a legend to the chart for better interpretation
    ax.legend(loc='upper left', bbox_to_anchor=(1, 0.8))

    # Displaying the pie chart
    plt.show()


In [None]:
# calling the function plot_pie_chart, with required dataframe, column name and title
title = "Distribution of Airbnb's in different Neighborhood Groups"
plot_pie_chart(df, 'neighbourhood_group', title)

##### 1. Why did you pick the specific chart?

I opted for the Pie chart because it efficiently illustrates the proportional distribution of categories using distinct colors. It also labels each category with its respective percentage, facilitating a clear visualization of each group's contribution to the overall dataset.

##### 2. What is/are the insight(s) found from the chart?

The preceding Pie chart illustrates the percentage distribution of listings among various neighborhood groups. Key insights derived from the chart include:

- Manhattan boasts the highest number of listings, constituting 44.3% of the total, closely trailed by Brooklyn at 41.1%.
- Combined, Manhattan and Brooklyn account for over 80% of Airbnb room listings.
- Queens and Bronx contribute 11.6% and 2.2% to the total listings, respectively.
- Staten Island has the fewest listings, making up less than 1% of the total.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can potentially lead to a positive business impact. Understanding the distribution of Airbnb listings among different neighborhood groups provides valuable information for strategic decision-making. Specifically:

1. **Optimizing Marketing Strategies:** Knowing that Manhattan and Brooklyn dominate the listings allows for targeted marketing efforts in these high-demand areas.

2. **Resource Allocation:** With over 80% of listings concentrated in Manhattan and Brooklyn, businesses can allocate resources, such as advertising or property management services, more efficiently to cater to the majority of the market.

3. **Pricing Strategies:** Insights into the contribution of each neighborhood group help hosts and Airbnb itself make informed decisions regarding pricing, taking into account the demand and supply dynamics in different areas.

4. **Market Expansion:** Recognizing the lower contribution of Queens, Bronx, and Staten Island could guide business strategies to explore opportunities for market expansion or targeted promotional activities in these areas.

it's important to note that the insights do not inherently suggest negative growth. Instead, they provide valuable information for informed decision-making. Negative growth could occur if the business solely focuses on Manhattan and Brooklyn, neglecting potential opportunities or failing to adapt strategies in other neighborhoods. The key is to leverage the insights for well-informed, diversified business strategies that encompass both high-performing and emerging markets.








#### Chart - 3 : Pie chart of room types

In [None]:
# Chart - 3 visualization code
# since we are using the Pie chart only to represent the percentage of each room type, i wil be calling the already defined Pie chart funciton
title = "Airbnb Distribution Across Various Room Types"
plot_pie_chart(df, 'room_type', title)

##### 1. Why did you pick the specific chart?

I opted for the Pie chart because it efficiently illustrates the proportional distribution of categories using distinct colors. It also labels each category with its respective percentage, facilitating a clear visualization of each group's contribution to the overall dataset.

##### 2. What is/are the insight(s) found from the chart?

The above Pie chart illustrates distribution of Room Types in Overall Airbnb Listings. Key insights derived from the chart include:

*   The Entire home/apt comes to be the most prefereed room types with 52% of total listings which is followed by Private room, which constitues 45.7% of total listings.
*   Shared room only constitutes only 2.4% of total listings hence making it the least preferred type of room booking in Airbnb.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights gained can help the Business make a decisions which might impact positiveley on the business.
The insights gained from the pie chart can indeed provide valuable information for making informed business decisions. Let's analyze the potential positive impact and consider any insights that may lead to negative growth:

**Positive Business Impact**:

1. **Understanding Preferred Room Types:**
   - **Positive Impact:** Knowing that "Entire home/apt" is the most preferred room type with 52% of total listings suggests that there is a high demand for entire homes or apartments. This information can guide business strategies, such as focusing marketing efforts on promoting entire home rentals.

2. **Private Room Demand:**
   - **Positive Impact:** The insight that "Private room" constitutes 45.7% of total listings indicates a significant demand for this type of accommodation. Airbnb hosts could optimize their listings and tailor services to cater to the preferences of guests seeking private rooms.

***Negative Growth Considerations***:

1. **Low Demand for Shared Rooms:**
   - **Negative Impact:** The insight that "Shared room" only constitutes 2.4% of total listings may indicate a low demand for this room type. Hosts offering shared rooms might need to reconsider their offerings or adjust pricing strategies to attract more guests. This could be a potential area of negative growth if resources are invested without a corresponding increase in demand.

2. **Market Saturation for Entire Home/Apt:**
   - **Negative Growth Concern:** While "Entire home/apt" is the most preferred type, the high percentage (52%) may also suggest a competitive market. Hosts in this category might face intense competition, and standing out becomes crucial. Additionally, if the market is saturated, new hosts may find it challenging to enter this segment and experience growth.

**Overall Justification:**

The gained insights are generally positive for the business, emphasizing the popularity of entire homes/apartments and private rooms. However, caution is needed regarding shared rooms, which show low demand. Additionally, if the market for entire homes/apartments is highly competitive, hosts in this category should focus on differentiation and providing unique value to guests. Overall, a comprehensive strategy considering both positive and negative insights is essential for sustained and positive business impact.

#### Chart - 4 : Count plot depicting the distribution of listings across price groups, with room type as the distinguishing factor.

In [None]:
# Chart - 4 visualization code

# A countplot is presented, illustrating the descending order of customer preferences for different price ranges.
# The plot includes the specific count for each price range, providing a detailed representation of customer choices.

def pricegroup_barplot(df_, col_name, title):
    """
    Generate a countplot to illustrate the distribution of Airbnb listings based on price ranges.

    Parameters:
    - df_ (DataFrame): The DataFrame containing the data.
    - col_name (str): The name of the column representing price ranges.

    Returns:
    None (Displays the countplot as a plot).
    """
    # Extracting the top 10 price ranges based on count for better visualization
    order = df_[col_name].value_counts().nlargest(10).index

    # Creating a figure with a specified size
    plt.figure(figsize=(14, 6))

    # Generating a countplot based on price ranges, ordering by the specified 'order' and differentiating by 'room_type'
    ax = sns.countplot(data=df_, x=col_name, order=order, hue='room_type', palette = 'muted')

    # Adding labels to the bars representing the count of each price range
    for bars in ax.containers:
        ax.bar_label(bars)

    # Setting the title for the plot
    ax.set_title(title ,fontsize = 16, fontweight = 'bold')

    # Displaying the plot
    plt.show()


In [None]:
# calling te function to plot the countplot for number of listings in each price group separately for each room type
title = 'Number of Listings as per Price Range'
pricegroup_barplot(df, 'price_group', title)

##### 1. Why did you pick the specific chart?

I opted to utilize a Countplot with room types as the hue to precisely display the customers' most preferred price groups. This approach adds precision by considering room types as a variable in the analysis.

##### 2. What is/are the insight(s) found from the chart?

The detailed insights obtained from the aforementioned Countplot graph are as follows:


- The highest booking frequency is observed in the price range of (51 to 100USD), primarily for private rooms.
- The price group of 101 to 150USD ranks second in cumulative preference and is the most favored range for Entire home/apt room types.
- Shared rooms are most commonly booked in the most economical price group (10 to 50USD).
- Cumulatively, the overall pattern reveals an upward frequency trend, starting from the minimum price of 10USD, peaking at 150USD, and then gradually declining. An unusual minor peak is noted in the price range of 301 to 350USD, credit for this goes for Entire home/apt type of bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**

1. **Identified Popular Price Ranges:**
   - **Positive Impact:** Knowing that the highest booking frequency is in the price range of (51 to 100USD) allows hosts to tailor promotions and strategies for increased bookings and revenue.

2. **Optimizing Entire Home/Apt Rentals:**
   - **Positive Impact:** The insight that the 101 to 150USD range is the most favored for Entire home/apt types guides hosts to optimize offerings in this range, potentially enhancing profitability.

**Negative Growth Considerations:**

1. **Shared Rooms in Most Economical Range:**
   - **Negative Impact:** While shared rooms are common in the most economical range (10 to 50USD), this might lead to challenges in revenue generation, as shared rooms generally have lower rates. Hosts may need to explore ways to attract bookings in higher-priced categories.

2. **Unusual Peak in 301 to 350USD Range:**
   - **Negative Growth Concern:** The unusual minor peak in the price range of 301 to 350USD might indicate a specific market segment(Luxary homes) or demand that is not in line with the overall pattern. Hosts should carefully evaluate this anomaly to avoid potential negative growth in this range.

**Overall Assessment:**

The gained insights provide opportunities for positive business impact, especially in optimizing popular price ranges and Entire home/apt rentals. However, considerations for shared rooms in the most economical range and the unusual peak in the higher range require strategic adjustments to avoid negative growth and optimize overall revenue.

##### Alternate way: Cumulatively for all the room types, number of listings against number of listings

In [None]:
# Chart - 4 visualization code

# a Cumulative graph showing overall frequency
order = df['price_group'].value_counts().nlargest(10).index
plt.figure(figsize = (10,6))
ax = sns.countplot(data = df, x = 'price_group', order = order)
for bars in ax.containers:
  ax.bar_label(bars)
ax.set_title('Number of listing as per price range', fontsize = 16, fontweight = 'bold')
plt.grid(axis = 'y')
plt.show()


#### Chart - 5 : Kernel density plot illustrating the distribution of the minimum number of nights required to be paid.

In [None]:
# Chart - 5 visualization code

# Creating a kernel density estimate (KDE) plot for the 'minimum_nights' column
ax = df['minimum_nights'].plot(kind='kde', figsize=(10, 5), color='red')

# Setting the title and  x-axis label
ax.set_title("Density graph showing host's demand for minimum nights to be paid for", fontsize = 14, fontweight = 'bold')
ax.set_xlabel('Minimum number of Nights')

# Displaying the KDE plot
plt.show()


##### 1. Why did you pick the specific chart?

I opted for a KDE plot to effectively depict the distribution of minimum nights to be paid for in the dataset, offering a smooth representation of the density of price values.

##### 2. What is/are the insight(s) found from the chart?

The graph above represents the distribution of frequencies for the minimum number of nights guests are required to pay. The pattern varies noticeably. It is evident that the most commonly requested minimum number of nights is 1 and 2, gradually decreasing thereafter. There is a slight increase observed at 11 nights, followed by a decline towards the end of the range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the paragraph suggest several potential implications for the business:

1. **Positive Business Impact:**
   - **Optimal Minimum Nights:** Knowing that the most commonly requested minimum number of nights is 1 and 2 can be valuable. The business can optimize pricing, promotions, or marketing strategies to attract guests looking for short stays, potentially increasing overall booking rates.
   - **Targeted Marketing:** Understanding the gradual decrease in demand beyond 2 nights allows for targeted marketing efforts. The business can focus on promoting longer stays to specific customer segments, potentially increasing revenue from extended bookings.
   - **Special Offers:** The observed slight increase at 11 nights could indicate a potential opportunity. Offering special promotions or incentives for an 11-night stay might attract more customers, contributing positively to the business.

2. **Negative Growth Potential:**
   - **Decline in Longer Stays:** The decline in demand for stays beyond 2 nights may indicate a challenge in attracting guests for extended periods. This could lead to negative growth if the business heavily relies on longer bookings. It may be essential to assess the reasons behind this decline and explore strategies to encourage longer stays.
   - **Competitive Analysis:** If competitors in the industry successfully cater to guests seeking longer stays, the business might face negative growth by not capitalizing on this segment. Analyzing competitor strategies can provide insights into potential areas for improvement.

In summary, while there are opportunities for positive business impact, the decline in demand for longer stays raises concerns. It is crucial for the business to strategize ways to address this decline, potentially through targeted marketing, promotions, or enhancing services for guests interested in extended stays.

#### Chart - 6 : Barplot : neighbourhood group vs Mean price

In [None]:

# Creating a function to generate a bar plot depicting mean prices based on a specified column.
def mean_price_barplot(df_, xcol_name, ycol_name, title):
    """
    Generate a bar plot to visualize the mean room rent prices based on specified columns.

    Parameters:
    - df_ (DataFrame): The DataFrame containing the data.
    - xcol_name (str): The name of the column to be plotted on the x-axis.
    - ycol_name (str): The name of the column representing the mean prices to be plotted on the y-axis.
    - title (str): The title for the bar plot.

    Returns:
    None (Displays the bar plot as a plot).
    """
    # Calculating the mean room rent prices
    mean_price = df_.groupby(xcol_name, as_index=False)[ycol_name].mean().sort_values(by=ycol_name, ascending=False)

    # Set the style of seaborn
    sns.set(style="whitegrid")

    # Create a bar plot using Seaborn
    plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
    ax = sns.barplot(x=xcol_name, y=ycol_name, data=mean_price, palette="viridis")

    # Adding labels to the bars representing the mean prices
    for bars in ax.containers:
        ax.bar_label(bars)

    # Customize the plot
    plt.title(title, fontsize=16, fontweight='bold', y = 1.1)
    plt.xlabel(xcol_name, fontweight='bold')
    plt.ylabel('Mean Room Rent Price', fontweight='bold')
    plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

    # Show the plot
    plt.show()


In [None]:
# calling mean_price_barplot function, and giving df, and neighbourhood group, and price as input for dataframe, x-column and y-column respectivly
title = 'Mean Room Rent Prices vs Neighbourhood groups'
mean_price_barplot(df, 'neighbourhood_group', 'price', title)

##### 1. Why did you pick the specific chart?

Given that the plot involves a numerical variable plotted against a categorical variable, utilizing a bar plot is deemed suitable for illustrating the mean prices of Airbnb listings across various neighborhood groups. In this representation, the bar plot effectively displays the mean prices for each neighborhood group, aligning with the intended objective of the visualization.

##### 2. What is/are the insight(s) found from the chart?

From the Above barplot the following insights were gathered
*   The Manhattan is the most costly neighbourhood group as per the mean prices, for a Airbnb bookings with mean price of 164.5 USD
*   Manhattan is followed by Brooklyn with second highest mean price for Airbnb, with mean price of 113.7USD. and Brookyn is followed by Staten Island, Queens and Bronx, and hence making the Bronx the cheapest neighbourhood group for Airbnb, with mean price 82.3USD.
*   it is also obsereved that, the mean price for staten Island and Queens is almost same, with mean price difference being less than 1 USD.










##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Impact:**
1. **Diverse Marketing Strategies:** Insights on neighborhood pricing can facilitate targeted marketing, attracting customers to different areas based on their preferences and budgets.
  
2. **Strategic Positioning:** Highlighting premium options in Manhattan and affordable choices in the Bronx creates a strategic market positioning, appealing to a broader customer base.

3. **Competitive Strategies:** Variations in mean prices allow for targeted promotions, enhancing competitiveness. Offering specials in areas like Staten Island or Queens with similar prices can attract more bookings.

**Potential Negative Growth:**
1. **Overemphasis on Manhattan:** Exclusive focus on high-priced Manhattan might limit opportunities in other boroughs, hindering overall growth potential.
  
2. **Saturation Risk in Manhattan:** Relying solely on Manhattan's premium pricing may lead to market saturation. Exploring and promoting other boroughs is crucial to avoid dependence on a single location.

3. **Overlooking Nuances:** Ignoring differences in customer preferences between Staten Island and Queens, despite similar prices, may result in missed opportunities. Tailoring services to specific neighborhood needs is essential for sustained growth.

#### Chart - 7 : Barplot : Mean room rents for different room types

In [None]:
# calling the mean_price_barplot function with input dataframe, room_type and price for dataframe, x-column name and y-column name respectivly
title = "Mean room rents for different types of rooms in listings"
mean_price_barplot(df, 'room_type', 'price', title)

##### 1. Why did you pick the specific chart?

Given that the plot involves a numerical variable plotted against a categorical variable, utilizing a bar plot is deemed suitable for illustrating the mean prices of Airbnb listings across various room types. In this representation, the bar plot effectively displays the mean prices for each room type, aligning with the intended objective of the visualization.

##### 2. What is/are the insight(s) found from the chart?

The bar plot above reveals the following insights:

- Listings categorized as "Entire Home or Apartment" exhibit the highest mean prices, displaying a substantial difference compared to other listing types. This can be attributed to the luxurious nature of the entire living space provided.

- "Private rooms" represent the second-highest mean prices among the three listing types, with an average price of 82.8 USD.

- "Shared rooms" have the lowest mean prices, given their shared nature, making them the most affordable type of listing.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
1. **Pricing Strategy Optimization:** Understanding that "Entire Home or Apartment" listings command the highest mean prices can guide the business in optimizing pricing strategies. Emphasizing the luxury and spaciousness of such accommodations may attract customers willing to pay a premium.

2. **Targeted Marketing:** Recognizing that "Private rooms" have the second-highest mean prices allows for targeted marketing efforts. Tailoring promotions or features that highlight the value of private spaces can attract customers seeking a balance between affordability and comfort.

**Potential Negative Growth:**
1. **Limited Emphasis on Shared Rooms:** While "Shared rooms" have the lowest mean prices due to their nature, neglecting these listings entirely may result in missed opportunities. There could be a market segment seeking budget-friendly options, and failing to address this may limit overall growth.

2. **Overemphasis on Luxury Listings:** Depending solely on the high prices of "Entire Home or Apartment" listings might exclude budget-conscious travelers. It's essential to diversify offerings to cater to a broader customer base and avoid limiting growth to a specific market segment.

In conclusion, while the insights offer opportunities for positive business impact through targeted marketing and pricing optimization, there is a potential risk of negative growth if there's too much focus on high-end listings and neglect of budget-friendly options like shared rooms. Striking a balance in catering to various customer preferences can contribute to sustained and inclusive growth.

#### Chart - 8 : kde plot for prices against its density of the listings for each room type

In [None]:
# Creating a duplicate of the dataframe.
df1 = df.copy()

# Create dummy variables for categorical columns
df_dummy = pd.get_dummies(df1)
df_dummy.head()

In [None]:
# Chart - 8 visualization code
# Convert all the categorical data into dummy variables

# Create a figure for the kernel density plot
plt.figure(figsize=(10, 6))

# Plot kernel density for price based on different room types
ax1 = sns.kdeplot(df_dummy.price[(df_dummy['room_type_Entire home/apt'] == 1)], color='Red', shade=True)
ax1 = sns.kdeplot(df_dummy.price[(df_dummy['room_type_Private room'] == 1)], ax=ax1, color='Blue', shade=True)
ax1 = sns.kdeplot(df_dummy.price[(df_dummy['room_type_Shared room'] == 1)], ax=ax1, color='Green', shade=True)

# Set the title for the plot
ax1.set_title('Price Distribution for Different Room Types', fontsize = 16, fontweight = 'bold')

# Add legend for better interpretation
ax1.legend(["Entire home/apt", "Private room", "Shared room"], loc='upper right')

# Set x-axis and y-axis labels
ax1.set_xlabel('Price')
ax1.set_ylabel('Density')

# Display the kernel density plot
plt.show()


##### 1. Why did you pick the specific chart?

In this representation, a Density plot is employed to visually depict multiple variables, specifically room types and listing prices, in relation to their respective densities. This choice is driven by the clarity with which the plot illustrates the variations in different types of listings across various price ranges, providing an ideal visualization for showcasing the density of bookings.

##### 2. What is/are the insight(s) found from the chart?

The Density plot above reveals the following insights:

- Both Shared Room and Private Room types exhibit a pronounced peak in density at around 40 USD and 60 USD, respectively. The density then gradually decreases as the price increases.

- In contrast, Entire Home/Apt listings demonstrate a gradual increase in density from 75 USD. The plot indicates a relatively constant density from 100 USD to 200 USD, followed by a gradual decline beyond 200 USD. There is a minor peak observed at 325 USD, likely attributable to some luxury listings in this price range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
1. **Pricing Optimization and Targeted Marketing:** Insights from the Density plot provide opportunities for pricing optimization and targeted marketing, enabling the business to set competitive prices and tailor promotions to specific price ranges with high demand.

**Potential Negative Growth:**
1. **Risk of Overpricing for Entire Home/Apt:** There's a potential risk of negative growth if Entire Home/Apt listings are overpriced beyond the observed density peak, as customers may be less inclined to book in higher price ranges.

2. **Underestimating Luxury Demand:** Neglecting the demand indicated by the small peak at 325 USD for Entire Home/Apt listings could result in missed opportunities to attract customers seeking premium accommodations, potentially leading to negative growth.

In summary, while there are positive opportunities for optimization and targeted marketing, it's crucial to avoid overpricing and to recognize and capitalize on demand for luxury offerings to ensure sustained growth.

##### Alternative plots

In [None]:
# Violin plot to compare the distribution of 'price' for different 'room_type'
plt.figure(figsize=(12, 8))
sns.violinplot(x='room_type', y='price', data=df, palette='Set3')
plt.title('Price Distribution by Room Type')
plt.show()


In [None]:

# Box plot to compare the distribution of 'price' for different 'neighbourhood_group'
plt.figure(figsize=(12, 8))
sns.boxplot(x='room_type', y='price', data=df)
plt.title('Price Distribution by  room type')
plt.show()


#### Chart - 9 :  violin plot : neighbourhood group vs price

In [None]:
# Violin plot to compare the distribution of 'price' for different 'neighbourhood groups'

# Set the size of the plot
plt.figure(figsize=(12, 8))

# 'neighbourhood_group' is on the x-axis, 'price' is on the y-axis, and 'Set3' is the color palette
sns.violinplot(x='neighbourhood_group', y='price', data=df, palette='Set3')

# Set the title of the plot
plt.title('Price Distribution for different Room Type', fontsize = 16, fontweight = 'bold')

# Display the plot
plt.show()

##### 1. Why did you pick the specific chart?

Because it effectively compares price distributions across neighborhood groups, offering insights into the density, central tendency, and skewness of prices within each category. Its capacity to handle multimodal distributions and provide a visual summary of key statistics makes it a valuable tool for comparative analysis.

##### 2. What is/are the insight(s) found from the chart?

1. In all Neighbourhood groups, there is a distinct peak in listing density around the 50 to 60 USD price range, indicating the highest concentration of listings, followed by a gradual decline as prices increase.

2. Conversely, the Manhattan Neighbourhood exhibits a peak in the higher price range of 80 to 100 USD, with a gradual decline in density from 100 to 200 USD. Subsequently, there is a decrease in booking density, followed by a slight peak at 330 USD, possibly associated with luxury listings.

3. This reaffirms our earlier findings that Manhattan is a comparatively expensive neighborhood for Airbnb bookings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
1. **Pricing Optimization:** Insights on distinct price peaks and density trends across neighborhoods enable businesses to optimize pricing strategies, tailoring them to the specific preferences and affordability ranges of each area.

2. **Targeted Marketing:** Understanding the concentration of listings at certain price ranges allows for targeted marketing efforts, attracting customers seeking accommodation within those popular price brackets.

**Potential Negative Growth:**
1. **Risk of Overpricing in Manhattan:** The peak at 80 to 100 USD in Manhattan suggests a higher price range for popular listings. However, overpricing beyond this range may lead to decreased booking density, potentially limiting overall growth.

2. **Overlooking Affordability Peaks:** Focusing solely on higher-priced ranges may overlook opportunities in the 50 to 60 USD range, where there is a peak in listing density across all neighborhoods. Neglecting this affordable segment could result in missed bookings and negative growth.

In conclusion, while the insights offer opportunities for pricing optimization and targeted marketing, there is a potential risk of negative growth if pricing strategies in Manhattan are not carefully aligned with market demand, and if the business neglects the affordable price range that shows high listing density. A balanced approach is essential for positive business impact.

In [None]:
df.columns

In [None]:
# Violin plot to compare the distribution of 'price' for different 'room_type'
plt.figure(figsize=(12, 8))
sns.violinplot(x='room_type', y='price', data=df, palette='Set3')
plt.title('Price Distribution by Room Type')
plt.show()


##### Alternate way of presenting

In [None]:
# Chart - 13 visualization code
# Box plot to compare the distribution of 'price' for different 'neighbourhood_group'
# Create a boxplot to visualize the distribution of 'price' for different 'neighbourhood_group'
plt.figure(figsize=(12, 8))

# 'neighbourhood_group' is on the x-axis, 'price' is on the y-axis
# 'data=df' specifies the DataFrame containing the data
sns.boxplot(x='neighbourhood_group', y='price', data=df)

# Set the title of the plot
plt.title('Price Distribution by Neighbourhood Group', fontsize = 16, fontweight = 'bold')

# Display the plot
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10 : scatterplot: lattitue vs longitude, just representation of location of different neigbourhood groups

In [None]:
# Chart - 12 visualization code

# Scatter plot to visualize the relationship between 'latitude' and 'longitude'
plt.figure(figsize=(10, 6))

# Scatter plot: 'latitude' vs 'longitude' with color-coded points for 'neighbourhood_group'
sns.scatterplot(x='longitude', y='latitude', data=df, hue='neighbourhood_group', palette='deep')

plt.title('Airbnb Listings by Latitude and Longitude', fontsize = 16, fontweight = 'bold')
plt.show()

##### 1. Why did you pick the specific chart?

I opted for a scatter plot to visually display the spatial distribution of various neighborhood groups on a map. In this representation, each listing is depicted as a point, making the scatter plot the clear and logical choice for illustrating the geographical layout of the neighborhoods.

##### 2. What is/are the insight(s) found from the chart?

The only information derived from the above graph pertains to the distinct geographic locations of each listing, with a unique color assigned to each neighborhood group.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

There appears to be no information in the above graph that could either positively or negatively impact the business. This graph is exclusively utilized for visually representing the locations of each listing based on latitude and longitude, without conveying any specific business-related insights.

#### Chart - 11: barplot: neigbourhood group vs average number of days available

In [None]:
# Chart - 13 visualization code
# Pair plot for a quick overview of relationships between numerical variables

def barplot_mean_availability(df_, col_name, title):
    mean_price = df_.groupby(col_name, as_index = False)['availability_365'].mean().sort_values(by = 'availability_365', ascending = False)

    # Set the style of seaborn
    sns.set(style="whitegrid")

    # Create a bar plot using Seaborn
    plt.figure(figsize=(10, 6))  # Adjust the figure size as needed
    ax = sns.barplot(x=col_name, y='availability_365', data=mean_price, palette="viridis")
    for bars in ax.containers:
      ax.bar_label(bars)
    # Customize the plot
    plt.title(title, fontsize = 16, fontweight = 'bold')
    plt.xlabel(col_name)
    plt.ylabel('availibility')
    plt.xticks(rotation=45)  # Rotate x-axis labels for better readability

    # Show the plot
    plt.show()


In [None]:
# calling the barplot function with neighbourhood group as column variable to plot mean availability againt for each neighbourhood groups
title = "Average Availability throught the year for each neighbourhood group"
barplot_mean_availability(df, 'neighbourhood_group', title)

In [None]:
# calling the function plot_pie_chart, with required dataframe, column name and title
title = "Distribution of Airbnb's in different Neighborhood Groups"
plot_pie_chart(df, 'neighbourhood_group', title)

##### 1. Why did you pick the specific chart?

Given that the plot involves a numerical variable plotted against a categorical variable, utilizing a bar plot is deemed suitable for illustrating The mean duration of availability throughout the year of Airbnb listings across various neighborhood groups. In this representation, the bar plot effectively displays the the mean duration of availability throughout the year for each neighborhood group, aligning with the intended objective of the visualization.

##### 2. What is/are the insight(s) found from the chart?

The barplot above reveals the following insights:

*   Listings in Staten Island consistently exhibit the highest availability, spanning 200 days throughout the year. Following closely is the Bronx neighborhood, boasting an average availability of 165 days annually.
*   Subsequently, Queens, Manhattan, and Brooklyn are ranked, with Brooklyn indicating the lowest average availability among them.
*   Despite constituting less than 1% of the total listings, Staten Island consistently exhibits the highest availability, with 200 days throughout the year. Similarly, the Bronx, with the second-highest availability, comprises only 2.2% of the total listings. This trend persists across the other neighborhood groups, with minor variations in Manhattan.
*   The pattern suggests that the lower number of listings in neighborhoods like Staten Island and the Bronx results in higher availability throughout the year. Conversely, areas with more listings, such as Manhattan and Brooklyn, tend to have comparatively lower availability.






##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
- The insights can potentially have a positive business impact by guiding strategic decisions. For example, the information highlights that despite constituting a small percentage of total listings, neighborhoods like Staten Island and the Bronx have high availability. This knowledge could inform marketing strategies to capitalize on the higher availability, attracting more customers and potentially increasing revenue.

**Negative Growth Implications:**
- The insights don't inherently suggest negative growth, but they do point out that areas with more listings, such as Manhattan and Brooklyn, tend to have lower availability. This could pose challenges in meeting customer demand, potentially leading to missed booking opportunities. Addressing this imbalance in supply and demand may be necessary to avoid negative growth implications in these areas.

#### Chart - 12 : bar plot = availability 365 vs room_type

In [None]:
# chart-12 : Plotting the mean availability for each room types

# calling the function barplot_mean_availability
title = "Average Availability throught the year for each room type"
barplot_mean_availability(df, 'room_type', title)

##### 1. Why did you pick the specific chart?

Given that the plot involves a numerical variable plotted against a categorical variable, utilizing a bar plot is deemed suitable for illustrating The mean duration of availability throughout the year of Airbnb listings for different room type. In this representation, the bar plot effectively displays the the mean duration of availability throughout the year for each neighborhood group, aligning with the intended objective of the visualization.

##### 2. What is/are the insight(s) found from the chart?

From the above graph following insights were obtained


*   The Shared Room type has the highest average availability of 161 days.
*   Both Private room and Entire home/apt type of listings have similar average availability of 111days.
*   Similiar to that of Neighbourhood groups vs availability, here also the pattern suggests that the lower number of listings like for Shared room type(2.4%) results in higher availability throughout the year. Conversely, room types with more listings "Private rooms" and "Entire home/apt" tend to have comparativly lower availability.





##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact:**
- The insights can contribute to a positive business impact by guiding strategic decisions. Knowing that the Shared Room type, despite constituting a smaller percentage of total listings (2.4%), has higher availability, businesses could tailor their marketing efforts to highlight this room type. This might attract more customers looking for increased availability, potentially leading to higher occupancy rates and revenue.

**Negative Growth Implications:**
- While the insights don't inherently suggest negative growth, they do indicate that room types with more listings, such as "Private Rooms" and "Entire Home/Apt," tend to have comparatively lower availability. This could pose challenges in meeting customer demand for these popular room types, potentially resulting in missed booking opportunities and lower overall revenue. It may be necessary for businesses to address this imbalance in supply and demand to avoid negative growth implications in these room categories.

#### Chart - 13 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
df.describe()

In [None]:
# Select relevant columns for correlation analysis
df_corr = df[['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']].corr()

# Create a heatmap to visualize the correlation matrix
sns.heatmap(df_corr, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

# Display the heatmap
plt.show()


##### 1. Why did you pick the specific chart?

A correlation matrix is a tabular representation displaying correlation coefficients between variables, with each cell indicating the correlation between two specific variables. It serves the purpose of summarizing data, providing input for more sophisticated analyses, and acting as a diagnostic tool for advanced analytical procedures. The correlation coefficients range from -1 to 1.

To ascertain the correlation among all variables, including the correlation coefficients, I employed a correlation heatmap.

##### 2. What is/are the insight(s) found from the chart?

From the correlation heatmap above, it is evident that there are not many significant relationships among these numeric variables. However, there are some minor correlations, as highlighted below:

- Prices exhibit a negative correlation with longitude, and minimum nights show a negative correlation with reviews per month, both approximately -0.3.

- Reviews per month and the number of reviews, 'host listing counts' and 'availability 365', and 'host listing counts' and 'minimum nights' are all slightly positively correlated, with values of 0.55, 0.23, and 0.24, respectively.

#### Chart - 14 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(data = df, vars = ['latitude', 'longitude', 'price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'], hue = 'room_type')
plt.show()

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Thus, I used pair plot to analyse the patterns of data and realationship between the features. It's exactly same as the correlation map but here you will get the graphical representation.

##### 2. What is/are the insight(s) found from the chart?

The pairplot above reveals the absence of linear relationships between various variables, with most graphs displaying clustered values and limited discernible insights. However, one variable stands out from the rest— "Prices." Notably, for Shared room types, Prices exhibit lower values across almost all variables, while Entire home/apt types show higher values (prices) across the board. Private rooms fall in between these extremes.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Based on the insights gained from the Airbnb dataset analysis, several recommendations can be made to help the client achieve their business objectives:

1. **Optimize Pricing Strategy:**
   - Capitalize on the popularity of Entire home/apartment listings by potentially offering additional amenities or services to enhance their perceived value.
   - Consider dynamic pricing strategies based on room type, ensuring competitive rates for shared rooms while maximizing revenue from more luxurious options.

2. **Enhance Minimum Nights Offerings:**
   - Tailor promotions or incentives for bookings that align with the prevalent trend of 1-3 nights, catering to the majority of host preferences.
   - Investigate the unusual rise observed at 11 days to understand if there's a market demand for longer stays and potentially offer special packages for such durations.

3. **Geographic Targeting:**
   - Focus marketing efforts on Manhattan and Brooklyn, which collectively represent a significant portion of Airbnb listings.
   - Explore opportunities to increase listings in areas with high availability, like Staten Island, to attract guests seeking more flexible booking options.

4. **Room Type Optimization:**
   - Emphasize the advantages of Entire home/apartment stays in marketing materials, given their higher average prices and popularity.
   - Consider promotional campaigns or discounts for shared rooms to increase their desirability and address the lower preference observed in the analysis.

5. **Quality Assurance for Manhattan Listings:**
   - Investigate the reasons behind the wide price range in Manhattan listings and consider implementing quality assurance measures to ensure consistency in offerings.

6. **Diversify Offerings Based on Availability:**
   - Acknowledge the correlation between listing numbers and availability. In areas with higher availability, consider introducing diverse offerings or special promotions to attract a broader audience.

7. **Continuous Monitoring and Adaptation:**
   - Regularly monitor trends in customer preferences, pricing dynamics, and availability to adapt strategies accordingly.
   - Stay informed about changes in the competitive landscape and adjust offerings to remain competitive in the Airbnb market.

By implementing these recommendations, the client can align their business strategy with the observed patterns in the Airbnb dataset, enhancing their competitiveness, attracting a broader range of guests, and maximizing revenue opportunities.

Answer Here.

# **Conclusion**

**Conclusion:**

In conclusion, the Exploratory Data Analysis (EDA) conducted on Airbnb's extensive dataset has provided valuable insights into the complex dynamics of the platform, catering to both guests and hosts. The primary objective of understanding customer and host behaviors, preferences, and influencing factors has been successfully achieved through a systematic and comprehensive approach.

**Key Findings:**

1. **Room Type Preferences:**
   - The analysis revealed a clear preference among customers for Entire home/apartment listings (52%) even though it was the costliest choice, emphasizing a desire for privacy and spacious accommodations.
   - Shared rooms, despite being the most economical option, constituted only 2.4% of total listings, indicating a lower preference among guests.

2. **Pricing Dynamics:**
   - Prices were found to be primarily influenced by the type of room, with Entire home/apartment listings being the costliest on average (mean price of 180USD).
   - The price distribution analysis unveiled distinct ranges for each room type, emphasizing the uniqueness in pricing dynamics.
   - Significant majority of Airbnb room rents fall within the range of 10 to 200 USD. Notably, the 20-100 USD range contributed to almost half of the total density.
   - Specific price ranges were associated with certain room types, with shared rooms dominating the 10-50 USD range and private rooms leading in the 51-100 USD range. An intriguing anomaly was observed at the 300 USD mark, primarily attributed to Entire home/apartment listings, indicative of a demand for more luxurious stays within the 100-150 USD range.
   - Examining the flow of prices for different room types, density plots illustrated that shared rooms and private rooms have concentrated price ranges, mainly between 10 - 75 USD and 40 - 100 USD, respectively. In contrast, Entire home/apartment listings exhibited a broad price spectrum, ranging from 75 - 250 USD, with a discernible peak at 325 USD. This peak is attributed to the spaciousness and luxurious features of these entire home/apartment stays

3. **Neighborhood Influence:**
   - Manhattan and Brooklyn emerged as dominant contributors, constituting 85% of total listings they are followed by Queens, Bronx and Staten Island, where staten Island contributes less than 1% of the total listings.
   - Manhattan stood out as the costliest neighborhood with mean price of 165USD, followed by Brooklyn, Staten Island, Queens, and the Bronx.
   - No significant correlation was observed between latitude/longitude values and listing prices, emphasizing the importance of neighborhood-specific factors.

4. **Minimum Nights Booking Preferences:**
  - The analysis of minimum nights booking preferences showcased that hosts predominantly demand minimum stays of 1 to 3 nights range. An unexpected slight rise was noted at 11 days, indicating a unique pattern in booking behavior.

5. **Availability Trends:**
   - Availability inversely correlated with the number of listings in an area, with Staten Island having the highest mean availability (200 days) and Manhattan and Brooklyn exhibiting lower availability.
   - similar pattern was observed in room types as well with shared rooms having highest mean availability.

**Recommendations for Business Strategy:**

1. **Tailored Marketing Strategies:**
   - Customize marketing initiatives to highlight the popularity of Entire home/apartment stays and encourage longer stays through targeted promotions.

2. **Diversification of Offerings:**
   - Explore opportunities to increase shared room listings by introducing unique features or cost-effective packages, aligning with the market trend favoring privacy and spacious options.

3. **Neighborhood-Specific Campaigns:**
   - Develop targeted campaigns to promote listings in underrepresented neighborhoods, leveraging the unique attractions of each area.

4. **Quality Assurance Measures:**
   - Implement quality assurance measures for Manhattan listings to ensure a consistent and desirable experience, addressing the wide price range observed in the neighborhood.

5. **Dynamic Pricing Strategy:**
   - Implement a dynamic pricing strategy considering room types and neighborhood demand to maximize revenue.

The culmination of these findings and recommendations aims to guide Airbnb in making informed decisions for the enhancement of its platform, ensuring a better experience for both guests and hosts. The strategic implementation of these insights is pivotal for the continued success and growth of Airbnb in the competitive landscape of short-term rentals and experiences.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***