##**AIR *BNB* ANALYSIS**

In [None]:
# Loading important libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
Airbnb_df = pd.read_csv('Airbnb_data.csv')
Airbnb_df

##**About the Dataset – Airbnb Bookings**

*   This Airbnb dataset contains nearly 49,000 observations from New York , with 16 columns of data.

*   The Data includes both categorical and numeric values, providing a diverse range of information about the listings.

*   This Dataset may be useful for analyzing trends and patterns in the Airbnb market in New York and also gain insights into the preferences and behavior of Airbnb users in the area.

*   This dataset contains information about Airbnb bookings in New York City in 2019. By analyzing this data, you may be able to understand the trends and patterns of Airbnb use in the NYC.

##**UNDERSTAND THE GIVEN VARIABLES**

**Listing_id :-** This is a unique identifier for each listing in the dataset.

**Listing_name :-** This is the name or title of the listing, as it appears on the Airbnb website.

**Host_id :-** This is a unique identifier for each host in the dataset.

**Host_name :-** This is the name of the host as it appears on the Airbnb website.

**Neighbourhood_group :-** This is a grouping of neighborhoods in New York City, such as Manhattan or Brooklyn.

**Neighbourhood :-** This is the specific neighborhood in which the listing is located.

**Latitude :-** This is the geographic latitude of the listing.

**Longitude :-** This is the geographic longitude of the listing.

**Room_type :-** This is the type of room or property being offered, such as an entire home, private room, shared room.

**Price :-** This is the nightly price for the listing, in US dollars.

**Minimum_nights :-** This is the minimum number of nights that a guest must stay at the listing.

**Total_reviews :-** This is the total number of reviews that the listing has received.

**Reviews_per_month :-** This is the average number of reviews that the listing receives per month.

**last_Reviews :-** Date of Last review recieved for that listing

**Host_listings_count :-** This is the total number of listings that the host has on Airbnb.

**Availability_365 :-** This is the number of days in the next 365 days that the listing is available for booking.

In [None]:
Airbnb_df["neighbourhood_group"].unique()

In [None]:
Airbnb_df["room_type"].unique()


**Steps to be performed in this case study:**


EDA (exploratory data  analysis)

Visual Analysis

Statistical Analysis

In [None]:
# EDA (exploratory data analysis)

In [None]:
rename_col = {'id':'listing_id','name':'listing_name','number_of_reviews':'total_reviews','calculated_host_listings_count':'host_listings_count'}

In [None]:
# use a pandas function to rename the current columns
Airbnb_df = Airbnb_df.rename(columns = rename_col)
Airbnb_df

In [None]:
Airbnb_df.isnull().sum()

In [None]:
Airbnb_df.info()

In [None]:
Airbnb_df=Airbnb_df.drop(["last_review"],axis=1)

In [None]:
Airbnb_df.isnull().sum()

In [None]:
Airbnb_df["reviews_per_month"]=Airbnb_df["reviews_per_month"].fillna(0)

In [None]:
Airbnb_df.isnull().sum()


In [None]:
Airbnb_df.dropna(inplace=True)

In [None]:
Airbnb_df.isnull().sum()


In [None]:
Airbnb_df

In [None]:
# duplicate values
Airbnb_df.duplicated().sum()

In [None]:
Airbnb_df["price"].value_counts()

In [None]:
# Outlier Analysis for Price columns
plt.boxplot(Airbnb_df["price"])
plt.show()

In [None]:
Q1=Airbnb_df["price"].quantile(0.25)
Q3=Airbnb_df["price"].quantile(0.75)
IQR=Q3-Q1

In [None]:
uf=Q3+1.5*IQR
lf=Q1-1.5*IQR
print(uf,lf)


In [None]:
Airbnb_df=Airbnb_df[(Airbnb_df["price"]<=uf) & (Airbnb_df["price"]>=lf)]


In [None]:
Airbnb_df

## **Visual Analysis**

In [None]:
# Create a figure with a custom size
plt.figure(figsize=(12, 5))


# Create a histogram of the 'price' column of the Airbnb_df dataframe
# using sns distplot function and specifying the color as red
sns.distplot(Airbnb_df['price'],color=('r'))

# Add labels to the x-axis and y-axis
plt.xlabel('Price', fontsize=14)
plt.ylabel('Density', fontsize=14)

# Add a title to the plot
plt.title('Distribution of Airbnb Prices',fontsize=15)

**Inferences**

-> Most Price is between 40 and 100

-> As the price increased the density is decreasing


In [None]:
# Set the figure size
plt.figure(figsize=(12, 8))

# Create a countplot of the neighbourhood group data
sns.countplot(data=Airbnb_df, x='neighbourhood_group')

# Set the title of the plot
plt.title('Neighbourhood_group Listing Counts in NYC', fontsize=15)

# Set the x-axis label
plt.xlabel('Neighbourhood_Group', fontsize=14)

# Set the y-axis label
plt.ylabel('Total Listings Counts', fontsize=14)

# Show the plot
plt.show()

**Inferences**

-> Brooklyn , Manhattan have more listings

-> Less preferences is staten island

In [None]:
# Set the figure size
plt.figure(figsize=(10, 6))

# Get the room type counts
room_type_counts = Airbnb_df['room_type'].value_counts()

# Set the labels and sizes for the pie chart
labels = room_type_counts.index
sizes = room_type_counts.values

# Create the pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%')

# Add a legend to the chart
plt.legend(title='Room Type')

# Show the plot
plt.show()


In [None]:
room_type_counts

**Inferences**

-> Entire home & private room are most preferred types

-> Shared room is least preferred

In [None]:

# Create the point plot
sns.pointplot(x = 'neighbourhood_group', y='price', data=Airbnb_df, estimator = np.mean)

# Add axis labels and a title
plt.xlabel('Neighbourhood Group',fontsize=14)
plt.ylabel('Average Price',fontsize=14)
plt.title('Average Price by Neighbourhood Group',fontsize=15)

**Inferences**

-> Manhattan is having most expensive hotels/properties

->  Bronx is having most cheapest hotels/properties



## **Statistical Analysis**





***Is there a significant difference in the average room prices between different neighborhood groups?***

Null Hypothesis (H0): There is no significant difference in the average room prices between different neighborhood groups.

Alternate Hypothesis (H1): There is a significant difference in the average room prices between different neighborhood groups.

In [None]:
from scipy.stats import f_oneway

data = Airbnb_df[['price', 'neighbourhood_group']]

# Perform ANOVA
neighbourhood_groups = data['neighbourhood_group'].unique()
grouped_data = [data['price'][data['neighbourhood_group'] == group] for group in neighbourhood_groups]


In [None]:
data

In [None]:
grouped_data

In [None]:
neighbourhood_groups

In [None]:
# Perform one-way ANOVA
statistic, p_value = f_oneway(*grouped_data)

# Display results
print("One-way ANOVA results:")
print(f"Statistic: {statistic}")
print(f"P-value: {p_value}")



In [None]:
# Interpret the results
if p_value < 0.05:
    print("There is a significant difference in the average room prices between different neighborhood groups.")
else:
    print("There is no significant difference in the average room prices between different neighborhood groups.")

***Is there a significant difference in the average room prices between Airbnb rentals in Brooklyn and Manhattan?***

Null Hypothesis (H0): There is no significant difference in the average room prices between Brooklyn and Manhattan

Alternate Hypothesis (H1): There is a significant difference in the average room prices between Brooklyn and Manhattan.

In [None]:
brooklyn_prices = Airbnb_df[Airbnb_df['neighbourhood_group'] == 'Brooklyn']['price']
manhattan_prices = Airbnb_df[Airbnb_df['neighbourhood_group'] == 'Manhattan']['price']

In [None]:
brooklyn_prices

In [None]:
manhattan_prices

In [None]:
brooklyn_prices.mean()

In [None]:
manhattan_prices.mean()

In [None]:
# Before Implementing the t test for two sample we have to compare the variances of both the sample

In [None]:
# let's apply the f test to check the variance of the samples
from scipy.stats import f

In [None]:
F=brooklyn_prices.mean()/manhattan_prices.mean()

In [None]:
F

In [None]:
df1=len(brooklyn_prices)-1
df2=len(manhattan_prices)-1

In [None]:
p_value=1-f.cdf(df1,df2,F)

In [None]:
p_value

In [None]:
# Interpret the results
if p_value < 0.05:
    print("Variances of both the samples are not equal.")
else:
    print("Variances of both the samples are equal.")

In [None]:
from scipy.stats import ttest_ind

In [None]:
t_stats, p_value=ttest_ind(brooklyn_prices,manhattan_prices, equal_var=False)

In [None]:
p_value

In [None]:
if p_value < 0.05:
    print("Reject the null hypothesis. There is a significant difference in average room prices between Brooklyn and Manhattan.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in average room prices between Brooklyn and Manhattan.")

***Is there a significant association between the neighborhood group and the room type chosen by Airbnb guests?***

Null Hypothesis= the both the variables are independent having no association

Alternate Hypothesis= the both the variables are dependent having association between them


In [None]:
from scipy.stats import chi2_contingency


In [None]:
contingency_table = pd.crosstab(Airbnb_df['neighbourhood_group'], Airbnb_df['room_type'])

In [None]:
contingency_table

In [None]:
chi2_stat, p_value, dof, expected = chi2_contingency(contingency_table)

In [None]:
p_value

In [None]:
if p_value < 0.05:
    print("Reject the null hypothesis. there is  association between both the samples")
else:
    print("Fail to reject the null hypothesis there is no association between both the samples.")