<a href="https://colab.research.google.com/github/suchismita-priya/airbnb_booking_analysis/blob/main/airbnb_booking_analysis_final_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member 1 -** suchismita priyadarsinee

# **Project Summary -**

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique,
personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world.
Data analysis on millions of listings provided through Airbnb is a crucial factor for the company.
These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers'
and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more.
This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

# **GitHub Link -**

https://github.com/suchismita-priya/airbnb_booking_analysis

# **Problem Statement**


**Write Problem Statement Here.**

>
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>



# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive/')

In [None]:
df = pd.read_csv('/content/drive/My Drive/Airbnb NYC 2019.csv')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
import missingno as msno

In [None]:
msno.bar(df)

### What did you know about your dataset?

we can see our dataset has 48895 data and 16 columns.

**id :** a unique id identifying an airbnb lisitng

**name :** name representing the accommodation

**host_id :** a unique id identifying an airbnb host

**host_name :** name under whom host is registered

**neighbourhood_group :** a group of area

**neighbourhood :** area falls under neighbourhood_group

**latitude :** coordinate of listing

**longitude :** coordinate of listing

**room_type :**  type to categorize listing rooms

**price :** price of listing

**minimum_nights :** the minimum nights required to stay in a single visit

**number_of_reviews :** total count of reviews given by visitors

**last_review :** date of last review given

**reviews_per_month :** rate of reviews given per month

**calculated_host_listings_count :** total no of listing registered under the host

**availability_365 :** the number of days for which a host is available in a year.

We can check there are 4 columns containing null values which are

**name,host_name,last_reviews, reviews_per_month.**

So we will just fillna(0) to those null values.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
column_to_count = df.columns
column_to_count

In [None]:
# Dataset Describe
df.describe()

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for x in column_to_count:
  print(df[x].unique())

In [None]:
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
new_df = df[['id','name','host_id','host_name','neighbourhood_group','neighbourhood','room_type','price','minimum_nights',
             'number_of_reviews','calculated_host_listings_count','availability_365']]
new_df.head()

### What all manipulations have you done and insights you found?

##1. What can we learn about different hosts and areas?

In [None]:
host_areas = new_df.groupby(['host_name','neighbourhood_group'])['calculated_host_listings_count'].max().reset_index()
host_areas.sort_values(by='calculated_host_listings_count',ascending = False)

From above we can see most number of listing are from Manhattan and host name is Sonder.

##2. What can we learn from predictions? (ex: locations, prices, reviews, etc)



In [None]:
areas_reviews = new_df.groupby(['neighbourhood_group'])['number_of_reviews'].max().reset_index()
areas_reviews

In [None]:
areas = areas_reviews['neighbourhood_group']
reviews = areas_reviews['number_of_reviews']

In [None]:
areas

In [None]:
reviews

In [None]:
plt.pie(new_df.neighbourhood_group.value_counts(),labels=new_df.neighbourhood_group.value_counts().index,autopct='%1.1f%%', startangle=180)
plt.title('review according to area')

In [None]:
prices_reviews = new_df.groupby(['price'])['number_of_reviews'].max().reset_index()
prices_reviews

In [None]:
new_price = prices_reviews['price']
new_review = prices_reviews['number_of_reviews']

In [None]:
plt.scatter(new_price , new_review)
plt.title('reviews according to price')
plt.xlabel('new price')
plt.ylabel('new review')

From above we can see that most people like to stay where price is low.

##3.Which hosts are the busiest and why?

In [None]:
busy_host = new_df.groupby(['host_id','host_name','room_type'])['number_of_reviews'].max().reset_index()
busy_host.sort_values(by='number_of_reviews',ascending=False).head(10)

In [None]:
new_hostname = busy_host.sort_values(by='number_of_reviews',ascending=False)['host_name'].head(25)
new_noreview = busy_host.sort_values(by='number_of_reviews',ascending=False)['number_of_reviews'].head(25)

In [None]:
new_hostname

In [None]:
new_noreview

In [None]:
import warnings

In [None]:
plt.figure(figsize=(20,6))
plt.bar(new_hostname , new_noreview)
warnings.filterwarnings("ignore")
plt.xticks(rotation=45,ha='right')
plt.title('review according to hostname')
plt.xlabel('hostname')
plt.ylabel('reviews')

From above we can see busy hosts are
1.   Dona
2.   Ji
3.   Maya
4.   Carol
5.   Danielle







##4. Is there any noticeable difference of traffic among different areas and what could be the reason for it?

In [None]:
traffic_areas = new_df.groupby(['neighbourhood_group','room_type'])['minimum_nights'].count().reset_index()
traffic_areas = traffic_areas.sort_values(by='minimum_nights', ascending=False)
traffic_areas

In [None]:
room_type = traffic_areas['room_type']
stayed = traffic_areas['minimum_nights']

plt.bar(room_type, stayed)
plt.xlabel("Room Type")
plt.ylabel("Minimum number of nights stayed")
plt.title("Traffic Areas")

From above we can see people prefer to stay in home/apartment and private rooms.

# **Conclusion**

Write the conclusion here.
1. The people who prefer to stay in Entire home or Apartment they are going to stay bit longer in that particular Neighbourhood only.
2. The people who prefer to stay in Private room they won't stay longer as compared to Home or Apartment.
3. Most people prefer to pay less price.
4. If there are more number of Reviews for particular Neighbourhood group that means that place is a tourist place.
5. If people are not staying more then one night means they are travellers.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***