# **Problem Statement**

## <b> Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Today, Airbnb became one of a kind service that is used and recognized by the whole world. Data analysis on millions of listings provided through Airbnb is a crucial factor for the company. These millions of listings generate a lot of data - data that can be analyzed and used for security, business decisions, understanding of customers' and providers' (hosts) behavior and performance on the platform, guiding marketing initiatives, implementation of innovative additional services and much more. </b>

## <b>This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values. </b>

## <b> Explore and analyze the data to discover key understandings (not limited to these) such as : 
* What can we learn about different hosts and areas?
* What can we learn from predictions? (ex: locations, prices, reviews, etc)
* Which hosts are the busiest and why?
* Is there any noticeable difference of traffic among different areas and what could be the reason for it? </b>

# **About the dataset:**
This dataset has around 49,000 observations in it with 16 columns and it is a mix between categorical and numeric values.

Airbnb is an online marketplace connecting travelers with local hosts. On one side, the platform enables people to list their available space and earn extra income in the form of rent. On the other, Airbnb enables travelers to book unique homestays from local hosts, saving them money and giving them a chance to interact with locals. Catering to the on-demand travel industry, Airbnb is present in over 190 countries across the world.

The data we are going to analyse is the data of Airbnb NYC (2019). Our main objectives of analysis will be above four statements which can be briefed as learnings from hosts, areas, price, reviews, locations etc. but we are not limited to it,we will also try to explore some more insights.

# **Approach used:**
The approach we have used in this project is defined in the given format-

**1) Loading our data :** In this section we just loaded our dataset in colab notebook and read the csv file.
 
**2) Data Cleaning and   Processing :** In this section we  have tried  to remove the null values and for some of the columns we have replaced the null values with the appropriate values with reasonable assumptions .
 
**3) Analysis and Visualization :** In this section we have tried to explore all variables which can play an important role for the analysis . In the next parts we have tried to explore the effect of one over the other . In the next part we tried to answers our hypothetical questions. 


**4) Future scope of Further Analysis:**
There are many apartments having availability as 0 and the date of last_review is very old, which can mean that they must have stopped their business, we can find the relation with neighbourhood with these apartments if we could dig much, various micro trends could be unearthed, which we are not able to cover during this short duration efficiently. There are various columns which can play an important role in further analysis such as number of reviews and reviews per month finding its relation with other factors or other grouped factors can play an important role.

## **Types of graphs used for data visualization:**
*  Count Plot
*  Bar Plot
*  Scatter Plot
*  Heatmap
*  Box plot


##**Python Libraries used for graphs:**
*  Matplotib
*  Seaborn
*  klib
*  Numpy
*  Pandas
*  Folium


#**What is EDA?**
**“Exploratory Data Analysis “** is very important in machine learning . Whenever we start our work on any project we must analyse the factors deeply .Hypothetical questions and that hypothetical questions lead to some hidden facts . This collaborative work is simply known as EDA.The following steps are involved in the **process of EDA**

*  Acquire and loading data
*  Understanding the variables
*  Cleaning dataset
*  Exploring and Visualizing Data
*  Analyzing relationships between variables


##**Understanding the Variables**
In this section, we will have the overview of   the basic  understanding of our dataset variables .What does particular features means and how its distributed  ,What type of data is it . Airbinb dataset is having 16 columns in total . We can get this by  basic inspection of our dataset  .  Some columns are not significant for our analysis which can also be kept off. Now let’s look at some of the useful columns in our data set.

**ID** 

*   Unique listing ID



**Host Id**
*  Unique host ID
*  This is a numerical variables associated with each host.
*   There are  37457 unique values in the data set.
*   There exist multiple listings corresponding to a particular host id.

**Host Name**
*   Host names are basically the names of the individuals or organisations who rent a rooms/apartment in Airbnb website.
*  There are about 11453 unique values out of 48895 observations.
*  This is a categorical variable as one host can have multiple apartments .

**Neighbourhood**
*  When searching for accommodations in a city, guests are able to filter by neighbourhood attributes and explore layers of professional-quality content, including neighbourhood maps, custom local photography and localized editorial, details on public transportation and parking, and tips from Airbnb’s host community.
*  In Airbnb dataset, neighbourhood is a categorical variable

**Neighbourhood groups:**
*  Neighbourhood groups are the clusters of neighbourhoods in the area.
*  There are about 5 boroughs in the state.
*  It is a categorical variable.
 
**Room type:**
Airbnb has 3 categories for types of spaces:
*   Entire house/apartment
*  Private room
*  Shared room.

**Price**
*  The total price of your Airbnb reservation is based on the rate set by the Host, plus fees or costs determined by either the Host or Airbnb.
*  This is a continuous variable
Other relevant variables
*  Reviews per month: insights into frequency of visits of the listing
*  Minimum nights: indicator of minimum stay length, to be used with the number of monthly reviews
*  Availability 365: It is an indicator of the total number of days the listing is available for during the year.


# **Table of content**

*  Loading Data

*  Checking for NaN values 

*  Handling NaNs

*  Analysis





## **Questions for analysis are as follows**
**1) Which hosts have the highest number of apartments  ?**

**2) Which are the top 10 neighbourhoods which are having maximum number of appartments on Airbnb in the respective neighbourhood ?**

**3) Which neighbourhood are having maximum prices in their respective neighbourhood_group ?**

**4) How is the neighbourhood related to reviews  ?**

**5) What can we learn from predictions? (ex: locations, prices, reviews, etc)**

**6) What is the distribution of the room type and its distribution over the location ?**

**7) How is the room_type distributed over neighbourhood_group are the ratios of respective room_types more or less the same over each neighbourhood_group ?**

**8) How is the price column distributed over room_type and are there any Surprising items in the price column ?**

**9) Which are the top 5 hosts that have obtained the highest no. of reviews ?**

**10) What is the average preferred price by customers according to the neighbourhood_group for each category of room_type?**

**11) What is the average price preferred for getting good number_of_reviews according to neighbourhood_group ?**

**12) Which hosts are busiest ? (Most important)**



In [None]:
# installing Klib library
!pip install klib

# importing libraries 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
import folium
from folium import plugins
from folium.plugins import MarkerCluster
from folium.plugins import FastMarkerCluster
import klib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting klib
  Downloading klib-1.0.1-py3-none-any.whl (20 kB)
Collecting Jinja2<4.0.0,>=3.0.3
  Downloading Jinja2-3.1.2-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 5.2 MB/s 
Installing collected packages: Jinja2, klib
  Attempting uninstall: Jinja2
    Found existing installation: Jinja2 2.11.3
    Uninstalling Jinja2-2.11.3:
      Successfully uninstalled Jinja2-2.11.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
flask 1.1.4 requires Jinja2<3.0,>=2.10.1, but you have jinja2 3.1.2 which is incompatible.[0m
Successfully installed Jinja2-3.1.2 klib-1.0.1


In [None]:
# importing drive and mounting it to colab
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
file_path ='/content/drive/MyDrive/Almabetter/capstone projects/EDA/Airbnb NYC 2019.csv'
df = pd.read_csv(file_path)

In [None]:
file_path = '/content/drive/MyDrive/AlmaBetter/1) Python for Data Science/Airbnb Bookings Analysis(Bhavik Verma)/Airbnb NYC 2019.csv'
df=pd.read_csv(file_path)

In [None]:
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [None]:
df.tail()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
48890,36484665,Charming one bedroom - newly renovated rowhouse,8232441,Sabrina,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,36485057,Affordable room in Bushwick/East Williamsburg,6570630,Marisol,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,36485431,Sunny Studio at Historical Neighborhood,23492952,Ilgar & Aysel,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,36485609,43rd St. Time Square-cozy single bed,30985759,Taz,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2
48894,36487245,Trendy duplex in the very heart of Hell's Kitchen,68119814,Christophe,Manhattan,Hell's Kitchen,40.76404,-73.98933,Private room,90,7,0,,,1,23


In [None]:
df.isna().sum() # checking number of null values in each feature column

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

Lets check out the columns having **null values**

In [None]:
df.loc[:,df.isna().sum()!=0]

Unnamed: 0,name,host_name,last_review,reviews_per_month
0,Clean & quiet apt home by the park,John,2018-10-19,0.21
1,Skylit Midtown Castle,Jennifer,2019-05-21,0.38
2,THE VILLAGE OF HARLEM....NEW YORK !,Elisabeth,,
3,Cozy Entire Floor of Brownstone,LisaRoxanne,2019-07-05,4.64
4,Entire Apt: Spacious Studio/Loft by central park,Laura,2018-11-19,0.10
...,...,...,...,...
48890,Charming one bedroom - newly renovated rowhouse,Sabrina,,
48891,Affordable room in Bushwick/East Williamsburg,Marisol,,
48892,Sunny Studio at Historical Neighborhood,Ilgar & Aysel,,
48893,43rd St. Time Square-cozy single bed,Taz,,


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     