# Case Study - Storytelling AIRBNB-NYC 

### Problem Statement
- For the past few months, Airbnb has seen a major decline in revenue. Now that the restrictions have started lifting and people have started to travel more, Airbnb wants to make sure that it is fully prepared for this change.
- The different leaders at Airbnb want to understand some important insights based on various attributes in the dataset so as to increase the revenue. Our responsibility is to provide valuable insights to aid in decision making.

In [None]:
# Importing Necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Read and understand the dataset and check the first five rows
Airbnb_data = pd.read_csv('AB_NYC_2019.csv')
Airbnb_data.head()

In [None]:
Airbnb_data.shape 

- There are total 48895 rows and 16 columns.

In [None]:
# printing all the columns in total dataset
print(Airbnb_data.columns)

# Finding Data types for each column

In [None]:
# To see Non-Null counts before adding new columns to dataframe 
Airbnb_data.info()

# 1. Creating features

By categorizing, understanding the relationships and the connections between things improves and findings can be communicated in the better way.

## 1.1 Categorizing the "availability_365" column into 5 categories

In [None]:
def availability_365_categories_function(row):
    """
    Categorizes the "minimum_nights" column into 5 categories
    """
    if row <= 1:
        return 'very Low'
    elif row <= 100:
        return 'Low'
    elif row <= 200 :
        return 'Medium'
    elif (row <= 300):
        return 'High'
    else:
        return 'very High'

In [None]:
Airbnb_data['availability_365_categories'] = Airbnb_data.availability_365.map(availability_365_categories_function)
Airbnb_data['availability_365_categories']

In [None]:
Airbnb_data['availability_365_categories'].value_counts()

## 1.2 Categorizing the "minimum_nights" column into 5 categories

In [None]:
def minimum_night_categories_function(row):
    """
    Categorizes the "minimum_nights" column into 5 categories
    """
    if row <= 1:
        return 'very Low'
    elif row <= 3:
        return 'Low'
    elif row <= 5 :
        return 'Medium'
    elif (row <= 7):
        return 'High'
    else:
        return 'very High'

In [None]:
Airbnb_data['minimum_night_categories'] = Airbnb_data.minimum_nights.map(minimum_night_categories_function)
Airbnb_data['minimum_night_categories']

In [None]:
Airbnb_data.minimum_night_categories.value_counts()

## 1.3 Categorizing the "number_of_reviews" column into 5 categories

In [None]:
def number_of_reviews_categories_function(row):
    """
    Categorizes the "number_of_reviews" column into 5 categories
    """
    if row <= 1:
        return 'very Low'
    elif row <= 5:
        return 'Low'
    elif row <= 10 :
        return 'Medium'
    elif (row <= 30):
        return 'High'
    else:
        return 'very High'

In [None]:
Airbnb_data['number_of_reviews_categories'] = Airbnb_data.minimum_nights.map(number_of_reviews_categories_function)
Airbnb_data['number_of_reviews_categories']

In [None]:
Airbnb_data.number_of_reviews_categories.value_counts()

## 1.4 Categorizing the "price" column into 5 categories

In [None]:
Airbnb_data.price.describe()

In [None]:
plt.figure(figsize=(10,5))
plt.title('Price Distribution')
sns.boxenplot(Airbnb_data.price, orient="h")
plt.show()

In [None]:
Airbnb_data[Airbnb_data.price == 0].shape

- there are total of 11 rows, in which price is marked as 0

In [None]:
def price_categories_function(row):
    """
    Categorizes the "number_of_reviews" column into 5 categories
    """
    if row <= 1:
        return 'very Low'
    elif row <= 4:
        return 'Low'
    elif row <= 15 :
        return 'Medium'
    elif (row <= 100):
        return 'High'
    else:
        return 'very High'

In [None]:
Airbnb_data['price_categories'] = Airbnb_data.minimum_nights.map(price_categories_function)
Airbnb_data['price_categories']

In [None]:
Airbnb_data.price_categories.value_counts()

# 2. Fixing columns

In [None]:
# Check the datatypes of all the columns of the dataframe after categorizing the columns in data
Airbnb_data.info()

- reviews_per_month column is of object Dtype. datetime64 is a better Data type for this column.

In [None]:
Airbnb_data.last_review = pd.to_datetime(Airbnb_data.last_review)
Airbnb_data.last_review

- There are no more Data types to be fixed and data does not contain inconsistencies such as shifted columns, which is need to align correctly. The columns necessery for the futher analysis are also derived.

# 3. Data types

In [None]:
# printing all the columns in the dataset
print(Airbnb_data.columns)

## 3.1 Categorical

In [None]:
# Categorical nominal
categorical_columns = Airbnb_data.columns[[0,1,3,4,5,8,16,17,18,19]]
categorical_columns

In [None]:
# To see the first few rows of categorical columns
Airbnb_data[categorical_columns].head()

## 3.2 Numerical

In [None]:
numerical_columns = Airbnb_data.columns[[9,10,11,13,14,15]]
numerical_columns

In [None]:
Airbnb_data[numerical_columns].head()             

In [None]:
Airbnb_data[numerical_columns].describe()             

## 3.3 Coordinates and date

In [None]:
coordinates = Airbnb_data.columns[[5,6,12]]
Airbnb_data[coordinates]

# 4. Missing value Treatment
- In Data cleaning the first step is to check the missing values
- Check the number of null (missing) values in the columns
- Missing value means that values is not present in the data

In [None]:
Airbnb_data.isnull()

In [None]:
# To see the sum of missing values for each column
Airbnb_data.isnull().mean()*100

##### Insights:
- last_review , reviews_per_month columns have around 20.56% missing values
- name and host_name have 0.03% and 0.04 % missing values respectively.

In [None]:
Airbnb_data[Airbnb_data.last_review.isnull()].head()

In [None]:
# Selecting the data with no missing values for 'last_review' feature
Airbnb_data1 = Airbnb_data[~Airbnb_data.last_review.isnull()]
Airbnb_data1.head()

In [None]:
Airbnb_data1.isnull().mean()*100

- After treating last review column, most of the missing values are removed

In [None]:
Airbnb_data1 = Airbnb_data1[~Airbnb_data1.host_name.isnull()]
Airbnb_data1 = Airbnb_data1[~Airbnb_data1.name.isnull()]
Airbnb_data1.head()

In [None]:
Airbnb_data1.isnull().mean()*100

In [None]:
# Count of 'neighbourhood_group' with missing values
Airbnb_data.neighbourhood_group.value_counts(dropna=False)

In [None]:
# Count of 'neighbourhood_group'
Airbnb_data1.neighbourhood_group.value_counts(dropna=False)

In [None]:
#Checking missing values percentage of each neighbourhood
1-Airbnb_data1.neighbourhood_group.value_counts(dropna=False)/Airbnb_data.neighbourhood_group.value_counts(dropna=False)

In [None]:
plt.figure(figsize=(10,5))
plt.title("Missing Value for Each Neighbourhood")
(1-Airbnb_data1.neighbourhood_group.value_counts(dropna=False)/Airbnb_data.neighbourhood_group.value_counts(dropna=False)).plot.bar()
plt.show()

In [None]:
(1-Airbnb_data1.neighbourhood_group.value_counts(dropna=False)/Airbnb_data.neighbourhood_group.value_counts(dropna=False)).mean()

##### Insights:
-  The Each neighbourhood_group has about 19 % missing values in 'last_review' feature.

In [None]:
# Count of 'room_type' with missing values
(1-Airbnb_data1.room_type.value_counts(dropna=False)/Airbnb_data.room_type.value_counts(dropna=False))*100

In [None]:
plt.figure(figsize=(10,5))
plt.title("Missing Value for Each Room Type")
(1-Airbnb_data1.room_type.value_counts(dropna=False)/Airbnb_data.room_type.value_counts(dropna=False)).plot.bar()
plt.show()

In [None]:
(1-Airbnb_data1.room_type.value_counts(dropna=False)/Airbnb_data.room_type.value_counts(dropna=False)).mean()

##### Insights:
-  The Each neighbourhood_group has about 22 % missing values in 'last_review' feature.

In [None]:
print('Mean and Median of prices with last_review feature missing')
print('Mean   = ', Airbnb_data[Airbnb_data['last_review'].isnull()].price.mean())
print('Median = ', Airbnb_data[Airbnb_data['last_review'].isnull()].price.median())

print('\nMean and Median of prices with last_review feature not missing')
print('Mean   = ', Airbnb_data[Airbnb_data['last_review'].notnull()].price.mean())
print('Median = ', Airbnb_data[Airbnb_data['last_review'].notnull()].price.median())

##### Insights:
- The pricing is higher when 'last_review' feature is missing .
- reviews are less likely to be given for shared rooms
- When the prices are high reviews are less likely to be given
- The above analysis seems to show that the missing values here are not MCAR (missing completely at random)

# 5.Univariate Analysis

In [None]:
Airbnb_data1.head()

## 5.1 host_id

In [None]:
Airbnb_data1.host_id.value_counts()

## 5.2 name

In [None]:
Airbnb_data1.name.value_counts()

## 5.3 host_name

In [None]:
Airbnb_data1.host_name.value_counts()

In [None]:
Airbnb_data1.host_name.value_counts().index[:10]

In [None]:
# Top 10 host's
plt.figure(figsize=(10,5))
plt.title("Top 10 host")
sns.barplot(x = Airbnb_data1.host_name.value_counts().index[:10] , y = Airbnb_data1.host_name.value_counts().values[:10])
plt.show()

## 5.4 neighbourhood_group

In [None]:
Airbnb_data1.neighbourhood_group.value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(8,8))
plt.title("Neighbourhood Group Distribution")
plt.pie(x = Airbnb_data1.neighbourhood_group.value_counts(normalize= True) * 100, labels = Airbnb_data1.neighbourhood_group.value_counts(normalize= True).index)
plt.legend()
plt.show()

##### Insights:
- What are the neighbourhoods they need to target?
- 81 % of the listing are Manhattan and Brooklyn neighbourhood_group

## 5.5 neighbourhood

In [None]:
Airbnb_data1.neighbourhood.value_counts()

## 5.6 Price

In [None]:
Airbnb_data1.price.value_counts()

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2, 1, 1)
plt.title("Price Distribution")
sns.histplot(data = Airbnb_data.price,kde = True)
plt.subplot(2, 1, 2)
sns.boxenplot(data = Airbnb_data1.price,
    orient ="h")
plt.show()

## 5.7 room_type

In [None]:
Airbnb_data1.room_type.value_counts(normalize=True)

In [None]:
plt.figure(figsize=(8,8))
plt.title("Room Type")
plt.pie(x = Airbnb_data1.room_type.value_counts(normalize=True), 
labels = Airbnb_data.room_type.value_counts(normalize= True).index)
plt.legend()
plt.show()

## 5.8 minimum_nights

In [None]:
Airbnb_data1.minimum_nights.value_counts()

In [None]:
Airbnb_data1.minimum_nights.describe()

In [None]:
plt.figure(figsize=(10,10))

plt.subplot(2,1,1)
plt.title("Minimum Nights Distribution")
sns.boxenplot(data = Airbnb_data1.minimum_nights, orient="h")
plt.subplot(2,1,2)
plt.hist(data = Airbnb_data1, x = 'minimum_nights', bins = 80,range=(0,35) )
plt.show()

## 5.9 number_of_reviews

In [None]:
Airbnb_data1.number_of_reviews.describe()

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,1,1)
plt.title("Number of Reviews Distribution")
sns.boxenplot(data = Airbnb_data1.number_of_reviews,orient="h")
plt.subplot(2,1,2)
sns.histplot(data = Airbnb_data1, x = 'number_of_reviews',bins=100,binrange=(0,300))
plt.show()

## 5.10 reviews_per_month

In [None]:
Airbnb_data1.reviews_per_month.describe()

In [None]:
plt.figure(figsize=(10,10))
plt.subplot(2,1,1)
plt.title("Reviews per Month Distribution")
sns.boxenplot(data = Airbnb_data1.reviews_per_month,orient="h")
plt.subplot(2,1,2)
sns.histplot(data = Airbnb_data1, x = 'reviews_per_month',bins=100,binrange=(0,30))
plt.show()

## 5.11 availability_365

In [None]:
Airbnb_data1.availability_365.describe()

In [None]:
plt.figure(figsize = (10,5))
plt.title("365 days Availability Distribution")
sns.histplot(data = Airbnb_data1, x = 'availability_365',bins=50,binrange=(0,365), kde=True)
plt.show()

## 5.12 calculated_host_listings_count

In [None]:
Airbnb_data1.calculated_host_listings_count.describe()

In [None]:
plt.figure(figsize = (10,10))
plt.subplot(2,1,1)
plt.title("Host Listing Distribution")
sns.boxenplot(data = Airbnb_data1 , x = 'calculated_host_listings_count')
plt.subplot(2,1,2)
sns.histplot(data = Airbnb_data1, x = 'calculated_host_listings_count',bins=20,binrange=(0,20))
plt.show()

## 5.13 minimum_night_categories

In [None]:
Airbnb_data1.minimum_night_categories.value_counts(normalize= True)*100

In [None]:
plt.figure(figsize=(8,8))
plt.title('Minimum night categories')
plt.pie(x = Airbnb_data1.minimum_night_categories.value_counts(), labels=Airbnb_data1.minimum_night_categories.value_counts().index)
plt.legend()
plt.show()

## 5.14 price_categories

In [None]:
Airbnb_data1['price_categories'].value_counts(normalize=True)*100

In [None]:
Airbnb_data1['price_categories'].describe()

In [None]:
plt.figure(figsize=(8,8))
plt.title('Price categories')
plt.pie(x = Airbnb_data1.price_categories.value_counts(),labels=Airbnb_data1.price_categories.value_counts().index)
plt.legend()
plt.show()


##### Insights:
- What is the pricing ranges preferred by customers?
- 'Low' price ranges are preferred by custumers followed by very 'Low' price ranges.

## 5.15 number_of_reviews_categories

In [None]:
Airbnb_data.number_of_reviews_categories.value_counts(normalize=True)*100

In [None]:
plt.figure(figsize=(8,8))
plt.title('number_of_reviews_categories Distribution')
plt.pie(x = Airbnb_data1.number_of_reviews_categories.value_counts(),labels=Airbnb_data1.number_of_reviews_categories.value_counts().index)
plt.legend()
plt.show()

# 6. Bivariate and Multivariate Analysis

## 6.1 Finding the correlations

In [None]:
numerical_columns = Airbnb_data1.columns[[9,10,11,13,14,15]]
Airbnb_data1[numerical_columns].head()

In [None]:
Airbnb_data1[numerical_columns].corr()

In [None]:
plt.figure(figsize=(10,10))
plt.title("Correlations between Different Features")
sns.heatmap(data = Airbnb_data1[numerical_columns].corr(), annot=True, cmap="Reds", linecolor="White", linewidths=5)
plt.show()

## 6.2 Finding Top correlations

In [None]:
corr_matrix = Airbnb_data1[numerical_columns].corr().abs()

#the matrix is symmetric so we need to extract upper triangle matrix without diagonal (k = 1)

data = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
                  .stack()
                  .sort_values(ascending=False))

In [None]:
corr_matrix

In [None]:
data

In [None]:
# Top meaningful correlations
data[1:8]

## 6.3 number_of_reviews_categories and prices

In [None]:
# prices for each of reviews_categories
y1 = Airbnb_data1.number_of_reviews_categories.value_counts()
y1

In [None]:
plt.figure(figsize=(8,5))
sns.barplot(x = y1.index,y = y1.values)
plt.show()

In [None]:
plt.figure(figsize=(10,5))
sns.boxenplot(y = Airbnb_data1.number_of_reviews_categories , x = Airbnb_data1.price)

In [None]:
Airbnb_data1.groupby('number_of_reviews_categories').price.mean().sort_values()

In [None]:
Airbnb_data1.groupby('number_of_reviews_categories').price.median().sort_values()

##### Insights:
- What is the pricing ranges preferred by customers?
- The total price for 'Low' or 'very Low' number_of_reviews_categories are high.

## 6.4 ('room_type' and 'number_of_reviews_categories')

In [None]:
Airbnb_data1.room_type.value_counts()

In [None]:
pd.crosstab(Airbnb_data1['room_type'], Airbnb_data1['number_of_reviews_categories'])

In [None]:
Airbnb_data1.groupby('room_type').number_of_reviews.sum() 

In [None]:
Airbnb_data1.groupby('room_type').number_of_reviews.sum()/Airbnb_data.room_type.value_counts()

##### Insights:
- The various kinds of properties that exist w.r.t. customer preferences.?
- Entire home/apt have more reviews than Shared rooms
- 'Shared room' are less likey to give reviews. only 16 %

## 6.5 'room_type' and 'price_categories'

In [None]:
pd.crosstab(Airbnb_data1['room_type'], Airbnb_data1['price_categories'])

## 6.6 'room_type' and 'reviews_per_month'

In [None]:
Airbnb_data1.room_type.value_counts()

In [None]:
Airbnb_data1.groupby('room_type').reviews_per_month.mean()

In [None]:
Airbnb_data1.groupby('room_type').reviews_per_month.median()

In [None]:
Airbnb_data1.groupby('room_type').reviews_per_month.sum()

In [None]:
plt.figure(figsize=(10,5))
plt.title("Room Type Distribution vs Reviews Per Month")
sns.boxplot(data = Airbnb_data1, y = 'room_type' ,x = 'reviews_per_month')
plt.show()

##### Insights:
- For each 'room_type' there are ~1.4 reviews per month on average.

## 6.7 minimum_night_categories and reviews_per_month

In [None]:
Airbnb_data1.groupby('minimum_night_categories').reviews_per_month.sum().sort_values()

In [None]:
plt.figure(figsize=(10,5))
plt.title("Minimum Nights vs Reviews Per Month")
sns.boxplot(data = Airbnb_data1, y = 'minimum_night_categories' ,x = 'reviews_per_month')
plt.show()

##### Insights
- Customer's are more likely to leave reviews for low number of minimum nights
- Adjustments in the existing properties to make it more customer-oriented. ?
- minimum_nights should be on the lower side to make properties more customer-oriented

## 6.8 'availability_365_categories', 'price_categories' and 'reviews_per_month'


In [None]:
Airbnb_data1.availability_365_categories.value_counts()

In [None]:
pd.DataFrame(Airbnb_data1.groupby(['availability_365_categories','price_categories']).reviews_per_month.mean())

##### Insights
- If availability and price both is very high, reviews_per_month is low on average.
- Very high availability and very low price are likely to getting more reviews.

In [None]:
Airbnb_data1.to_csv('AirBnb_NYC_processed.csv')