# New-York Airbnb Dataset Exploration

# Analysis Part-I

In this we are going to analyze the data set with respect to features for modelling.
We will start with importing all the necessary packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.rcParams["figure.figsize"] = [10,8]

Firstly we will be looking at the data and try to gain some basic information about it.

In [None]:
df=pd.read_csv('../input/airbnb-dataset-for-internship-project-showcase/newyork_airbnb.csv')

In [None]:
print(df.shape)

# Feature Engineering 

Now we will be looking our data with the feeling to get some of the features for analysis and to look whether it will play a crucial role for our model or not.
So try to look at the features with less missing values and afterwards we will be going see how to take care of the missing values.

In [None]:
# features with missing values 
print(df.isnull().sum()[df.isnull().sum() > 0])

print('----------------------------------------------------------------------------------')

# features with no missing values
print(df.isnull().sum()[df.isnull().sum() == 0])

In [None]:
# We create an array with the features we want to keep.
selected_features = ['name', 'neighbourhood_cleansed', 'room_type', 'guests_included', 'minimum_nights',
                     'number_of_reviews', 'review_scores_rating', 'amenities', 'property_type',
                     'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'price']
selected_df= df.copy()[selected_features]
selected_df.rename(columns = {'neighbourhood_cleansed':'neighbourhood'}, inplace = True)
selected_df[:2]

Now we will see what are the condition of our selected features in terms of missing values.

In [None]:
selected_df.isnull().sum()

Above our columns are looking pretty good accept for review_scores_rating which have too much of Null values we will see how to manage null values through this.

Now the next thing we are going to take a brief look over data types

In [None]:
selected_df.info()

Above price is object but it should be float so we will be looking after it.

In [None]:
# seeing some values in price first
print(selected_df['price'][:2])
print('\n\n\n')
# price values contain $ and , in it so we will replace them

selected_df['price']=selected_df['price'].apply(lambda x: x.replace('$',''))
selected_df['price']=selected_df['price'].apply(lambda x: float(x.replace(',','')))

# seeing some statistics over price
print(selected_df['price'].describe())

In [None]:
# seeing distribution of price
sns.swarmplot(y=selected_df['price'].sample(300))
plt.show()

# Exploration and Visualization


Now we are going to explore and visualize some relations of prices with the predictors.

firstly we are going to see the relationship between price and neighbourhood because we can think of it in terms neighbourhood affects price in every aspect of a property.

In [None]:
# First, we get the median values of the price per neighbourhood and sort them descending 
# to use them as the graph index.
price_neighbourhood = selected_df.query('price <= 500')\
                    .groupby('neighbourhood')['price']\
                    .median()\
                    .sort_values(ascending=False)\
                    .index

# Then we filter the data to use only the prices within a certain range to avoid outliers         
data = selected_df.query('price <= 500')

# We use seaborn boxplot to generate the graph, passing as parameters the target variable and 
# the feature we want to associate, and an additonal parameter "order" to plot them in a descending
# way.
plt.figure(figsize=(15,10))
sns.boxplot(y=data['price'], x=data['neighbourhood'], order=price_neighbourhood)

# now will be giving title and get the plot axes and modify them to rotate the labels orientation
plt.title('Price V/S Neighbourhood',fontsize=12)
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=90, ha='right')
plt.show()

Now we will look the similar relation between price and room type so we can ensure the dependency of this feature to our model and naturally room type also a big factor in deciding the price of a property

In [None]:

# First, we get the median values of the price per neighbourhood and sort them descending 
# to use them as the graph index.
room_price = selected_df.query('price <= 500')\
                    .groupby('room_type')['price']\
                    .median()\
                    .sort_values(ascending=False)\
                    .index
                
# Then we filter the data to use only the prices within a certain range to avoid outliers        
data = selected_df.query('price <= 500')

# We use seaborn boxplot to generate the graph, passing as parameters the target variable and 
# the feature we want to associate, and an additonal parameter "order" to plot them descending.
plt.figure(figsize=(15,10))
sns.boxplot(y=data['price'], x=data['room_type'], order=room_price)
plt.title('Price V/S Room Type',fontsize=12)
plt.show()

Now same relationship ie. between price and no. of bedrooms

In [None]:
# Then we filter the data to use only the prices within a certain range to avoid outliers        
data = selected_df.query('price <= 500')

# We use seaborn boxplot to generate the graph, passing as parameters the target variable and 
# the feature we want to associate, and an additonal parameter "order" to plot them descending.
plt.figure(figsize=(15,10))
sns.boxplot(y=data['price'], x=data['bedrooms'])
plt.title('Price V/S Bedrooms',fontsize=12)
plt.show()

# Conclusion

From the swarm plot, we can see that the prices are concentrated around the 0\$ - 200\$ dollars interval. 

The beautiful boxplot between price and neighbourhood tells us which are the most affordable neighbourhoods and which are the most premium ones and also distribution is consistent so it could mean that this feature play some crucial rolew towards ouyr model.

The boxplot between room_type and price shows a very realisitic thing that the complete apartment or property will be expensive in compare to private and shared rooms also there is not much difference between the price for a private and a shared room so one can go for private room for most of the time.

And last, from the bedrooms/price box plot, the prices are split in a consistent way. That could be another interesting feature for the model because, for example, decision trees find the best feature splits to model the prices distribution according to the predictor features. This graph indicates that the price of the property increases as the number of bedrooms increases (A kind of obvious observation) but with some different behaviour around 7 to 10 bedrooms since there is probably a limited amount of listings with that number of bedrooms.
