<a id='TOC'></a>

# Project: Investigate Airbnb dataset of Boston and Seattle. 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#business_understanding">Business Understanding</a></li>
<li><a href="#data_understanding">Data Understanding</a></li>
<li><a href="#data_preparation">Data Preparation</a></li>
<li><a href="#modeling">Modeling</a></li>    
<li><a href="#results_evaluation">Results Evaluation</a></li>
<li><a href="#deploy_solution">Deployment</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

Airbnb is a publicly listed company with focus on Lodging industry. It is headquartered in San Francisco but has presence in multiple countries. 
<br>
It works in online marketplace for rental activities.
<br><br>
Dataset includes 2 cities; Boston and Seattle. Files for each city contain 3 csv (comma separated variable) files containing rental availability calendar, available listings and reviews.
<br><br>
Objective is to discover actionable insight from the available data so that stakeholders can use that information and strategize business decisions.


<li><a href="#TOC">Back To Table Of Contents</a></li>

Import general packages and graphing capabilities which will be used in all datasets.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn as sklearn


<a id='business_understanding'></a>
## Business Understanding

Business model of Airbnb is based on revenue earning through comission for providing rental listings to end users. people willing to rent put up the listing/s on Airbnb website and customers book it through Airbnb. 
<br><br>
Based on the data available objective of this analysis is going to be to discover factors behind several situations; e.g. which neighborhoods are always overbooked? what exactly causes that? is it because of the surrounding area or the rental facilities are very good? knowledge about these things can tell Airbnb how they can approach the situation. Maybe the listings are not very good, or the service provided is not upto the mark. whatever the reason understanding what are the environmental factors is the first step towards the improvement.

<li><a href="#TOC">Back To Table Of Contents</a></li>

<a id='data_understanding'></a>
## Data Understanding

Amongst many Objectives of this stage prime one is to get familiar with the data. The familiarity has potential to discovering features of interest for mining activities.


Import the data

In [4]:
df_boston_calendar = pd.read_csv('Boston-calendar.csv')  
df_boston_listings =  pd.read_csv('Boston-listings.csv')
df_boston_reviews = pd.read_csv('Boston-reviews.csv')
df_seattle_calendar = pd.read_csv('Seattle-calendar.csv')
df_seattle_listings = pd.read_csv('Seattle-listings.csv')
df_seattle_reviews = pd.read_csv('Seattle-reviews.csv')

FIrst we will focus on dataframes ob Boston dataframes


Checkout sample lines of each dataframe

In [7]:
df_boston_calendar.sample(7)

Unnamed: 0,listing_id,date,available,price
1548,7651065,2017-03-25,t,$79.00
952848,1185034,2016-12-24,t,$163.00
1272908,14487262,2016-11-29,f,
1042191,12366845,2017-02-24,f,
1215492,12628940,2017-06-13,f,
897788,13435185,2017-07-23,f,
744480,2747654,2016-11-20,t,$101.00


In [8]:
df_boston_calendar.available.value_counts()

f    665853
t    643037
Name: available, dtype: int64

In [9]:
df_boston_listings.sample(7)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
1260,9857,https://www.airbnb.com/rooms/9857,20160906204935,2016-09-07,[1280] Luxury 2BR Suite-Back Bay,,Enjoy architecturally unique suites in a Back ...,Enjoy architecturally unique suites in a Back ...,none,,...,9.0,f,,,f,super_strict_30,f,t,79,0.35
519,5967936,https://www.airbnb.com/rooms/5967936,20160906204935,2016-09-07,Apt close to downtown and backbay,A furnished one bed room apt with functional k...,,A furnished one bed room apt with functional k...,none,,...,,f,,,f,moderate,f,f,2,
1919,11906382,https://www.airbnb.com/rooms/11906382,20160906204935,2016-09-07,Gorgeous new remodel in Beacon Hill 2BR/2BA,"Lovely newly renovated 2 bedroom, 2 bath condo...",Stunning new contemporary renovation with open...,"Lovely newly renovated 2 bedroom, 2 bath condo...",none,"Walk to everything: food, transportation, tour...",...,10.0,f,,,f,strict,f,f,1,0.93
1936,7854872,https://www.airbnb.com/rooms/7854872,20160906204935,2016-09-07,"Temple Street By Maverick, Twelve",Studio offers the luxury of home furnishings w...,Our professional managed apartments are great ...,Studio offers the luxury of home furnishings w...,none,Beacon Hill is one of Boston’s most historic n...,...,9.0,f,,,f,strict,f,f,50,0.19
568,14573486,https://www.airbnb.com/rooms/14573486,20160906204935,2016-09-07,Brand New Luxury 1 bedroom Downtown Boston,Come stay at our brand new luxury apartment in...,Come stay in our luxurious (660 sq ft) apartme...,Come stay at our brand new luxury apartment in...,none,Living on the greenway! The building borders F...,...,,f,,,t,strict,f,f,3,
78,11159585,https://www.airbnb.com/rooms/11159585,20160906204935,2016-09-07,Mark P.Coleman,Quiet Residential Boston Neighborhood.Nearby a...,,Quiet Residential Boston Neighborhood.Nearby a...,none,,...,10.0,f,,,f,flexible,f,f,1,0.21
2724,7497047,https://www.airbnb.com/rooms/7497047,20160906204935,2016-09-07,Clean and quiet close to T,The Space The house Is in a great family neigh...,,The Space The house Is in a great family neigh...,none,,...,8.0,f,,,f,flexible,f,f,6,0.08


In [10]:
df_boston_listings.shape

(3585, 95)

In [12]:
df_boston_reviews.shape

(68275, 6)

In [13]:
df_boston_reviews.sample(7)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
11011,994138,67253969,2016-03-27,62182124,Brandon,Andrew and Katie were great hosts for my 5 nig...
60416,8224214,69656202,2016-04-12,63783247,Tim,Andy could not have been a better host and his...
8543,4479140,56120849,2015-12-09,47169398,Ward,We had a great stay in Iain's apartment. The l...
54236,12308161,95244407,2016-08-19,70570778,Taylor,Great location!
25010,197146,68124915,2016-04-02,28896680,Roman,My family and I stayed two nights. Great locat...
17238,6006121,33586871,2015-05-30,22345930,Marcella,Jake and Danny were very hospitable and friend...
22571,1692573,32491522,2015-05-19,13505816,Martha,Everything went smooth with our stay. Communic...


In [11]:
df_seattle_listings.shape

(3818, 92)

In [19]:
df_boston_listings.shape

(3585, 95)

In [18]:
df_seattle_calendar.shape

(1393570, 4)

In [17]:
df_boston_calendar.shape

(1308890, 4)

In [15]:
df_seattle_reviews.shape

(84849, 6)

Exploring column differences between listing dataframes of Boston and Seattle.

In [27]:
bost_list_cols = set(df_boston_listings.columns.values)

In [28]:
seatt_list_cols = set(df_seattle_listings.columns.values)

In [30]:
len(bost_list_cols.intersection(seatt_list_cols))

92

In [31]:
len(bost_list_cols.union(seatt_list_cols))

95

In [33]:
len(bost_list_cols | seatt_list_cols)

95

In [26]:
len(set(df_boston_listings.columns.values).intersection(set(df_seattle_listings.columns.values)))

92

Columns which are presented in Boston but not in Seattle.

In [34]:
len(bost_list_cols - seatt_list_cols)

3

In [36]:
(bost_list_cols - seatt_list_cols)

{'access', 'house_rules', 'interaction'}

In [35]:
len(seatt_list_cols - bost_list_cols)

0

In [38]:
df_boston_listings.columns.values

array(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name',
       'summary', 'space', 'description', 'experiences_offered',
       'neighborhood_overview', 'notes', 'transit', 'access',
       'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
       'picture_url', 'xl_picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode',
       'market', 'smart_location', 'country_code', 'country', 'latitude',
       'longitude', 'is_location_exact', 'property_type', 'room_type',
       'accommodates', 'bath

<li><a href="#TOC">Back To Table Of Contents</a></li>

### Assess

###### perform rudimentary data assessment

In [46]:
pd.set_option('display.max_rows', 500)

In [47]:
df_boston_listings.isna().sum()

id                                     0
listing_url                            0
scrape_id                              0
last_scraped                           0
name                                   0
summary                              143
space                               1057
description                            0
experiences_offered                    0
neighborhood_overview               1415
notes                               1975
transit                             1290
access                              1489
interaction                         1554
house_rules                         1192
thumbnail_url                        599
medium_url                           599
picture_url                            0
xl_picture_url                       599
host_id                                0
host_url                               0
host_name                              0
host_since                             0
host_location                         11
host_about      

In [39]:
df_boston_calendar.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308890 entries, 0 to 1308889
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   listing_id  1308890 non-null  int64 
 1   date        1308890 non-null  object
 2   available   1308890 non-null  object
 3   price       643037 non-null   object
dtypes: int64(1), object(3)
memory usage: 39.9+ MB


In [40]:
df_boston_calendar.describe()

Unnamed: 0,listing_id
count,1308890.0
mean,8442118.0
std,4500149.0
min,3353.0
25%,4679319.0
50%,8578710.0
75%,12796030.0
max,14933460.0


In [41]:
df_boston_calendar.corr()

Unnamed: 0,listing_id
listing_id,1.0


In [52]:
len(df_boston_calendar[(df_boston_calendar.listing_id==994138)])

365

In [56]:
len(df_boston_calendar[(df_boston_calendar.listing_id==994138) & (df_boston_calendar.available=='t')])

293

In [57]:
len(df_boston_calendar[df_boston_calendar.listing_id==197146])

365

In [58]:
len(df_boston_calendar[(df_boston_calendar.listing_id==197146) & (df_boston_calendar.available=='t')])

350

In [62]:
df_boston_reviews.sample(7)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
16943,1131634,12273844,2014-04-28,12029530,Alvaro,Richard´s apartment has a perfect location.\r\...
23799,4196643,22588995,2014-11-10,21844938,Yusaku,We spent really good time there. As Shawn told...
8410,2636365,26192997,2015-02-06,6298496,Michelle,We had a really nice stay at Dean's South End ...
22282,8973362,53006303,2015-11-03,15338468,Caleb,Kate's apartment was just what we needed. The ...
39908,8228903,87310334,2016-07-19,84145243,Bryan,Great place to stay and close to public transp...
35448,1811776,41002151,2015-08-03,9371540,John,The location is great; the apartment was great...
20104,1115394,17495852,2014-08-13,10679682,Niklas,Wir hatten einen perfekten Aufenthalt bei Megh...


Exploring listings data frame to understand rating of individual listing

In [67]:
df_boston_listings.columns[df_boston_listings.columns.str.contains('ating')]

Index(['review_scores_rating'], dtype='object')

In [68]:
df_boston_listings.columns[df_boston_listings.columns.str.contains('eview')]

Index(['number_of_reviews', 'first_review', 'last_review',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'reviews_per_month'],
      dtype='object')

In [74]:
df_boston_listings.columns[df_boston_listings.columns.str.contains('core')]

Index(['review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value'],
      dtype='object')

In [71]:
df_boston_listings.columns[df_boston_listings.columns.str.contains('vailab')]

Index(['has_availability', 'availability_30', 'availability_60',
       'availability_90', 'availability_365'],
      dtype='object')

In [76]:
df_boston_listings[['review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value']].sample(5)

Unnamed: 0,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
598,80.0,8.0,7.0,8.0,8.0,9.0,8.0
3084,,,,,,,
1649,80.0,10.0,6.0,10.0,10.0,6.0,6.0
3329,,,,,,,
697,92.0,9.0,9.0,10.0,10.0,10.0,9.0


In [72]:
df_boston_listings[['id','has_availability', 'availability_30', 'availability_60',
       'availability_90', 'availability_365']]

Unnamed: 0,id,has_availability,availability_30,availability_60,availability_90,availability_365
0,12147973,,0,0,0,0
1,3075044,,26,54,84,359
2,6976,,19,46,61,319
3,1436513,,6,16,26,98
4,7651065,,13,34,59,334
...,...,...,...,...,...,...
3580,8373729,,21,51,81,356
3581,14844274,,29,59,89,364
3582,14585486,,0,15,40,40
3583,14603878,,5,5,5,253


###### List of issues you identiefied using rudimentry assessment

- Issue 1
- Issue 2
- Issue 3
- Issue 4

Columns identified in earlier step, run value_counts on them so as to get sense of outliers

In [None]:
df_name.column_name.value_counts()

Plot histograms of the dataframes so as to identify general distribution of features  

In [None]:
df_name.hist()

Plot scatterplot of the dataframes so as to identify correlations amongst several variables. Through this we will start to get sense of which features could be of use to us for further analysis.

In [None]:
pd.plotting.scatter_matrix(df_name)

###### List of issues you identiefied using visual assessment

- Issue 1
- Issue 2
- Issue 3
- Issue 4

nan value detection

Following code fragments can be run to identify presence of NaN Null in dataframe

In [None]:
df_name.isnull()

Following command will tell us columns that have atleast 1 NaN value in them

In [None]:
df_name.isnull().any(axis=0)

Following command will tell us rows that have atleast 1 NaN value in them

In [None]:
df_name.isnull().any(axis=1)

checking duplicate value/s 

In [None]:
sum(df_name.duplicated())

checking and making note of incorrect datatype/s. Prime examples to look for are date column in string datatype, unit mentioned in numeric value column.

In [None]:
df_name.info()

###### List of issues you identiefied using programmatic assessment

- Issue 1
- Issue 2
- Issue 3
- Issue 4

<li><a href="#TOC">Back To Table Of Contents</a></li>

<a id='data_preparation'></a>
## Data Preparation

##### Clean the data, drop not useful data, replace missing values, do feature engineering.

In [None]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


<a id='modeling'></a>
## Modeling

> **Tip**: Here starts modeling of the data, depending on the targeted business goals and insights modeling technique/s is chosen and relevant model is trained and predictions are made. Performance of the model is also evaluated in this step using several inbuilt functions.
<br><br>All proposed questions might not need data mining techniques, in such cases descriptive and inferential statistics is used to get the needed answers.


### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<li><a href="#TOC">Back To Table Of Contents</a></li>

<a id='results_evaluation'></a>
## Results Evaluation

> **Tip**: Evaluation in this step is with regards to the business value this analysis, modeling provides. Therefore the analysis will be from the point of view of the stakeholder.
<br>This section should make sense to non technical as well as technical audience.

<li><a href="#TOC">Back To Table Of Contents</a></li>

<a id='deploy_solution'></a>
## Deployment

> **Tip**: In this stage deployment plan is made and along with that monitoring and maintenace plan is drafted out as well.

<li><a href="#TOC">Back To Table Of Contents</a></li>

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.



<li><a href="#TOC">Back To Table Of Contents</a></li>