# Airbnb - Predict the price

- Task: Predict the price 
- Challenge: Assess the impact of the neighbourhood 
 
-----

1. quality of the data science pipeline, 
2. the methods, 
3. the completion of the discussion of the results 
4. code quality. (SUPER IMPORTANT - 50%)

 



In [1]:
import pandas as pd

## Reading the dataset

In [13]:
DATA = "data/AB_NYC_2019.csv.zip"

df = pd.read_csv(DATA)
print(df.shape)
df.head(n=2)

(48895, 16)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355


## Exploring the dataset

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

### Missing data

In [15]:
# Find the last values
df.isnull().sum()

id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

   `last_review` and `last_review` have the same number of missing values, so I further explore their relation

In [23]:
last_review_null = round(df['last_review'].isnull().mean()*100,2) 
print(f"Percentage of null values in the last review column is {last_review_null} %")


Percentage of null values in the last review column is 20.56 %


In [18]:
# see correlation between number of reviews, last review and reviews per month
df[df.reviews_per_month.isnull()][['number_of_reviews', 'last_review', 'reviews_per_month']]

Unnamed: 0,number_of_reviews,last_review,reviews_per_month
2,0,,
19,0,,
26,0,,
36,0,,
38,0,,
...,...,...,...
48890,0,,
48891,0,,
48892,0,,
48893,0,,


Dealing with the missing data:
- `host_name`: it can be dropped for ethical reasons and it's irrelevant when predicting the price of the room
- `name`: TODO - insignificant to our data analysis
- `last_review`: date; if there were no reviews for the listing - date simply will not exist
    - irrelevant and insignificant therefore appending those values is not needed.
    
- `reviews_per_month`: replace empty values with `0.0` - it's the same number as the number of reviews -> No total reviews means no reviews per month


## Data preprocessing

Replacing null values in the column reviews_per_month with 0 in the dataset

In [19]:
df['reviews_per_month'].fillna(0,inplace = True)

TODO: 
- explore features like name (https://www.kaggle.com/sidjain1611/explanation-eda-airbnb-ny)
- neigbourhood
- room type
- price per neigbbourhood (https://www.kaggle.com/dgomonov/data-exploration-on-nyc-airbnb)
- scatterplot of neihbourhood (https://www.kaggle.com/chirag9073/airbnb-analysis-visualization-and-prediction)