# Exploratory Data Analysis for predicting Short-term Rental Prices in NYC

### 1) Problem:

#### Assumption:
We are working for a property management company renting rooms and properties for short periods of time on various platforms. 
#### Task:
We need to estimate the typical price for a given property based on the price of similar properties.

Here as the first step we explore the data provided to us to analysis the components in our data, the distribution of different features, the correlation between features and also check for any abnormalities and fix them before using it for model development.

### 2) Import relevant library:
- wandb: It is a dashboard to keep track of your experiments and artifacts so you can compare models and keep track of the findings.
- pandas: python library for data manipulation
- pandas_profiling: Python library that performs an automated Exploratory Data Analysis and provides us with a detailed report.

In [None]:
import wandb
import pandas as pd

### 3) Import Data
We connect our project to weights and baises(W&B) and data is retrieved from the artifact section.
- Here we first initialize the project and use save_code as True to make sure that the notebook is uploaded and versioned by W&B 
- We download the artifact file and get its local path
- We use pandas to read the file from the local path

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33msneha_kumari[0m. Use [1m`wandb login --relogin`[0m to force relogin


Lets take a look at the data

In [3]:
# Check the first 2 rows
df.head(2)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,9138664,Private Lg Room 15 min to Manhattan,47594947,Iris,Queens,Sunnyside,40.74271,-73.92493,Private room,74,2,6,2019-05-26,0.13,1,5
1,31444015,TIME SQUARE CHARMING ONE BED IN HELL'S KITCHEN...,8523790,Johlex,Manhattan,Hell's Kitchen,40.76682,-73.98878,Entire home/apt,170,3,0,,,1,188


### 4) Automatic EDA using Pandas Profiling
We use pandas profiling to generate an automatic preliminary analysis report for the data.
- We first create a profile report for the input dataframe
- We display the report

In [4]:
import pandas_profiling
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

  import pandas_profiling


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valuâ€¦

- We convert the report to an html format to view, analyse and share it easily.

In [9]:
profile.to_file('profile_report.html')

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

### 5) Analysis:
- There are no duplicate rows.
- There are 2.6% missing values with last_review having a lot of missing values.
- Price and num_reviews features have a very skewed plot suggesting effect of outliers.
- last_review feature has a string datatype but should be of date format.

- Correlation: number_of reviews, reviews_per_months, calculated_cost_listing_count features have a neagtive correlation coefficient values. This suggests that a highly priced property has low reviews overall as well as per month and a low calculated_cost_listing_count.

### 6) Data Cleanup
We try to address few of the problems listed above.

- We drop the outliers for the price feature and bound it within a reasonable range for our analysis

In [5]:
# Drop outliers for price
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df =df[idx].copy()

- We convert the last_review feature to its appropriate datatype of datetime

In [6]:
# Convert last review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

We verify if the changes are reflected

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

### 7) End the W&B run.

In [None]:
run.finish()