# EDA for Predicting Short-term Rental Price in NYC

## 1. Import libraries

In [1]:
import wandb 
import pandas as pd
import pandas_profiling

In [2]:
# Initialize W&B run
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mwif0n[0m. Use [1m`wandb login --relogin`[0m to force relogin


## 2. Import data

In [3]:
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

## 3. Run `pandas_profiling` on raw data

In [4]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Firstly, looking at the generated EDA reports, we can see that `price` variable contain outliers. Going forward, we will remove them.

Secodnly, since we are predicting short-term rental prices, we are going to cap the minimum nights.

Lastly, type of `last_review` column is currently `categorical`. Let's change it to `datetime`.

## 4. Preprocessing

### 4.1 Drop outliers for `price`

In [5]:
# Drop outliers for price
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

### 4.2. Cap minimum nights

In [6]:
min_minimum_nights = 1
max_minimum_nights = 31
idx = df['minimum_nights'].between(min_minimum_nights, max_minimum_nights)
df = df[idx].copy()

### 4.3 Convert `last_review` to datetime

In [7]:
df['last_review'] = pd.to_datetime(df['last_review'])

## 5. Run profiling again (to see the impact of preprocessing changes)

In [8]:
profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/30 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

Distribution of `price` looks better now.

In [9]:
# finish W&B run
run.finish()

VBox(children=(Label(value='0.093 MB of 0.093 MB uploaded (0.009 MB deduped)\r'), FloatProgress(value=1.0, max…