# Fetch Artifact
We first fetch artifact from W&B (`sample.csv`) and read it with pandas, and use `save_code=True` in the call to `wandb.init` so the notebook is uploaded and versioned by W&B.

In [1]:
import wandb
import pandas as pd

run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33ms_a[0m. Use [1m`wandb login --relogin`[0m to force relogin


# Exploratory Data Analysis
In this section, we explore the data and learn as much as possible from it by using `pandas_profiling` tool, which helps us to look for things such as:
- Understand what each feature means
- Univariate analysis to verify that our expectation on that feature matches reality
- Bivariate analysis where we look for correlations
- Anomaly detection
- Missing values handling

In [2]:
import pandas_profiling

profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

By looking at the data, we decided to do the followings:
- change column `last_review` type from string format to date format.
- Drop outliers for column `price` by considering prices between `$10` to `$350`.

In [3]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

We check the dataset to make sure all obvious problems have been solved.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

In [5]:
run.finish()

VBox(children=(Label(value='0.062 MB of 0.062 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…