In [None]:
!pip install wandb==0.16.3
!pip install ydata-profiling==4.13.0
!pip install pandas==2.2.0

1. Fetch the artifact we just created (sample.csv) from W&B and read it with pandas:

In [None]:
import wandb
import pandas as pd
# Note that we use save_code=True in the call to wandb.init so the notebook is uploaded and versioned by W&B
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

2. Explore the data in df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

3. What do you notice in the data? Look around and see what you can find.

> For example, there are missing values in a few columns and the column `last_review` is a date but it is in string format. Look also at the `price` column, and note the outliers. There are some zeros and some very high prices. After talking to your stakeholders, you decide to consider from a minimum of `$10` to a maximum of `$350` per night.

4. Fix some of the little problems we have found in the data with the following code:

In [None]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Note how we did not impute missing values. We will do that in the inference pipeline, so we will be able to handle missing values also in production.

5. Check with df.info() that all obvious problems have been solved

In [None]:
df.info()

6. Terminate the run by running `run.finish()`

In [None]:
run.finish()

7. Save the notebook.