In [1]:
!pip install wandb==0.16.0
!pip install ydata-profiling==4.12.1
!pip install pandas==2.1.3

Collecting ydata-profiling==4.12.1
  Downloading ydata_profiling-4.12.1-py2.py3-none-any.whl.metadata (20 kB)
Collecting scipy<1.14,>=1.4.1 (from ydata-profiling==4.12.1)
  Downloading scipy-1.13.1-cp310-cp310-macosx_10_9_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m732.3 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Collecting matplotlib<3.10,>=3.5 (from ydata-profiling==4.12.1)
  Downloading matplotlib-3.9.4-cp310-cp310-macosx_10_12_x86_64.whl.metadata (11 kB)
Collecting pydantic>=2 (from ydata-profiling==4.12.1)
  Using cached pydantic-2.11.9-py3-none-any.whl.metadata (68 kB)
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata-profiling==4.12.1)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata-profiling==4.12.1)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting phik<0.13,>=0

1. Fetch the artifact we just created (sample.csv) from W&B and read it with pandas:

In [2]:
import wandb
import pandas as pd
# Note that we use save_code=True in the call to wandb.init so the notebook is uploaded and versioned by W&B
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

  import pkg_resources
  import pkg_resources
[34m[1mwandb[0m: Currently logged in as: [33myannicknkongolo7[0m ([33myannicknkongolo7-wgu[0m). Use [1m`wandb login --relogin`[0m to force relogin


2. Explore the data in df

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20000 non-null  int64  
 1   name                            19993 non-null  object 
 2   host_id                         20000 non-null  int64  
 3   host_name                       19992 non-null  object 
 4   neighbourhood_group             20000 non-null  object 
 5   neighbourhood                   20000 non-null  object 
 6   latitude                        20000 non-null  float64
 7   longitude                       20000 non-null  float64
 8   room_type                       20000 non-null  object 
 9   price                           20000 non-null  int64  
 10  minimum_nights                  20000 non-null  int64  
 11  number_of_reviews               20000 non-null  int64  
 12  last_review                     

In [None]:
df.describe()

In [None]:
df.head()

3. What do you notice in the data? Look around and see what you can find.

> For example, there are missing values in a few columns and the column `last_review` is a date but it is in string format. Look also at the `price` column, and note the outliers. There are some zeros and some very high prices. After talking to your stakeholders, you decide to consider from a minimum of `$10` to a maximum of `$350` per night.

4. Fix some of the little problems we have found in the data with the following code:

In [4]:
# Drop outliers
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

Note how we did not impute missing values. We will do that in the inference pipeline, so we will be able to handle missing values also in production.

5. Check with df.info() that all obvious problems have been solved

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  number_

6. Terminate the run by running `run.finish()`

In [6]:
run.finish()

7. Save the notebook.