In [1]:
import wandb
import pandas as pd
import ydata_profiling

# Exploratory Data Analysis

We use here a Jupyter Notebook to perform a simple EDA. In a real scenario we would spend a lot more time in this phase, but here we are going to do the bare minimum to illustrate the use of Jupyter Notebooks in conjunction with mlflow.


### Data retrieval

In [2]:
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mspadawan[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Automatic preliminary analysis

We use here ydata_profiling (previously pandas_profiling) to automatically generate a meaningfull report.

In [3]:
profile = ydata_profiling.ProfileReport(df)

In [4]:
profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [5]:
profile.to_notebook_iframe()

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

We notice in the missing values section that there are missing values for last_review and reviews_per_month columns. This suggests that we will have to check if missing values in production can occur or not to decide if imputation is needed or not in the pipeline.
We also notice some outliers in the price columns (price up to 10000 dollars and prices equal to 0), this suggests that we might have to select a range of reasonable price and remove some outliers. Here we decide to focus only on 10 to 350 per night.
Finally, the last_review is a string format, we convert it into a datetime format that can be more easily handled with pandas built-in functions.
The next section contains the code associated with our proposed basic cleaning

### Basic cleaning code

In [6]:
# Trying some easy clean up
min_price = 10
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()
# Convert last_review to datetime
df['last_review'] = pd.to_datetime(df['last_review'])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19001 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              19001 non-null  int64         
 1   name                            18994 non-null  object        
 2   host_id                         19001 non-null  int64         
 3   host_name                       18993 non-null  object        
 4   neighbourhood_group             19001 non-null  object        
 5   neighbourhood                   19001 non-null  object        
 6   latitude                        19001 non-null  float64       
 7   longitude                       19001 non-null  float64       
 8   room_type                       19001 non-null  object        
 9   price                           19001 non-null  int64         
 10  minimum_nights                  19001 non-null  int64         
 11  nu

### Finish 
We terminate the run so that the code is saved.

In [8]:
run.finish()

VBox(children=(Label(value='0.791 MB of 9.065 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.087296…