# Perform EDA

### Fetching the artifact from W&B and load it

In [1]:
import wandb
import pandas as pd

# Use save_code=True so the notebook is uploaded and versioned by W&B
run = wandb.init(project="nyc_airbnb", group="eda", save_code=True)
local_path = wandb.use_artifact("sample.csv:latest").file()
df = pd.read_csv(local_path)

[34m[1mwandb[0m: Currently logged in as: [33mucaiado[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.10 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              20000 non-null  int64  
 1   name                            19993 non-null  object 
 2   host_id                         20000 non-null  int64  
 3   host_name                       19992 non-null  object 
 4   neighbourhood_group             20000 non-null  object 
 5   neighbourhood                   20000 non-null  object 
 6   latitude                        20000 non-null  float64
 7   longitude                       20000 non-null  float64
 8   room_type                       20000 non-null  object 
 9   price                           20000 non-null  int64  
 10  minimum_nights                  20000 non-null  int64  
 11  number_of_reviews               20000 non-null  int64  
 12  last_review                     

### Profiling data

In [3]:
import pandas_profiling

profile = pandas_profiling.ProfileReport(df)
profile.to_widgets()

Summarize dataset:   0%|          | 0/29 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

We can see that there are:

1. Almost 4000 missing values in the features `last_review` and `reviews_per_month`
2. The variable `last_review` is date, but it is encondaded as string
2. In place of missing values, the variable `number_of_reviews` has almost 4000 zeros
3. The variable `prices` has $\mu=153.27$, but goes from 0 to 10000
4. The variable `minimum_nights` has $\mu=7$, but goes from 1 to 1250

### Fix some problems

Missing values imputation shoud be done in the inference pipeline so
 that it is also handle in production

#### Drop outliers in price

In [4]:
df['price'].describe(percentiles=(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.95, 0.99))

count    20000.000000
mean       153.269050
std        243.325609
min          0.000000
1%          30.000000
5%          40.000000
10%         49.000000
25%         69.000000
50%        105.000000
75%        175.000000
95%        350.000000
99%        800.000000
max      10000.000000
Name: price, dtype: float64

In [5]:
# Drop outliers in prices
min_price = 10  # it is resonable
max_price = 350
idx = df['price'].between(min_price, max_price)
df = df[idx].copy()

#### Drop outliers in minimum_nights

In [6]:
df['minimum_nights'].describe(percentiles=(0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 0.95, 0.99))

count    19001.000000
mean         6.906900
std         21.456544
min          1.000000
1%           1.000000
5%           1.000000
10%          1.000000
25%          1.000000
50%          2.000000
75%          5.000000
95%         30.000000
99%         39.000000
max       1250.000000
Name: minimum_nights, dtype: float64

In [7]:
# Drop outliers in minimum_nights
min_days = 1
max_days = 35
idx = df['minimum_nights'].between(min_days, max_days)
df = df[idx].copy()

#### Convert last_review to datetime

In [8]:
df['last_review'] = pd.to_datetime(df['last_review'])

#### Checking changes

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18809 entries, 0 to 19999
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype         
---  ------                          --------------  -----         
 0   id                              18809 non-null  int64         
 1   name                            18802 non-null  object        
 2   host_id                         18809 non-null  int64         
 3   host_name                       18801 non-null  object        
 4   neighbourhood_group             18809 non-null  object        
 5   neighbourhood                   18809 non-null  object        
 6   latitude                        18809 non-null  float64       
 7   longitude                       18809 non-null  float64       
 8   room_type                       18809 non-null  object        
 9   price                           18809 non-null  int64         
 10  minimum_nights                  18809 non-null  int64         
 11  nu

### Terminate the run

Close the notebook using (File -> Close and Halt). Then, in
 the main Jupyter notebook page, click Quit in the upper
 right to stop Jupyter. This will also terminate the mlflow run.
 **Do not use Ctrl + C**

In [10]:
run.finish()

VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.01MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…