# 1.0 An end-to-end classification problem (Part I)



## 1.1 Dataset description


In this document, we will analyze and predict the **price of Airbnb** in the city of **Rio de Janeiro** based on different information.


You can download the data from [Ivanovitch Silva, Google Drive](https://drive.google.com/file/d/16zF4MHEP_bBxAEWpQgVocPupTjRRAgfP/view).

Let's take the following steps:

1. Load Libraries
2. Fetch Data, including EDA
3. Pre-procesing
4. Data Segregation

<center><img width="600" src="https://drive.google.com/uc?export=view&id=1a-nyAPNPiVh-Xb2Pu2t2p-BhSvHJS0pO"></center>

## 1.2 Load libraries

In [14]:
!pip install seaborn
!pip install sklearn
!pip install pandas_profiling
!pip install ipywidgets



In [1]:
import wandb
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import tempfile
import os
from sklearn.model_selection import train_test_split
from pandas_profiling import ProfileReport

## 1.3 Get data & Exploratory Data Analysis (EDA)

### 1.3.1 Create the raw_data artifact

In [88]:
# importing the dataset
airbnb = pd.read_csv("listings.csv")
airbnb.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,17878,https://www.airbnb.com/rooms/17878,20211224070558,2021-12-25,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",Discounts for long term stays. <br />- Large b...,This is the one of the bests spots in Rio. Bec...,https://a0.muscache.com/pictures/65320518/3069...,68997,https://www.airbnb.com/users/show/68997,...,4.9,4.76,4.66,,f,1,1,0,0,1.92
1,24480,https://www.airbnb.com/rooms/24480,20211224070558,2021-12-25,"Nice and cozy near Ipanema Beach, w/ home office","My studio is located in the best of Ipanema, t...","The beach, the lagoon, Ipanema is a great loca...",https://a0.muscache.com/pictures/11955612/b28e...,99249,https://www.airbnb.com/users/show/99249,...,4.9,4.97,4.58,,f,1,1,0,0,0.62
2,25026,https://www.airbnb.com/rooms/25026,20211224070558,2021-12-25,Beautiful Modern Decorated Studio in Copa,"Our apartment is a little gem, everyone loves ...",Copacabana is a lively neighborhood and the ap...,https://a0.muscache.com/pictures/3003965/68ebb...,102840,https://www.airbnb.com/users/show/102840,...,4.9,4.81,4.56,,f,1,1,0,0,1.69
3,35636,https://www.airbnb.com/rooms/35636,20211224070558,2021-12-26,Cosy flat close to Ipanema beach,This cosy apartment is just a few steps away ...,The apartment street is very quiet and safe ....,https://a0.muscache.com/pictures/20009355/38b6...,153232,https://www.airbnb.com/users/show/153232,...,4.88,4.9,4.65,,f,1,1,0,0,1.82
4,35764,https://www.airbnb.com/rooms/35764,20211224070558,2021-12-25,COPACABANA SEA BREEZE - RIO - 20 X Superhost,Our newly renovated studio is located in the b...,Our guests will experience living with a local...,https://a0.muscache.com/pictures/23782972/1d3e...,153691,https://www.airbnb.com/users/show/153691,...,4.94,4.94,4.87,,f,1,1,0,0,2.76


In [89]:
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24549 entries, 0 to 24548
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            24549 non-null  int64  
 1   listing_url                                   24549 non-null  object 
 2   scrape_id                                     24549 non-null  int64  
 3   last_scraped                                  24549 non-null  object 
 4   name                                          24528 non-null  object 
 5   description                                   23336 non-null  object 
 6   neighborhood_overview                         13212 non-null  object 
 7   picture_url                                   24549 non-null  object 
 8   host_id                                       24549 non-null  int64  
 9   host_url                                      24549 non-null 

In [11]:
airbnb.to_csv("raw_data.csv",index=False)

In [90]:
# Login to Weights & Biases
wandb.login(relogin=True)

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/vitor/.netrc


True

In [19]:
# Send the raw_data.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name airbnb/raw_data.csv \
      --type raw_data \
      --description "The raw data from Airbnb - Rio de Janeiro" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "vitor-gabo/airbnb/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mvitor-gabo[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/vitor/UFRN/Tecnologia da Informação/2022.1/MLOps/Airbnb/wandb/run-20220711_195448-282v2bj0[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mwise-serenity-5[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/vitor-gabo/airbnb[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/vitor-gabo/airbnb/runs/282v2bj0[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("vitor-gabo/airbnb/raw_data.csv:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0m:                       

### 1.3.2 Download raw_data artifact from Wandb

In [2]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="airbnb", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mvitor-gabo[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [6]:
# donwload the latest version of artifact raw_data.csv
artifact = run.use_artifact("airbnb/raw_data.csv:latest")

# create a dataframe from the artifact
airbnb = pd.read_csv(artifact.file())

In [7]:
airbnb.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,17878,https://www.airbnb.com/rooms/17878,20211224070558,2021-12-25,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",Discounts for long term stays. <br />- Large b...,This is the one of the bests spots in Rio. Bec...,https://a0.muscache.com/pictures/65320518/3069...,68997,https://www.airbnb.com/users/show/68997,...,4.9,4.76,4.66,,f,1,1,0,0,1.92
1,24480,https://www.airbnb.com/rooms/24480,20211224070558,2021-12-25,"Nice and cozy near Ipanema Beach, w/ home office","My studio is located in the best of Ipanema, t...","The beach, the lagoon, Ipanema is a great loca...",https://a0.muscache.com/pictures/11955612/b28e...,99249,https://www.airbnb.com/users/show/99249,...,4.9,4.97,4.58,,f,1,1,0,0,0.62
2,25026,https://www.airbnb.com/rooms/25026,20211224070558,2021-12-25,Beautiful Modern Decorated Studio in Copa,"Our apartment is a little gem, everyone loves ...",Copacabana is a lively neighborhood and the ap...,https://a0.muscache.com/pictures/3003965/68ebb...,102840,https://www.airbnb.com/users/show/102840,...,4.9,4.81,4.56,,f,1,1,0,0,1.69
3,35636,https://www.airbnb.com/rooms/35636,20211224070558,2021-12-26,Cosy flat close to Ipanema beach,This cosy apartment is just a few steps away ...,The apartment street is very quiet and safe ....,https://a0.muscache.com/pictures/20009355/38b6...,153232,https://www.airbnb.com/users/show/153232,...,4.88,4.9,4.65,,f,1,1,0,0,1.82
4,35764,https://www.airbnb.com/rooms/35764,20211224070558,2021-12-25,COPACABANA SEA BREEZE - RIO - 20 X Superhost,Our newly renovated studio is located in the b...,Our guests will experience living with a local...,https://a0.muscache.com/pictures/23782972/1d3e...,153691,https://www.airbnb.com/users/show/153691,...,4.94,4.94,4.87,,f,1,1,0,0,2.76


### Pre-processing

Depois de realizar uma análise em relação as variáveis mais interessantes do dataset, selecionou-se as *features* mostradas abaixo

In [8]:
# columns used 
columns = ['latitude', 'longitude', 'room_type', 'accommodates','neighbourhood_cleansed',
          'bedrooms','beds','price','review_scores_rating']

airbnb = airbnb[columns]
airbnb.head()

Unnamed: 0,latitude,longitude,room_type,accommodates,neighbourhood_cleansed,bedrooms,beds,price,review_scores_rating
0,-22.96599,-43.1794,Entire home/apt,5,Copacabana,2.0,2.0,$350.00,4.68
1,-22.98405,-43.20189,Entire home/apt,2,Ipanema,1.0,1.0,$296.00,4.73
2,-22.97735,-43.19105,Entire home/apt,3,Copacabana,1.0,1.0,$387.00,4.68
3,-22.98839,-43.19232,Entire home/apt,2,Ipanema,1.0,1.0,$172.00,4.72
4,-22.98107,-43.19136,Entire home/apt,2,Copacabana,1.0,1.0,$260.00,4.89


Pré-processamento da coluna alvo: price (*preço*)

In [9]:
airbnb['price'] = airbnb['price'].str.replace('$', '')
airbnb['price'] = airbnb['price'].str.replace(',', '')
airbnb['price'] = airbnb['price'].astype('float')
airbnb['price'] = airbnb['price'].drop(airbnb[(airbnb.price < 1) | (airbnb.price > 5000)].index)
airbnb = airbnb.dropna(axis=0)

  airbnb['price'] = airbnb['price'].str.replace('$', '')


Tratamento das variáveis:

In [93]:
airbnb.to_csv("dataset_preprocessed1.csv",index=False)

In [94]:
airbnb.describe()

Unnamed: 0,latitude,longitude,accommodates,bedrooms,beds,price,review_scores_rating
count,14221.0,14221.0,14221.0,14221.0,14221.0,14221.0,14221.0
mean,-22.968192,-43.240559,4.080304,1.609029,2.592293,631.12812,4.624606
std,0.033429,0.091894,2.306755,0.946316,2.047916,662.690229,0.868589
min,-23.07267,-43.68972,1.0,1.0,1.0,33.0,0.0
25%,-22.98462,-43.24412,2.0,1.0,1.0,231.0,4.67
50%,-22.9736,-43.19381,4.0,1.0,2.0,425.0,4.89
75%,-22.95971,-43.1855,5.0,2.0,3.0,786.0,5.0
max,-22.74969,-43.10527,16.0,20.0,50.0,5000.0,5.0


In [95]:
# Send the dataset_preprocessing1.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name airbnb/dataset_preprocessed1.csv \
      --type raw_data \
      --description "Pre-processing 1 - Airbnb, Rio de Janeiro" dataset_preprocessed1.csv

[34m[1mwandb[0m: Uploading file dataset_preprocessed1.csv to: "vitor-gabo/airbnb/dataset_preprocessed1.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mvitor-gabo[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/vitor/UFRN/Tecnologia da Informação/2022.1/MLOps/Airbnb/wandb/run-20220722_201747-3bttkhy1[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mwandering-moon-7[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/vitor-gabo/airbnb[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/vitor-gabo/airbnb/runs/3bttkhy1[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("vitor-gabo/airbnb/dataset_preprocessed1.csv:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[

### 1.3.3 Pandas Profilling

In [96]:
ProfileReport(airbnb, title="Pandas Profiling Report", explorative=True)

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [10]:
# There are duplicated rows
airbnb.duplicated().sum()

2

In [11]:
# Delete duplicated rows
airbnb.drop_duplicates(inplace=True)
airbnb.duplicated().sum()

0

### Preprocessing 2

Processando as colunas *accommodates*, *bedrooms*, *beds*, *neighbourhood_cleansed* e *room_type*.

In [12]:
columns = ['room_type','accommodates','neighbourhood_cleansed','bedrooms','beds','price']

airbnb = airbnb[columns]

In [13]:
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14219 entries, 0 to 24545
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   room_type               14219 non-null  object 
 1   accommodates            14219 non-null  int64  
 2   neighbourhood_cleansed  14219 non-null  object 
 3   bedrooms                14219 non-null  float64
 4   beds                    14219 non-null  float64
 5   price                   14219 non-null  float64
dtypes: float64(3), int64(1), object(2)
memory usage: 777.6+ KB


In [14]:
airbnb.describe()

Unnamed: 0,accommodates,bedrooms,beds,price
count,14219.0,14219.0,14219.0,14219.0
mean,4.080385,1.609115,2.592376,631.172938
std,2.306899,0.946355,2.048014,662.722128
min,1.0,1.0,1.0,33.0
25%,2.0,1.0,1.0,231.0
50%,4.0,1.0,2.0,425.0
75%,5.0,2.0,3.0,786.0
max,16.0,20.0,50.0,5000.0


In [15]:
airbnb['accommodates'] = airbnb['accommodates'].drop(airbnb[(airbnb.accommodates < 1) | (airbnb.accommodates > 10)].index)
airbnb['bedrooms'] = airbnb['bedrooms'].drop(airbnb[(airbnb.bedrooms < 1) | (airbnb.bedrooms > 10)].index)
airbnb['beds'] = airbnb['beds'].drop(airbnb[(airbnb.beds < 1) | (airbnb.beds > 15)].index)
airbnb.describe()

Unnamed: 0,accommodates,bedrooms,beds,price
count,13962.0,14212.0,14186.0,14219.0
mean,3.901733,1.602308,2.546031,631.172938
std,1.894639,0.892279,1.755809,662.722128
min,1.0,1.0,1.0,33.0
25%,2.0,1.0,1.0,231.0
50%,4.0,1.0,2.0,425.0
75%,5.0,2.0,3.0,786.0
max,10.0,10.0,15.0,5000.0


In [16]:
airbnb = airbnb.dropna()

In [17]:
airbnb.isnull().sum()

room_type                 0
accommodates              0
neighbourhood_cleansed    0
bedrooms                  0
beds                      0
price                     0
dtype: int64

In [114]:
airbnb.to_csv("dataset_preprocessed1.csv",index=False)

In [115]:
# Send the dataset_preprocessing1.csv to the Wandb storing it as an artifact
!wandb artifact put \
      --name airbnb/dataset_preprocessed1.csv \
      --type raw_data \
      --description "Pre-processing - Airbnb, Rio de Janeiro" dataset_preprocessed1.csv

[34m[1mwandb[0m: Uploading file dataset_preprocessed1.csv to: "vitor-gabo/airbnb/dataset_preprocessed1.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mvitor-gabo[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/home/vitor/UFRN/Tecnologia da Informação/2022.1/MLOps/Airbnb/wandb/run-20220722_211112-2cyabjds[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mclean-sponge-8[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/vitor-gabo/airbnb[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/vitor-gabo/airbnb/runs/2cyabjds[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("vitor-gabo/airbnb/dataset_preprocessed1.csv:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34

## 1.4 Train & Split

In [18]:
splits = {}
splits["train"], splits["test"] = train_test_split(airbnb,
                                                   test_size=0.30,
                                                   random_state=41)

In [19]:
# Save the artifacts. We use a temporary directory so we do not leave
# any trace behind

with tempfile.TemporaryDirectory() as tmp_dir:

    for split, df in splits.items():

        # Make the artifact name from the provided root plus the name of the split
        artifact_name = f"data_{split}.csv"

        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir, artifact_name)

        # Save then upload to W&B
        df.to_csv(temp_path,index=False)

        artifact = wandb.Artifact(
            name=artifact_name,
            type="raw_data",
            description=f"{split} split of dataset airbnb/raw_data.csv:latest",
        )
        artifact.add_file(temp_path)

        run.log_artifact(artifact)

        # This waits for the artifact to be uploaded to W&B. If you
        # do not add this, the temp directory might be removed before
        # W&B had a chance to upload the datasets, and the upload
        # might fail
        artifact.wait()

### 1.4.1 Donwload the train and test artifacts

In [20]:
# donwload the latest version of artifacts data_test.csv and data_train.csv
artifact_train = run.use_artifact("airbnb/data_train.csv:latest")
artifact_test = run.use_artifact("airbnb/data_test.csv:latest")

# create a dataframe from each artifact
df_train = pd.read_csv(artifact_train.file())
df_test  = pd.read_csv(artifact_test.file())

In [21]:
print("Train: {}".format(df_train.shape))
print("Test: {}".format(df_test.shape))

Train: (9767, 6)
Test: (4187, 6)


In [22]:
run.finish()

VBox(children=(Label(value='121.489 MB of 121.489 MB uploaded (5.251 MB deduped)\r'), FloatProgress(value=1.0,…