# Precificação Inteligente Airbnb

A partir dos dados dos clientes, iremos criar um modelo de machine learning de **precificação inteligente, onde, baseado nas características dos imóveis, nosso modelo irá sugerir um preço automaticamente**. Isso agrega bastante valor, pois, o cliente pode utilizar o modelo para fazer sua precificação de forma rápida e alinhada ao mercado, já que o modelo será treinado com os dados de outros clientes. Esse problema de negócio, pode ser usado em diversos mercados diferentes, como por exemplo precificação de produtos no atacado e varejo.

<img src="https://img.freepik.com/vetores-gratis/os-funcionarios-do-departamento-financeiro-estao-calculando-as-despesas-dos-negocios-da-empresa_1150-41782.jpg?w=740&t=st=1659997600~exp=1659998200~hmac=18a8593ddb566559e2bb87a1d54ffce6d5719471875595b2195a5e15d45973bc" width="30%">

## Coleta dos dados

Os [dados](http://insideairbnb.com/get-the-data/) foram coletados do próprio Airbnb. Confere lá!

In [1]:
# Importar os pacotes necessários
import warnings
warnings.simplefilter(action='ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Código que transforma o csv em um dataframe pandas
df = pd.read_csv('listings.csv.gz')

### Uma rápida olhada na estrutura dos dados

In [2]:
# Verificar as 5 primeiras linhas do conjunto de dados
pd.set_option("display.max_columns", None)
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,15965441,https://www.airbnb.com/rooms/15965441,20220620202144,2022-06-20,Quarto de casal com vista para a Baía de Guana...,"Meu espaço é bom para casais, aventuras indivi...",,https://a0.muscache.com/pictures/76550464-7859...,103691209,https://www.airbnb.com/users/show/103691209,José,2016-11-14,"Rio de Janeiro, State of Rio de Janeiro, Brazil",,within an hour,83%,0%,f,https://a0.muscache.com/im/pictures/user/6e013...,https://a0.muscache.com/im/pictures/user/6e013...,,3.0,3.0,"['email', 'phone']",t,t,,Cocotá,,-22.8063,-43.17788,Private room in rental unit,Private room,2,,1 bath,1.0,1.0,"[""TV"", ""Washer"", ""Kitchen"", ""Lock on bedroom d...",$150.00,1,1125,1,1,1125,1125,1.0,1125.0,,t,30,60,90,365,2022-06-20,0,0,0,,,,,,,,,,,f,3,0,3,0,
1,47908784,https://www.airbnb.com/rooms/47908784,20220620202144,2022-06-20,"Apartamento bem localizado, bonito e familiar!",,,https://a0.muscache.com/pictures/f44537ff-72f1...,83985216,https://www.airbnb.com/users/show/83985216,Raquel,2016-07-15,"Rio de Janeiro, State of Rio de Janeiro, Brazil",,within a day,50%,100%,f,https://a0.muscache.com/im/pictures/user/4b75c...,https://a0.muscache.com/im/pictures/user/4b75c...,,1.0,1.0,['phone'],t,f,,Freguesia (Jacarepaguá),,-22.93633,-43.34907,Entire condo,Entire home/apt,4,,2 baths,2.0,2.0,"[""Body soap"", ""Dedicated workspace"", ""Cleaning...",$450.00,7,1125,7,7,1125,1125,7.0,1125.0,,t,0,0,0,52,2022-06-20,0,0,0,,,,,,,,,,,f,1,1,0,0,
2,52239613,https://www.airbnb.com/rooms/52239613,20220620202144,2022-06-20,Apartamento com varanda e linda vista,"Condomínio com porteiro 24 horas , piscina, sa...",O condomínio fica em frente ao portão 2 do PRO...,https://a0.muscache.com/pictures/miso/Hosting-...,422870631,https://www.airbnb.com/users/show/422870631,Fabio,2021-09-13,"Petrópolis, State of Rio de Janeiro, Brazil",,within a day,100%,89%,f,https://a0.muscache.com/im/pictures/user/e5a15...,https://a0.muscache.com/im/pictures/user/e5a15...,,0.0,0.0,"['email', 'phone']",t,f,"Jacarepaguá, Rio de Janeiro, Brazil",Curicica,,-22.96253,-43.40291,Entire rental unit,Entire home/apt,5,,1 bath,2.0,5.0,"[""Microwave"", ""Extra pillows and blankets"", ""F...",$350.00,2,365,2,2,365,365,2.0,365.0,,t,29,59,80,347,2022-06-20,9,9,0,2021-10-31,2022-05-15,5.0,5.0,5.0,4.89,4.78,4.78,4.89,,f,1,1,0,0,1.16
3,10445855,https://www.airbnb.com/rooms/10445855,20220620202144,2022-06-20,"Campo dos Afonsos, Sulacap",Casa com vista para as instalações do Parque ...,"Bairro suburbano, tranquilo, seguro, casas bem...",https://a0.muscache.com/pictures/0f42e026-0955...,1647571,https://www.airbnb.com/users/show/1647571,Márcio,2012-01-24,"Rio de Janeiro, State of Rio de Janeiro, Brazil",Agora trabalhando com Experiencias do AirBnB. =-),within an hour,100%,100%,f,https://a0.muscache.com/im/pictures/user/6d332...,https://a0.muscache.com/im/pictures/user/6d332...,,2.0,2.0,"['email', 'phone']",t,t,"Rio de Janeiro, Brazil",Vila Militar,,-22.87969,-43.40361,Entire home,Entire home/apt,3,,1 bath,1.0,2.0,"[""Carbon monoxide alarm"", ""Cleaning before che...",$145.00,3,300,3,3,1125,1125,3.0,1125.0,,t,19,49,79,79,2022-06-20,46,1,0,2016-08-10,2021-07-18,4.78,4.91,4.89,4.93,4.87,4.7,4.57,,f,1,1,0,0,0.64
4,565405043878669885,https://www.airbnb.com/rooms/565405043878669885,20220620202144,2022-06-20,Pousada completa: 2 quartos com muita natureza!,Este lugar único e cheio de estilo é o cenário...,,https://a0.muscache.com/pictures/miso/Hosting-...,24596747,https://www.airbnb.com/users/show/24596747,Júlio,2014-12-07,"Rio de Janeiro, State of Rio de Janeiro, Brazil","Sou músico, professor, empreendedor, fotógraf...",within an hour,100%,67%,f,https://a0.muscache.com/im/users/24596747/prof...,https://a0.muscache.com/im/users/24596747/prof...,,0.0,0.0,"['email', 'phone']",t,t,,Del Castilho,,-22.88168,-43.26803,Private room in bed and breakfast,Private room,4,,1.5 baths,2.0,3.0,"[""Body soap"", ""Dedicated workspace"", ""Backyard...",$180.00,2,60,2,2,60,60,2.0,60.0,,t,30,60,90,365,2022-06-20,0,0,0,,,,,,,,,,,f,2,0,2,0,


In [3]:
# Verificar o tamanho, tipo e valores nulos dos dados
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24881 entries, 0 to 24880
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            24881 non-null  int64  
 1   listing_url                                   24881 non-null  object 
 2   scrape_id                                     24881 non-null  int64  
 3   last_scraped                                  24881 non-null  object 
 4   name                                          24860 non-null  object 
 5   description                                   23975 non-null  object 
 6   neighborhood_overview                         13370 non-null  object 
 7   picture_url                                   24881 non-null  object 
 8   host_id                                       24881 non-null  int64  
 9   host_url                                      24881 non-null 

In [4]:
# Verificar a porcentagem de valores nulos
pd.set_option("display.max_rows", None)
missing_much_values = (df.isnull().sum()/df.shape[0])
print(missing_much_values)

id                                              0.000000
listing_url                                     0.000000
scrape_id                                       0.000000
last_scraped                                    0.000000
name                                            0.000844
description                                     0.036413
neighborhood_overview                           0.462642
picture_url                                     0.000000
host_id                                         0.000000
host_url                                        0.000000
host_name                                       0.004702
host_since                                      0.004702
host_location                                   0.007275
host_about                                      0.510952
host_response_time                              0.209718
host_response_rate                              0.209718
host_acceptance_rate                            0.193320
host_is_superhost              

Pela quantidade de valores nulos, iremos excluir as colunas: "neighborhood_overview", "host_about", "host_neighbourhood", "neighbourhood", "bathrooms", "neighbourhood_group_cleansed", "calendar_updated", "license". Para nós, não faz sentido tentar trabalhar com essas colunas com muitos valores ausentes, isso implicaria em um viés altíssimo.

Iremos eliminar também os metadados, que são apenas as colunas criadas a fim de obter um relacionamento com outras tabelas, como por exemplo, as colunas de "ID".

Continuando nosso trabalho de seleção, vamos analisar coluna por coluna, para ver o que podemos descartar, pois, não vai gerar valor para nossa análise. Lembre-se, o objetivo é prever o preço do imóvel, portanto, precisamos de colunas informativas que nos leve a prever esse preço com assertividade.

In [5]:
# Analisar as entradas de coluna por coluna para tentar mapear o que pode ser relevante e o que não é relevante
for column in df.columns:
    values = df[column].head()
    print(values, '\n')

0              15965441
1              47908784
2              52239613
3              10445855
4    565405043878669885
Name: id, dtype: int64 

0              https://www.airbnb.com/rooms/15965441
1              https://www.airbnb.com/rooms/47908784
2              https://www.airbnb.com/rooms/52239613
3              https://www.airbnb.com/rooms/10445855
4    https://www.airbnb.com/rooms/565405043878669885
Name: listing_url, dtype: object 

0    20220620202144
1    20220620202144
2    20220620202144
3    20220620202144
4    20220620202144
Name: scrape_id, dtype: int64 

0    2022-06-20
1    2022-06-20
2    2022-06-20
3    2022-06-20
4    2022-06-20
Name: last_scraped, dtype: object 

0    Quarto de casal com vista para a Baía de Guana...
1       Apartamento bem localizado, bonito e familiar!
2                Apartamento com varanda e linda vista
3                           Campo dos Afonsos, Sulacap
4      Pousada completa: 2 quartos com muita natureza!
Name: name, dtype: object 

0   

Baseado no nosso conhecimento de negócio, vamos listar as colunas que serão eliminadas, por certamente não agregarem valor na modelagem do problema em questão. São elas: "id", "listing_url", "scrape_id", "last_scraped", "name", "description", "picture_url", "host_id", "host_url", "host_name", "host_location", "host_thumbnail_url", "host_picture_url", "host_listings_count", "host_total_listings_count", "host_verifications", "neighbourhood_cleansed", "property_type", "amenities", "minimum_minimum_nights", "maximum_minimum_nights", "minimum_maximum_nights", "maximum_maximum_nights", "minimum_nights_avg_ntm", "maximum_nights_avg_ntm", "calendar_last_scraped", "number_of_reviews_ltm", "number_of_reviews_l30d", "first_review", "last_review", "calculated_host_listings_count", "calculated_host_listings_count_entire_homes", "calculated_host_listings_count_private_rooms".

## Pipeline ETL

Nessa etapa, iremos fazer algumas transformações no dataset, para levarmos apenas o que realmente nos interessa para o projeto de ciência de dados.

In [6]:
# Criar uma cópia do conjunto de dados
df_clean = df.copy()

In [7]:
# 1. Remover as colunas que selecionamos
df_clean.drop(["neighborhood_overview", "host_about", "host_neighbourhood",
               "neighbourhood", "bathrooms", "neighbourhood_group_cleansed", 
               "calendar_updated", "license", "id", "listing_url", "scrape_id", 
               "last_scraped", "name", "description", "picture_url", "host_id", 
               "host_url", "host_name", "host_location", "host_thumbnail_url", 
               "host_picture_url", "host_listings_count", "host_total_listings_count", 
               "host_verifications", "neighbourhood_cleansed", "property_type", 
               "amenities", "minimum_minimum_nights", "maximum_minimum_nights", 
               "minimum_maximum_nights", "maximum_maximum_nights", "minimum_nights_avg_ntm", 
               "maximum_nights_avg_ntm", "calendar_last_scraped", "number_of_reviews_ltm", 
               "number_of_reviews_l30d", "first_review", "last_review", "calculated_host_listings_count", 
               "calculated_host_listings_count_entire_homes", "calculated_host_listings_count_private_rooms"], 
               axis=1, inplace=True)

Com essa 1a etapa finalizada, vamos agora apenas limpar algumas entradas de algumas colunas para que fiquem compatíveis com nosso projeto e podermos explorá-las mais profundamente quando chegarmos no pipeline de data science.

In [8]:
# 2. Passar a coluna "host_since" para datetime
df_clean.host_since = pd.to_datetime(df_clean.host_since, infer_datetime_format=True)  

In [9]:
# 3. Elimar caracteres indesejados de algumas colunas
df_clean.host_response_rate = (df_clean.host_response_rate.str.replace('%', '').astype(float)) / 100 
df_clean.host_acceptance_rate = (df_clean.host_acceptance_rate.str.replace('%', '').astype(float)) / 100
df_clean.price = df_clean.price.str.replace('$', '').str.replace(',', '').astype(float)

In [10]:
# 4. Elimar caracteres indesejados da coluna "bathrooms_text"
mapping_dict_values = {
                      ' baths': '',
                      ' bath': '',
                      ' private bath': '',
                      ' private': '',
                      ' shared baths': '',
                      'Half-bath': '0.5',
                      'Shared half-bath': '0.5',
                      ' shared bath': '',
                      ' shared': '',
                      'Private half-bath': '0.5'
                      }
df_clean.bathrooms_text = df_clean.bathrooms_text.replace(mapping_dict_values, regex=True).astype(float)
df_clean.rename(columns={'bathrooms_text': 'bathrooms'}, inplace = True) # renomear coluna

Transformações feitas! Vamos dividir o conjunto de dados em treino e teste para alimentar o pipeline de data science.

## Criando um conjunto de testes

In [11]:
# Separando o conjunto de dados em treino e teste
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df_clean, test_size = 0.2, random_state = 42)
train_set_size = len(train_set)
test_set_size = len(test_set)

print('Tamanho do conjunto de treino:', train_set_size)
print('Tamanho do conjunto de test:', test_set_size)

Tamanho do conjunto de treino: 19904
Tamanho do conjunto de test: 4977


In [13]:
# # Exportar os arquivos de treino e teste para alimentar o pipeline de DS
# train_set.to_csv('/Users/Vitor/Desktop/InovaMed/Teste Modelo ML/train_set.csv', index=False)  
# test_set.to_csv('/Users/Vitor/Desktop/InovaMed/Teste Modelo ML/test_set.csv', index=False) 