# Análise de dados AirBnB - Rio de Janeiro

Origem dos dados: http://insideairbnb.com/get-the-data <br>
Dicionário: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=1322284596



In [23]:
# Importar as bibliotecas. 

In [24]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
#warnings.filterwarnings("ignore")
%matplotlib inline

# Os arquivos base estão na pasta "/datasets/raw/"
import os
for dirname, _, filenames in os.walk('datasets\\raw'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

datasets\raw\calendar.csv.gz
datasets\raw\listings.csv
datasets\raw\listings.csv.gz
datasets\raw\neighbourhoods.csv
datasets\raw\neighbourhoods.geojson
datasets\raw\reviews.csv
datasets\raw\reviews.csv.gz


# Conteúdo de cada tabela

*@@@ Podemos falar um pouco sobre a origem das planilhas, contextualizar a existência de um site da própria empresa com essas informações, separadas por algumas classificações*

|Arquivo   |Data do dataset   |Descrição original | Descrição  |   |
|---|---|---|---|---|
| calendar.csv.gz  | 20 de junho de 2022  | *Detailed Calendar data*  | Base de dados com informações detalhadas de XXX / XXX   |   |
| listings.csv.gz  | 20 de junho de 2022  | *Detailed Listings data*  | Base de dados com informações detalhadas de XXX / XXX     |   |
| reviews.csv.gz  | 20 de junho de 2022  | *Detailed Review data* | Base de dados com informações detalhadas de XXX / XXX     |
| listings.csv | 20 de junho de 2022  | *Summary information and metrics for<br> listings in Rio de Janeiro (good for<br> visualisations).*  | Base de dados sumarizada de XXX / XXX  |   |
| reviews.csv  | 20 de junho de 2022  |  *Summary Review data and Listing ID <br>(to facilitate time based analytics and <br>visualisations linked to a listing).* | Base de dados sumarizada de XXX / XXX  |   |
| neighbourhoods.csv | Sem informação  | *Neighbourhood list for geo filter. <br>Sourced from city or open source GIS files.*  | Base de dados sumarizada de XXX / XXX  |   |
| neighbourhoods.geojson | Sem informação  | *GeoJSON file of neighbourhoods of the city.*  | Base de dados sumarizada de XXX / XXX  |   |


*"Entendemos que a empresa disponibiliza arquivos mais detalhados, com XXX informações, sobre XXX assunto e com os campos XXX, XXX, XXX, ..., XXX mais importantes. 
@@@ Da pra escrever um pouco mais sobre o que podemos encontrar, o que temos de dúvidas, a primeira parte do trabalho"*

In [25]:
# Após listar os arquivos, utilizando pandas, 
# setaremos os dataframes que serão utilizados na análise, 
# tendo como origem as tabelas baixadas
calendario = pd.read_csv("datasets/raw/calendar.csv.gz")
reservas = pd.read_csv("datasets/raw/listings.csv.gz")
reviews = pd.read_csv("datasets/raw/reviews.csv.gz")
resumo_reservas = pd.read_csv("datasets/raw/listings.csv")
resumo_reviews = pd.read_csv("datasets/raw/reviews.csv")
geoloc = pd.read_csv("datasets/raw/neighbourhoods.csv")
# geojson = pd.read_csv("datasets/raw/neighbourhoods.geojson")



## calendario

In [26]:
# .head() mostra as primeiras linhas do arquivo, 
# ajudando a mapear as informações que temos
calendario.head()


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,28053241,2022-06-21,t,"$1,850.00","$1,850.00",2,1125
1,1174007,2022-06-21,t,$107.00,$107.00,2,30
2,1174007,2022-06-22,t,$107.00,$107.00,2,30
3,1174007,2022-06-23,t,$107.00,$107.00,2,30
4,1174007,2022-06-24,t,$107.00,$107.00,2,30


In [27]:
# .info() apresenta as informações do dataframe, 
# como nome e quantidade de colunas, além do tipo
calendario.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9081565 entries, 0 to 9081564
Data columns (total 7 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   listing_id      int64 
 1   date            object
 2   available       object
 3   price           object
 4   adjusted_price  object
 5   minimum_nights  int64 
 6   maximum_nights  int64 
dtypes: int64(3), object(4)
memory usage: 485.0+ MB


## reservas

In [28]:
reservas.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,15965441,https://www.airbnb.com/rooms/15965441,20220620202144,2022-06-20,Quarto de casal com vista para a Baía de Guana...,"Meu espaço é bom para casais, aventuras indivi...",,https://a0.muscache.com/pictures/76550464-7859...,103691209,https://www.airbnb.com/users/show/103691209,...,,,,,f,3,0,3,0,
1,47908784,https://www.airbnb.com/rooms/47908784,20220620202144,2022-06-20,"Apartamento bem localizado, bonito e familiar!",,,https://a0.muscache.com/pictures/f44537ff-72f1...,83985216,https://www.airbnb.com/users/show/83985216,...,,,,,f,1,1,0,0,
2,52239613,https://www.airbnb.com/rooms/52239613,20220620202144,2022-06-20,Apartamento com varanda e linda vista,"Condomínio com porteiro 24 horas , piscina, sa...",O condomínio fica em frente ao portão 2 do PRO...,https://a0.muscache.com/pictures/miso/Hosting-...,422870631,https://www.airbnb.com/users/show/422870631,...,4.78,4.78,4.89,,f,1,1,0,0,1.16
3,10445855,https://www.airbnb.com/rooms/10445855,20220620202144,2022-06-20,"Campo dos Afonsos, Sulacap",Casa com vista para as instalações do Parque ...,"Bairro suburbano, tranquilo, seguro, casas bem...",https://a0.muscache.com/pictures/0f42e026-0955...,1647571,https://www.airbnb.com/users/show/1647571,...,4.87,4.7,4.57,,f,1,1,0,0,0.64
4,565405043878669885,https://www.airbnb.com/rooms/565405043878669885,20220620202144,2022-06-20,Pousada completa: 2 quartos com muita natureza!,Este lugar único e cheio de estilo é o cenário...,,https://a0.muscache.com/pictures/miso/Hosting-...,24596747,https://www.airbnb.com/users/show/24596747,...,,,,,f,2,0,2,0,


In [29]:
reservas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24881 entries, 0 to 24880
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            24881 non-null  int64  
 1   listing_url                                   24881 non-null  object 
 2   scrape_id                                     24881 non-null  int64  
 3   last_scraped                                  24881 non-null  object 
 4   name                                          24860 non-null  object 
 5   description                                   23975 non-null  object 
 6   neighborhood_overview                         13370 non-null  object 
 7   picture_url                                   24881 non-null  object 
 8   host_id                                       24881 non-null  int64  
 9   host_url                                      24881 non-null 

# reviews

In [30]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,1174007,8837512,2013-11-20,9438993,Chloe,Definitely worth the walk up the hill to get t...
1,556930647599893392,565830572071983778,2022-02-19,68918783,Samila,Uma experiência incrível é única! <br/>Andréia...
2,1174007,9452466,2013-12-29,7418078,Julien,Thiago and his family were the best hosts ever...
3,1174007,9806285,2014-01-11,3738399,Jazmin,Muy lindo el hostel. La vista es excelente y l...
4,1174007,9986412,2014-01-23,124357,Valeria,Really Neat and Clean. Great Place with amazin...


In [31]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418469 entries, 0 to 418468
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     418469 non-null  int64 
 1   id             418469 non-null  int64 
 2   date           418469 non-null  object
 3   reviewer_id    418469 non-null  int64 
 4   reviewer_name  418469 non-null  object
 5   comments       418459 non-null  object
dtypes: int64(3), object(3)
memory usage: 19.2+ MB


## resumo_reservas

In [32]:
resumo_reservas.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,17878,"Very Nice 2Br in Copacabana w. balcony, fast WiFi",68997,Matthias,,Copacabana,-22.96599,-43.1794,Entire home/apt,350,5,272,2022-04-23,1.87,1,311,8,
1,556930647599893392,Venha passar uma pernoite em um veleiro na Urca!,198206849,Andréia,,Urca,-22.94781,-43.16351,Private room,280,1,1,2022-02-19,0.25,1,365,1,
2,1174007,100% Best View In Copa In Suite 2,3962758,Thiago Luiz,,Copacabana,-22.97277,-43.17966,Private room,107,2,177,2022-06-07,1.69,6,357,29,
3,8410797,Ipanema(Arpoador) 100mdo Mar/ Jan & Carnaval +Fev,42038091,Sheila,,Ipanema,-22.98871,-43.19334,Private room,1000,3,1,2017-09-28,0.02,3,362,0,
4,28053241,Navegar a Bordo de um Veleiro Francês no Rio!,193860988,Luciano,,Urca,-22.95056,-43.17175,Private room,1850,2,0,,,1,180,0,


In [33]:
resumo_reservas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24881 entries, 0 to 24880
Data columns (total 18 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              24881 non-null  int64  
 1   name                            24860 non-null  object 
 2   host_id                         24881 non-null  int64  
 3   host_name                       24764 non-null  object 
 4   neighbourhood_group             0 non-null      float64
 5   neighbourhood                   24881 non-null  object 
 6   latitude                        24881 non-null  float64
 7   longitude                       24881 non-null  float64
 8   room_type                       24881 non-null  object 
 9   price                           24881 non-null  int64  
 10  minimum_nights                  24881 non-null  int64  
 11  number_of_reviews               24881 non-null  int64  
 12  last_review                     

## resumo_reviews

In [34]:
resumo_reviews.head()

Unnamed: 0,listing_id,date
0,556930647599893392,2022-02-19
1,1174007,2013-11-20
2,1174007,2013-12-29
3,1174007,2014-01-11
4,1174007,2014-01-23


In [35]:
resumo_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418469 entries, 0 to 418468
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   listing_id  418469 non-null  int64 
 1   date        418469 non-null  object
dtypes: int64(1), object(1)
memory usage: 6.4+ MB


## geoloc

In [37]:
geoloc.head()

Unnamed: 0,neighbourhood_group,neighbourhood
0,,Abolição
1,,Acari
2,,Água Santa
3,,Alto da Boa Vista
4,,Anchieta


In [38]:
geoloc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 160 entries, 0 to 159
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   neighbourhood_group  0 non-null      float64
 1   neighbourhood        160 non-null    object 
dtypes: float64(1), object(1)
memory usage: 2.6+ KB


## *@@@ Começamos agora a entender como estão os dados na tabela*
Precisamos saber se existem campos com dados nulos, quantidade de linhas, quais as colunas que utilizaremos, XXX. ..., XXX

In [39]:
calendario.isnull().sum()

listing_id        0
date              0
available         0
price             0
adjusted_price    0
minimum_nights    0
maximum_nights    0
dtype: int64

In [40]:
reservas.isnull().sum()

id                                                 0
listing_url                                        0
scrape_id                                          0
last_scraped                                       0
name                                              21
                                                ... 
calculated_host_listings_count                     0
calculated_host_listings_count_entire_homes        0
calculated_host_listings_count_private_rooms       0
calculated_host_listings_count_shared_rooms        0
reviews_per_month                               7668
Length: 74, dtype: int64

In [41]:
reviews.isnull().sum()

listing_id        0
id                0
date              0
reviewer_id       0
reviewer_name     0
comments         10
dtype: int64

In [42]:
resumo_reservas.isnull().sum()

id                                    0
name                                 21
host_id                               0
host_name                           117
neighbourhood_group               24881
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                        7668
reviews_per_month                  7668
calculated_host_listings_count        0
availability_365                      0
number_of_reviews_ltm                 0
license                           24881
dtype: int64

In [43]:
resumo_reviews.isnull().sum()

listing_id    0
date          0
dtype: int64

In [44]:
geoloc.isnull().sum()

neighbourhood_group    160
neighbourhood            0
dtype: int64

# Limpeza de dados
### *@@@ Fazer parte de limpeza e organização dos dados*