# Exercícios: Parte 06
O arquivo ```AB_NYC_2019.csv``` contem informações sobre reservas de espaços no AirBnB de Nova Iorque.
Com base no ```DataFrame```criado a partir deste arquivo, responda o seguinte:
1. Quais colunas possuem pelo menos um registro em branco?
2. Elimine as linhas que possuam pelo menos um registro em branco.
3. Quais colunas poderiam ser eliminadas por não oferecerem informações relevantes a uma análise estatística? (Dica: utilize o comando [```Series.nunique()```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nunique.html) para contar o número de valores que não se repetem).
4. Elimine as colunas identificadas na etapa anterior utilizando o comando [```DataFrame.drop(lista_de_colunas, axis=1, inplace=True)```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html).
5. Modifique a tipagem das colunas para reduzir o uso de memória.

In [4]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('dados/AB_NYC_2019.csv')

### Solução - 1
Para verificar a pergunta, utilizaremos o comando ```DataFrame.info()```.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
id                                48895 non-null int64
name                              48879 non-null object
host_id                           48895 non-null int64
host_name                         48874 non-null object
neighbourhood_group               48895 non-null object
neighbourhood                     48895 non-null object
latitude                          48895 non-null float64
longitude                         48895 non-null float64
room_type                         48895 non-null object
price                             48895 non-null int64
minimum_nights                    48895 non-null int64
number_of_reviews                 48895 non-null int64
last_review                       38843 non-null object
reviews_per_month                 38843 non-null float64
calculated_host_listings_count    48895 non-null int64
availability_365                  48895 non-null int64

Observamos que as colunas ```name```, ```host_name```, ```last_review``` e ```reviews_per_month``` possuem quantidade de linhas não nulas menor que o número total de linhas do ```DataFrame```. 

### Solução - 2
Utilizaremos o comando ```DataFrame.dropna()```, sem esquecer de definir o parâmetro ```axis=0``` para eliminação de linhas e ```inplace=True``` para efetivação da eliminação.

In [6]:
df.dropna(axis=0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38821 entries, 0 to 48852
Data columns (total 16 columns):
id                                38821 non-null int64
name                              38821 non-null object
host_id                           38821 non-null int64
host_name                         38821 non-null object
neighbourhood_group               38821 non-null object
neighbourhood                     38821 non-null object
latitude                          38821 non-null float64
longitude                         38821 non-null float64
room_type                         38821 non-null object
price                             38821 non-null int64
minimum_nights                    38821 non-null int64
number_of_reviews                 38821 non-null int64
last_review                       38821 non-null object
reviews_per_month                 38821 non-null float64
calculated_host_listings_count    38821 non-null int64
availability_365                  38821 non-null int64

### Solução - 3
As colunas ```id```, ```name```, ```host_id``` e ```host_name``` possuem valores categóricos únicos (que não se repetem). Esse tipo de variável não pode ser considerada em análise estatísticas e por isso devem ser eliminadas.

In [7]:
df.id.nunique()

38821

In [8]:
df.name.nunique()

38253

In [9]:
df.host_id.nunique()

30232

In [10]:
df.host_name.nunique()

9885

### Solução - 4

In [11]:
df.drop(['id', 'name', 'host_id', 'host_name'], axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38821 entries, 0 to 48852
Data columns (total 12 columns):
neighbourhood_group               38821 non-null object
neighbourhood                     38821 non-null object
latitude                          38821 non-null float64
longitude                         38821 non-null float64
room_type                         38821 non-null object
price                             38821 non-null int64
minimum_nights                    38821 non-null int64
number_of_reviews                 38821 non-null int64
last_review                       38821 non-null object
reviews_per_month                 38821 non-null float64
calculated_host_listings_count    38821 non-null int64
availability_365                  38821 non-null int64
dtypes: float64(3), int64(5), object(4)
memory usage: 3.9+ MB


### Solução - 5
utilizando o comando ```DataFrame.describe()``` temos uma noção dos limites dos valores de cada coluna numérica.

In [12]:
df.describe()

Unnamed: 0,latitude,longitude,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
count,38821.0,38821.0,38821.0,38821.0,38821.0,38821.0,38821.0,38821.0
mean,40.728129,-73.951149,142.332526,5.86922,29.290255,1.373229,5.166611,114.886299
std,0.054991,0.046693,196.994756,17.389026,48.1829,1.680328,26.302954,129.52995
min,40.50641,-74.24442,0.0,1.0,1.0,0.01,1.0,0.0
25%,40.68864,-73.98246,69.0,1.0,3.0,0.19,1.0,0.0
50%,40.72171,-73.95481,101.0,2.0,9.0,0.72,1.0,55.0
75%,40.76299,-73.93502,170.0,4.0,33.0,2.02,2.0,229.0
max,40.91306,-73.71299,10000.0,1250.0,629.0,58.5,327.0,365.0


In [15]:
df.latitude = df.latitude.astype('float16')
df.longitude = df.longitude.astype('float16')
df.price = df.price.astype('int16')
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38821 entries, 0 to 48852
Data columns (total 12 columns):
neighbourhood_group               38821 non-null object
neighbourhood                     38821 non-null object
latitude                          38821 non-null float16
longitude                         38821 non-null float16
room_type                         38821 non-null object
price                             38821 non-null int16
minimum_nights                    38821 non-null int64
number_of_reviews                 38821 non-null int64
last_review                       38821 non-null object
reviews_per_month                 38821 non-null float64
calculated_host_listings_count    38821 non-null int64
availability_365                  38821 non-null int64
dtypes: float16(2), float64(1), int16(1), int64(4), object(4)
memory usage: 3.2+ MB
