## 2. Data Understanding
📒 `1.0-rc-data-understanding.ipynb`
- Carregamento do dataset bruto
- Visão geral do schema (colunas, tipos, valores únicos)
- Valores ausentes, outliers
- Distribuição dos Dados
- Correlações entre features e target (price)
- Gráficos iniciais: histogramas, boxplots, heatmaps

### 📘 Data Dictionary

About Dataset:This dataset contains around 13.000 apartments for sale and for rent in the city of São Paulo, Brazil. The data comes from multiple sources, specially real estate classified websites.


|   id | Variable Name      | Role    | Type         | Descrição                                                 |
|------|--------------------| --------| -------------|-----------------------------------------------------------|
|   00 | Price              | Target  | Numérica     | Valor mensal do aluguel do imóvel (em R$)                 |
|   01 | Condo              | Feature | Numérica     | Valor da taxa de condomínio (em R$)                       |
|   02 | Size               | Feature | Numérica     | Área útil do imóvel em metros quadrados                   |
|   03 | Rooms              | Feature | Numérica     | Número de cômodos                                         |
|   04 | Toilets            | Feature | Numérica     | Número de banheiros                                       |
|   05 | Suites             | Feature | Numérica     | Número de suítes                                          |
|   06 | Parking            | Feature | Numérica     | Número de vagas de garagem                                |
|   07 | Elevator           | Feature | Binária      | 1 se possui elevador, 0 caso contrário                    |
|   08 | Furnished          | Feature | Binária      | 1 se o imóvel é mobiliado, 0 caso contrário               |
|   09 | Swimming Pool      | Feature | Binária      | 1 se possui piscina, 0 caso contrário                     |
|   10 | New                | Feature | Binária      | 1 se o imóvel é novo, 0 caso contrário                    |
|   11 | District           | Feature | Categórica   | Nome do bairro e cidade onde o imóvel está localizado     |
|   12 | Negotiation Type   | Feature | Categórica   | Tipo de negociação (ex: aluguel ou venda)                 |
|   13 | Property Type      | Feature | Categórica   | Tipo de imóvel (ex: apartamento, casa)                    |
|   14 | Latitude           | Feature | Numérica     | Latitude geográfica do imóvel                             |
|   15 | Longitude          | Feature | Numérica     | Longitude geográfica do imóvel                            |

---

In [3]:
import pandas as pd
import numpy as np
#from _utils import carregar_dados

# Load the data
path = "../data/raw/sao-paulo-real-state.csv"
df = pd.read_csv(path)

# View the top 5 rows of the dataset
df.head()

Unnamed: 0,Price,Condo,Size,Rooms,Toilets,Suites,Parking,Elevator,Furnished,Swimming Pool,New,District,Negotiation Type,Property Type,Latitude,Longitude
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.525025,-46.482436


In [4]:
# View the dataset shape
print("Nº Rows:", df.shape[0])
print("Nº Cols:", df.shape[1])

Nº Rows: 13640
Nº Cols: 16


In [5]:
# Rename the columns name
cols_name = ["price", "condo", "size", "rooms", "toilets", 
               "suites", "parking", "elevator", "furnished", 
               "swim_pool", "new", "district", "negotiation_type", 
               "property_type", "lat", "long"
               ]

df.columns = cols_name
df.head()


Unnamed: 0,price,condo,size,rooms,toilets,suites,parking,elevator,furnished,swim_pool,new,district,negotiation_type,property_type,lat,long
0,930,220,47,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.543138,-46.479486
1,1000,148,45,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.550239,-46.480718
2,1000,100,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.542818,-46.485665
3,1000,200,48,2,2,1,1,0,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.547171,-46.483014
4,1300,410,55,2,2,1,1,1,0,0,0,Artur Alvim/São Paulo,rent,apartment,-23.525025,-46.482436


In [6]:
# Checking the data type - ok
df.dtypes

price                 int64
condo                 int64
size                  int64
rooms                 int64
toilets               int64
suites                int64
parking               int64
elevator              int64
furnished             int64
swim_pool             int64
new                   int64
district             object
negotiation_type     object
property_type        object
lat                 float64
long                float64
dtype: object

In [7]:
# Checking the Null Values - ok
df.isnull().sum()

price               0
condo               0
size                0
rooms               0
toilets             0
suites              0
parking             0
elevator            0
furnished           0
swim_pool           0
new                 0
district            0
negotiation_type    0
property_type       0
lat                 0
long                0
dtype: int64

In [8]:
df.describe()

Unnamed: 0,price,condo,size,rooms,toilets,suites,parking,elevator,furnished,swim_pool,new,lat,long
count,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0,13640.0
mean,287737.8,689.882331,84.3739,2.312023,2.07368,0.980792,1.393182,0.354179,0.146774,0.51217,0.015616,-22.077047,-43.597088
std,590821.4,757.649363,58.435676,0.777461,0.961803,0.834891,0.829932,0.478281,0.353894,0.49987,0.123988,5.866633,11.487288
min,480.0,0.0,30.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-46.749039,-58.364352
25%,1858.75,290.0,50.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,-23.594552,-46.681671
50%,8100.0,500.0,65.0,2.0,2.0,1.0,1.0,0.0,0.0,1.0,0.0,-23.552813,-46.637255
75%,360000.0,835.0,94.0,3.0,2.0,1.0,2.0,1.0,0.0,1.0,0.0,-23.51764,-46.56004
max,10000000.0,9500.0,880.0,10.0,8.0,6.0,9.0,1.0,1.0,1.0,1.0,0.0,0.0


In [11]:
for i, col in enumerate(df.columns): 
    unique_vals = df[col].unique()
    print(f"Name Col: {col}: \n Unique Values: {unique_vals[:10]} \n")

Name Col: price: 
 Unique Values: [ 930 1000 1300 1170  900  760  800 1800 1600 1500] 

Name Col: condo: 
 Unique Values: [220 148 100 200 410   0 180 150 160 130] 

Name Col: size: 
 Unique Values: [ 47  45  48  55  50  52  40  65 100  38] 

Name Col: rooms: 
 Unique Values: [ 2  1  3  4  5 10  6  7] 

Name Col: toilets: 
 Unique Values: [2 3 4 1 5 6 7 8] 

Name Col: suites: 
 Unique Values: [1 3 2 4 0 5 6] 

Name Col: parking: 
 Unique Values: [1 2 3 4 5 6 8 9 0 7] 

Name Col: elevator: 
 Unique Values: [0 1] 

Name Col: furnished: 
 Unique Values: [0 1] 

Name Col: swim_pool: 
 Unique Values: [0 1] 

Name Col: new: 
 Unique Values: [0 1] 

Name Col: district: 
 Unique Values: ['Artur Alvim/São Paulo' 'Belém/São Paulo' 'Cangaíba/São Paulo'
 'Carrão/São Paulo' 'Cidade Líder/São Paulo' 'Cidade Tiradentes/São Paulo'
 'Ermelino Matarazzo/São Paulo' 'Iguatemi/São Paulo'
 'Itaim Paulista/São Paulo' 'Itaquera/São Paulo'] 

Name Col: negotiation_type: 
 Unique Values: ['rent' 'sale'] 

Name 

In [23]:
df["negotiation_type"].unique()

array(['rent', 'sale'], dtype=object)

In [24]:
df["property_type"].unique()

array(['apartment'], dtype=object)

In [29]:
df["new"].unique()

array([0, 1])

In [34]:
numeric_cols = ["condo", "size", "rooms", "toilets", "suites", "price" ]

df[numeric_cols].corr()

KeyError: "None of [Index([('condo', 'size', 'rooms', 'toilets', 'suites', 'price')], dtype='object')] are in the [columns]"