# Criando um Data Lake

## 2. Tratamento dos dados na camada "trusted"
Os dados que estão na camada "raw" deverão ser tratados antes de serem salvos no formato "parquet".

### 2.1 Configuração

In [0]:
username = 'thais'

northwind = f'/letscode/{username}/northwind/'

spark.sql(f'CREATE DATABASE IF NOT EXISTS letscode_{username}_trusted')
spark.sql(f'USE letscode_{username}_trusted')

Out[365]: DataFrame[]

### 2.2 Carregando os dados
Criando um dicionário para armazenar um DataFrame do Spark para cada tabela CSV carregada.

In [0]:
dict_df_northwind = dict()
path_raw = northwind + 'raw/'
for folder in dbutils.fs.ls(path_raw):
    for file in dbutils.fs.ls(path_raw + folder.name):
        file_without_extension = f'{file.name}'.rsplit('.', 1)[0]
        dict_df_northwind[file_without_extension] = spark.read.csv(path_raw + folder.name + file.name, header=True, inferSchema=True)

### 2.3 Exploração dos dados
Como o objetivo do projeto não é analisar os dados, farei uma verificação simples em cada tabela, apenas para observar se o tipo dos dados está correto, se há dados nulos ou duplicados.

#### 2.3.1 Tabela "categories"

In [0]:
dict_df_northwind['categories'].toPandas()

Unnamed: 0,category_id,category_name,description,picture
0,1,Beverages,"Soft drinks, coffees, teas, beers, and ales",binary data
1,2,Condiments,"Sweet and savory sauces, relishes, spreads, an...",binary data
2,3,Confections,"Desserts, candies, and sweet breads",binary data
3,4,Dairy Products,Cheeses,binary data
4,5,Grains/Cereals,"Breads, crackers, pasta, and cereal",binary data
5,6,Meat/Poultry,Prepared meats,binary data
6,7,Produce,Dried fruit and bean curd,binary data
7,8,Seafood,Seaweed and fish,binary data


In [0]:
dict_df_northwind['categories'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   category_id    8 non-null      int32 
 1   category_name  8 non-null      object
 2   description    8 non-null      object
 3   picture        8 non-null      object
dtypes: int32(1), object(3)
memory usage: 352.0+ bytes


In [0]:
dict_df_northwind['categories'].toPandas().duplicated().sum()

Out[369]: 0

#### 2.3.2 Tabela "customers"

In [0]:
dict_df_northwind['customers'].toPandas().head()

Unnamed: 0,customer_id,company_name,contact_name,contact_title,address,city,region,postal_code,country,phone,fax
0,ALFKI,Alfreds Futterkiste,Maria Anders,Sales Representative,Obere Str. 57,Berlin,,12209,Germany,030-0074321,030-0076545
1,ANATR,Ana Trujillo Emparedados y helados,Ana Trujillo,Owner,Avda. de la Constitución 2222,México D.F.,,05021,Mexico,(5) 555-4729,(5) 555-3745
2,ANTON,Antonio Moreno Taquería,Antonio Moreno,Owner,Mataderos 2312,México D.F.,,05023,Mexico,(5) 555-3932,
3,AROUT,Around the Horn,Thomas Hardy,Sales Representative,120 Hanover Sq.,London,,WA1 1DP,UK,(171) 555-7788,(171) 555-6750
4,BERGS,Berglunds snabbköp,Christina Berglund,Order Administrator,Berguvsvägen 8,Luleå,,S-958 22,Sweden,0921-12 34 65,0921-12 34 67


In [0]:
# Percebe-se acima que existem dados nulos em "region", mas, nas informações abaixo não há registro de dados faltantes.
# Como esse fato não impactaria muito no objetivo do projeto de Big Data, não farei o tratamento dos dados nulos.

dict_df_northwind['customers'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91 entries, 0 to 90
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customer_id    91 non-null     object
 1   company_name   91 non-null     object
 2   contact_name   91 non-null     object
 3   contact_title  91 non-null     object
 4   address        91 non-null     object
 5   city           91 non-null     object
 6   region         91 non-null     object
 7   postal_code    91 non-null     object
 8   country        91 non-null     object
 9   phone          91 non-null     object
 10  fax            91 non-null     object
dtypes: object(11)
memory usage: 7.9+ KB


In [0]:
dict_df_northwind['customers'].toPandas().duplicated().sum()

Out[372]: 0

#### 2.3.3 Tabela "employee_territories"

In [0]:
dict_df_northwind['employee_territories'].toPandas().head()

Unnamed: 0,employee_id,territory_id
0,1,6897
1,1,19713
2,2,1581
3,2,1730
4,2,1833


In [0]:
dict_df_northwind['employee_territories'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   employee_id   49 non-null     int32
 1   territory_id  49 non-null     int32
dtypes: int32(2)
memory usage: 520.0 bytes


In [0]:
dict_df_northwind['employee_territories'].toPandas().duplicated().sum()

Out[375]: 0

#### 2.3.4 Tabela "employees"

In [0]:
dict_df_northwind['employees'].toPandas().head()

Unnamed: 0,employee_id,last_name,first_name,title,title_of_courtesy,birth_date,hire_date,address,city,region,postal_code,country,home_phone,extension,photo,notes,reports_to,photo_path
0,1,Davolio,Nancy,Sales Representative,Ms.,1948-12-08,1992-05-01,507 - 20th Ave. E.\nApt. 2A,Seattle,WA,98122,USA,(206) 555-9857,5467,binary data,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1952-02-19,1992-08-14,908 W. Capital Way,Tacoma,WA,98401,USA,(206) 555-9482,3457,binary data,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1963-08-30,1992-04-01,722 Moss Bay Blvd.,Kirkland,WA,98033,USA,(206) 555-3412,3355,binary data,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp
3,4,Peacock,Margaret,Sales Representative,Mrs.,1937-09-19,1993-05-03,4110 Old Redmond Rd.,Redmond,WA,98052,USA,(206) 555-8122,5176,binary data,Margaret holds a BA in English literature from...,2.0,http://accweb/emmployees/peacock.bmp
4,5,Buchanan,Steven,Sales Manager,Mr.,1955-03-04,1993-10-17,14 Garrett Hill,London,,SW1 8JR,UK,(71) 555-4848,3453,binary data,Steven Buchanan graduated from St. Andrews Uni...,2.0,http://accweb/emmployees/buchanan.bmp


In [0]:
dict_df_northwind['employees'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9 entries, 0 to 8
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   employee_id        9 non-null      int32         
 1   last_name          9 non-null      object        
 2   first_name         9 non-null      object        
 3   title              9 non-null      object        
 4   title_of_courtesy  9 non-null      object        
 5   birth_date         9 non-null      datetime64[ns]
 6   hire_date          9 non-null      datetime64[ns]
 7   address            9 non-null      object        
 8   city               9 non-null      object        
 9   region             9 non-null      object        
 10  postal_code        9 non-null      object        
 11  country            9 non-null      object        
 12  home_phone         9 non-null      object        
 13  extension          9 non-null      int32         
 14  photo         

In [0]:
dict_df_northwind['employees'].toPandas().duplicated().sum()

Out[378]: 0

#### 2.3.5 Tabela "order_details"

In [0]:
dict_df_northwind['order_details'].toPandas().head()

Unnamed: 0,order_id,product_id,unit_price,quantity,discount
0,10248,11,14.0,12,0.0
1,10248,42,9.8,10,0.0
2,10248,72,34.8,5,0.0
3,10249,14,18.6,9,0.0
4,10249,51,42.4,40,0.0


In [0]:
dict_df_northwind['order_details'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2155 entries, 0 to 2154
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   order_id    2155 non-null   int32  
 1   product_id  2155 non-null   int32  
 2   unit_price  2155 non-null   float64
 3   quantity    2155 non-null   int32  
 4   discount    2155 non-null   float64
dtypes: float64(2), int32(3)
memory usage: 59.1 KB


In [0]:
dict_df_northwind['order_details'].toPandas().duplicated().sum()

Out[381]: 0

#### 2.3.6 Tabela "orders"

In [0]:
dict_df_northwind['orders'].toPandas().head()

Unnamed: 0,order_id,customer_id,employee_id,order_date,required_date,shipped_date,ship_via,freight,ship_name,ship_address,ship_city,ship_region,ship_postal_code,ship_country
0,10248,VINET,5,1996-07-04,1996-08-01,1996-07-16,3,32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,1996-07-05,1996-08-16,1996-07-10,1,11.61,Toms Spezialitäten,Luisenstr. 48,Münster,,44087,Germany
2,10250,HANAR,4,1996-07-08,1996-08-05,1996-07-12,2,65.83,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,RJ,05454-876,Brazil
3,10251,VICTE,3,1996-07-08,1996-08-05,1996-07-15,1,41.34,Victuailles en stock,"2, rue du Commerce",Lyon,,69004,France
4,10252,SUPRD,4,1996-07-09,1996-08-06,1996-07-11,2,51.3,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium


In [0]:
dict_df_northwind['orders'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 14 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   order_id          830 non-null    int32         
 1   customer_id       830 non-null    object        
 2   employee_id       830 non-null    int32         
 3   order_date        830 non-null    datetime64[ns]
 4   required_date     830 non-null    datetime64[ns]
 5   shipped_date      830 non-null    object        
 6   ship_via          830 non-null    int32         
 7   freight           830 non-null    float64       
 8   ship_name         830 non-null    object        
 9   ship_address      830 non-null    object        
 10  ship_city         830 non-null    object        
 11  ship_region       830 non-null    object        
 12  ship_postal_code  830 non-null    object        
 13  ship_country      830 non-null    object        
dtypes: datetime64[ns](2), floa

In [0]:
dict_df_northwind['orders'].toPandas().duplicated().sum()

Out[384]: 0

#### 2.3.7 Tabela "products"

In [0]:
dict_df_northwind['products'].toPandas().head()

Unnamed: 0,product_id,product_name,supplier_id,category_id,quantity_per_unit,unit_price,units_in_stock,units_on_order,reorder_level,discontinued
0,1,Chai,8,1,10 boxes x 30 bags,18.0,39,0,10,1
1,2,Chang,1,1,24 - 12 oz bottles,19.0,17,40,25,1
2,3,Aniseed Syrup,1,2,12 - 550 ml bottles,10.0,13,70,25,0
3,4,Chef Anton's Cajun Seasoning,2,2,48 - 6 oz jars,22.0,53,0,0,0
4,5,Chef Anton's Gumbo Mix,2,2,36 boxes,21.35,0,0,0,1


In [0]:
dict_df_northwind['products'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   product_id         77 non-null     int32  
 1   product_name       77 non-null     object 
 2   supplier_id        77 non-null     int32  
 3   category_id        77 non-null     int32  
 4   quantity_per_unit  77 non-null     object 
 5   unit_price         77 non-null     float64
 6   units_in_stock     77 non-null     int32  
 7   units_on_order     77 non-null     int32  
 8   reorder_level      77 non-null     int32  
 9   discontinued       77 non-null     int32  
dtypes: float64(1), int32(7), object(2)
memory usage: 4.0+ KB


In [0]:
dict_df_northwind['products'].toPandas().duplicated().sum()

Out[387]: 0

#### 2.3.8 Tabela "region"

In [0]:
dict_df_northwind['region'].toPandas().head()

Unnamed: 0,region_id,region_description
0,1,Eastern
1,2,Western
2,3,Northern
3,4,Southern


In [0]:
dict_df_northwind['region'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   region_id           4 non-null      int32 
 1   region_description  4 non-null      object
dtypes: int32(1), object(1)
memory usage: 176.0+ bytes


In [0]:
dict_df_northwind['region'].toPandas().duplicated().sum()

Out[390]: 0

#### 2.3.9 Tabela "shippers"

In [0]:
dict_df_northwind['shippers'].toPandas().head()

Unnamed: 0,shipper_id,company_name,phone
0,1,Speedy Express,(503) 555-9831
1,2,United Package,(503) 555-3199
2,3,Federal Shipping,(503) 555-9931
3,4,Alliance Shippers,1-800-222-0451
4,5,UPS,1-800-782-7892


In [0]:
dict_df_northwind['shippers'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   shipper_id    6 non-null      int32 
 1   company_name  6 non-null      object
 2   phone         6 non-null      object
dtypes: int32(1), object(2)
memory usage: 248.0+ bytes


In [0]:
dict_df_northwind['shippers'].toPandas().duplicated().sum()

Out[393]: 0

#### 2.3.10 Tabela "suppliers"

In [0]:
dict_df_northwind['suppliers'].toPandas().head()

Unnamed: 0,supplier_id,company_name,contact_name,contact_title,address,city,region,postal_code,country,phone,fax,homepage
0,1,Exotic Liquids,Charlotte Cooper,Purchasing Manager,49 Gilbert St.,London,,EC1 4SD,UK,(171) 555-2222,,
1,2,New Orleans Cajun Delights,Shelley Burke,Order Administrator,P.O. Box 78934,New Orleans,LA,70117,USA,(100) 555-4822,,#CAJUN.HTM#
2,3,Grandma Kelly's Homestead,Regina Murphy,Sales Representative,707 Oxford Rd.,Ann Arbor,MI,48104,USA,(313) 555-5735,(313) 555-3349,
3,4,Tokyo Traders,Yoshi Nagase,Marketing Manager,9-8 Sekimai Musashino-shi,Tokyo,,100,Japan,(03) 3555-5011,,
4,5,Cooperativa de Quesos 'Las Cabras',Antonio del Valle Saavedra,Export Administrator,Calle del Rosal 4,Oviedo,Asturias,33007,Spain,(98) 598 76 54,,


In [0]:
dict_df_northwind['suppliers'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   supplier_id    29 non-null     int32 
 1   company_name   29 non-null     object
 2   contact_name   29 non-null     object
 3   contact_title  29 non-null     object
 4   address        29 non-null     object
 5   city           29 non-null     object
 6   region         29 non-null     object
 7   postal_code    29 non-null     object
 8   country        29 non-null     object
 9   phone          29 non-null     object
 10  fax            29 non-null     object
 11  homepage       29 non-null     object
dtypes: int32(1), object(11)
memory usage: 2.7+ KB


In [0]:
dict_df_northwind['suppliers'].toPandas().duplicated().sum()

Out[396]: 0

#### 2.3.11 Tabela "territories"

In [0]:
dict_df_northwind['territories'].toPandas().head()

Unnamed: 0,territory_id,territory_description,region_id
0,1581,Westboro,1
1,1730,Bedford,1
2,1833,Georgetow,1
3,2116,Boston,1
4,2139,Cambridge,1


In [0]:
dict_df_northwind['territories'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53 entries, 0 to 52
Data columns (total 3 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   territory_id           53 non-null     int32 
 1   territory_description  53 non-null     object
 2   region_id              53 non-null     int32 
dtypes: int32(2), object(1)
memory usage: 976.0+ bytes


In [0]:
dict_df_northwind['territories'].toPandas().duplicated().sum()

Out[399]: 0

#### 2.3.12 Tabela "us_states"

In [0]:
dict_df_northwind['us_states'].toPandas().head()

Unnamed: 0,state_id,state_name,state_abbr,state_region
0,1,Alabama,AL,south
1,2,Alaska,AK,north
2,3,Arizona,AZ,west
3,4,Arkansas,AR,south
4,5,California,CA,west


In [0]:
dict_df_northwind['us_states'].toPandas().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   state_id      51 non-null     int32 
 1   state_name    51 non-null     object
 2   state_abbr    51 non-null     object
 3   state_region  51 non-null     object
dtypes: int32(1), object(3)
memory usage: 1.5+ KB


In [0]:
dict_df_northwind['us_states'].toPandas().duplicated().sum()

Out[402]: 0

### 2.4 Transformando os dados

#### 2.4.1 Adicionando as colunas "order_year" e "order_month" no DF da tabela "orders"  
Essas colunas foram criadas com o intuito de usá-las para gerar tabelas agregadas no Data Lakehouse.

In [0]:
from pyspark.sql.functions import *

df_orders_processed = dict_df_northwind['orders'].select(
    'order_id',
    'customer_id',
    'employee_id',
    'order_date',
    year(dict_df_northwind['orders'].order_date).alias('order_year'),
    month(dict_df_northwind['orders'].order_date).alias('order_month'),
    'required_date',
    'shipped_date',
    'ship_via',
    'freight',
    'ship_name',
    'ship_address',
    'ship_city',
    'ship_region',
    'ship_postal_code',
    'ship_country')

In [0]:
display(df_orders_processed.toPandas().head())

order_id,customer_id,employee_id,order_date,order_year,order_month,required_date,shipped_date,ship_via,freight,ship_name,ship_address,ship_city,ship_region,ship_postal_code,ship_country
10248,VINET,5,1996-07-04T00:00:00.000+0000,1996,7,1996-08-01T00:00:00.000+0000,1996-07-16,3,32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
10249,TOMSP,6,1996-07-05T00:00:00.000+0000,1996,7,1996-08-16T00:00:00.000+0000,1996-07-10,1,11.61,Toms Spezialitäten,Luisenstr. 48,Münster,,44087,Germany
10250,HANAR,4,1996-07-08T00:00:00.000+0000,1996,7,1996-08-05T00:00:00.000+0000,1996-07-12,2,65.83,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,RJ,05454-876,Brazil
10251,VICTE,3,1996-07-08T00:00:00.000+0000,1996,7,1996-08-05T00:00:00.000+0000,1996-07-15,1,41.34,Victuailles en stock,"2, rue du Commerce",Lyon,,69004,France
10252,SUPRD,4,1996-07-09T00:00:00.000+0000,1996,7,1996-08-06T00:00:00.000+0000,1996-07-11,2,51.3,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium


### 2.5 Substituindo o DataFrame
Substituindo "dict_df_northwind['orders']" por "df_orders_processed" no dicionário de DataFrames.   
Ou seja, substituindo o DF da tabela "orders" pelo DF processado com as novas colunas ("order_year" e "order_month").

In [0]:
dict_df_northwind.update({'orders': df_orders_processed})

In [0]:
# Visualizando o DF da tabela "orders" com as duas novas colunas ("order_year" e "order_month")
dict_df_northwind['orders'].toPandas().head()

Unnamed: 0,order_id,customer_id,employee_id,order_date,order_year,order_month,required_date,shipped_date,ship_via,freight,ship_name,ship_address,ship_city,ship_region,ship_postal_code,ship_country
0,10248,VINET,5,1996-07-04,1996,7,1996-08-01,1996-07-16,3,32.38,Vins et alcools Chevalier,59 rue de l'Abbaye,Reims,,51100,France
1,10249,TOMSP,6,1996-07-05,1996,7,1996-08-16,1996-07-10,1,11.61,Toms Spezialitäten,Luisenstr. 48,Münster,,44087,Germany
2,10250,HANAR,4,1996-07-08,1996,7,1996-08-05,1996-07-12,2,65.83,Hanari Carnes,"Rua do Paço, 67",Rio de Janeiro,RJ,05454-876,Brazil
3,10251,VICTE,3,1996-07-08,1996,7,1996-08-05,1996-07-15,1,41.34,Victuailles en stock,"2, rue du Commerce",Lyon,,69004,France
4,10252,SUPRD,4,1996-07-09,1996,7,1996-08-06,1996-07-11,2,51.3,Suprêmes délices,"Boulevard Tirou, 255",Charleroi,,B-6000,Belgium


### 2.6 Limpeza
Removendo o diretório "trusted", caso este notebook seja executado mais de uma vez.

In [0]:
dbutils.fs.rm(northwind + 'trusted', recurse=True)

Out[407]: False

### 2.7 Gravando os arquivos na camada "trusted"

In [0]:
for key, value in dict_df_northwind.items():
    value.write.mode("overwrite").format("parquet").save(northwind + f'trusted/{key}')

### 2.8 Registrando as tabelas no Metastore

In [0]:
path_trusted = northwind + 'trusted/'

for folder in dbutils.fs.ls(path_trusted):
    folder_name = f'{folder.name}'.rsplit('/', 1)[0]
    spark.sql(
        f"""
        DROP TABLE IF EXISTS {folder_name}
        """
    )
    spark.sql(
        f"""
        CREATE TABLE {folder_name}
        USING PARQUET
        LOCATION "{northwind}trusted/{folder.name}"
        """
    )
    spark.sql(
        f"""
        REFRESH TABLE {folder_name}
        """
    )

### 2.9 Verificando o número de registros na primeira tabela ("categories")

In [0]:
tb_trusted_categories = spark.read.table('categories')
tb_trusted_categories.count()

Out[410]: 8

### 2.10 Exibindo os atributos da tabela "categories", criada na camada "trusted"

In [0]:
%sql
DESCRIBE EXTENDED categories;

col_name,data_type,comment
category_id,int,
category_name,string,
description,string,
picture,string,
,,
# Detailed Table Information,,
Database,letscode_thais_trusted,
Table,categories,
Owner,root,
Created Time,Tue May 17 08:38:26 UTC 2022,
