## Data Lake vs Data Lakehouse: Implementação prática com Spark

![capa](images/capa.png)

### Evolução das Arquiteturas de Dados

![data_architectures](images/data_architectures.png)

> # **Data Lakehouse** é uma nova arquitetura de gerenciamento de dados aberta que implementa estruturas de dados e recursos de gerenciamento de dados semelhantes aos de um Data Warehouse, diretamente no tipo de armazenamento de baixo custo usado para Data Lakes. 

![acid](images/acid.png)

In [1]:
!pip install pyspark==3.0.0

Collecting pyspark==3.0.0
  Downloading pyspark-3.0.0.tar.gz (204.7 MB)
     |████████████▍                   | 79.4 MB 33 kB/s eta 1:02:01  [31mERROR: Exception:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 438, in _error_catcher
    yield
  File "/opt/conda/lib/python3.9/site-packages/pip/_vendor/urllib3/response.py", line 519, in read
    data = self._fp.read(amt) if not fp_closed else b""
  File "/opt/conda/lib/python3.9/http/client.py", line 462, in read
    n = self.readinto(b)
  File "/opt/conda/lib/python3.9/http/client.py", line 506, in readinto
    n = self.fp.readinto(b)
  File "/opt/conda/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
  File "/opt/conda/lib/python3.9/ssl.py", line 1241, in recv_into
    return self.read(nbytes, buffer)
  File "/opt/conda/lib/python3.9/ssl.py", line 1099, in read
    return self._sslobj.read(len, buffer)
socket.timeout: The read o

In [None]:
import pyspark

spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:0.8.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.databricks.delta.schema.autoMerge.enabled","true") \
    .config("spark.databricks.delta.autoOptimize.optimizeWrite","true") \
    .config("spark.databricks.delta.optimizeWrite.enabled","true") \
    .config("spark.databricks.delta.vacuum.parallelDelete.enabled","true") \
    .getOrCreate()

from delta.tables import *
from pyspark.sql.functions import *

In [None]:
spark

In [None]:
path = 'tmp/sample.parquet'

In [None]:
!rm -rf tmp/

**Criando uma tabela Delta**

Para criar uma tabela Delta, escreva um DataFrame no formato delta. Você pode usar o código Spark SQL existente e alterar o formato de parquet, csv, json e assim por diante para delta.

In [None]:
df1 = spark.createDataFrame(
    [
        (1,    100.0, 'registro da primeira linha', 'batata'),
        (2,    150.0, 'registro da segunda linha', 'arroz'),
    ],
        ['id', 'number', 'txt', 'etc']
)

df1.show()

In [None]:
df_star_wars = spark.createDataFrame(
    [
        (1, 'Luke Skywalker', 1.72,'azul','19BBY','masculino','Tatooine','Humano'),
        (2, 'C-3PO',1.67,'amarelo','112BBY','NA','Tatooine','Droid'),
        (3, 'R2-D2', 0.67, 'vermelho','33BBY','NA','Naboo','Droid'),
        (4, 'Anakin Skywalker', 1.88, 'azul','41.9BBY','masculino','Tatooine','Humano'),
        (5, 'Leia Organa', 1.50,'castanho','19BBY','feminino','Alderaan','Humano'),
        (6, 'Han Solo', 1.80, 'castanho', '29BBY', 'masculino', 'Corellia', 'Humano')
          
    ],
        ['id', 'nome', 'altura', 'cor_dos_olhos','data_nascimento','sexo','planeta']
)

#df_star_wars.show()

### Referências:

[1] [Data Lakehouse](https://databricks.com/glossary/data-lakehouse)

[2] [Lakehouse: A New Generation of Open Platforms that Unify
Data Warehousing and Advanced Analytics](https://databricks.com/wp-content/uploads/2020/12/cidr_lakehouse.pdf)

[3] [Building a Data Lakehouse on GCP](https://services.google.com/fh/files/misc/building-a-data-lakehouse.pdf)

[4] [Lakehouse: unindo o Data Lake e o Data Warehouse](https://medium.com/data-hackers/lakehouse-unindo-o-data-lake-e-o-data-warehouse-1428be2dda21)

[5] [Construindo Data Lakehouse e muito mais, no Grupo Boticário — Data Hackers Podcast 44](https://medium.com/data-hackers/construindo-data-lakehouse-e-muito-mais-no-grupo-botic%C3%A1rio-data-hackers-podcast-44-20d67f05cfa4)

[6] [What Is a Lakehouse?](https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html)