# Pyspark: Tratamento de Dados e Big Data / Data Processing and Big Data

O objetivo desse projeto é demonstrar conhecimentos em PySpark.
Para esse projeto, utilizei um dataset de domínio público de estatísticas do Youtube: https://www.kaggle.com/datasets/advaypatil/youtube-statistics/
/
# Nova seção
The objective of this project is to demonstrate knowledges in PySpark.
For this project, I utilized a public domain Youtube statistics Dataset: https://www.kaggle.com/datasets/advaypatil/youtube-statistics/

In [2]:
# Instalando o PySpark / Installing PySpark
! pip install pyspark



In [3]:
# Importando PySpark e SparkSession / Importing PySpark and SparkSession
import pyspark
from pyspark.sql import SparkSession

In [4]:
# Criando uma sessão Spark / Creating a Spark session
spark = SparkSession.builder.getOrCreate()

In [5]:
# Lendo os dados do arquivo “videos-stats.csv” / Reading data from the “videos-stats.csv” file
df = spark.read.option('header', 'true').csv('videos-stats.csv')

In [6]:
# Visualizando os primeiros 8 registros do arquivo / Viewing the first 8 records of the file
df.show(8)

+---+--------------------+-----------+------------+-------+--------+--------+---------+
|_c0|               Title|   Video ID|Published At|Keyword|   Likes|Comments|    Views|
+---+--------------------+-----------+------------+-------+--------+--------+---------+
|  0|Apple Pay Is Kill...|wAZZ-UWGVHI|  2022-08-23|   tech|  3407.0|   672.0| 135612.0|
|  1|The most EXPENSIV...|b3x28s61q3c|  2022-08-24|   tech| 76779.0|  4306.0|1758063.0|
|  2|My New House Gami...|4mgePWWCAmA|  2022-08-23|   tech| 63825.0|  3338.0|1564007.0|
|  3|Petrol Vs Liquid ...|kXiYSI7H2b0|  2022-08-23|   tech| 71566.0|  1426.0| 922918.0|
|  4|Best Back to Scho...|ErMwWXQxHp0|  2022-08-08|   tech| 96513.0|  5155.0|1855644.0|
|  5|Brewmaster Answer...|18fwz9Itbvo|  2021-11-05|   tech| 33570.0|  1643.0| 943119.0|
|  6|Tech Monopolies: ...|jXf04bhcjbg|  2022-06-13|   tech|135047.0|  9367.0|5937790.0|
|  7|I bought the STRA...|2TqOmtTAMRY|  2022-08-07|   tech|216935.0| 12605.0|4782514.0|
+---+--------------------+------

In [7]:
# Visualizando o esquema do arquivo / Viewing the file schema
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- Title: string (nullable = true)
 |-- Video ID: string (nullable = true)
 |-- Published At: string (nullable = true)
 |-- Keyword: string (nullable = true)
 |-- Likes: string (nullable = true)
 |-- Comments: string (nullable = true)
 |-- Views: string (nullable = true)



Para os propósito desse projeto será utilizado o inferSchema, mas aqui está um exemplo de como eu definitira o schema manualmente:

For the purposes of this project inferSchema will be utilized, but here's and example of how I would define the schema manually:



```
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, LongType

schema = StructType([
  StructType("_c0", IntegerType(), True),
  StructType("Title", StringType(), True),
  StructType("Video ID", StringType(), True),
  StructType("Published At", DateType(), True),
  StructType("Keyword", StringType(), True),
  StructType("Likes", IntegerType(), True),
  StructType("Comments", IntegerType(), True),
  StructType("Views", LongType(), True),
])
```



In [8]:
# Lendo novamente o arquivo inferindo o esquema e visualizando o esquema novamente / Reading the file again, inferring the schema, and viewing the schema again
df = spark.read.option('header', 'true').option('inferSchema', 'true').csv('videos-stats.csv')
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- Title: string (nullable = true)
 |-- Video ID: string (nullable = true)
 |-- Published At: date (nullable = true)
 |-- Keyword: string (nullable = true)
 |-- Likes: double (nullable = true)
 |-- Comments: double (nullable = true)
 |-- Views: double (nullable = true)



In [9]:
# Salvando o arquivo como 'videos-parquet' no formato parquet e adicionando o cabeçalho nos dados / Saving the file as 'videos-parquet' in parquet format and adding the header to the data
df.write.option('header', 'true').option('inferSchema', 'true').save('output/videos-parquet')

In [10]:
# Lendo e visualizando o arquivo 'videos-parquet' com cabeçalho nos dados / Reading and viewing the 'videos-parquet' file with header in the data
df = spark.read.option('header', 'true').option('inferSchema', 'true').parquet('output/videos-parquet')
df.show(10)

+---+--------------------+-----------+------------+-------+--------+--------+---------+
|_c0|               Title|   Video ID|Published At|Keyword|   Likes|Comments|    Views|
+---+--------------------+-----------+------------+-------+--------+--------+---------+
|  0|Apple Pay Is Kill...|wAZZ-UWGVHI|  2022-08-23|   tech|  3407.0|   672.0| 135612.0|
|  1|The most EXPENSIV...|b3x28s61q3c|  2022-08-24|   tech| 76779.0|  4306.0|1758063.0|
|  2|My New House Gami...|4mgePWWCAmA|  2022-08-23|   tech| 63825.0|  3338.0|1564007.0|
|  3|Petrol Vs Liquid ...|kXiYSI7H2b0|  2022-08-23|   tech| 71566.0|  1426.0| 922918.0|
|  4|Best Back to Scho...|ErMwWXQxHp0|  2022-08-08|   tech| 96513.0|  5155.0|1855644.0|
|  5|Brewmaster Answer...|18fwz9Itbvo|  2021-11-05|   tech| 33570.0|  1643.0| 943119.0|
|  6|Tech Monopolies: ...|jXf04bhcjbg|  2022-06-13|   tech|135047.0|  9367.0|5937790.0|
|  7|I bought the STRA...|2TqOmtTAMRY|  2022-08-07|   tech|216935.0| 12605.0|4782514.0|
|  8|15 Emerging Techn...|wLlL46

In [11]:
# Salvando o arquivo do exec. anterior como tabela chamada ‘tb_videos’ no banco de dados default do spark catalog / Saving the previous exec. file as a table named 'tb_videos' in the Spark catalog's default database
df.write.option('header', 'true').option('inferSchema', 'true').saveAsTable('tb_videos')

In [12]:
# Listando as tabelas do spark catalog para verificar a tabela / Listing the Spark catalog tables to verify the table
spark.catalog.listTables()

[Table(name='tb_videos', catalog='spark_catalog', namespace=['default'], description=None, tableType='MANAGED', isTemporary=False)]

In [13]:
# Utilizando o spark SQL para ler a tabela ‘tb_videos’ / Using Spark SQL to read the ‘tb_videos’ table
tab_df = spark.sql('SELECT * FROM tb_videos')
tab_df.show()

+---+--------------------+-----------+------------+-------+--------+--------+-----------+
|_c0|               Title|   Video ID|Published At|Keyword|   Likes|Comments|      Views|
+---+--------------------+-----------+------------+-------+--------+--------+-----------+
|  0|Apple Pay Is Kill...|wAZZ-UWGVHI|  2022-08-23|   tech|  3407.0|   672.0|   135612.0|
|  1|The most EXPENSIV...|b3x28s61q3c|  2022-08-24|   tech| 76779.0|  4306.0|  1758063.0|
|  2|My New House Gami...|4mgePWWCAmA|  2022-08-23|   tech| 63825.0|  3338.0|  1564007.0|
|  3|Petrol Vs Liquid ...|kXiYSI7H2b0|  2022-08-23|   tech| 71566.0|  1426.0|   922918.0|
|  4|Best Back to Scho...|ErMwWXQxHp0|  2022-08-08|   tech| 96513.0|  5155.0|  1855644.0|
|  5|Brewmaster Answer...|18fwz9Itbvo|  2021-11-05|   tech| 33570.0|  1643.0|   943119.0|
|  6|Tech Monopolies: ...|jXf04bhcjbg|  2022-06-13|   tech|135047.0|  9367.0|  5937790.0|
|  7|I bought the STRA...|2TqOmtTAMRY|  2022-08-07|   tech|216935.0| 12605.0|  4782514.0|
|  8|15 Em

In [14]:
# Lendo o arquivo ‘comments.csv' inferindo o esquema e visualizando / Reading the ‘comments.csv' file, inferring the schema, and viewing
df = spark.read.option('header', 'true').option('inferSchema', 'true').csv('comments.csv')
df.show()

+--------------+-----------+--------------------+------+---------+
|           _c0|   Video ID|             Comment| Likes|Sentiment|
+--------------+-----------+--------------------+------+---------+
|             0|wAZZ-UWGVHI|Let's not forget ...|  95.0|      1.0|
|             1|wAZZ-UWGVHI|Here in NZ 50% of...|  19.0|      0.0|
|             2|wAZZ-UWGVHI|I will forever ac...| 161.0|      2.0|
|             3|wAZZ-UWGVHI|Whenever I go to ...|   8.0|      0.0|
|             4|wAZZ-UWGVHI|Apple Pay is so c...|  34.0|      2.0|
|             5|wAZZ-UWGVHI|We’ve been houndi...|   8.0|      1.0|
|             6|wAZZ-UWGVHI|We only got Apple...|  29.0|      2.0|
|             7|wAZZ-UWGVHI|For now, I need b...|   7.0|      1.0|
|             8|wAZZ-UWGVHI|In the United Sta...|   2.0|      2.0|
|             9|wAZZ-UWGVHI|In Cambodia, we h...|  28.0|      1.0|
|            10|b3x28s61q3c|Wow, you really w...|1344.0|      2.0|
|            11|b3x28s61q3c|The lab is the mo...| 198.0|      

In [15]:
# Salvando arquivo como ‘comments-parquet' no formato parquet e adicionando o cabeçalho nos dados / Saving the file as ‘comments-parquet' in parquet format and adding the header to the data
df.write.option('header', 'true').option('inferSchema', 'true').save('comments-parquet')