Desde Spark 2.0, la Structured API (DataFrames, Datasets, Spark SQL) permite trabajar con datos estructurados o semiestructurados que tienen un esquema definido.
El Catalyst Optimizer aprovecha este esquema para realizar optimizaciones.

In [1]:
!pip install pyspark



In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.master("local[*]").appName("Spark df").getOrCreate()

In [5]:
sc = spark.sparkContext

Crear dataframe a partir de un RDD

In [6]:
rdd = sc.parallelize([item for item in range(10)]).map(lambda x: (x,x**2))
rdd.collect()

[(0, 0),
 (1, 1),
 (2, 4),
 (3, 9),
 (4, 16),
 (5, 25),
 (6, 36),
 (7, 49),
 (8, 64),
 (9, 81)]

In [7]:
df = rdd.toDF(['numero','cuadrado'])

In [8]:
df.printSchema()

root
 |-- numero: long (nullable = true)
 |-- cuadrado: long (nullable = true)



In [9]:
df.show()

+------+--------+
|numero|cuadrado|
+------+--------+
|     0|       0|
|     1|       1|
|     2|       4|
|     3|       9|
|     4|      16|
|     5|      25|
|     6|      36|
|     7|      49|
|     8|      64|
|     9|      81|
+------+--------+



Crear un dataframe a partir de un rdd con Schema

In [10]:
rdd1 = sc.parallelize([(1, 'Abel', 35.5), (2, 'Julian', 54.3), (3, 'Marcos', 12.7)])

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType


In [11]:
# Via uno
esquema1 = StructType(
    [
        StructField('id',IntegerType(), True),
        StructField('nombre',StringType(), True),
        StructField('saldo',DoubleType(), True),
    ]
)

In [27]:
# Via dos
esquema2 = "`id` INT, `nombre` STRING, `saldo` DOUBLE"""

In [22]:
df1 = spark.createDataFrame(rdd1,schema=esquema1)
df1.printSchema()

root
 |-- id: integer (nullable = true)
 |-- nombre: string (nullable = true)
 |-- saldo: double (nullable = true)



In [23]:
df1.show()

+---+------+-----+
| id|nombre|saldo|
+---+------+-----+
|  1|  Abel| 35.5|
|  2|Julian| 54.3|
|  3|Marcos| 12.7|
+---+------+-----+



In [28]:
df2 = spark.createDataFrame(rdd1, schema = esquema2)
df2.printSchema()

root
 |-- id: integer (nullable = true)
 |-- nombre: string (nullable = true)
 |-- saldo: double (nullable = true)



In [29]:
df2.show()

+---+------+-----+
| id|nombre|saldo|
+---+------+-----+
|  1|  Abel| 35.5|
|  2|Julian| 54.3|
|  3|Marcos| 12.7|
+---+------+-----+



Crear df a partir de fuentes de datos

In [32]:
# A partir de archivo de texto
df = spark.read.text('/content/data/dataTXT.txt')

In [34]:
df.show(truncate = False)

+-----------------------------------------------------------------------+
|value                                                                  |
+-----------------------------------------------------------------------+
|Estamos en el curso de pyspark                                         |
|En este capítulo estamos estudiando el API SQL de Saprk                |
|En esta sección estamos creado dataframes a partir de fuentes de datos,|
|y en este ejemplo creamos un dataframe a partir de un texto plano      |
+-----------------------------------------------------------------------+



In [37]:
df_csv = spark.read.csv('/content/data/dataCSV.csv')
df_csv.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+--------------------+--------------------+
|        _c0|          _c1|                 _c2|                 _c3|        _c4|                 _c5|                 _c6|    _c7|   _c8|     _c9|         _c10|                _c11|             _c12|            _c13|                _c14|                _c15|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+--------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|vid

In [38]:
df_csv = spark.read.option('header', 'true').csv('/content/data/dataCSV.csv')
df_csv.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13T17:13:...|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           Fal

A partir de un texto con un delimitador diferente

In [40]:
df2 = spark.read.option('header','true').option('delimiter','|').csv('/content/data/dataTab.txt')
df2.show()

+----+----+----------+-----+
|pais|edad|     fecha|color|
+----+----+----------+-----+
|  MX|  23|2021-02-21| rojo|
|  CA|  56|2021-06-10| azul|
|  US|  32|2020-06-02|verde|
+----+----+----------+-----+



A partir de un json en base a un esquema

In [41]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType
json_schema = StructType(
    [
        StructField('color',StringType(),True),
        StructField('edad',IntegerType(),True),
        StructField('fecha',DateType(),True),
        StructField('pais',StringType(),True),
    ]
)

In [42]:
df_json = spark.read.schema(json_schema).json('/content/data/dataJSON.json')
df_json.show()

+-----+----+----------+----+
|color|edad|     fecha|pais|
+-----+----+----------+----+
| rojo|NULL|2021-02-21|  MX|
| azul|NULL|2021-06-10|  CA|
|verde|NULL|2020-06-02|  US|
+-----+----+----------+----+



In [43]:
df_json.printSchema()

root
 |-- color: string (nullable = true)
 |-- edad: integer (nullable = true)
 |-- fecha: date (nullable = true)
 |-- pais: string (nullable = true)



A partir de un archivo parquet

In [45]:
df_parquet = spark.read.parquet('/content/data/dataPARQUET.parquet')
df_parquet.printSchema()

root
 |-- video_id: string (nullable = true)
 |-- trending_date: string (nullable = true)
 |-- title: string (nullable = true)
 |-- channel_title: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- tags: string (nullable = true)
 |-- views: string (nullable = true)
 |-- likes: string (nullable = true)
 |-- dislikes: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- thumbnail_link: string (nullable = true)
 |-- comments_disabled: string (nullable = true)
 |-- ratings_disabled: string (nullable = true)
 |-- video_error_or_removed: string (nullable = true)
 |-- description: string (nullable = true)



In [46]:
df_parquet.show(4)

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13T17:13:...|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           Fal

In [48]:
df5 = spark.read.format('parquet').load('/content/data/dataPARQUET.parquet')
df5.show()

+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|   video_id|trending_date|               title|       channel_title|category_id|        publish_time|                tags|  views| likes|dislikes|comment_count|      thumbnail_link|comments_disabled|ratings_disabled|video_error_or_removed|         description|
+-----------+-------------+--------------------+--------------------+-----------+--------------------+--------------------+-------+------+--------+-------------+--------------------+-----------------+----------------+----------------------+--------------------+
|2kyS6SvSYSE|     17.14.11|WE WANT TO TALK A...|        CaseyNeistat|         22|2017-11-13T17:13:...|     SHANtell martin| 748374| 57527|    2966|        15954|https://i.ytimg.c...|            False|           Fal