# DataFrames Basics

## Prerrequisites

Install Spark and Java in VM

In [11]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.0
!wget -q https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [12]:
ls -l # check the .tgz is there

total 782032
drwxr-xr-x  1 root root      4096 Feb 14 14:28 [0m[01;34msample_data[0m/
drwxr-xr-x 13 1000 1000      4096 Sep  9 02:08 [01;34mspark-3.5.0-bin-hadoop3[0m/
-rw-r--r--  1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz
-rw-r--r--  1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz.1


In [13]:
# unzip it
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [14]:
!pip install -q findspark

In [15]:

!pip install py4j

# For maps
!pip install folium
!pip install plotly



Define the environment

In [16]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [17]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [18]:
spark

In [19]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [20]:
# Import sql functions
from pyspark.sql.functions import *

## Examples

# Creamos los datasets de los CSV.

In [22]:
info_episodesDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/friends_info.csv")
info_emotionsDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/Friends_emotions.csv")

# Mostramos los datasets

In [23]:
info_episodesDF.show()
info_emotionsDF.show()

+------+-------+--------------------+---------------+--------------------+----------+-----------------+-----------+
|season|episode|               title|    directed_by|          written_by|  air_date|us_views_millions|imdb_rating|
+------+-------+--------------------+---------------+--------------------+----------+-----------------+-----------+
|     1|      1|           The Pilot|  James Burrows|David Crane & Mar...|1994-09-22|             21.5|        8.3|
|     1|      2|The One with the ...|  James Burrows|David Crane & Mar...|1994-09-29|             20.2|        8.1|
|     1|      3|The One with the ...|  James Burrows|Jeffrey Astrof & ...|1994-10-06|             19.5|        8.2|
|     1|      4|The One with Geor...|  James Burrows|         Alexa Junge|1994-10-13|             19.7|        8.1|
|     1|      5|The One with the ...|  Pamela Fryman|Jeff Greenstein &...|1994-10-20|             18.6|        8.5|
|     1|      6|The One with the ...| Arlene Sanford|Adam Chase & Ira ..

# ¿Qué esquema tienen los datasets?

In [27]:
info_episodesDF.printSchema()
info_emotionsDF.printSchema()

root
 |-- season: string (nullable = true)
 |-- episode: string (nullable = true)
 |-- title: string (nullable = true)
 |-- directed_by: string (nullable = true)
 |-- written_by: string (nullable = true)
 |-- air_date: string (nullable = true)
 |-- us_views_millions: string (nullable = true)
 |-- imdb_rating: string (nullable = true)

root
 |-- Season: string (nullable = true)
 |-- Episode: string (nullable = true)
 |-- Scene: string (nullable = true)
 |-- Utterance: string (nullable = true)
 |-- Emotion: string (nullable = true)



# ¿Qué temporada tuvo más espectadores?
Para ello sumaremos el campo us_views_milions y agruparemos por temporada.

In [24]:
viewers_per_seasonDF = (info_episodesDF.groupBy("Season")
                          .agg(format_number(sum("us_views_millions"), 2).alias("Total_Temporada"))
                          .orderBy(desc("Total_Temporada"))
                       )
#Ahora eliminamos los NULL para limpiar el resultado
viewers_per_seasonDF2= viewers_per_seasonDF.na.drop()
viewers_per_seasonDF2.show()

+------+---------------+
|Season|Total_Temporada|
+------+---------------+
|     2|         761.30|
|     3|         657.70|
|     8|         641.29|
|     1|         595.00|
|     5|         593.90|
|     9|         574.33|
|     4|         573.30|
|     6|         565.40|
|     7|         529.23|
|    10|         470.33|
+------+---------------+



# ¿Qué director ha dirigido más episodios?

Para ello, haremos un conteo por director y agruparemos por director para mostrar resultados. También excluiremos los valores nulos para limpiar el resultado final.

In [26]:
directors_episode_countDF = info_episodesDF.groupBy("directed_by").agg(count("episode").alias("Total_Episodios")).orderBy(desc("Total_Episodios"))

#Ahora eliminamos los NULL para limpiar el resultado
directors_episode_countDF2= directors_episode_countDF.na.drop()
directors_episode_countDF2.show()

+--------------------+---------------+
|         directed_by|Total_Episodios|
+--------------------+---------------+
|      Gary Halvorson|             54|
|     Kevin S. Bright|             53|
|     Michael Lembeck|             24|
|       James Burrows|             15|
|        Gail Mancuso|             14|
|        Peter Bonerz|             12|
|           Ben Weiss|             10|
|     David Schwimmer|             10|
|        Robby Benson|              6|
|        Terry Hughes|              5|
|      Shelley Jensen|              5|
|        Sheldon Epps|              3|
|       Pamela Fryman|              2|
|     Thomas Schlamme|              2|
|     Steve Zuckerman|              2|
|  Roger Christiansen|              2|
|Dana DeValley Piazza|              2|
|        Alan Myerson|              2|
|Kevin S. Bright &...|              1|
|      Arlene Sanford|              1|
+--------------------+---------------+
only showing top 20 rows



# ¿Cuál es el tono de cada temporada?

Se han contado cada estado de "Emotion" y después se ha agrupado y ordenado por temporada.

In [32]:
emotions_count_per_seasonDF = info_emotionsDF.groupBy("Season").pivot("Emotion").count()
ordered_emotions_count_per_seasonDF = emotions_count_per_seasonDF.orderBy("Season")
ordered_emotions_count_per_seasonDF.show()

+------+------+---+-------+--------+--------+---+------+
|Season|Joyful|Mad|Neutral|Peaceful|Powerful|Sad|Scared|
+------+------+---+-------+--------+--------+---+------+
|     1|   599|309|   1241|     112|     125|155|   357|
|     2|   567|317|    985|     249|     189|207|   386|
|     3|   810|426|    896|     386|     259|276|   443|
|     4|   779|280|    654|     444|     490|206|   459|
+------+------+---+-------+--------+--------+---+------+



**Distribución en %**

In [42]:
emotions_count_per_seasonDF = emotions_count_per_seasonDF.withColumn('Total', lit(0))
for emotion in [c for c in emotions_count_per_seasonDF.columns if c not in ['Season', 'Total']]:
    emotions_count_per_seasonDF = emotions_count_per_seasonDF.withColumn('Total', col('Total') + col(emotion))
for emotion in [c for c in emotions_count_per_seasonDF.columns if c not in ['Season', 'Total']]:
    emotions_count_per_seasonDF = emotions_count_per_seasonDF.withColumn(emotion, format_number((col(emotion) / col('Total')) * 100, 0))
emotions_count_per_seasonDF = emotions_count_per_seasonDF.orderBy(col('Season'))
columns_to_show = [c for c in emotions_count_per_seasonDF.columns if c != 'Total']
emotions_count_per_seasonDF.select(*columns_to_show).orderBy(col('Season')).show()

+------+------+---+-------+--------+--------+---+------+
|Season|Joyful|Mad|Neutral|Peaceful|Powerful|Sad|Scared|
+------+------+---+-------+--------+--------+---+------+
|     1|    21| 11|     43|       4|       4|  5|    12|
|     2|    20| 11|     34|       9|       7|  7|    13|
|     3|    23| 12|     26|      11|       7|  8|    13|
|     4|    24|  8|     20|      13|      15|  6|    14|
+------+------+---+-------+--------+--------+---+------+



# Relación tono con director
Para este caso, haremos un join de la tabla anterior con la de información general, a fin de relacionar tono de la serie con el director que más episodios haya dirigido en cada temporada.

In [46]:
director_counts = info_episodesDF.groupBy("Season", "directed_by").agg(count("episode").alias("episodes_count"))
windowSpec = Window.partitionBy("Season").orderBy(col("episodes_count").desc())
top_director_per_season = director_counts.withColumn("rank", row_number().over(windowSpec)) \
                                          .filter(col("rank") == 1).drop("rank")
top_director_per_season = top_director_per_season.withColumnRenamed("directed_by", "Top_Director")
final_df = emotions_count_per_seasonDF.join(top_director_per_season, "Season")
columns_to_show = ['Season', 'Top_Director'] + [col for col in emotions_count_per_seasonDF.columns if col not in ['Season', 'Total']]
final_df = final_df.select(*columns_to_show)
final_df = final_df.orderBy("Season")
final_df.show()

+------+---------------+------+---+-------+--------+--------+---+------+
|Season|   Top_Director|Joyful|Mad|Neutral|Peaceful|Powerful|Sad|Scared|
+------+---------------+------+---+-------+--------+--------+---+------+
|     1|  James Burrows|    21| 11|     43|       4|       4|  5|    12|
|     2|Michael Lembeck|    20| 11|     34|       9|       7|  7|    13|
|     3|   Robby Benson|    23| 12|     26|      11|       7|  8|    13|
|     4|   Peter Bonerz|    24|  8|     20|      13|      15|  6|    14|
+------+---------------+------+---+-------+--------+--------+---+------+

