# DataFrames Basics

## Prerrequisites

Install Spark and Java in VM

In [1]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.0
!wget -q https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [2]:
ls -l # check the .tgz is there

total 391016
drwxr-xr-x 1 root root      4096 Jan 11 17:02 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz


In [3]:
# unzip it
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [4]:
!pip install -q findspark

In [5]:

!pip install py4j

# For maps
!pip install folium
!pip install plotly



Define the environment

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [7]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [8]:
spark

In [9]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [10]:
# Import sql functions
from pyspark.sql.functions import *

El dataset escogido muestra: Cada partido incluye el torneo, fecha, tipo de torneo, si el partido se jugó bajo techo o no, tipo de superficie de cancha, ronda del partido, número máximo de sets en el partido, participantes en el partido, ganador, rangos de los participantes. , probabilidades de ganar y puntuación del partido.

EJERCICIOS

1.Leemos el dataset atp_tennis.csv y seleccionamos dos columnas a nuestra elección.

In [13]:
 TennisDF = spark.read.option("header", "true").option("delimiter", ",").csv("/content/atp_tennis.csv")
TennisDF.show(20)

+--------------------+----------+------+-------+-------+---------+-------+------------+-------------+-------------+------+------+-----+-----+-----+-----+-------------+
|          Tournament|      Date|Series|  Court|Surface|    Round|Best of|    Player_1|     Player_2|       Winner|Rank_1|Rank_2|Pts_1|Pts_2|Odd_1|Odd_2|        score|
+--------------------+----------+------+-------+-------+---------+-------+------------+-------------+-------------+------+------+-----+-----+-----+-----+-------------+
|Brisbane Internat...|2012-12-31|ATP250|Outdoor|   Hard|1st Round|      3|    Mayer F.|   Giraldo S.|     Mayer F.|    28|    57| 1215|  778| 1.36|  3.0|   6-4 6-4   |
|Brisbane Internat...|2012-12-31|ATP250|Outdoor|   Hard|1st Round|      3|Benneteau J.|  Nieminen J.|  Nieminen J.|    35|    41| 1075|  927|  2.2| 1.61|3-6 6-2 1-6  |
|Brisbane Internat...|2012-12-31|ATP250|Outdoor|   Hard|1st Round|      3|Nishikori K.| Matosevic M.| Nishikori K.|    19|    49| 1830|  845| 1.25| 3.75|   7-5 

In [14]:
TennisDF.select(
    TennisDF.Tournament,
    col("Surface"),
).show(3)

+--------------------+-------+
|          Tournament|Surface|
+--------------------+-------+
|Brisbane Internat...|   Hard|
|Brisbane Internat...|   Hard|
|Brisbane Internat...|   Hard|
+--------------------+-------+
only showing top 3 rows



Mostramos las distintas superfícies que hay en el tenis.


In [30]:
TennisDF.select(countDistinct(TennisDF.Surface)).show()

+-----------------------+
|count(DISTINCT Surface)|
+-----------------------+
|                      3|
+-----------------------+



Mostramos el número máximo de sets por jugador

In [39]:
max_sets_by_player = TennisDF.groupBy("Player_1").agg(max("Best of").alias("max_sets_player_1"))
max_sets_by_player.show()

max_sets_by_player = TennisDF.groupBy("Player_2").agg(max("Best of").alias("max_sets_player_2"))
max_sets_by_player.show()

+-----------------+-----------------+
|         Player_1|max_sets_player_1|
+-----------------+-----------------+
|     Agamenone F.|                5|
|       Aguilar J.|                3|
|        Ahouda A.|                3|
|     Ajdukovic D.|                3|
|         Albot R.|                5|
|       Alcaraz C.|                5|
|       Almagro N.|                5|
|    Altamirano C.|                5|
|      Altmaier D.|                5|
|         Alund M.|                5|
|Alvarez Varona N.|                3|
|      Amritraj P.|                3|
|      Anderson K.|                5|
|       Andreev A.|                3|
|     Andreozzi G.|                5|
|       Andujar P.|                5|
|       Aragone J.|                5|
|      Arguello F.|                5|
|     Arnaboldi A.|                5|
|       Arnaldi M.|                3|
+-----------------+-----------------+
only showing top 20 rows

+------------+-----------------+
|    Player_2|max_sets_player

Número máximo de victorias por superfície y jugador

In [40]:
victories_df = TennisDF.filter(col("Winner") == col("Player_1"))

# Agrupar por superficie y jugador y contar el número de victorias
victories_by_surface_and_player = victories_df.groupBy("Surface", "Player_1").agg(count("Winner").alias("num_victories"))

# Mostrar el resultado
victories_by_surface_and_player.show()

+-------+----------------+-------------+
|Surface|        Player_1|num_victories|
+-------+----------------+-------------+
|   Clay|     Janowicz J.|            7|
|   Hard|   Gabashvili T.|           10|
|   Clay|       Quinzi G.|            1|
|   Hard|      Krueger M.|            4|
|   Clay|    Tsitsipas S.|           46|
|   Clay|      Galovic V.|            1|
|   Hard|      Bhambri Y.|            5|
|  Grass|        Falla A.|            4|
|   Clay|       Ahouda A.|            1|
|   Hard|      Almagro N.|           15|
|   Clay|     Delbonis F.|           64|
|   Hard|       Quiroz R.|            1|
|  Grass|       Sugita Y.|           10|
|   Hard|        Fritz T.|           64|
|  Grass|Bautista Agut R.|           20|
|   Hard|  Gunneswaran P.|            1|
|   Clay|   Kecmanovic M.|           13|
|   Hard|      Bolelli S.|           13|
|   Hard|       Broady L.|            3|
|   Hard|     Tsonga J.W.|           75|
+-------+----------------+-------------+
only showing top

Mostramos si los partidos ganados por jugador se jugaron bajo techo o no.

In [41]:
victories_df = TennisDF.filter(col("Winner") == col("Player_1"))

# Creamos una nueva columna indicando si el partido se jugó bajo techo o no
victories_df = victories_df.withColumn("Indoor", when(col("Court") == "Indoor", True).otherwise(False))

# Agrupamos por jugador y contamos el número de victorias bajo techo y al aire libre
victories_by_indoor = victories_df.groupBy("Player_1", "Indoor").agg(count("Winner").alias("num_victories"))

# Mostrar el resultado
victories_by_indoor.show()

+--------------+------+-------------+
|      Player_1|Indoor|num_victories|
+--------------+------+-------------+
|   Karlovic I.| false|           66|
|    Pouille L.| false|           45|
|   Forejtek J.|  true|            1|
|      Lacko L.|  true|            4|
|Trungelliti M.|  true|            1|
|  De Minaur A.|  true|           16|
|    Millman J.|  true|           13|
|   Harrison R.| false|           35|
|      Hajek J.| false|            3|
|   Marterer M.| false|           14|
|  Khachanov K.|  true|           21|
|      Souza J.| false|            2|
|    Galovic V.| false|            1|
|         Wu Y.| false|            2|
|      Evans D.|  true|           13|
|      Marti Y.| false|            1|
|     Tabilo A.|  true|            2|
|       Sock J.| false|           71|
|     Berrer M.|  true|            2|
|   Berankis R.|  true|           12|
+--------------+------+-------------+
only showing top 20 rows



Contamos cuántas victorias tiene cada jugador y las ordenamos por torneo


In [43]:
victories_df = TennisDF.filter(col("Winner") == col("Player_1"))

# Agrupamos por torneo y jugador y contamos el número de victorias
victories_by_tournament_and_player = victories_df.groupBy("Tournament", "Player_1").agg(count("Winner").alias("num_victories"))

# Ordenamos por torneo y jugador
victories_by_tournament_and_player = victories_by_tournament_and_player.orderBy("Tournament", "Player_1")

# Mostramos el resultado
victories_by_tournament_and_player.show()

+--------------------+--------------------+-------------+
|          Tournament|            Player_1|num_victories|
+--------------------+--------------------+-------------+
|ABN AMRO World Te...|  Auger-Aliassime F.|            4|
|ABN AMRO World Te...|        Bachinger M.|            1|
|ABN AMRO World Te...|        Baghdatis M.|            2|
|ABN AMRO World Te...|          Barrere G.|            1|
|ABN AMRO World Te...|    Bautista Agut R.|            1|
|ABN AMRO World Te...|         Bautista R.|            3|
|ABN AMRO World Te...|           Bedene A.|            1|
|ABN AMRO World Te...|        Benneteau J.|            2|
|ABN AMRO World Te...|          Berdych T.|            3|
|ABN AMRO World Te...|           Berrer M.|            1|
|ABN AMRO World Te...|          Brouwer G.|            1|
|ABN AMRO World Te...|           Bublik A.|            2|
|ABN AMRO World Te...|           Chardy J.|            2|
|ABN AMRO World Te...|            Chung H.|            1|
|ABN AMRO Worl