# Caso de uso

Los brotes de enfermedades pueden ser inevitables, pero las pandemias a gran escala no lo son. El mundo puede responder de manera rápida y efectiva a los riesgos de la pandemia en el futuro con una mejor comprensión, recursos y esfuerzo.

Para evitar sufrir otra gran pandemia, tenemos que tomarnos en serio el riesgo de las pandemias. A pesar de las advertencias de que era probable que había otra, la pandemia de COVID-19 mató a más de 27 millones de personas.1

Debemos desarrollar la capacidad de probar los patógenos y entenderlos: qué patógenos nos ponen en mayor riesgo, cómo se propagan y cómo abordarlos.

Sabemos que es posible reducir en gran medida el riesgo de enfermedades infecciosas. Hemos aprendido a lo largo de la historia cómo reducir su impacto con las vacunas, los esfuerzos de salud pública y la medicina.

Además de los viejos riesgos, nos enfrentamos a nuevas amenazas de la agricultura industrial, la modificación genética, el cambio climático y la resistencia a los antimicrobianos. Con más atención y esfuerzo, también podemos reducir sus riesgos.

In [1]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.0
!wget -q https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [2]:
ls -l # check the .tgz is there

total 391016
drwxr-xr-x 1 root root      4096 Feb 14 14:28 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz


In [3]:
# unzip it
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [4]:
!pip install -q findspark

In [5]:
!pip install py4j

# For maps
!pip install folium
!pip install plotly



Enviroment

In [6]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

In [7]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("Joins") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [8]:
spark

In [9]:
# Import sql functions
from pyspark.sql.functions import *

In [10]:
!mkdir -p dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2324/master/dataset/titles.csv -P /dataset
!ls /dataset

titles.csv


--------------------------------------------------------

READ DATAFRAMES

--------------------------------------------------------------

In [11]:
from google.colab import drive
drive.mount ('/gdrive')

Mounted at /gdrive


In [12]:
ls /gdrive/MyDrive/EDEM/Spark/notebooks/e2e_spark/archive/

'1- the-number-of-cases-of-infectious-diseases.csv'
'2- the-worlds-number-of-vaccinated-one-year-olds.csv'
'3- annual-mortality-rate-from-seasonal-influenza-ages-65.csv'
'4- excess-deaths-cumulative-economist-single-entity.csv'
'5- number-of-reported-cholera-deaths.csv'


De los 5 dataframes solo vamos a usar los 2 primeros

# 1 dataframe

In [13]:
Infeccion = spark.read.option("header", "true").option("delimiter", ",").csv("/gdrive/MyDrive/EDEM/Spark/notebooks/e2e_spark/archive/1- the-number-of-cases-of-infectious-diseases.csv")
Infeccion.show(3)
Infeccion.printSchema()
Infeccion.count()

+-----------+----+----+------------------------------------------+-----------------------------+-----------------------------------------------+----------------------------------------------------------+-----------------------------------------------------------+------------------------------------------------------------+----------------------------------------------------------------+---------------------------------+----------------------+
|     Entity|Code|Year|Indicator:Number of cases of yaws reported|Total (estimated) polio cases|Reported cases of guinea worm disease in humans|Number of new cases of rabies, in both sexes aged all ages|Number of new cases of malaria, in both sexes aged all ages|Number of new cases of hiv/aids, in both sexes aged all ages|Number of new cases of tuberculosis, in both sexes aged all ages|Number of reported smallpox cases|Reported cholera cases|
+-----------+----+----+------------------------------------------+-----------------------------+----------

10521

In [39]:
valores_unicos_entity = Infeccion.select("Entity").distinct()
valores_unicos_entity.show(100)

+--------------------+
|              Entity|
+--------------------+
|                Chad|
|            Paraguay|
|              Russia|
|               Macao|
|Serbia and Monten...|
|               World|
|               Yemen|
|             Senegal|
|              Sweden|
|             Tokelau|
|            Kiribati|
|              Guyana|
|             Eritrea|
|         Philippines|
|            Djibouti|
|European Region (...|
|               Tonga|
|            Malaysia|
|           Singapore|
|                Fiji|
|              Turkey|
|United States Vir...|
|              Malawi|
|                Iraq|
|              Europe|
|             Germany|
|Northern Mariana ...|
|             Comoros|
|         Afghanistan|
|            Cambodia|
|              Jordan|
|            Maldives|
|              Rwanda|
|        Saint Helena|
|     Western Pacific|
|               Sudan|
|               Palau|
|              France|
|              Greece|
|African Region (WHO)|
|          

In [33]:

# Convierte la columna 'mi_columna' de tipo String a tipo Integer
Infeccion = Infeccion.withColumn("Year", col("Year").cast("int"))
Infeccion = Infeccion.withColumn("Indicator:Number of cases of yaws reported", col("Indicator:Number of cases of yaws reported").cast("int"))
Infeccion = Infeccion.withColumn("Indicator:Number of cases of yaws reported", col("Indicator:Number of cases of yaws reported").cast("int"))
Infeccion = Infeccion.withColumn("Total (estimated) polio cases", col("Total (estimated) polio cases").cast("int"))
Infeccion = Infeccion.withColumn("Number of new cases of tuberculosis, in both sexes aged all ages", col("Number of new cases of tuberculosis, in both sexes aged all ages").cast("int"))

# Validar que ha cambiado el tipo
Infeccion.printSchema()

root
 |-- Entity: string (nullable = true)
 |-- Code: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Indicator:Number of cases of yaws reported: integer (nullable = true)
 |-- Total (estimated) polio cases: integer (nullable = true)
 |-- Reported cases of guinea worm disease in humans: string (nullable = true)
 |-- Number of new cases of rabies, in both sexes aged all ages: string (nullable = true)
 |-- Number of new cases of malaria, in both sexes aged all ages: string (nullable = true)
 |-- Number of new cases of hiv/aids, in both sexes aged all ages: string (nullable = true)
 |-- Number of new cases of tuberculosis, in both sexes aged all ages: integer (nullable = true)
 |-- Number of reported smallpox cases: string (nullable = true)
 |-- Reported cholera cases: string (nullable = true)



In [34]:
Infeccion_por_pais = Infeccion.groupBy(col("Entity")).sum("Total (estimated) polio cases")
Infeccion_por_pais_ordenado = Infeccion_por_pais.orderBy(col("sum(Total (estimated) polio cases)").desc())
# Muestra el DataFrame resultante
Infeccion_por_pais_ordenado.show()

+--------------------+----------------------------------+
|              Entity|sum(Total (estimated) polio cases)|
+--------------------+----------------------------------+
|               World|                           3439520|
|     South-East Asia|                           2057819|
|               India|                           1964243|
|Eastern Mediterra...|                            469554|
|     Western Pacific|                            450333|
|              Africa|                            372798|
|               China|                            265030|
|            Pakistan|                            172339|
|             Nigeria|                             95181|
|             Vietnam|                             93831|
|               Egypt|                             86013|
|         Afghanistan|                             74904|
|            Americas|                             68274|
|               Kenya|                             47299|
|            C

In [35]:
Infeccion_por_pais = Infeccion.groupBy(col("Entity")).sum("Total (estimated) polio cases")
Infeccion_por_pais_ordenado = Infeccion_por_pais.orderBy(col("sum(Total (estimated) polio cases)").asc())
# Muestra el DataFrame resultante
Infeccion_por_pais_ordenado.show()

+--------------------+----------------------------------+
|              Entity|sum(Total (estimated) polio cases)|
+--------------------+----------------------------------+
|     Low Income (WB)|                              NULL|
|             Bermuda|                              NULL|
|         Puerto Rico|                              NULL|
|Serbia and Monten...|                              NULL|
|Latin America & C...|                              NULL|
|European Region (...|                              NULL|
|Sub-Saharan Afric...|                              NULL|
|Northern Mariana ...|                              NULL|
|       North America|                              NULL|
|               Wales|                              NULL|
|  Middle Income (WB)|                              NULL|
|  North America (WB)|                              NULL|
|Region of the Ame...|                              NULL|
|Lower Middle Inco...|                              NULL|
|South-East As

Maximo y minimo

In [17]:
Infeccion.select(min('Total (estimated) polio cases'), max('Total (estimated) polio cases')).show()

+----------------------------------+----------------------------------+
|min(Total (estimated) polio cases)|max(Total (estimated) polio cases)|
+----------------------------------+----------------------------------+
|                                 0|                            460159|
+----------------------------------+----------------------------------+



# 2 dataframe

In [18]:
ninos_vacunados = spark.read.option("header", "true").option("delimiter", ",").csv("/gdrive/MyDrive/EDEM/Spark/notebooks/e2e_spark/archive/2- the-worlds-number-of-vaccinated-one-year-olds.csv")
ninos_vacunados.show(3)
ninos_vacunados.printSchema()
ninos_vacunados.count()

+-----------+----+----+---------------------------------------------+------------------------------------------------------------------------+-------------------------------------------------------+---------------------------------------------------+----------------------------------------------------------------------------+--------------------------------------------+----------------------------------------------------------------------------+------------------------------------------------------------+-------------------------------------------+
|     Entity|Code|Year|Number of one-year-olds vaccinated with HepB3|Number of one-year-olds vaccinated with DTP containing vaccine, 3rd dose|Number of one-year-olds vaccinated with polio, 3rd dose|Population - Sex: all - Age: 0 - Variant: estimates|Number of one-year-olds vaccinated with measles-containing vaccine, 1st dose|Number of one-year-olds vaccinated with Hib3|Number of one-year-olds vaccinated with rubella-containing vaccine, 1st d

10668

In [19]:
# using SQL
ninos_vacunados.createOrReplaceTempView("ninos_vacunados")
spark.sql("select * from ninos_vacunados")
#spark.sql("select * from ninos_vacunados").show()

DataFrame[Entity: string, Code: string, Year: string, Number of one-year-olds vaccinated with HepB3: string, Number of one-year-olds vaccinated with DTP containing vaccine, 3rd dose: string, Number of one-year-olds vaccinated with polio, 3rd dose: string, Population - Sex: all - Age: 0 - Variant: estimates: string, Number of one-year-olds vaccinated with measles-containing vaccine, 1st dose: string, Number of one-year-olds vaccinated with Hib3: string, Number of one-year-olds vaccinated with rubella-containing vaccine, 1st dose: string, Number of one-year-olds vaccinated with rotavirus, last dose: string, Number of one-year-olds vaccinated with BCG: string]

# JOIN

In [24]:
Infeccion = spark.read.option("header", "true").option("delimiter", ",").csv("/gdrive/MyDrive/EDEM/Spark/notebooks/e2e_spark/archive/1- the-number-of-cases-of-infectious-diseases.csv")
nuevos_Infeccion = Infeccion.select("Entity", "Reported cases of guinea worm disease in humans")
nuevos_Infeccion_no_null = nuevos_Infeccion.filter(col( "Reported cases of guinea worm disease in humans").isNotNull())


In [27]:
ninos_vacunados = spark.read.option("header", "true").option("delimiter", ",").csv("/gdrive/MyDrive/EDEM/Spark/notebooks/e2e_spark/archive/2- the-worlds-number-of-vaccinated-one-year-olds.csv")
nuevos_ninos_vacunados = ninos_vacunados.select("Entity", "Number of one-year-olds vaccinated with HepB3")
nuevos_ninos_vacunados_no_null = nuevos_ninos_vacunados.filter(col("Number of one-year-olds vaccinated with HepB3").isNotNull())
nuevos_ninos_vacunados_no_null.show(1)

+-----------+---------------------------------------------+
|     Entity|Number of one-year-olds vaccinated with HepB3|
+-----------+---------------------------------------------+
|Afghanistan|                                       645329|
+-----------+---------------------------------------------+
only showing top 1 row



In [44]:
joinCondition = nuevos_Infeccion_no_null.Entity == nuevos_ninos_vacunados_no_null.Entity
DF = nuevos_Infeccion_no_null.join(nuevos_ninos_vacunados_no_null, joinCondition, "inner")
DF = DF.withColumnRenamed("Entity", " Entity1")
DF.show(1)


+-----------+-----------------------------------------------+-----------+---------------------------------------------+
|    Entity1|Reported cases of guinea worm disease in humans|    Entity1|Number of one-year-olds vaccinated with HepB3|
+-----------+-----------------------------------------------+-----------+---------------------------------------------+
|Afghanistan|                                              0|Afghanistan|                                       645329|
+-----------+-----------------------------------------------+-----------+---------------------------------------------+
only showing top 1 row

