# DataFrames Basics

## Prerrequisites

Install Spark and Java in VM

In [None]:
# install Java8
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
# download spark 3.5.0
!wget -q https://apache.osuosl.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz

In [None]:
ls -l # check the .tgz is there

total 391016
drwxr-xr-x 1 root root      4096 Jan 10 14:23 [0m[01;34msample_data[0m/
-rw-r--r-- 1 root root 400395283 Sep  9 02:10 spark-3.5.0-bin-hadoop3.tgz


In [None]:
# unzip it
!tar xf spark-3.5.0-bin-hadoop3.tgz

In [None]:
!pip install -q findspark

In [None]:

!pip install py4j

# For maps
!pip install folium
!pip install plotly



Define the environment

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.5.0-bin-hadoop3"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

Start Spark Session

---

In [None]:
import findspark
findspark.init("spark-3.5.0-bin-hadoop3")# SPARK_HOME

from pyspark.sql import SparkSession

# create the session
spark = SparkSession \
        .builder \
        .appName("DataFrames Basics") \
        .master("local[*]") \
        .getOrCreate()

spark.version

'3.5.0'

In [None]:
spark

In [None]:
# For Pandas conversion optimization
spark.conf.set("spark.sql.execution.arrow.enabled", "true")

In [None]:
# Import sql functions
from pyspark.sql.functions import *

Download datasets

In [None]:
!mkdir -p dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2324/master/dataset/cars.json -P /dataset
!wget -q https://raw.githubusercontent.com/paponsro/spark_edem_2324/master/dataset/movies.json -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/bank.csv -P /dataset
!wget -q https://github.com/masfworld/datahack_docker/raw/master/zeppelin/data/vehicles.csv -P /dataset
!ls /dataset

bank.csv  cars.json  movies.json  vehicles.csv


In [None]:
ls -l /dataset

total 1784
-rw-r--r-- 1 root root  461474 Jan 12 16:41 bank.csv
-rw-r--r-- 1 root root   74910 Jan 12 16:41 cars.json
-rw-r--r-- 1 root root 1274347 Jan 12 16:41 movies.json
-rw-r--r-- 1 root root    4370 Jan 12 16:41 vehicles.csv


**LEEMOS EL CSV**

In [None]:
drugDeathsDF = spark.read.option('header', 'true').option('delimitter', ',').option('inferSchema', 'true').csv('dataset/drug_deaths.csv')

In [None]:
drugDeathsDF.show()

+----------+-------------+--------------------+--------+----+------+------------+-------------+---------------+--------------+---------+-----------+---------+---------------+--------------------+-----------+----------+------------+-----------+--------------------+---------------+------+-------+--------+-----------------+---------+-----------+-------+-----------+--------------+---------+------+------+------------------+-------------+-----+---------+---------+-------------+-------------+----------------+-------------+
|       _c0|           ID|                Date|DateType| Age|   Sex|        Race|ResidenceCity|ResidenceCounty|ResidenceState|DeathCity|DeathCounty| Location|LocationifOther| DescriptionofInjury|InjuryPlace|InjuryCity|InjuryCounty|InjuryState|                 COD|OtherSignifican|Heroin|Cocaine|Fentanyl|Fentanyl_Analogue|Oxycodone|Oxymorphone|Ethanol|Hydrocodone|Benzodiazepine|Methadone|Amphet|Tramad|Morphine_NotHeroin|Hydromorphone|Other|OpiateNOS|AnyOpioid|MannerofDeath| D

**GROUP BY POR RAZA Y MOSTRAMOS UN CONTEO DE LOS FALLECIMIENTOS DE CADA GRUPO Y ORDENADOS DE MAYOR A MENOR**

In [None]:
DF1 = drugDeathsDF.groupBy('Race').agg(count('Sex').alias('Cantidad')).orderBy(desc('Cantidad'))
DF1.show()

+--------------------+--------+
|                Race|Cantidad|
+--------------------+--------+
|               White|    4002|
|     Hispanic, White|     560|
|               Black|     433|
|     Hispanic, Black|      24|
|             Unknown|      23|
|        Asian, Other|      18|
|        Asian Indian|      14|
|               Other|      11|
|                NULL|      10|
|             Chinese|       2|
|            Hawaiian|       1|
|Native American, ...|       1|
+--------------------+--------+



**AHORA HACEMOS UNA AGRUPACIÓN POR CIUDAD DE RESIDENCIA Y LA MEDIA DE LAS EDADES POR CADA CIUDAD**

In [None]:
DF2 = drugDeathsDF.select('Age', 'ResidenceCity')
EdadDF = DF2.groupBy('ResidenceCity').agg(avg('Age').alias('media_edad')).orderBy(desc(avg('Age')))

EdadDF.show()


+-----------------+----------+
|    ResidenceCity|media_edad|
+-----------------+----------+
|ARLINGTON HEIGHTS|      72.0|
|   ALFRED STATION|      65.0|
|        WELLESLEY|      64.0|
|     NORTH WINDAM|      64.0|
|            SAKEM|      63.0|
|          SEBRING|      62.0|
|    OLD GREENWICH|      59.0|
|           NAPLES|      59.0|
|       SOUTH LYME|      59.0|
|          JACKSON|      59.0|
|   EAST WOODSTOCK|      58.0|
|           ROSCOE|      58.0|
|        CHEPACHET|      58.0|
|       WASHINGTON|      57.5|
|        SOUTHPORT|      57.0|
|           NUTLEY|      57.0|
|          CHELSEA|      57.0|
|        BLANDFORD|      57.0|
|         ROCKFALL|      57.0|
|            CHASE|      56.0|
+-----------------+----------+
only showing top 20 rows



**POR ÚLTIMO EN LOS GROUPBY, HACEMOS UNA AGRUPACIÓN POR SEXO Y CALCULAMOS LA CANTIDAD DE FALLECIMIENTOS POR GRUPO**

In [None]:
personaDF = drugDeathsDF.groupBy('Sex').agg(count('ID').alias('Sex_count')).orderBy(desc(count('ID')))
personaDF.show()

+-------+---------+
|    Sex|Sex_count|
+-------+---------+
|   NULL|    15150|
|   Male|     3773|
| Female|     1325|
|Unknown|        1|
+-------+---------+



In [22]:
personaDF2 = drugDeathsDF.groupBy('Location').agg(count('ID'))
personaDF2.show()

byLocation = Window.partitionBy('Location')

+-----------------+---------+
|         Location|count(ID)|
+-----------------+---------+
|             NULL|    15168|
|            Other|      773|
|Convalescent Home|        3|
|          Hospice|        1|
|         Hospital|     1626|
|     Nursing Home|        1|
|        Residence|     2677|
+-----------------+---------+



**NO PODEMOS HACER MUCHAS MAS CONVERSIONES CON ESTE CSV ASI QUE HEMOS DECIDIDO IMPORTAR DOS MAS XD**

In [24]:
organizationsDF = spark.read.option('header', 'true').option('delimitter', ',').csv('dataset/organizations-100.csv')
organizationsDF.show()

+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+
|Index|Organization Id|                Name|             Website|             Country|         Description|Founded|            Industry|Number of employees|
+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+
|    1|FAB0d41d5b5d22c|         Ferrell LLC|  https://price.net/|    Papua New Guinea|Horizontal empowe...|   1990|            Plastics|               3498|
|    2|6A7EdDEA9FaDC52|Mckinney, Riley a...|http://www.hall-b...|             Finland|User-centric syst...|   2015|Glass / Ceramics ...|               4952|
|    3|0bFED1ADAE4bcC1|          Hester Ltd|http://sullivan-r...|               China|Switchable scalab...|   1971|       Public Safety|               5287|
|    4|2bFC1Be8a4ce42f|      Holder-Sellers| https://becke

**HACEMOS UNA VENTANA PARA AGRUPAR POR AÑO DE FUNDACIÓN Y QUEDARNOS CON LAS TRES OBSERVACIONES CON MAYOR NUMERO DE EMPLEADOS**

In [32]:
from pyspark.sql.window import Window
byFounded = Window.partitionBy('Founded').orderBy(col('Number of employees').desc())
countryEmployeesDF = organizationsDF.withColumn("rank_employees", row_number().over(byFounded)).filter(col("rank_employees") <= 3)
countryEmployeesDF.show()

+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+--------------+
|Index|Organization Id|                Name|             Website|             Country|         Description|Founded|            Industry|Number of employees|rank_employees|
+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+--------------+
|   75|f3C365f0c1A0623|           Hicks LLC| http://alvarez.biz/|            Pakistan|Quality-focused c...|   1970|Computer Software...|               8480|             1|
|    8|ccc93DCF81a31CD|       Mcintosh-Mora|https://www.brook...|Heard Island and ...|Centralized attit...|   1970|     Import / Export|               4389|             2|
|   28|88b1f1cDcf59a37|        Prince-David|http://thompson.com/|    Christmas Island|Virtual holistic ...|   1970|  Banking / Mortgage|    

**PARA TERMINAR Y NO HACERNOS MAS PESADOS, HACEMOS UN JOIN CON OTRA TABLA DE ORGANIZACIONES**

In [33]:
organizationsDF2 = spark.read.option('header', 'true').option('delimitter', ',').csv('dataset/organizations-1000.csv')
organizationsDF2.show()

+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+
|Index|Organization Id|                Name|             Website|             Country|         Description|Founded|            Industry|Number of employees|
+-----+---------------+--------------------+--------------------+--------------------+--------------------+-------+--------------------+-------------------+
|    1|E84A904909dF528|          Liu-Hoover|http://www.day-ha...|      Western Sahara|Ergonomic zero ad...|   1980|   Online Publishing|               6852|
|    2|AAC4f9aBF86EAeF|       Orr-Armstrong|https://www.chapm...|             Algeria|Ergonomic radical...|   1970|     Import / Export|               7994|
|    3|ad2eb3C8C24DB87|           Gill-Lamb|     http://lin.com/|       Cote d'Ivoire|Programmable inte...|   2005|   Apparel / Fashion|               5105|
|    4|D76BB12E5eE165B|         Bauer-Weiss|https://gilles

**NO TIENE MUCHO SIGNIFICADO EL JOIN PERO ERA POR PROBAR A HACER UNO**

In [39]:
joinedDF = organizationsDF2.join(organizationsDF, 'Country', 'left').filter(col('Country') == 'Pakistan').select(organizationsDF2.Index, 'Country', organizationsDF2.Name)
joinedDF.show()

+-----+--------+--------------+
|Index| Country|          Name|
+-----+--------+--------------+
|  192|Pakistan|   Harding Inc|
|  192|Pakistan|   Harding Inc|
|  217|Pakistan|    Morrow Inc|
|  217|Pakistan|    Morrow Inc|
|  323|Pakistan|Donovan-Carson|
|  323|Pakistan|Donovan-Carson|
|  495|Pakistan|Cisneros-Parks|
|  495|Pakistan|Cisneros-Parks|
+-----+--------+--------------+

