

# Propaganda, start the `spark` session

> For SQL users, Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore.

> For Spark users, Spark SQL becomes the narrow-waist for manipulating (semi-) structured data as well as ingesting data from sources that provide schema, such as JSON, Parquet, Hive, or EDWs. It truly unifies SQL and sophisticated analysis, allowing users to mix and match SQL and more imperative programming APIs for advanced analytics.

> For open source hackers, Spark SQL proposes a novel, elegant way of building query planners. It is incredibly easy to add new optimizations under this framework.

> Internally, a structured query is a Catalyst tree of (logical and physical) relational operators and expressions.




In [91]:
# import the usual suspects
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import os
from pathlib import Path
import sys
import timeit

%matplotlib inline
import seaborn as sns

sns.set_context("notebook", font_scale=1.2)

During the session, we will use classes and functions exported by `pyspark`


In [2]:
# spark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.functions import col
import pyspark.sql.functions as fn
from pyspark.sql.catalog import Catalog
from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import IntegerType, StringType

Start the `SparkSession`

In [3]:
conf = SparkConf().setAppName("Spark SQL Illustrations")
sc = SparkContext(conf=conf)

spark = (SparkSession
    .builder
    .appName("Spark SQL")
    .getOrCreate()
)

US Baby Names 1880-2017
=======================


Description
: US baby names provided by the SSA. 

This dataset contains all names used
for at least 5 children of either sex during a year. 


The file is made of `1924665` lines and  4 columns.

```
|-- name: string (nullable = true)
|-- n: integer (nullable = true)
|-- sex: string (nullable = true)
|-- year: integer (nullable = true)
```

Each row indicates for a given name, sex, and year the number of babies 
of the given sex who were given that name during the given year. Names 
with less than 5 occurrences during the year were note recorded. 

|    name|  n|sex|year|
|:--------|:---:|:---:|:----:|
|  Emilia|112|  F|1985|
|   Kelsi|112|  F|1985|
|  Margot|112|  F|1985|
|  Mariam|112|  F|1985|
|Scarlett|112|  F|1985|

First, we download the data if it's not there yet

In [4]:
import requests, zipfile, io
from pathlib import Path

path = Path('babynames_short.csv')
if not path.exists():
    url = "https://stephanegaiffas.github.io/big_data_course/data/babynames_short.csv.zip"
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(path='./')

Load `babynames` from a `csv` file

In [5]:
df_sp = spark.read\
             .format('csv')\
             .option("header", "true")\
             .option("mode", "FAILFAST")\
             .option("inferSchema", "true")\
             .option("sep", ",")\
             .load("babynames_short.csv")

df_sp.printSchema()

root
 |-- name: string (nullable = true)
 |-- n: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- year: double (nullable = true)




Ensure that the dataframe has the following schema:

    root
        |-- name: string (nullable = true)
        |-- n: integer (nullable = true)
        |-- sex: string (nullable = true)
        |-- year: integer (nullable = true)




SQL versus spark-Dataframe API
=================================

>  Dataset API vs SQL

> Spark SQL supports two "modes" to write structured queries: Dataset API and SQL. SQL Mode is used to express structured queries using SQL statements using SparkSession.sql operator, expr standard function and spark-sql command-line tool.

> Some structured queries can be expressed much easier using Dataset API, but there are some that are only possible in SQL. In other words, you may find mixing Dataset API and SQL modes challenging yet rewarding.

> What is important, and one of the reasons why Spark SQL has been so successful, is that there is no performance difference between the modes. Whatever mode you use to write your structured queries, they all end up as a tree of Catalyst relational data structures. And, yes, you could consider writing structured queries using Catalyst directly, but that could quickly become unwieldy for maintenance (i.e. finding Spark SQL developers who could be comfortable with it as well as being fairly low-level and therefore possibly too dependent on a specific Spark SQL version).

Warmup:  compute the 10 most popular names given to babies in year 2000.
======================================================================

## Using `spark.sql()`

In order to use mode `sql`, create a temporary view from the `DataFrame`.

1. What are temporary views made of?
1. Are there other kind of views in spark's world?

In [6]:
Catalog(spark).listTables()

[]

In [12]:
df_sp.createOrReplaceTempView("temp_view")
Catalog(spark).listTables()

[Table(name='temp_view', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]


## A query is a plain SQL query embodied in a string.



In [13]:
query = """TODO: """

#spark.sql(query)


> This phrasing is not consistent with the DRY principle. Fix this using formatted strings.

## Using the dataframe/dataset API

This can also be done using Spark SQL API.

### Pedestrian approach

1. First select `10` most popular names for girls in year `2000`, define `spark` dataframe
`top10_2000_f`.
1. Does the definition of `top10_2000_f` involve _transformations_, _actions_ or both?
1. What is the type of the result returned by `top10_2000_f.take(2)`? the type of elements of the result?



In [29]:
top10_2000_f = df_sp.where('year = 2000').where("sex == 'F'").orderBy(df_sp.n.desc()).limit(10)
top10_2000_f.show()

+---------+-----+---+------+
|     name|    n|sex|  year|
+---------+-----+---+------+
|    Emily|25953|  F|2000.0|
|   Hannah|23080|  F|2000.0|
|  Madison|19967|  F|2000.0|
|   Ashley|17997|  F|2000.0|
|    Sarah|17697|  F|2000.0|
|   Alexis|17629|  F|2000.0|
| Samantha|17266|  F|2000.0|
|  Jessica|15709|  F|2000.0|
|Elizabeth|15094|  F|2000.0|
|   Taylor|15078|  F|2000.0|
+---------+-----+---+------+




1. Do the same thing for boys.



In [30]:
top10_2000_m = df_sp.where('year = 2000').where("sex == 'M'").orderBy(df_sp.n.desc()).limit(10)
top10_2000_m.show()

+-----------+-----+---+------+
|       name|    n|sex|  year|
+-----------+-----+---+------+
|      Jacob|34471|  M|2000.0|
|    Michael|32035|  M|2000.0|
|    Matthew|28572|  M|2000.0|
|     Joshua|27538|  M|2000.0|
|Christopher|24931|  M|2000.0|
|   Nicholas|24652|  M|2000.0|
|     Andrew|23639|  M|2000.0|
|     Joseph|22825|  M|2000.0|
|     Daniel|22312|  M|2000.0|
|      Tyler|21503|  M|2000.0|
+-----------+-----+---+------+




1. Compute the _union_ of the two spark dataframes. Store the result in
dataframe `top10_2000`



In [31]:
top10_2000 = top10_2000_m.union(top10_2000_f)
top10_2000.show()

+-----------+-----+---+------+
|       name|    n|sex|  year|
+-----------+-----+---+------+
|      Jacob|34471|  M|2000.0|
|    Michael|32035|  M|2000.0|
|    Matthew|28572|  M|2000.0|
|     Joshua|27538|  M|2000.0|
|Christopher|24931|  M|2000.0|
|   Nicholas|24652|  M|2000.0|
|     Andrew|23639|  M|2000.0|
|     Joseph|22825|  M|2000.0|
|     Daniel|22312|  M|2000.0|
|      Tyler|21503|  M|2000.0|
|      Emily|25953|  F|2000.0|
|     Hannah|23080|  F|2000.0|
|    Madison|19967|  F|2000.0|
|     Ashley|17997|  F|2000.0|
|      Sarah|17697|  F|2000.0|
|     Alexis|17629|  F|2000.0|
|   Samantha|17266|  F|2000.0|
|    Jessica|15709|  F|2000.0|
|  Elizabeth|15094|  F|2000.0|
|     Taylor|15078|  F|2000.0|
+-----------+-----+---+------+




### Do it again, complying  with DRY principle



In [35]:
query = """
    SELECT *
    FROM temp_view
    WHERE year = 2000 AND sex = 'F'
    ORDER BY n DESC
    LIMIT 10;
"""

spark.sql(query).show()

+---------+-----+---+------+
|     name|    n|sex|  year|
+---------+-----+---+------+
|    Emily|25953|  F|2000.0|
|   Hannah|23080|  F|2000.0|
|  Madison|19967|  F|2000.0|
|   Ashley|17997|  F|2000.0|
|    Sarah|17697|  F|2000.0|
|   Alexis|17629|  F|2000.0|
| Samantha|17266|  F|2000.0|
|  Jessica|15709|  F|2000.0|
|Elizabeth|15094|  F|2000.0|
|   Taylor|15078|  F|2000.0|
+---------+-----+---+------+





Name portfolio through ages
===========================

1. Compute for each year and sex the number of distinct names given that year.



In [41]:
nb_names_year_sex = df_sp.groupby(df_sp['year'], df_sp['sex']).count()
nb_names_year_sex.show()

+------+---+-----+
|  year|sex|count|
+------+---+-----+
|1930.0|  M| 4541|
|1935.0|  M| 4145|
|1903.0|  F| 2083|
|1956.0|  F| 6885|
|1892.0|  M| 1260|
|1995.0|  M|10327|
|1966.0|  M| 4536|
|2006.0|  M|14032|
|1970.0|  F| 9350|
|1889.0|  M| 1111|
|1924.0|  M| 4970|
|1973.0|  M| 5876|
|2008.0|  F|20457|
|1911.0|  M| 1999|
|1951.0|  M| 4251|
|1921.0|  M| 4986|
|1934.0|  F| 4973|
|1898.0|  M| 1289|
|1910.0|  M| 1839|
|1953.0|  F| 6499|
+------+---+-----+
only showing top 20 rows




1. Plot the evolution of the number of distinct names as a function of `year`.
Use some aesthetics to distinguish sexes.




In [97]:
nb_distinct_name_year = df_sp.groupBy('year', 'sex').count().orderBy('year')

+------+---+-----+
|  year|sex|count|
+------+---+-----+
|1880.0|  M| 1058|
|1880.0|  F|  942|
|1881.0|  F|  938|
|1881.0|  M|  997|
|1882.0|  F| 1028|
|1882.0|  M| 1099|
|1883.0|  M| 1030|
|1883.0|  F| 1054|
|1884.0|  F| 1172|
|1884.0|  M| 1125|
|1885.0|  F| 1197|
|1885.0|  M| 1097|
|1886.0|  F| 1282|
|1886.0|  M| 1110|
|1887.0|  M| 1067|
|1887.0|  F| 1306|
|1888.0|  F| 1474|
|1888.0|  M| 1177|
|1889.0|  M| 1111|
|1889.0|  F| 1479|
+------+---+-----+
only showing top 20 rows



In [106]:
df_pand = nb_distinct_name_year.toPandas()
fig = px.line(df_pand, x='year', y='count', color='sex',
              title='Evolution of distinct names each year from 1880 to 2017.',
              labels={'count':'Number of distinct name', 'year':'Year', 'sex': 'Gender'})
fig.show()



Assessing popularity through time
=================================

1. For each year and sex, compute the total number of births
1. Plot the evolution of the sex ratio over time
1. For each year, sex, and name compute the percentage of newborns
given that name for that given year.


> Use `Window` functions.



In [126]:
window = Window.partitionBy('year', 'sex')
df_tot_births = df_sp.withColumn('total births', fn.sum('n').over(window))

df_tot_births = df_tot_births.select('year', 'sex', 'total births').distinct()
df_tot_births.show()

+------+---+------------+
|  year|sex|total births|
+------+---+------------+
|1970.0|  F|     1748175|
|1977.0|  M|     1643766|
|1900.0|  F|      299800|
|1962.0|  M|     2068669|
|2005.0|  F|     1846258|
|1887.0|  F|      145981|
|1960.0|  M|     2132359|
|2013.0|  M|     1886989|
|1935.0|  F|     1048493|
|1966.0|  F|     1691945|
|1890.0|  M|      111025|
|2016.0|  M|     1889052|
|1999.0|  M|     1919391|
|1952.0|  M|     1944564|
|1885.0|  M|      107799|
|1991.0|  F|     1874620|
|1988.0|  M|     1913203|
|1881.0|  F|       91953|
|1920.0|  F|     1198290|
|2004.0|  F|     1834856|
+------+---+------------+
only showing top 20 rows



In [154]:
window = Window.partitionBy('year')
df_tot_f = df_sp.where(df_sp['sex'] == 'F').groupBy('year', 'sex').sum('n').drop('sex').withColumnRenamed('sum(n)', 'n_f')
df_tot_m = df_sp.where(df_sp['sex'] == 'M').groupBy('year', 'sex').sum('n').drop('sex').withColumnRenamed('sum(n)', 'n_m')

df_tot_ratio = df_tot_f.join(df_tot_m, on=['year']).orderBy('year')
df_tot_ratio = df_tot_ratio.withColumn('ratio', df_tot_ratio['n_f'] / (df_tot_ratio['n_f'] + df_tot_ratio['n_m']))

fig = px.line(df_tot_ratio.toPandas(), x='year', y='ratio',title='Evolution of ratio of female birth from 1880 to 2017.',labels={'ratio':'Ratio of female', 'year':'Year'})
fig.show()

# I ploted only the female ratio because the male ratio would just be the opposite and would be redondant.


1. Compute for each year, sex and name  the `row_number`, `rank`, and `dense_rank`
of the name within that year and sex category, when names are sorted by increasing popularity.



In [170]:
window = Window.partitionBy(df_sp['year'], df_sp['sex']).orderBy(df_sp['year'], df_sp['sex'], df_sp['n'].desc())

df_fn = df_sp.withColumn('row_number', fn.row_number().over(window))\
        .withColumn('rank', fn.rank().over(window))\
        .withColumn('dense_rank', fn.dense_rank().over(window))

df_fn.show()

+-------+-----+---+------+----------+----+----------+
|   name|    n|sex|  year|row_number|rank|dense_rank|
+-------+-----+---+------+----------+----+----------+
| Robert|62147|  M|1930.0|         1|   1|         1|
|  James|53944|  M|1930.0|         2|   2|         2|
|   John|52432|  M|1930.0|         3|   3|         3|
|William|47259|  M|1930.0|         4|   4|         4|
|Richard|32178|  M|1930.0|         5|   5|         5|
|Charles|31863|  M|1930.0|         6|   6|         6|
| Donald|29046|  M|1930.0|         7|   7|         7|
| George|22779|  M|1930.0|         8|   8|         8|
| Joseph|20981|  M|1930.0|         9|   9|         9|
| Edward|17347|  M|1930.0|        10|  10|        10|
| Thomas|17013|  M|1930.0|        11|  11|        11|
|   Paul|12960|  M|1930.0|        12|  12|        12|
|  Frank|12539|  M|1930.0|        13|  13|        13|
|   Jack|12431|  M|1930.0|        14|  14|        14|
|  David|12272|  M|1930.0|        15|  15|        15|
|Raymond|11715|  M|1930.0|  




Evolution of top popular names through the century
==================================================


1. For each sex, select the ten most popular names in year 2000, and plot the proportion
of newborns given that name over time. Take into account that some names might have
zero occurrence during certain years.



In [198]:
most_pop_2020_m = df_sp.where(df_sp['year'] == 2000).where(df_sp['sex'] == 'M').orderBy(df_sp['n'].desc()).limit(10)
most_pop_2020_f = df_sp.where(df_sp['year'] == 2000).where(df_sp['sex'] == 'F').orderBy(df_sp['n'].desc()).limit(10)

most_pop_2020_m = most_pop_2020_m.drop('n', 'sex', 'year')
most_pop_2020_f = most_pop_2020_f.drop('n', 'sex', 'year')

df_prop_m = df_sp.where(df_sp['sex'] == 'M').join(most_pop_2020_m, on=['name'])
df_prop_f = df_sp.where(df_sp['sex'] == 'F').join(most_pop_2020_f, on=['name'])

df_total_m = df_sp.where(df_sp['sex'] == 'M').groupBy('year', 'sex').sum('n').drop('sex')
df_total_f = df_sp.where(df_sp['sex'] == 'F').groupBy('year', 'sex').sum('n').drop('sex')

df_prop_m = df_prop_m.join(df_total_m, on=['year'])
df_prop_f = df_prop_f.join(df_total_f, on=['year'])

df_prop_m = df_prop_m.withColumn('ratio', df_prop_m['n'] / df_prop_m['sum(n)']).orderBy('year').drop('n')
df_prop_f = df_prop_f.withColumn('ratio', df_prop_f['n'] / df_prop_f['sum(n)']).orderBy('year').drop('n')

In [202]:
fig = px.line(df_prop_m.toPandas(), x='year', y='ratio', color='name',
              title='Proportion of newborns of the 10 most popular firstname of 2000 from ',
              labels={'ratio':'Ratio', 'year':'Year', 'name':'Firstname'})
fig.show()

In [203]:
fig = px.line(df_prop_f.toPandas(), x='year', y='ratio', color='name',
              title='Proportion of newborns of the 10 most popular firstname of 2000 from ',
              labels={'ratio':'Ratio', 'year':'Year', 'name':'Firstname'})
fig.show()


1. Use `explain()` to determine the joining strategy used by spark.


Plot  the popularity of each of the top ten achievers from year 2000 with respect to time
==================================================================================




In [None]:
# %%
# TODO:
# %%



Plot the total popularity of the top ten achievers from year 2000 with respect to time
==================================================================================




In [None]:
# %%
# TODO:
# %%



Plot lorenz curves
=====================

Every year, the name counts define a discrete probability distribution.
This distribution, just as income or wealth distribution,
is (usually) far from being uniform. We want to assess how uneven it is.
We use the tools developed in econometrics.

Without loss of generality, that we handle a distribution over $1, \ldots, n$
where $n$ is the number of distinct names given during a year.
We assume that frequencies $p_1, p_2, \ldots, p_n$ are given in ascending order.

The Lorenz function maps $[0, 1] \to [0, 1]$.
$$L(x) = \sum_{i=1}^{\lfloor nx \rfloor} p_i$$.

1. Design a query that adds a column "lorenz" to the dataframe , and for each
row computes the value of the Lorenz function defined by `year`  and `sex`.




In [None]:
# TODO:



1. Design a function that takes as input a `year` and plots the Lorenz curve
for that year for both sexes.



In [None]:
# %%
# TODO:
# %%


Gini index
==========

The [Gini index](https://en.wikipedia.org/wiki/Gini_coefficient) is twice the surface of the area comprised between curves $y=x$
and $y=L(x)$.

Choose a formula that allows you to compute it efficiently.

$$G={\frac {2\sum _{i=1}^{n}iy_{i}}{n\sum _{i=1}^{n}y_{i}}}-{\frac {n+1}{n}}.$$

1. Design a query that computes the Gini index of the `babynames` distribution
for every `year` and `sex`.

1. Plot Gini index over time




In [None]:
# TODO:

In [None]:
# %%
# TODO:
# %%




Close the door, leave work area clean
=====================================


In [None]:
spark.stop()