# Intro to Apache Spark

* [Intro to Spark slides](https://github.com/databricks/tech-talks/blob/master/2020-04-29%20%7C%20Intro%20to%20Apache%20Spark/Intro%20to%20Spark.pdf)
* What is a Spark DataFrame?
  * Read in the [NYT data set](https://github.com/nytimes/covid-19-data) 
* How to perform a distributed count?
* Transformations vs. Actions
* Spark SQL

[Spark docs](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html)

In [2]:
%fs ls databricks-datasets/COVID/covid-19-data/

path,name,size
dbfs:/databricks-datasets/COVID/covid-19-data/.git/,.git/,0
dbfs:/databricks-datasets/COVID/covid-19-data/.github/,.github/,0
dbfs:/databricks-datasets/COVID/covid-19-data/.gitignore,.gitignore,10
dbfs:/databricks-datasets/COVID/covid-19-data/LICENSE,LICENSE,1289
dbfs:/databricks-datasets/COVID/covid-19-data/NYT-readme.md,NYT-readme.md,1748
dbfs:/databricks-datasets/COVID/covid-19-data/PROBABLE-CASES-NOTE.md,PROBABLE-CASES-NOTE.md,3162
dbfs:/databricks-datasets/COVID/covid-19-data/README.md,README.md,26125
dbfs:/databricks-datasets/COVID/covid-19-data/excess-deaths/,excess-deaths/,0
dbfs:/databricks-datasets/COVID/covid-19-data/live/,live/,0
dbfs:/databricks-datasets/COVID/covid-19-data/mask-use/,mask-use/,0


## How do we represent this data?

![Unified Engine](https://files.training.databricks.com/images/105/unified-engine.png)


####At first there were RDDs...
* **R**esilient: Fault-tolerant
* **D**istributed: Across multiple nodes
* **D**ataset: Collection of partitioned data

RDDs are immutable once created and keep track of their lineage to enable failure recovery.

####... and then there were DataFrames
* Higher-level APIs
* User friendly
* Optimizations and performance improvements

![RDD vs DataFrames](https://files.training.databricks.com/images/105/rdd-vs-dataframes.png)

###Create a DataFrame from the NYT COVID data

In [6]:
covid_df = spark.read.csv("dbfs:/databricks-datasets/COVID/covid-19-data/us-counties.csv")
covid_df.show()

Let's look at the [Spark docs](https://spark.apache.org/docs/latest/index.html) to see what options we have to pass into the csv reader.

In [8]:
covid_df = spark.read.csv("dbfs:/databricks-datasets/COVID/covid-19-data/us-counties.csv", header=True, inferSchema=True)
covid_df.show()

In [9]:
covid_df.count()

### Let's write some Spark code!

* What about th information for the county where from 06.August, all COVID-19 tests are free (Los Angeles)
* I want the most recent information at the top

In [11]:
(covid_df
 .sort(covid_df["date"].desc()) 
 .filter(covid_df["county"] == "Los Angeles")) 

**...nothing happened. Why?**

## Transformations vs Actions

There are two types of operations in Spark: transformations and actions.

Fundamental to Apache Spark are the notions that
* Transformations are **LAZY**
* Actions are **EAGER**

In [14]:
# same operations as above
(covid_df
 .sort(covid_df["date"].desc()) 
 .filter(covid_df["county"] == "Los Angeles")) 

Why isn't is showing me results? **Sort** and **filter** are `transformations`, which are lazily evaluated in Spark.

Laziness has a number of benefits
* Not forced to load all data in the first step
  * Technically impossible with **REALLY** large datasets.
* Easier to parallelize operations 
  * N different transformations can be processed on a single data element, on a single thread, on a single machine. 
* Most importantly, it allows the framework to automatically apply various optimizations
  * This is also why we use Dataframes!
  
There's a lot Spark's **Catalyst** optimizer can do. Let's focus on only this situation. For more information, read [this blog!](https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html)
  
![Catalyst](https://files.training.databricks.com/images/105/catalyst-diagram.png)

In [16]:
(covid_df
 .sort(covid_df["date"].desc()) 
 .filter(covid_df["county"] == "Los Angeles") 
 .show())  #action!

###We can see the optimizations in action!
* Go to the Spark UI
* Click on the SQL query associated with your Spark job
* See the logical and physical plans!
  * The filter and sort have been swapped

## Now, Dive into Spark SQL

In [19]:
covid_df.createOrReplaceTempView("covid")

In [20]:
%sql

SELECT * 
FROM covid

-- keys = date, grouping = county, values = cases

date,county,state,fips,cases,deaths
2020-01-21,Snohomish,Washington,53061.0,1,0
2020-01-22,Snohomish,Washington,53061.0,1,0
2020-01-23,Snohomish,Washington,53061.0,1,0
2020-01-24,Cook,Illinois,17031.0,1,0
2020-01-24,Snohomish,Washington,53061.0,1,0
2020-01-25,Orange,California,6059.0,1,0
2020-01-25,Cook,Illinois,17031.0,1,0
2020-01-25,Snohomish,Washington,53061.0,1,0
2020-01-26,Maricopa,Arizona,4013.0,1,0
2020-01-26,Los Angeles,California,6037.0,1,0


In [21]:
%sql

SELECT * 
FROM covid 
WHERE county = "Los Angeles"

-- keys = date, grouping = county, values = cases, deaths

date,county,state,fips,cases,deaths
2020-01-26,Los Angeles,California,6037,1,0
2020-01-27,Los Angeles,California,6037,1,0
2020-01-28,Los Angeles,California,6037,1,0
2020-01-29,Los Angeles,California,6037,1,0
2020-01-30,Los Angeles,California,6037,1,0
2020-01-31,Los Angeles,California,6037,1,0
2020-02-01,Los Angeles,California,6037,1,0
2020-02-02,Los Angeles,California,6037,1,0
2020-02-03,Los Angeles,California,6037,1,0
2020-02-04,Los Angeles,California,6037,1,0


In [22]:
%sql

SELECT max(cases) AS max_cases, max(deaths) AS max_deaths, county 
FROM covid 
GROUP BY county 
ORDER BY max_cases DESC
LIMIT 10

max_cases,max_deaths,county
230964,23027,New York City
195614,4758,Los Angeles
124758,1724,Miami-Dade
121789,2153,Maricopa
107744,4902,Cook
79543,1367,Harris
58953,765,Broward
52131,722,Dallas
44997,716,Clark
43468,2044,Suffolk


###Get more data into the analysis

**This is census data taken from census.gov**
* It has enough information to be able to construct a fips code column that will correspond the the NYT data

In [25]:
%sh wget https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv && cp co-est2019-alldata.csv /dbfs/tmp

In [26]:
census_df = spark.read.csv("dbfs:/tmp/co-est2019-alldata.csv", header=True, inferSchema=True)

#display() is a Databricks only function. It displays the data, like show(), but also gives the visualization options we saw in the SQL section above
display(census_df)

SUMLEV,REGION,DIVISION,STATE,COUNTY,STNAME,CTYNAME,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,POPESTIMATE2017,POPESTIMATE2018,POPESTIMATE2019,NPOPCHG_2010,NPOPCHG_2011,NPOPCHG_2012,NPOPCHG_2013,NPOPCHG_2014,NPOPCHG_2015,NPOPCHG_2016,NPOPCHG_2017,NPOPCHG_2018,NPOPCHG_2019,BIRTHS2010,BIRTHS2011,BIRTHS2012,BIRTHS2013,BIRTHS2014,BIRTHS2015,BIRTHS2016,BIRTHS2017,BIRTHS2018,BIRTHS2019,DEATHS2010,DEATHS2011,DEATHS2012,DEATHS2013,DEATHS2014,DEATHS2015,DEATHS2016,DEATHS2017,DEATHS2018,DEATHS2019,NATURALINC2010,NATURALINC2011,NATURALINC2012,NATURALINC2013,NATURALINC2014,NATURALINC2015,NATURALINC2016,NATURALINC2017,NATURALINC2018,NATURALINC2019,INTERNATIONALMIG2010,INTERNATIONALMIG2011,INTERNATIONALMIG2012,INTERNATIONALMIG2013,INTERNATIONALMIG2014,INTERNATIONALMIG2015,INTERNATIONALMIG2016,INTERNATIONALMIG2017,INTERNATIONALMIG2018,INTERNATIONALMIG2019,DOMESTICMIG2010,DOMESTICMIG2011,DOMESTICMIG2012,DOMESTICMIG2013,DOMESTICMIG2014,DOMESTICMIG2015,DOMESTICMIG2016,DOMESTICMIG2017,DOMESTICMIG2018,DOMESTICMIG2019,NETMIG2010,NETMIG2011,NETMIG2012,NETMIG2013,NETMIG2014,NETMIG2015,NETMIG2016,NETMIG2017,NETMIG2018,NETMIG2019,RESIDUAL2010,RESIDUAL2011,RESIDUAL2012,RESIDUAL2013,RESIDUAL2014,RESIDUAL2015,RESIDUAL2016,RESIDUAL2017,RESIDUAL2018,RESIDUAL2019,GQESTIMATESBASE2010,GQESTIMATES2010,GQESTIMATES2011,GQESTIMATES2012,GQESTIMATES2013,GQESTIMATES2014,GQESTIMATES2015,GQESTIMATES2016,GQESTIMATES2017,GQESTIMATES2018,GQESTIMATES2019,RBIRTH2011,RBIRTH2012,RBIRTH2013,RBIRTH2014,RBIRTH2015,RBIRTH2016,RBIRTH2017,RBIRTH2018,RBIRTH2019,RDEATH2011,RDEATH2012,RDEATH2013,RDEATH2014,RDEATH2015,RDEATH2016,RDEATH2017,RDEATH2018,RDEATH2019,RNATURALINC2011,RNATURALINC2012,RNATURALINC2013,RNATURALINC2014,RNATURALINC2015,RNATURALINC2016,RNATURALINC2017,RNATURALINC2018,RNATURALINC2019,RINTERNATIONALMIG2011,RINTERNATIONALMIG2012,RINTERNATIONALMIG2013,RINTERNATIONALMIG2014,RINTERNATIONALMIG2015,RINTERNATIONALMIG2016,RINTERNATIONALMIG2017,RINTERNATIONALMIG2018,RINTERNATIONALMIG2019,RDOMESTICMIG2011,RDOMESTICMIG2012,RDOMESTICMIG2013,RDOMESTICMIG2014,RDOMESTICMIG2015,RDOMESTICMIG2016,RDOMESTICMIG2017,RDOMESTICMIG2018,RDOMESTICMIG2019,RNETMIG2011,RNETMIG2012,RNETMIG2013,RNETMIG2014,RNETMIG2015,RNETMIG2016,RNETMIG2017,RNETMIG2018,RNETMIG2019
40,3,6,1,0,Alabama,Alabama,4779736,4780125,4785437,4799069,4815588,4830081,4841799,4852347,4863525,4874486,4887681,4903185,5312,13632,16519,14493,11718,10548,11178,10961,13195,15504,14226,59690,59067,57929,58903,59647,59389,58961,58271,57313,11075,48833,48366,50851,49712,51876,51710,53195,53665,53879,3151,10857,10701,7078,9191,7771,7679,5766,4606,3434,924,4665,5817,5046,3684,4580,5777,3011,3379,2772,1244,-1893,-114,2297,-959,-1544,-2157,2298,5279,9387,2168,2772,5703,7343,2725,3036,3620,5309,8658,12159,-7,3,115,72,-198,-259,-121,-114,-69,-89,116185,116246,115180,115793,116932,119032,119972,118619,117094,116576,116625,12.455519356,12.286865772,12.011401179,12.180258647,12.305777115,12.225150764,12.109454384,11.938128082,11.707442426,10.189987883,10.060889328,10.543799502,10.279697432,10.702541513,10.644438296,10.925228982,10.994485138,11.005972301,2.2655314734,2.2259764441,1.467601677,1.9005612146,1.6032356022,1.5807124672,1.1842254029,0.9436429432,0.7014701253,0.9734461014,1.2100275652,1.0462726847,0.7617960521,0.944900149,1.1891881655,0.6184014374,0.6922643302,0.5662420464,-0.395012534,-0.023713794,0.4762759328,-0.198306844,-0.318542758,-0.44401573,0.4719649629,1.0815221661,1.9175014754,0.5784335677,1.1863137707,1.5225486174,0.5634892079,0.6263573914,0.7451724354,1.0903664003,1.7737864964,2.4837435218
50,3,6,1,1,Alabama,Autauga County,54571,54597,54773,55227,54954,54727,54893,54864,55243,55390,55533,55869,176,454,-273,-227,166,-29,379,147,143,336,150,638,615,571,640,651,666,676,631,624,157,514,560,582,573,584,547,573,518,541,-7,124,55,-11,67,67,119,103,113,83,25,4,-14,12,7,13,-3,-12,-7,-16,147,327,-329,-226,101,-107,266,59,37,270,172,331,-343,-214,108,-94,263,47,30,254,11,-1,15,-2,-9,-2,-3,-3,0,-1,455,455,455,455,455,455,455,455,455,455,455,11.6,11.163449234,10.41201302,11.676701332,11.86256913,12.097323513,12.220585178,11.377261704,11.202671406,9.3454545455,10.165091985,10.612594706,10.454296661,10.641690279,9.9357897318,10.358572939,9.3398123022,9.7125724852,2.2545454545,0.9983572485,-0.200581687,1.2224046707,1.2208788506,2.1615337808,1.8620122387,2.0374494018,1.490098921,0.0727272727,-0.2541273,0.2188163857,0.1277139208,0.2368869412,-0.054492448,-0.216933465,-0.12621368,-0.287247985,5.9454545455,-5.971991541,-4.121041931,1.8427294289,-1.949761746,4.8316637453,1.0665895348,0.6671294502,4.8473097431,6.0181818182,-6.226118841,-3.902225545,1.9704433498,-1.712874805,4.777171297,0.8496560701,0.5409157704,4.5600617583
50,3,6,1,3,Alabama,Baldwin County,182265,182265,183112,186558,190145,194885,199183,202939,207601,212521,217855,223234,847,3446,3587,4740,4298,3756,4662,4920,5334,5379,516,2189,2093,2160,2212,2257,2300,2300,2310,2304,533,1829,1885,1900,1987,2098,2022,2099,2312,2326,-17,360,208,260,225,159,278,201,-2,-22,36,177,239,204,113,131,180,86,97,80,782,2899,3055,4176,3864,3433,4188,4619,5224,5297,818,3076,3294,4380,3977,3564,4368,4705,5321,5377,46,10,85,100,96,33,16,14,15,24,2307,2307,2263,2240,2296,2331,2337,2275,2193,2170,2170,11.842995104,11.112202451,11.219904942,11.226488829,11.225448993,11.204754713,10.94920047,10.734799338,10.446871266,9.8953120351,10.007884195,9.869360829,10.084553935,10.434644212,9.8504408827,9.9923355597,10.744093537,10.546624377,1.9476830687,1.104318256,1.3505441134,1.1419348945,0.7908047806,1.3543138306,0.9568649107,-0.009294199,-0.099753111,0.9576108421,1.26890415,1.059657689,0.5735050803,0.6515435614,0.8768938471,0.4094048872,0.4507686302,0.3627385856,15.684258934,16.219674385,21.691816222,19.610828588,17.074420201,20.402396843,21.988850858,24.276446642,24.017828601,16.641869776,17.488578535,22.751473911,20.184333668,17.725963762,21.27929069,22.398255745,24.727215272,24.380567187
50,3,6,1,5,Alabama,Barbour County,27457,27455,27327,27341,27169,26937,26755,26283,25806,25157,24872,24686,-128,14,-172,-232,-182,-472,-477,-649,-285,-186,71,331,300,282,264,271,276,280,263,256,131,323,286,295,308,332,280,295,329,312,-60,8,14,-13,-44,-61,-4,-15,-66,-56,-1,-5,-12,-10,4,13,17,12,12,13,-69,13,-176,-210,-142,-430,-492,-649,-231,-141,-70,8,-188,-220,-138,-417,-475,-637,-219,-128,2,-2,2,1,0,6,2,3,0,-2,3193,3193,3380,3390,3388,3352,3193,2975,2817,2813,2812,12.109460745,11.007154651,10.423982553,9.8338672428,10.219088201,10.59724702,10.988364107,10.513901937,10.331328948,11.816784956,10.493487433,10.904520756,11.472845117,12.519325766,10.75083031,11.57702647,13.152371624,12.591307155,0.2926757884,0.513667217,-0.480538203,-1.638977874,-2.300237566,-0.15358329,-0.588662363,-2.638469688,-2.259978207,-0.182922368,-0.440286186,-0.369644771,0.1489979885,0.4902145631,0.6527289831,0.4709298903,0.4797217614,0.5246377981,0.4755981561,-6.457530728,-7.762540199,-5.289428593,-16.2147894,-18.89074469,-25.46945823,-9.234643907,-5.690302272,0.2926757884,-6.897816914,-8.13218497,-5.140430604,-15.72457483,-18.2380157,-24.99852834,-8.754922145,-5.165664474
50,3,6,1,7,Alabama,Bibb County,22915,22915,22870,22745,22667,22521,22553,22566,22586,22550,22367,22394,-45,-125,-78,-146,32,13,20,-36,-183,27,44,264,246,258,253,251,276,291,232,240,32,276,236,275,247,265,241,252,263,252,12,-12,10,-17,6,-14,35,39,-31,-12,0,10,19,20,14,13,14,10,10,10,-59,-124,-105,-151,16,17,-30,-83,-164,31,-59,-114,-86,-131,30,30,-16,-73,-154,41,2,1,-2,2,-4,-3,1,-2,2,-2,2224,2224,2224,2228,2224,2245,2255,2204,2150,2146,2148,11.575139757,10.834140756,11.418960786,11.225983938,11.126133115,12.225372077,12.894363701,10.330164526,10.723621009,12.101282473,10.39372853,12.171372931,10.959755069,11.746714245,10.675053154,11.166253102,11.710488234,11.25980206,-0.526142716,0.4404122258,-0.752412145,0.2662288681,-0.62058113,1.5503189227,1.7281105991,-1.380323708,-0.53618105,0.4384522635,0.8367832291,0.8851907586,0.6212006922,0.5762539063,0.6201275691,0.4431052818,0.4452657123,0.4468175421,-5.436808068,-4.624328371,-6.683190227,0.7099436482,0.7535628006,-1.328844791,-3.677773839,-7.302357682,1.3851343804,-4.998355804,-3.787545142,-5.797999469,1.3311443404,1.3298167069,-0.708717222,-3.234668557,-6.85709197,1.8319519224
50,3,6,1,9,Alabama,Blount County,57322,57322,57376,57560,57580,57619,57526,57526,57494,57787,57771,57826,54,184,20,39,-93,0,-32,293,-16,55,184,744,712,647,619,716,700,660,679,651,131,569,587,583,587,633,650,721,688,657,53,175,125,64,32,83,50,-61,-9,-6,-2,-16,5,45,40,13,22,-1,5,6,9,28,-100,-65,-158,-90,-102,358,-9,59,7,12,-95,-20,-118,-77,-80,357,-4,65,-6,-3,-10,-5,-7,-6,-2,-3,-3,-4,489,489,489,489,489,489,489,489,489,489,489,12.946335352,12.367552545,11.232736395,10.751660949,12.44654591,12.171796209,11.450282354,11.751674484,11.263268078,9.901162386,10.196282786,10.121615639,10.195840028,11.003720057,11.302382194,12.508566026,11.90744042,11.367077,3.0451729658,2.1712697586,1.1111207563,0.5558209214,1.4428258527,0.869414015,-1.058283672,-0.155765936,-0.103808922,-0.278415814,0.0868507903,0.7812567817,0.6947761518,0.2259847721,0.3825421666,-0.017348913,0.086536631,0.1038089224,0.4872276745,-1.737015807,-1.128482018,-2.7443658,-1.564509961,-1.773604591,6.2109107312,-0.155765936,1.0207877367,0.2088118605,-1.650165017,-0.347225236,-2.049589648,-1.338525189,-1.391062424,6.1935618185,-0.069229305,1.1245966591
50,3,6,1,11,Alabama,Bullock County,10914,10911,10876,10675,10606,10549,10663,10400,10389,10176,10174,10101,-35,-201,-69,-57,114,-263,-11,-213,-2,-73,39,169,122,130,124,125,143,136,104,109,53,133,116,120,115,133,130,144,92,109,-14,36,6,10,9,-8,13,-8,12,0,1,17,7,18,6,0,8,8,13,-1,-24,-254,-81,-85,96,-259,-31,-213,-28,-72,-23,-237,-74,-67,102,-259,-23,-205,-15,-73,2,0,-1,0,3,4,-1,0,1,0,1690,1690,1690,1779,1717,1755,1660,1728,1660,1663,1663,15.683726973,11.465626615,12.290238714,11.69149538,11.869154441,13.757275482,13.226355458,10.221130221,10.75215783,12.342814719,10.901743339,11.344835736,10.842919102,12.628780326,12.506614075,14.004376368,9.0417690418,10.75215783,3.3409122547,0.5638832762,0.945402978,0.8485762776,-0.759625884,1.2506614075,-0.778020909,1.1793611794,0.0,1.5776530091,0.6578638222,1.7017253604,0.5657175184,0.0,0.7696377892,0.7780209093,1.2776412776,-0.09864365,-23.57199202,-7.612424228,-8.035925313,9.0514802942,-24.592888,-2.982346433,-20.71480671,-2.751842752,-7.102342787,-21.99433901,-6.954560406,-6.334199953,9.6171978126,-24.592888,-2.212708644,-19.9367858,-1.474201474,-7.200986436
50,3,6,1,13,Alabama,Butler County,20947,20940,20932,20866,20670,20356,20327,20162,20012,19888,19631,19448,-8,-66,-196,-314,-29,-165,-150,-124,-257,-183,65,274,240,241,251,239,230,244,216,213,66,264,274,262,287,276,253,269,274,272,-1,10,-34,-21,-36,-37,-23,-25,-58,-59,0,1,6,7,20,27,28,22,17,18,-4,-77,-170,-304,-11,-153,-154,-121,-216,-141,-4,-76,-164,-297,9,-126,-126,-99,-199,-123,-3,0,2,4,-2,-2,-1,0,0,-1,333,333,333,333,333,333,333,333,333,333,333,13.110675152,11.55624037,11.748647199,12.339306344,11.805675616,11.450191666,12.230576441,10.931450695,10.90099542,12.632183358,13.193374422,12.772388242,14.109087334,13.63333251,12.595210833,13.483709273,13.86674764,13.920519972,0.4784917939,-1.637134052,-1.023741042,-1.76978099,-1.827656894,-1.145019167,-1.253132832,-2.935296946,-3.019524553,0.0478491794,0.2889060092,0.3412470141,0.9832116609,1.3336955716,1.3939363768,1.1027568922,0.8603456565,0.9212108805,-3.684386813,-8.185670262,-14.81987033,-0.540766413,-7.557608239,-7.666650072,-6.065162907,-10.93145069,-7.216151897,-3.636537633,-7.896764253,-14.47862331,0.4424452474,-6.223912668,-6.272713695,-4.962406015,-10.07110504,-6.294941017
50,3,6,1,15,Alabama,Calhoun County,118572,118526,118408,117744,117190,116471,115917,115469,114973,114710,114331,113605,-118,-664,-554,-719,-554,-448,-496,-263,-379,-726,318,1385,1356,1309,1315,1388,1382,1324,1299,1269,311,1325,1359,1410,1395,1455,1475,1393,1616,1532,7,60,-3,-101,-80,-67,-93,-69,-317,-263,-4,26,67,45,66,66,102,69,103,14,-113,-752,-606,-659,-534,-438,-502,-259,-159,-475,-117,-726,-539,-614,-468,-372,-400,-190,-56,-461,-8,2,-12,-4,-6,-9,-3,-4,-6,-2,2933,2933,2882,2958,2814,2798,2775,2761,2743,2830,2833,11.729733392,11.543667583,11.204266009,11.317279722,11.997268633,11.994341309,11.528933356,11.342947333,11.134704478,11.221586097,11.569206671,12.06876629,12.005783431,12.576387508,12.80148584,12.129761454,14.111010692,13.442369788,0.508147295,-0.025539088,-0.86450028,-0.688503709,-0.579118875,-0.807144531,-0.600828098,-2.76806336,-2.30766531,0.2201971612,0.5703729558,0.3851733922,0.5680155602,0.5704753097,0.885255292,0.6008280979,0.8994022904,0.1228414994,-6.36877943,-5.15889569,-5.640650344,-4.59576226,-3.7858816,-4.356844672,-2.25528228,-1.38839771,-4.167836586,-6.148582269,-4.588522734,-5.255476952,-4.027746699,-3.215406291,-3.47158938,-1.654454183,-0.48899542,-4.044995086
50,3,6,1,17,Alabama,Chambers County,34215,34169,34122,34033,34104,34139,33977,33996,33745,33707,33600,33254,-47,-89,71,35,-162,19,-251,-38,-107,-346,83,401,394,405,425,421,391,378,359,354,80,442,476,454,454,445,444,485,414,441,3,-41,-82,-49,-29,-24,-53,-107,-55,-87,6,27,31,29,29,14,13,7,6,6,-54,-74,122,57,-158,31,-209,65,-57,-265,-48,-47,153,86,-129,45,-196,72,-51,-259,-2,-1,0,-2,-4,-2,-2,-3,-1,0,458,458,458,458,458,458,458,458,458,458,458,11.767295136,11.564935351,11.869349237,12.478712784,12.387271417,11.543968941,11.207970112,10.667538295,10.590241422,12.970435038,13.97185083,13.30539396,13.330201421,13.093434158,13.108752454,14.380596572,12.301840819,13.192927873,-1.203139902,-2.406915479,-1.436044723,-0.851488637,-0.706162741,-1.564783514,-3.17262646,-1.634302524,-2.602686451,0.7923116426,0.9099314616,0.8499040195,0.851488637,0.4119282656,0.3838148241,0.2075550021,0.1782875481,0.1794956173,-2.171520798,3.5810205909,1.6705010038,-4.639144988,0.9121268739,-6.170561403,1.9272964478,-1.693731707,-7.927723098,-1.379209156,4.4909520525,2.5204050232,-3.787656351,1.3240551395,-5.786746579,2.1348514499,-1.515444159,-7.748227481


Let's tweak the DataFrame above to have a fips column that matches the NYT data. Here's the documentation on [user-defined functions (UDFs)](https://docs.databricks.com/spark/latest/spark-sql/udf-python.html).

In [28]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def make_fips(state_code, county_code):
  if len(str(county_code)) == 1:
    return str(state_code) + "00" + str(county_code)
  elif len(str(county_code)) == 2:
    return str(state_code) + "0" + str(county_code)
  else:
    return str(state_code) + str(county_code)

make_fips_udf = udf(make_fips, StringType())
  
census_df = census_df.withColumn("fips", make_fips_udf(census_df.STATE, census_df.COUNTY))

Now that both the census and the covid data have an identical column, let's join the two DataFrames.

In [30]:
covid_with_census = (covid_df
                     .na.drop(subset=["fips"])
                     .join(census_df.drop("COUNTY", "STATE"), on=['fips'], how='inner'))

What do the cases look like for the most populous counties?

In [32]:
display(covid_with_census.filter("POPESTIMATE2019 > 2000000").select("county", "cases", "date"))

# keys = date, grouping = county, values = cases

county,cases,date
Cook,1,2020-01-24
Orange,1,2020-01-25
Cook,1,2020-01-25
Maricopa,1,2020-01-26
Los Angeles,1,2020-01-26
Orange,1,2020-01-26
Cook,1,2020-01-26
Maricopa,1,2020-01-27
Los Angeles,1,2020-01-27
Orange,1,2020-01-27


Since the NYT dataset has a new row for every day, with cases increasing each day, let's grab only the most recent numbers for each county.
* Below we're using the `col` function to refer to columns. It's equivalent to something like `df["column_name"]`
* To get the most recent row per county,  we'll use a [window function](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=window#pyspark.sql.Window)

In [34]:
from pyspark.sql.functions import row_number, col
from pyspark.sql import Window

w = Window.partitionBy("fips").orderBy(col("date").desc())
current_covid_rates = (covid_with_census
                       .withColumn("row_num", row_number().over(w))
                       .filter(col("row_num") == 1)
                       .drop("row_num"))

What counties are hardest hit when the cases are scaled with their population?

In [36]:
current_covid_rates = (current_covid_rates
                       .withColumn("case_rates_percent", 100*(col("cases")/col("POPESTIMATE2019")))
                       .sort(col("case_rates_percent").desc()))

#Look at the top 10 counties
display(current_covid_rates.select("county", "state", "cases", "POPESTIMATE2019", "case_rates_percent").limit(10))

county,state,cases,POPESTIMATE2019,case_rates_percent
Trousdale,Tennessee,1576,11284,13.966678482807517
Lake,Tennessee,756,7016,10.775370581527936
Lee,Arkansas,890,8857,10.048549170147906
Dakota,Nebraska,1904,20026,9.507640067911714
Lincoln,Arkansas,1195,13024,9.17536855036855
Buena Vista,Iowa,1786,19620,9.10295616717635
Nobles,Minnesota,1749,21629,8.08636552776365
Bristol Bay Borough,Alaska,66,836,7.894736842105263
East Carroll,Louisiana,510,6861,7.433318758198514
Colfax,Nebraska,695,10709,6.489868335045289
