## Welcome to Spark!

Spark is an extremely powerful cluster computing framework that is used for data workloads that require more resources than can be provided by a single node. There is a wealth of information available online about Spark, as it is one of the most in demand data engineering tools today. In general, Spark is the de facto tool of choice for working with large data sets, streaming data, or high concurrency workloads.

In this lab, you'll read in some datasets and do a few basic tasks. The beauty of Spark is that you can write it on a local machine, and then deploy it to a massive cluster and thus scale your workload seamlessly.

For now you'll use Databricks to run the lab. Databricks is a company that provides a managed Spark solution. Although you can run Spark anywhere, using Databricks will allow us to jump right into learning the program, without needing to spend any time configuring the environment.

You may notice that parts of Spark feels a bit like pandas or SQL. This is because much of Spark's feel has been inherited from those tools. In general Spark is just a little more verbose than pandas, but in return for that verbosity, you get incredible power.

If you'd like to read up in great depth, this [e-book](https://runawayhorse001.github.io/LearningApacheSpark/pyspark.pdf) is excellent.

We'll use a popular Kaggle dataset for this lab. Head over [here](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) to download the data, and then let's get started!

In [1]:
#read in the data set

housing_data = spark.read.csv('FileStore/tables/AB_NYC_2019.csv',
                    inferSchema=True, header="true")

NameError: name 'spark' is not defined

Datatypes can be a little tricky in Spark. Keep this in mind when operating on tables. Let's give the dtypes a quick check below.

In [None]:
housing_data.dtypes

You'll see above, just about everything was read in as a string. This is somewhat akin to things being read in as 'object' in pandas. For now, we'll leave this as is, but keep this in mind. Id you ever get results that don't make sense, do a quick check of the dtypes you're working with.

In [None]:
#show the first few rows

housing_data.show(5)

In [None]:
#check how many rows there are in this dataset

housing_data.count()

In [None]:
#display summary stats for the data

housing_data.describe().show()

You may notice the default view for Spark dataframes is not as clean as pandas. For summary stats and other small dataframes, there's a nifty method you can use to display things in a cleaner way. .toPandas() will be your friend.

Careful though, only use this as neccesary when you need a cleaner display. It's best to not mix up Spark and pandas too much :)

In [None]:
#display summary stats using .toPandas()

housing_data.describe().toPandas()

Unnamed: 0,summary,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,count,49079,49047,48894,48873,48894,48894,48894,48894,48894,48894,48894,48874.0,38845.0,38864,48892.0,48737.0
1,mean,1.9017143236179568E7,1.02037532075E8,6.749591589946438E7,4.357369725E7,1.960586395E8,40.49195828025478,40.36302551723428,437.1107369000375,148.10106579268293,152.22296299343384,7.1286126280910596,23.25829574459354,2.6292321379310346,1.3743823665654686,7.655045918471702,112.59808769518024
2,stddev,1.0983108385610068E7,8.709090084371349E7,7.855358174017523E7,7.931788496149102E7,1897565.595041315,3.0923424874796805,6.467103784584827,112820.32836636381,507.0239464524172,238.54148640283228,20.828534365347032,44.55795124559937,8.964786212322723,1.694376217315016,34.82254748680071,131.60972881440694
3,min,"12 mins Manhattan""",1 Bed Apt in Utopic Williamsburg,"Heart of Greenwich Village""","very clean studio app""",194716858,2,-73.72247,-73.71299,-73.90783,-73.99986,0,0.0,-73.94134,0,0.0,0.0
4,max,"獨一無二的紐約閣樓""","ﾏﾝﾊｯﾀﾝ､駅から徒歩4分でどこに行くのにも便利な場所!女性の方希望,ｷﾚｲなお部屋｡",呈刚,현선,Woodside,Woodside,West Village,Shared room,Shared room,Private room,Private room,99.0,9.66,Private room,99.0,365.0


You can also call display() to format tables nicely

In [None]:
#use display() to show summary stats

display(housing_data.describe())

summary,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
count,49079,49047,48894,48873,48894,48894,48894,48894,48894,48894,48894,48874.0,38845.0,38864,48892.0,48737.0
mean,1.9017143236179568E7,1.02037532075E8,6.749591589946438E7,4.357369725E7,1.960586395E8,40.49195828025478,40.36302551723428,437.1107369000375,148.10106579268293,152.22296299343384,7.1286126280910596,23.25829574459354,2.6292321379310346,1.3743823665654686,7.655045918471702,112.59808769518024
stddev,1.0983108385610068E7,8.709090084371349E7,7.855358174017523E7,7.931788496149102E7,1897565.595041315,3.0923424874796805,6.467103784584827,112820.32836636381,507.0239464524172,238.54148640283228,20.828534365347032,44.55795124559937,8.964786212322723,1.694376217315016,34.82254748680071,131.60972881440694
min,"12 mins Manhattan""",1 Bed Apt in Utopic Williamsburg,"Heart of Greenwich Village""","very clean studio app""",194716858,2,-73.72247,-73.71299,-73.90783,-73.99986,0,0.0,-73.94134,0,0.0,0.0
max,"獨一無二的紐約閣樓""","ﾏﾝﾊｯﾀﾝ､駅から徒歩4分でどこに行くのにも便利な場所!女性の方希望,ｷﾚｲなお部屋｡",呈刚,현선,Woodside,Woodside,West Village,Shared room,Shared room,Private room,Private room,99.0,9.66,Private room,99.0,365.0


In [None]:
#filter the df to find only listings in Brooklyn using Spark's .filter() method

housing_data.filter(housing_data['neighbourhood_group'] == 'Brooklyn').show(5)

Another really cool feature in Spark, is that you can query data as if it's a table in a SQL database. This makes finding/selecting super easy!

In [None]:
#create a temp view and return only apartments in Brooklyn using SQL

housing_data.createOrReplaceTempView('sql_view')

brooklyn = spark.sql("SELECT * FROM sql_view WHERE neighbourhood_group='Brooklyn'")
brooklyn.show(5)

If you've done the above correctly, both query methods will return the same data!

In [None]:
#select just the host name column

housing_data.select('host_name').show()

In [None]:
#find the most common host names using .groupby(), and .count()

housing_data.groupby('host_name').count().orderBy('count', ascending=False).show()

In [None]:
#find the most common host names in Manhattan using SQL

manhattan = spark.sql("SELECT host_name, neighbourhood_group, COUNT(host_name) \
                     FROM sql_view \
                     WHERE neighbourhood_group = 'Manhattan' \
                     GROUP BY host_name, neighbourhood_group \
                     ORDER BY count(host_name) DESC \
                     LIMIT 10")
manhattan.show(10)

In [None]:
#find the most common host names in Queens

queens = spark.sql("SELECT host_name, neighbourhood_group, COUNT(host_name) \
                     FROM sql_view \
                     WHERE neighbourhood_group = 'Queens' \
                     GROUP BY host_name, neighbourhood_group \
                     ORDER BY count(host_name) DESC \
                     LIMIT 10")
queens.show(10)

In [None]:
#find average availability by borough (remember there are 5 boroughs in NYC)

averages = spark.sql("SELECT neighbourhood_group, AVG(availability_365) as average_availability \
                     FROM sql_view \
                     GROUP BY neighbourhood_group \
                     ORDER BY average_availability DESC")
averages.show(5)

Let's if there's a pattern between average availability across boroughs and average prices. You might think prices are highest where availability is lowest.

In [None]:
#find average availability and price by borough


averages_and_price = spark.sql("SELECT neighbourhood_group, AVG(price) as average_price, AVG(availability_365) as average_availability\
                     FROM sql_view \
                     GROUP BY neighbourhood_group \
                     ORDER BY average_price DESC")
averages_and_price.show(5)

Seems the price/availability correlation isn't true in the averages, but let's take a closer look using a scatterplot. There is built in plotting functionality in Databricks, which you're free to use. That said you may find its easier to use matplotlib/seaborn/etc.

Two hints for the plotting below:
  -You'll need to make sure your columns are some sort of number dtype
  -Matplotlib/seaborn won't like the spark dataframes directly. You can use toPandas() to solve this problem

In [None]:
#check the dtypes

housing_data.dtypes

In [None]:
#change price to a "number like" dtype

housing_data = housing_data.withColumn("price", housing_data.price.cast("int"))

In [None]:
#check dtypes again to make sure 

housing_data.dtypes

In [None]:
#scatterplot the relationship between price and availability

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10,8), dpi= 80)
sns.scatterplot(x=housing_data.toPandas().availability_365, y=housing_data.toPandas().price, s=10)
plt.ylim(0, 500)
plt.show()

So there you have it. We've read in some data, done some basic selecting/aggregating, and even tied in the visualization tools we learned about earlier. It would be a great idea to continue reading about the spark API, and thinking of use cases for it.