# ADS2 - Tutorial 2 - PySpark Basics
Learning Outcomes:
*   Importing data into Spark dataframes from .csv files
*   Exploring and manipulating data tables with Spark SQL
*   Write data to a file

**Methods and Functions:**


```
spark
    .read
    .sql

dataframe
    .show()
    .printSchema
```



To begin, colab doesn't come with PySpark available by default, you will need to run the filling blocks of code to install it.


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Apache Spark uses Java, so first we must install that
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

In [None]:
# Unpack Spark from google drive
!tar xzf /content/drive/MyDrive/spark-3.3.0-bin-hadoop3.tgz

In [None]:
# Set up environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

In [None]:
# Install findspark, which helps python locate the psyspark module files
!pip install -q findspark
import findspark
findspark.init()

In [None]:
# Finally, we initialse a "SparkSession", which handles the computations
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

# Exercise 1

for this tutorial, you will explore a dataset of house price data from California. The .csv for this data is available on canvas, as well as from [Kaggle](https://www.kaggle.com/camnugent/california-housing-prices).

 * longitude
 * latitude
 * housing_median_age
 * total_rooms
 * total_bedrooms
 * population
 * households
 * median_income
 * median_house_value
 * ocean_proximity

Download the data, then upload it in the files panel on the left of the colab window. You can copy the path to this file by right-clicking it after it's uploaded.

Begin by loading the dataset into a Spark DataFrame. Certain options can to be set using the `.read.option(key, value)` methods. A full list of the options for .CSV files can be found here: [CSV Files](https://spark.apache.org/docs/latest/sql-data-sources-csv.html). Set the sperator option to commas, and the header option to True.

Finally, you need to tell the Reader where the .csv file is.

When the file is loaded, show the first 5 rows of the data and print the schema.

In [None]:
### Load the California Housing Prices Dataset
# .read, .option, .csv


### Show the first 5 rows of data and print the schema
# .show, .printSchema


The Schema is the database blueprint which specifies the column name, data type, whether the field is nullable, and any extra metadata. In PySpark, these are StructTypes and StructFields. You should have found that the DataFrame you loaded has only strings as the datatypes. This isn't useful for numerical data. fortunately, there are a number of ways to set the schema of a DataFrame when you load it.

The first, and simplest way, is to set the `inferSchema` option to `True` in the read call. Reload the DataFrame with this option set, and preint the schema.

In [None]:
### Load the DataFrame again, this time with the inferSchema option enabled


For very large tables, inferring the schema can be computationally costly, as PySpark must run an additional pass over the dataset. Instead, you can set predefined schema. One way to do this is by defining a StructType, with a list individual StructFields for each column.

```
schema = StructType([StructField_1, StructField_2, ...])
```

The first StructField is provided below, complete the list for all the columns in the dataset.

Reload the DataFrame, this time replacing the inferSchema `.option()` call, with `.schema(userDefinedSchema)`. Print the new schema and check that it is correct

In [None]:
### Load the DataFrame using a schema define with StructType and StructField
from pyspark.sql.types import DoubleType, StringType, StructType, StructField

# Complete the schema

### Reload the DataFrame with the new schema, then printSchema to check it
# .schema
# Load .csv with header, user defined schema and ',' seperators
# Show the first 5 rows of data and print the schema


Finally, you can define the schema with a DDL (Data Definition Language) string. In this case, the string defines each column name and data type pair, and can be fed into the same `.schema()` method as before. Try this now, and print the schema.

In [None]:
### Load the DataFrame using a DLL string formatted schema
DDLSchema = "longitude double, latitude double, "

### Reload the DataFrame with the new schema, then printSchema to check it


# Exercise 2

To save data stored in a DataFrame, use the `.write` method. you can save your data to a number of formats with PySpark. In addition to saving the data as a new .csv, popular formats include [Parquet files](https://parquet.apache.org/documentation/latest/) and JSON files.

Save your DataFrame as a .csv, .parquet, and .JSON file. For the parquet file, set the `.option()` `'compression', 'snappy'`. For the csv file, set the `'header', 'True'`, and `'delimeter', ','`.

In [None]:
### Save the DataFrame as a .csv, .parquet and .JSON
# .write, .option, .csv, .parquet, .json


An alternative way to save the DataFrame is to specify the `.format()` of the file and use the `.save()` method. Repeat the previous oiperations, but this time use the `.format('string')` method, the `.mode('overwrite')` method, and the `.save('/path/to/file') method.

In [None]:
### Use the .save() method to save the DataFrame as a .csv, .parquet and .JSON
# .write, .format, .option, .mode, .save


# Exercise 3

DataFrames can be manipulated using the built-in SQL API. The methods can be used to select columns from the DataFrame, apply filters and masks, sort, or group data, and much more. In this exercise, you will need the following methods:



```
.select() # one or more column names
.where() # boolean expression
.groupBy() # column name
.count()
.orderBy() # column name, ascending=True/False
```

The SQL operations aren't evaluated immediately, and return a new dataframe. By appending `.show()`, you can trigger the calculation and display the new dataframe.

In [None]:
### Example: Select the median income and house value columns, sort by income
# .select, .orderBy/.sort



In [None]:
### Select the median house age and house value columns and order by total
### bedrooms in descending order
# .select, .orderBy/.sort


In [None]:
### Select median income and house value, where ocean proximity is NEAR_BAY
# .select, .filter/.where


In [None]:
### Count the number of entries where population > 500
# .filter/.where, .count


In [None]:
### Group by ocean proximity and count the number of entries in each category
# .groupBy, .count


# Exercise 4

The `Column` class is another way to access and manipulate the data within the DataFrame. You can use Columns to form complex expressions, such as:
```
col('total_bedrooms') / col('total_rooms')
col('median_house_value').desc()
(col('median_income')*1000).cast('int')
```
For the following tasks, use Column objects in the DataFrame transformations. Create a new DataFrame for each task, and show the contents.

In [None]:
### Example: Create a DataFrame with only rows where population > 500, include
### a column with the number of bedrooms / total number of rooms, and sort by
### descending house value
# col, .filter/.where, withColumn, .orderBy/.sort, desc
from pyspark.sql.functions import col

# Column expression to calculate ratio of bedrooms and rooms

# Filter by population, add new column, sort DF


In [None]:
### Select the population and median house value where
### the median house age is < 20, store the result as a new DataFrame
# col, .select, .where

In [None]:
### Create a new DataFrame where the ocean proximity column has been dropped.
# col, .drop

In [None]:
### Create a DataFrame which includes a new column for population per household,
### sort by that column, and rename 'ocean_proximity' to 'location'
# col, .withColumn, .withColumnRenamed, .sort/.orderBy

In [None]:
### Create a DataFrame with the null values in total_bedrooms removed
# col, .isNotNull, .where/.filter

In [None]:
### Using the collect method to return the listed values of a column, create a
### scatter plot in matplotlib with 'longitude' vs 'latitude', coloured by the
### 'median_house_value'. Include appropriate axis labels, and a labelled
### colour bar.
# .select, .collect, .scatter, .colorbar
import matplotlib.pyplot as plt

longitude = usersDF.select(col('longitude')).collect()
latitude = []
house_value = []