# ADS2 - Tutorial 3 - Aggregations

Learning Outcomes:

1.   Use Aggregation functions to explore the properties of a DataFrame
2.   Use GroupedData to perform multiple aggregations at once, over specific subsets of data




In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Apache Spark uses Java, so first we must install that
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Unpack Spark from google drive
!tar xzf /content/drive/MyDrive/spark-3.3.0-bin-hadoop3.tgz

# Set up environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

# Install findspark, which helps python locate the psyspark module files
!pip install -q findspark
import findspark
findspark.init()

In [None]:
# Finally, we initialse a "SparkSession", which handles the computations
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

from pyspark.sql import functions as F

# Exercise 1

Upload and read the all-weeks-countries.csv file from the canvas page into a DataFrame. The dataset is described [here](https://www.kaggle.com/datasets/dhruvildave/netflix-top-10-tv-shows-and-films).

In [None]:
### Read in the .csv data, ensure the schema is appropriate

CsvPath = '/content/all-weeks-countries.csv'

# Load .csv with header, ',' seperators and inferred schema
NetflixDF =

# Print Schema to check


In [None]:
### Split the dataframe into two separate tables, one for films and one for TV
### Display the two tables
# .filter



# Exercise 2

Aggregate [functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions) can be accessed through `pyspark.sql.functions`, this has been imported as `F` for ease of use. To perform a simple aggregation, you can call the function on a column name, then pass it to the `.select` method.

In [None]:
### EXAMPLE: Find the TV show with the most weeks in the top 10

# Find max weeks in top 10, select that column
tvDF.select(F.max('cumulative_weeks_in_top_10')).show()

# To access the number in the DataFrame, use .first()[0]
tvDF.select(F.max('cumulative_weeks_in_top_10')).first()[0]


In [None]:
### Find the mean number of weeks in the top 10 for all films in the dataset
### Then use the .filter method to find the mean rating for
### films in the UK and Germany
# .select, .mean, .filter



In [None]:
### Use the .count_distinct aggregate function to find the number of unique
### films and TV shows and TV seasons in the dataset



In [None]:
### You aren't limited to selecting a single aggregate column
### Using the .count_distinct function, find the
### number of unique TV shows and the number of unique seasons


In [None]:
### Use the .collect_set function to get a list of all the
### unique films in the dataset, in alphabetical order
# .select, .sort_array, .collect_set


# Exercise 3

The `.groupBy()` method produces a GroupedData object, which can in turn be used to perform aggregations. You can even group over multiple columns.

In [None]:
### EXAMPLE: Find the mean and max weeks in top 10 per country

# group by year, feed aggregations into .agg, use .alias to rename new columns
tvDF.groupBy('country_name')\
       .agg(F.mean('cumulative_weeks_in_top_10').alias('mean_weeks_in_top_10'),
            F.max('cumulative_weeks_in_top_10').alias('max_weeks_in_top_10'))\
       .show()

In [None]:
### Use the "where" method to create a new dataframe containing the data for
### the show Stranger Things in the Uniter Kingdom. Call this dataframe STDF.
# .where()


### Using "groupBy" method and "F.count_distinct" function, find the total number of weeks
### Stranger Things spent in the top 10 of the UK, across all seasons. Show the
### result.
# .groupBy(), .agg(), F.count_distinct()



In [None]:
### Produce a dataframe containing the top 25 seasons by number of weeks in the
### top 10 of the United Kingdom, sorted by number of weeks.
# .where(), .groupBy(), .max(), .sort(), .limit()



In [None]:
# The column below finds the number of weeks a show spent at thge number 1 spot
# in the Top 10.

weeks_at_1 = F.when(F.min('weekly_rank')==1,
                    F.count_distinct('week'))\
                    .otherwise(0)\
                    .alias('weeks_at_1')

### Group by country name and show title, and use the .agg method and the new
### column to find the number of weeks each film spent in the top spot for each
### country.
# .groupBy(), .agg(), .sort()

### Produce a dataframe grouped by country name that contains the show title and
### number of weeks at the number 1 spot of the top performing film in each
### country.
# .groupBy(), .agg(), F.first()

