# Sort, Count, Group By and Aggregate
In this notebook we will go over how sort, count group by and aggregate works in Pyspark.



We will start by importing the pyspark machinery and a helper function that creates the table we will be using.

In [1]:
# Spark related machinery
import pyspark
import pyspark.sql.functions as F
from pyspark import SparkConf
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql import HiveContext
from pyspark.sql.functions import concat_ws

spark = pyspark.sql.SparkSession.builder.enableHiveSupport().getOrCreate()

In [2]:
from pyspark_functions import create_sp_table3

In the block of code below we create and show the first 5 rows of the table ```data```. This table has 20 rows and 3 columns: **id**, **team**, and **score**. Below we show the values that each column can take.

* id: Numbers between 1 and 5
* team: Numbers between 1 and 3
* score: Numbers between 1 and 100

In [3]:
data = create_sp_table3()
data.show(5)

+---+----+-----+
| id|team|score|
+---+----+-----+
|  1|   3|   53|
|  3|   2|   48|
|  3|   3|   98|
|  1|   2|   65|
|  3|   2|   23|
+---+----+-----+
only showing top 5 rows



# Sort
You can easily sort a table by the value of one of its columns using the ``` sort``` method. In the block of code below we create a new table called ``` sorted_scores```, where we have sorted the table by the values of the column **score**.

In [4]:
sorted_scores = data.sort("score")
sorted_scores.show()

+---+----+-----+
| id|team|score|
+---+----+-----+
|  2|   1|    5|
|  5|   1|    7|
|  1|   1|    8|
|  3|   1|   10|
|  2|   1|   12|
|  5|   2|   17|
|  3|   2|   23|
|  3|   3|   26|
|  2|   3|   27|
|  5|   3|   29|
|  5|   2|   43|
|  3|   2|   48|
|  1|   3|   53|
|  2|   3|   54|
|  1|   2|   65|
|  2|   2|   67|
|  4|   3|   70|
|  1|   3|   79|
|  5|   1|   80|
|  3|   3|   98|
+---+----+-----+



As you can see, the sorting took place in ascending order. We can also sort it in descending order by setting ```ascending=False```. See below:

In [5]:
sorted_scores_desc = data.sort("score", ascending=False)
sorted_scores_desc.show()

+---+----+-----+
| id|team|score|
+---+----+-----+
|  3|   3|   98|
|  5|   1|   80|
|  1|   3|   79|
|  4|   3|   70|
|  2|   2|   67|
|  1|   2|   65|
|  2|   3|   54|
|  1|   3|   53|
|  3|   2|   48|
|  5|   2|   43|
|  5|   3|   29|
|  2|   3|   27|
|  3|   3|   26|
|  3|   2|   23|
|  5|   2|   17|
|  2|   1|   12|
|  3|   1|   10|
|  1|   1|    8|
|  5|   1|    7|
|  2|   1|    5|
+---+----+-----+



# Counting

Counting the number of times a value shows up in a column can be very useful. We can find out how many times each value, in a column, shows up using the methods ``` groupby``` and ```count```. As an example in the block of code below we find out how many times each team (1, 2, or  3) shows up in the column **team**.

In [6]:
teams_counts = data.groupby("team").count()
teams_counts.show()

+----+-----+
|team|count|
+----+-----+
|   1|    6|
|   3|    8|
|   2|    6|
+----+-----+



# Aggregation

In the table ```data``` we have the same id showing up in several rows, and in each case (row) we have a different score. Now, letâ€™s say we want to obtain some basics statistics, imagine that for each id you want to obtain the average the min and the max score. In the block of code below we show you how to do it using the methods ```groupby``` and ```agg```.

In [7]:
quick_stats = data.groupby("id").agg(F.min("score"), F.max("score"), F.mean("score"))
quick_stats.show()

+---+----------+----------+----------+
| id|min(score)|max(score)|avg(score)|
+---+----------+----------+----------+
|  5|         7|        80|      35.2|
|  1|         8|        79|     51.25|
|  3|        10|        98|      41.0|
|  2|         5|        67|      33.0|
|  4|        70|        70|      70.0|
+---+----------+----------+----------+



Let's break down the line of code above where the **quick_stats** table is created (```data.groupby("id").agg(F.min("score"), F.max("score"), F.mean("score"))```)

* ```groupby("id")```: Here we are grouping by the column **id**.
* ```agg(F.min("score"), F.max("score"), F.mean("score"))```: here we aggregate the data and calculate the minimun, maximum and average value for the column **score** in each group (id).


## Final Words

We have gone over the basic tools to sort, aggregate and count using Pyspark. Now it is time for you to start coding. Start with the following:

* Find out how many times the each id shows up in the table.
* Use the ```agg``` method grouping by a different column.