### Ranking

Ranking in Spark, at its core, is a way to order data based on a certain condition. It combines the functionalities of both the WHERE clause (to filter data) and the ORDER BY clause (to sort data), but it doesn't remove data; instead, it assigns a numerical label to each row based on the specified condition or column.

While ordering helps you sort data based on a specific column, ranking lets you assign a number or rank to each row. This rank can then be used in various ways downstream, such as selecting the top results or applying further transformations based on these assigned labels.

A common ranking function in Spark is row_number(). This function allows you to assign a unique rank or value to each row, either within the entire dataset or within specific partitions of the data, based on a defined ordering. This result can be extremely useful for tasks like performing "top n" analysis, where you want to identify and work with the top results in your dataset.

Ranking provides a powerful way to organize and manipulate data, especially when you need to prioritize or filter data based on specific criteria. It's a valuable tool in data analysis and processing tasks. 

#### In this Exercise

we aim to find the top 2 cats and dogs in each category using ranking. While it's possible to answer this question without ranking, ranking simplifies the process and makes the results more readable and understandable. Let's see how we can achieve this.

This code allows us to add a new column or modify an existing one in a DataFrame. It's commonly used to apply logic to a specific column in a DataFrame or DataSet. There's also a variant called **df.withColumnRenamed()** that both applies logic and renames the column simultaneously.

To find the top 2 cats and dogs in each category using ranking, we'll use Spark's ranking functions.

In [1]:
# Importing lib and functions
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Vamsi_App_6').getOrCreate()

It looks like you've provided a detailed set of code and steps for working with Spark to find the top 2 cats and dogs in each category using ranking. The provided code demonstrates how to import libraries, create an RDD and DataFrame, set up a temporary table view, create a window, apply the **row_number()** function, and filter the results.

In [4]:
# Import relevant Spark libraries
from pyspark.sql.window import Window
from pyspark.sql import functions as F

These allow us to access two key components in our code: **the windowing specification and the row_number ranking function**.

In [5]:
# Create a list of Rows, each containing a name, type, age, and color
my_previous_pets = [
    ("fido", "dog", 4, "brown"),
    ("annabelle", "cat", 15, "white"),
    ("fred", "bear", 29, "brown"),
    ("daisy", "cat", 8, "black"),
    ("jerry", "cat", 1, "white"),
    ("fred", "parrot", 1, "brown"),
    ("gus", "fish", 1, "gold"),
    ("gus", "dog", 11, "black"),
    ("daisy", "iguana", 2, "green"),
    ("rufus", "dog", 10, "gold")
]

In [6]:
# Use parallelize() to create an RDD from the list
petsRDD = spark.sparkContext.parallelize(my_previous_pets)

In [11]:
# Create a DataFrame from the RDD with a provided schema
petsDF = spark.createDataFrame(petsRDD,['nickname', 'type', 'age', 'color'])

In [12]:
# Create a temporary table view of the data in Spark SQL
petsDF.createOrReplaceTempView('pets')

#### Create a Window that is partitioned by 'type' and orders by 'age' in descending order

Let's break down the components of this operation:

+ **Window:** In Spark, a window is a logical construct used for defining a set of rows over which a function is applied. It specifies the range or scope of rows that should be considered when applying functions like **row_number(), rank(), or aggregation functions**.

+ **Partitioning:** The term "partitioned by 'type'" means that the data is divided into groups or partitions based on the values in the 'type' column. Each partition is treated as a separate group for window functions. For example, if you have 'cat' and 'dog' as values in the 'type' column, the window function will operate separately within these two groups.

+ **Ordering:** The window is ordered by the 'age' column, and it specifies "in descending order." This means that within each partition (group), the rows are sorted by the 'age' column in descending order. In other words, the rows with the highest 'age' values come first within each group.

In [13]:
window = Window.partitionBy("type").orderBy(F.col("age").desc())

In [14]:
print(window)

<pyspark.sql.window.WindowSpec object at 0x000002417E16E790>


In [15]:
# Use withColumn() and row_number() to apply the Windowing function and create a 'rank' column
# Filter down to the top two ranks of each group: cats and dogs
# Print the results
resultDF = petsDF.withColumn("rank", F.row_number().over(window))\
    .where("rank <= 2 and (type = 'dog' or type = 'cat')")\
    .orderBy("type", "rank")

Defines a window, applies the row_number() function, and filters the top 2 ranks for cats and dogs. Finally, it prints the results as shown in your provided output

In [16]:
resultDF.show()

+---------+----+---+-----+----+
| nickname|type|age|color|rank|
+---------+----+---+-----+----+
|annabelle| cat| 15|white|   1|
|    daisy| cat|  8|black|   2|
|      gus| dog| 11|black|   1|
|    rufus| dog| 10| gold|   2|
+---------+----+---+-----+----+

