### Introduction

**Ordering** is like putting things in a neat line so they make sense. It's used when you want to show information in a clear way. Think of it as tidying up your data to make it easier to look at. People often use ordering when they're done working with data to show it neatly and organized. This makes the data easier to understand. But remember, ordering is helpful for all sorts of tasks involving data!

You can also use **.orderBy()**, which is like a friendlier version of **.sort()**, especially for those who like SQL. Plus, you can also use regular SQL commands on your Spark Dataframes to do sorting, as you'll find out below.

#### Getting Ready for the Exercise

In the earlier task labeled "4.6: How to Aggregate Data," we discussed a set of example data and used aggregation (grouping) to answer questions like:

+ Which cat is the youngest among our data?
+ Which cat is the oldest among our data?

Now, we're going to explore another way to answer these same questions using sorting or ordering in Spark. We'll aim to find the youngest and oldest cats in the data again, but this time, we'll take advantage of sorting to achieve our goal.

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName("Vamsi_App_5").getOrCreate()

In [4]:
from pyspark.sql.functions import col

Create a List of Rows, each containing a name and type using the following code:

In [5]:
my_previous_pets = [("fido", "dog", 4, "brown"),
                    ("annabelle", "cat", 15, "white"),
                    ("fred", "bear", 29, "brown"),
                    ("daisy", "cat", 8, "black"),
                    ("jerry", "cat", 1, "white"),
                    ("fred", "parrot", 1, "brown"),
                    ("gus", "fish", 1, "gold"),
                    ("gus", "dog", 11, "black"),
                    ("daisy", "iguana", 2, "green"),
                    ("rufus", "dog", 10, "gold")]

Use the **parallelize()** function of Spark to turn that List into an RDD as shown in the following code:

In [6]:
petsRDD = spark.sparkContext.parallelize(my_previous_pets)

Create a DataFrame from the RDD and a provided schema using the following code:

In [7]:
petsDF = spark.createDataFrame(petsRDD, ['nickname', 'type','age','color'])

First, let's set up the data for analysis. We'll create a temporary table view named 'pets' using the code below:

In [8]:
petsDF.registerTempTable('pets')



In [9]:
petsDF.printSchema()

root
 |-- nickname: string (nullable = true)
 |-- type: string (nullable = true)
 |-- age: long (nullable = true)
 |-- color: string (nullable = true)



Two Approaches to Achieve the Goal:

You have two ways to achieve your goal:

### SQL Approach

In this approach, you'll write SQL queries to find the youngest and oldest cats and then display the results using the provided code. Here's the code:

In [11]:
spark.sql("SELECT nickname AS youngest_cat, "
          "MIN(age) AS age "
          "FROM pets "
          "WHERE type = 'cat' "
          "GROUP BY nickname "
          "ORDER BY age ASC "
          "LIMIT 1")\
    .show()

+------------+---+
|youngest_cat|age|
+------------+---+
|       jerry|  1|
+------------+---+



In [12]:
spark.sql("SELECT nickname AS oldest_cat, "
          "MAX(age) AS age "
          "FROM pets "
          "WHERE type = 'cat' "
          "GROUP BY nickname "
          "ORDER BY age DESC "
          "LIMIT 1")\
    .show()

+----------+---+
|oldest_cat|age|
+----------+---+
| annabelle| 15|
+----------+---+



Leverage the function-chaining alternative to accomplish the same thing. Print the results to the console using the following code:

In [13]:
petsDF.where("type = 'cat'").sort("age").limit(1).show()

+--------+----+---+-----+
|nickname|type|age|color|
+--------+----+---+-----+
|   jerry| cat|  1|white|
+--------+----+---+-----+



In [14]:
petsDF.where("type = 'cat'").sort(col("age").desc()).limit(1).show()

+---------+----+---+-----+
| nickname|type|age|color|
+---------+----+---+-----+
|annabelle| cat| 15|white|
+---------+----+---+-----+



The previous results show that the youngest cat is named Jerry, aged 1 year, and is white in color. On the other hand, the oldest cat, Annabelle, is 15 years old and also white in color.

In this exercise, we've grasped the concept of arranging our final data output to present information in a meaningful order that helps us understand it better.

In the upcoming section, we'll delve into the concept of Ranking. A hint for you: it involves utilizing sorting and ordering techniques!