# Introduction

### Having 
is similar to filtering, which can be compared to functions like **"filter()"** in programming or the **"WHERE"** clause in SQL queries. However, their use cases have subtle differences. While filtering allows you to set conditions on non-aggregated columns to restrict the result set, **"Having"** allows you to impose conditions on aggregate functions or columns instead.

Both methods serve the purpose of limiting your result set, but the key distinction lies in how they are applied. In summary, **"WHERE"** filters are employed for row-level filtering, while "HAVING" filters operate at the aggregate level. Consequently, the use of a **"HAVING"** statement can often simplify or even eliminate the need for certain sub-queries or Common Table Expressions (CTEs).

Let's consider an example:

In our previous section, we used filtering to exclude animals from a dataset if their names didn't start with the letter "C." We also utilized filtering to eliminate animals categorized as pets from our dataset.

"HAVING" becomes particularly valuable when you want to compute a metric within your dataset and then further refine your dataset based on that aggregated metric. In essence, it functions like a "WHERE" clause for aggregate functions.

### In this Exercise Set:

#### What are the top three animal categories with the highest total combined age?

To put it even more straightforwardly, this question can be rephrased as: "If we add up the ages of all animals in each category, which three categories have the highest total age?" Fortunately, Spark provides an easy way to accomplish this task.

In [1]:
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [2]:
from pyspark.sql import SparkSession

In [3]:
spark = SparkSession.builder.appName('Vamsi_App_7').getOrCreate()

Create an RDD from a list, convert it into a DataFrame, register it as a temporary table, and then perform SQL-like queries on it using Spark. The output indeed shows the top three animal categories with a total age greater than 10.

In [5]:
my_previous_pets = [("fido", "dog", 4, "brown"),
("annabelle", "cat", 15, "white"),
("fred", "bear", 29, "brown"),
("daisy", "cat", 8, "black"),
("jerry", "cat", 1, "white"),
("fred", "parrot", 1, "brown"),
("gus", "fish", 1, "gold"),
("gus", "dog", 11, "black"),
("daisy", "iguana", 2, "green"),
("rufus", "dog", 10, "gold")]

petsRDD = spark.sparkContext.parallelize(my_previous_pets)
petsDF = spark.createDataFrame(petsRDD,['name','type','age','color'])

In [6]:
#Create a temporary table view of the DataFrame for Spark SQL operations.
petsDF.createOrReplaceTempView('pets')

### Spark SQL

In [7]:
result = spark.sql("SELECT type, SUM(age) AS total_age FROM pets GROUP BY type HAVING total_age > 10 ORDER BY total_age DESC")
result.show()

+----+---------+
|type|total_age|
+----+---------+
|bear|       29|
| dog|       25|
| cat|       24|
+----+---------+



### Programmatic Approach

In [8]:
from pyspark.sql.functions import col

In [9]:
results = petsDF.groupBy("type").sum("age").withColumnRenamed("sum(age)","total_age").where(col("total_age")>10).orderBy(col("total_age").desc())

In [10]:
results.show()

+----+---------+
|type|total_age|
+----+---------+
|bear|       29|
| dog|       25|
| cat|       24|
+----+---------+

