Think of filtering data like cleaning out stuff you don't need from a big pile. Just like when you're sorting through your things and keeping only what matters, filtering helps you get rid of irrelevant or messy information in your data.

In Spark, doing this is easy and straightforward using a function called **.filter()**. Another option is **.where()**, which does the same thing as .filter(). Just like how you use filters in a coffee machine to get rid of coffee grounds, here you're using filters to get rid of unwanted data. This usually happens when you're preparing data for analysis, and it's a really important step. Just like tidying up your room before a friend visits!

we will learn how to filter data in a Spark DataFrame step by step (in Python).

You start by creating a list of animals along with their categories:

In [2]:
categorized_animals = [("dog", "pet"), ("cat", "pet"), ("bear", "wild")]

#### Turning List into RDD

In [7]:
import findspark
findspark.init()
findspark.find()

'C:\\Spark\\sparkhome'

In [9]:
from pyspark.sql import SparkSession

In [10]:
spark = SparkSession.builder.appName("Vamsi_App_1").getOrCreate()

In [11]:
animalDataRDD = spark.sparkContext.parallelize(categorized_animals)

In [12]:
animalDFs = spark.createDataFrame(animalDataRDD,['name','category'])

In [13]:
animalDFs.show()

+----+--------+
|name|category|
+----+--------+
| dog|     pet|
| cat|     pet|
|bear|    wild|
+----+--------+



In [15]:
nonCats = animalDFs.filter("name like 'c%'")
nonCats.show()

+----+--------+
|name|category|
+----+--------+
| cat|     pet|
+----+--------+



In this exercise, you learned how to use Spark's .filter() method to keep or remove specific data from your DataFrame based on conditions. This is like sorting out things you want to keep and things you don't want, just like organizing your belongings for a tidy room.