# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import FloatType

spark = (
    SparkSession.builder
    .appName("MyApp")
    .getOrCreate()
    )

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv).

In [2]:
!wget -O chipotle.csv https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv

--2025-12-01 18:50:57--  https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 364975 (356K) [text/plain]
Saving to: ‘chipotle.csv’


2025-12-01 18:50:57 (10.7 MB/s) - ‘chipotle.csv’ saved [364975/364975]



### Step 3. Assign it to a variable called chipo.

In [3]:
chipo = spark.read.csv('chipotle.csv',header=True,sep="\t", inferSchema=True)

In [4]:
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



### Step 4. How many products cost more than $10.00?

In [44]:
chipo = chipo.withColumn(
    'item_price_float',
    F.regexp_replace('item_price','[$]', '')
    .cast(FloatType()) / F.col('quantity')
    )

expensive_items = (
    chipo.filter(F.col('item_price_float') > 10)
    .select(F.col('item_name'),F.col('choice_description'))
    .distinct()
    .count()
)

expensive_items


707

### Step 5. What is the price of each item?
###### print a data frame with only two columns item_name and item_price

In [48]:
chipo.select(F.col('item_name'),F.col('choice_description'),F.col('item_price')).show()

+--------------------+--------------------+----------+
|           item_name|  choice_description|item_price|
+--------------------+--------------------+----------+
|Chips and Fresh T...|                NULL|    $2.39 |
|                Izze|        [Clementine]|    $3.39 |
|    Nantucket Nectar|             [Apple]|    $3.39 |
|Chips and Tomatil...|                NULL|    $2.39 |
|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       Side of Chips|                NULL|    $1.69 |
|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
| Chips and Guacamole|                NULL|    $4.45 |
|Chicken Crispy Tacos|[Roasted Chili Co...|    $8.75 |
|  Chicken Soft Tacos|[Roasted Chili Co...|    $8.75 |
|        Chicken Bowl|[Fresh Tomato Sal...|   $11.25 |
| Chips and Guacamole|                NULL|    $4.45 |
|Chips and

### Step 6. Sort by the name of the item

In [45]:
chipo.orderBy(F.col('item_name')).show()

+--------+--------+-----------------+------------------+----------+-----------------+
|order_id|quantity|        item_name|choice_description|item_price| item_price_float|
+--------+--------+-----------------+------------------+----------+-----------------+
|     511|       1|6 Pack Soft Drink|            [Coke]|    $6.49 |6.489999771118164|
|    1253|       1|6 Pack Soft Drink|        [Lemonade]|    $6.49 |6.489999771118164|
|     520|       1|6 Pack Soft Drink|          [Sprite]|    $6.49 |6.489999771118164|
|     148|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |6.489999771118164|
|     566|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |6.489999771118164|
|     168|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |6.489999771118164|
|     708|       1|6 Pack Soft Drink|            [Coke]|    $6.49 |6.489999771118164|
|     230|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |6.489999771118164|
|     709|       1|6 Pack Soft Drink|       [Diet Coke

### Step 7. What was the quantity of the most expensive item ordered?

In [None]:
c

### Step 8. How many times was a Veggie Salad Bowl ordered?

In [89]:
filtered = chipo.filter(F.col('item_name') == 'Veggie Salad Bowl')
filtered.count()

18

### Step 9. How many times did someone order more than one Canned Soda?

In [90]:
filtered = chipo.filter((F.col('item_name') == 'Canned Soda') & (F.col('quantity') > 1))
filtered.count()

20