# ESIEE Paris — Data Engineering I — Assignment 1
> Author : Badr TAJINI

**Academic year:** 2025–2026  
**Program:** Data & Applications - Engineering - (FD)   
**Course:** Data Engineering I  

---

In this assignment, you'll make sure that you've correctly set up your local Spark environment.
You'll then complete a classic "Word Count" task on the `description` column of the `a1-brand.csv` file.

You can think of "Word Count" as the "Hello World!" of Hadoop, Spark, etc.
The task is simple: We want to count the total number of times each word occurs (in a potentially large collection of text).
Typically, we want to sort by the counts in descending order so we can examine the most frequently occurring words.

## Learning goals
- Confirm local Spark environment in JupyterLab.
- Implement word-count using **RDD** and **DataFrame** APIs.
- Produce top-10 tokens with and without stopwords.
- Record brief performance notes and environment details.


## 1. Setup

The following code snippet should "just work" to initialize Spark.
If it doesn't, consult the **helper and Lab 0 with installation and setup guide**.

In [None]:
import findspark, os
os.environ["SPARK_HOME"] = "/path/to/spark-4.0.0-bin-hadoop3"
findspark.init()

Edit the path below to point to your local copy of `a1-brand.csv`. 

Examples:
- macOS/Linux: `/Users/yourname/data/a1-brand.csv`
- Windows: `C:\\Users\\yourname\\data\\a1-brand.csv`

In [1]:
# TODO: Set the path to a1-brand.csv
DATA_PATH = "/path/to/a1-brand.csv"

Import PySpark:

In [2]:
import sys, re
from pyspark.sql import SparkSession, functions as F, types as T
from pyspark.sql.functions import col

Set up to measure wall time and memory. (Don't worry about the details, just run the cell)


In [None]:
from IPython.core.magic import register_cell_magic
import time, os, platform
import psutil, resource

def _rss_bytes():
    return psutil.Process(os.getpid()).memory_info().rss

def _ru_maxrss_bytes():
    # ru_maxrss: bytes on macOS; kilobytes on Linux
    ru = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    if platform.system() == "Darwin":
        return int(ru)  # bytes
    else:
        return int(ru) * 1024  # KB -> bytes

@register_cell_magic
def timemem(line, cell):
    """
    Measure wall time and memory around the execution of this cell.
    Usage:
        %%timemem
        <your code>
    """
    ip = get_ipython()
    rss_before = _rss_bytes()
    peak_before = _ru_maxrss_bytes()
    t0 = time.perf_counter()

    # Execute the cell body
    result = ip.run_cell(cell)

    t1 = time.perf_counter()
    rss_after = _rss_bytes()
    peak_after = _ru_maxrss_bytes()

    wall = t1 - t0
    rss_delta_mb = (rss_after - rss_before) / (1024*1024)
    peak_delta_mb = (peak_after - peak_before) / (1024*1024)

    print("======================================")
    print(f"Wall time: {wall:.3f} s")
    print(f"RSS Δ: {rss_delta_mb:+.2f} MB")
    print(f"Peak memory Δ: {peak_delta_mb:+.2f} MB (OS-dependent)")
    print("======================================")

    return result

Start a local Spark session (i.e., a `SparkContext`):

In [None]:
%%timemem

spark = (
    SparkSession.builder
    .appName("Assignment1")
    .master("local[*]")            # Use all local cores
    .config("spark.ui.showConsoleProgress", "true")
    .getOrCreate()
)

spark

If you've gotten to here, congrats! Everything seems to have been set up and initialized properly!

## 2. Word Count with RDDs

First, let's read the `a1-brand.csv` file into an RDD.

**write some code here**

**Hints:**

- You'll want to fetch the `SparkContext` from the `SparkSession`.
- There's a method of the `SparkContext` for reading in text files.
- This simple exercise should only take two lines. If you find yourself writing more code, you're doing something wrong...

In [None]:
%%timemem

# TODO: Write your code below, but do not remove any lines already in this cell.



# By the time we get to here, "lines" should refer to an RDD with the brand file loaded.
# Let's count the lines.

lines.count()

Next, clean and tokenize text, and then find the 10 most common words.

**write some code here**

**Required Steps:**

- Lowercase all text.
- Replace non-letter characters (`[^a-z]`) with spaces.
- Split on whitespace into tokens.
- Remove tokens with length < 2.

**Hints:**

- You _must_ use `flatMap` and other RDD operations in this step. If you're not, you're doing something wrong...
- At the end, you'll need to `collect` the output.


In [None]:
%%timemem

# TODO: Write your code below, but do not remove any lines already in this cell.



# By the time we get to here "word_counts" already has the collected output, sorted by frequency in descending order.
# So we just print out the top-10.

for word, count in word_counts[:10]:
    print(f"{word}: {count}")

## 3. Word Count with DataFrames

### 3.1 Again, Just with DataFrames

Now, we're going to do the same thing, but with DataFrames instead of RDDs.

What's the difference, you ask? We'll cover it in lecture soon enough!

**write some code here**

**Hints:**

- Here, you'll use the `SparkSession`.
- Loading a DataFrame is a single method call. If you find yourself writing more code, you're doing something wrong...
- When loading the CSV file, be aware of your escape character; use something like `.option("escape", ...)`.

In [None]:
%%timemem

# TODO: Write your code below, but do not remove any lines already in this cell.



# By the time we get to here, the file should have already been loaded into a DataFrame.
# Here, we just inspect it.

print("Rows:", df.count())
df.printSchema()
df.select("description").show(5, truncate=80)

Next, clean and tokenize text, and then find the 10 most common (i.e., frequently occurring) words.
This attempts the same processing as word count with RDDs above, except here you're using a DataFrame.

**write some code here**

**Required Steps:** (Exactly the same as above.)

- Lowercase all text.
- Replace non-letter characters (`[^a-z]`) with spaces.
- Split on whitespace into tokens.
- Remove tokens with length < 2.

**Hints:**

- You _must_ use `explode` and other Spark DataFrame operations in this exercise.
- This exercise shouldn't take more than (roughly) a dozen lines. If you find yourself writing more code, you're doing something wrong...

In [None]:
%%timemem

# TODO: Write your code below, but do not remove any lines already in this cell.



# By the time we get to here "word_counts" is a DataFrame that already has the word counts sorted in descending order.
# So we just print out the top-10.

top10 = word_counts.limit(10)
top10.show()

**Questions to reflect on**:

- What is conceptually different about how Spark executes `flatMap` and `explode`?
- What are the advantages or disadvantages of using each of them? 
- Are there cases where you may prefer one over the other?

(No need to write answers in the assignment submission. Just think about it...)

**Question to actually answer**:

Does the RDD approach and the DataFrame approach give the same answers? Explain why or why not.

**Write your answer to the above question!**

### 3.1 Removing Stopwords

You've probably noticed that many of the most frequently occurring words are not providing us any indication about the content because they are words like "in", "the", "for", etc.
These are called stopwords.

Let's remove stopwords and count again!

**write some code here**

**Hints:**

- Filter out all stopwords from the DataFrame before counting.
- Use `StopWordsRemover` from `pyspark.ml.feature`.

In [None]:
%%timemem

# TODO: Write your code below, but do not remove any lines already in this cell.

import numpy
from pyspark.ml.feature import StopWordsRemover



# By the time we get to here "word_counts_noStopWords" is a DataFrame that already has the word counts sorted in descending order.
# So we just print out the top-10.

top10_noStopWords = word_counts_noStopWords.limit(10)
top10_noStopWords.show()

### 3.2 Saving Results to CSV

+ Save the results of the top-10 most frequently occurring words _with stopwords_, as a CSV file, to `top10_words.csv`.
+ Save the results of the top-10 frequently occurring words _discarding stopwords_, as a CSV file, to `top10_noStopWords.csv`.

**write some code here**

In [None]:
%%timemem

# TODO: Write your code below, but do not remove any lines already in this cell.




## 4. Assignment Submission and Cleanup

Details about the Submission of this assignment are outlined in the helper. Please read carefully the instructions.

Finally, clean up!

In [None]:
spark.stop()

## Performance notes

- Prefer DataFrame built-ins; avoid Python UDFs for tokenization where possible.
- Keep shuffle partitions modest on local runs.
- Cache wisely and avoid unnecessary actions.


## Reproducibility checklist

- Record Python/Java/Spark versions.
- Fix timezone to UTC.
- Provide exact run command and paths to input/output files.
