# Big Data Analytics — Lab 0 Starter (v2)
> Author : Badr TAJINI - Big Data Analytics - ESIEE 2025-2026

Verify your PySpark setup end‑to‑end and capture evidence.

In [8]:
pip install Py

Collecting Py
  Downloading py-1.11.0-py2.py3-none-any.whl.metadata (2.8 kB)
Downloading py-1.11.0-py2.py3-none-any.whl (98 kB)
Installing collected packages: Py
Successfully installed Py-1.11.0
Note: you may need to restart the kernel to use updated packages.


## 1. Environment bootstrap

In [1]:
from datetime import datetime
print("Run timestamp (UTC):", datetime.utcnow().isoformat())

try:
    from pyspark.sql import SparkSession
    import pyspark, sys, platform, os
    spark = (
        SparkSession.builder
        .appName("BDA-Lab0")
        .config("spark.sql.session.timeZone","UTC")
        .config("spark.sql.shuffle.partitions","8")
        .getOrCreate()
    )
    print("Spark:", spark.version)
    print("PySpark:", pyspark.__version__)
    print("Python:", sys.version.split()[0], "|", platform.platform())
    print("SPARK_HOME:", os.environ.get("SPARK_HOME", "<pip-only>"))
except Exception as e:
    print("Spark init failed:", e)
    spark = None


Run timestamp (UTC): 2025-11-12T09:35:59.855973


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/12 10:36:02 WARN Utils: Your hostname, LAPTOP-ED8D06VN, resolves to a loopback address: 127.0.1.1; using 172.19.238.66 instead (on interface eth0)
25/11/12 10:36:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/12 10:36:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark: 4.0.1
PySpark: 4.0.1
Python: 3.10.19 | Linux-6.6.87.2-microsoft-standard-WSL2-x86_64-with-glibc2.39
SPARK_HOME: <pip-only>


## 2. DataFrame sanity check

In [2]:
if spark is None:
    raise SystemExit("Spark not available. Fix setup and re-run Section 1.")

data = [("a",1),("b",2),("c",3),("a",2)]
df = spark.createDataFrame(data, ["key","val"])
df.show()
df.groupBy("key").count().show()

print("\n--- formatted plan ---")
df.groupBy("key").count().explain(mode="formatted")


                                                                                

+---+---+
|key|val|
+---+---+
|  a|  1|
|  b|  2|
|  c|  3|
|  a|  2|
+---+---+





+---+-----+
|key|count|
+---+-----+
|  a|    2|
|  b|    1|
|  c|    1|
+---+-----+


--- formatted plan ---
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
   +- Exchange (4)
      +- HashAggregate (3)
         +- Project (2)
            +- Scan ExistingRDD (1)


(1) Scan ExistingRDD
Output [2]: [key#0, val#1L]
Arguments: [key#0, val#1L], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)

(2) Project
Output [1]: [key#0]
Input [2]: [key#0, val#1L]

(3) HashAggregate
Input [1]: [key#0]
Keys [1]: [key#0]
Functions [1]: [partial_count(1)]
Aggregate Attributes [1]: [count#26L]
Results [2]: [key#0, count#27L]

(4) Exchange
Input [2]: [key#0, count#27L]
Arguments: hashpartitioning(key#0, 8), ENSURE_REQUIREMENTS, [plan_id=71]

(5) HashAggregate
Input [2]: [key#0, count#27L]
Keys [1]: [key#0]
Functions [1]: [count(1)]
Aggregate Attributes [1]: [count(1)#25L]
Results [2]: [key#0, count(1)#25L AS count#22L]

(6) A

                                                                                

## 3. Spark UI metrics (screenshot)
Open http://localhost:4040 after running an action and record Files Read, Input Size, and Shuffle Read/Write.

## 4. Optional: RDD quick check (for Hadoop+Spark profile)

In [4]:
rdd = spark.sparkContext.parallelize([1,2,3,4,5])
print(rdd.map(lambda x: x*2).collect())


[2, 4, 6, 8, 10]


                                                                                

## 5. Save evidence

In [11]:
from pyspark.sql import SparkSession, functions as F
from io import StringIO
import sys
from pathlib import Path

# 1. Création de la SparkSession si besoin (ignore si ton kernel l'a déjà)
spark = SparkSession.builder.appName("lab0").getOrCreate()

# 2. Créer un DataFrame avec une colonne calculée 'id_mod_2' = id % 2
df = spark.range(10).withColumn("id_mod_2", F.expr("id % 2"))

# 3. Grouper par la colonne calculée et compter
result = df.groupBy("id_mod_2").count()

# 4. Capture du plan d'exécution (mode formatted) dans un fichier pour preuve
buf = StringIO()
old_stdout = sys.stdout
try:
    sys.stdout = buf
    result.explain(mode="formatted")
finally:
    sys.stdout = old_stdout
Path("lab0_plan.txt").write_text(buf.getvalue(), encoding="utf-8")
print("Saved lab0_plan.txt")


25/11/12 10:54:11 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Saved lab0_plan.txt
