# Download from git and upload into lakehouse

In [13]:
import requests

raw_url = "https://raw.githubusercontent.com/rritec/Microsoft-Fabric/main/Labdata/titanic.parquet"
lakehouse_path = "Files/titanic.parquet"

response = requests.get(raw_url)
response.raise_for_status()

with open("/tmp/titanic.parquet", "wb") as f:
    f.write(response.content)

mssparkutils.fs.cp(
    "file:/tmp/titanic.parquet",
    lakehouse_path
)

print("✅ Parquet file successfully copied to Lakehouse")


StatementMeta(, 0da083a2-ac04-497b-ae3f-dfe53716692a, 15, Finished, Available, Finished)

✅ Parquet file successfully copied to Lakehouse


# Reading `titanic.parquet` from Lakehouse (Best Performance Approach)

## ✅ Recommended Way (Spark + Lakehouse)

In Microsoft Fabric, **Parquet is a columnar format**, so Spark can read it very efficiently when you:
- Use **direct Parquet read**
- Avoid schema inference
- Select only required columns
- Apply filters early (predicate pushdown)

In [14]:
titanic_df = spark.read.format("parquet").load("Files/titanic.parquet")
titanic_df.show(5)

StatementMeta(, 0da083a2-ac04-497b-ae3f-dfe53716692a, 16, Finished, Available, Finished)

+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|PassengerId|Survived|Pclass|                Name|   Sex| Age|SibSp|Parch|          Ticket|   Fare|Cabin|Embarked|
+-----------+--------+------+--------------------+------+----+-----+-----+----------------+-------+-----+--------+
|          1|       0|     3|Braund, Mr. Owen ...|  male|22.0|    1|    0|       A/5 21171|   7.25| NULL|       S|
|          2|       1|     1|Cumings, Mrs. Joh...|female|38.0|    1|    0|        PC 17599|71.2833|  C85|       C|
|          3|       1|     3|Heikkinen, Miss. ...|female|26.0|    0|    0|STON/O2. 3101282|  7.925| NULL|       S|
|          4|       1|     1|Futrelle, Mrs. Ja...|female|35.0|    1|    0|          113803|   53.1| C123|       S|
|          5|       0|     3|Allen, Mr. Willia...|  male|35.0|    0|    0|          373450|   8.05| NULL|       S|
+-----------+--------+------+--------------------+------+----+-----+-----+------

# Performance-Optimized Read
- Select Only Required Columns

In [18]:
titanic_df = spark.read.parquet(
    "Files/titanic.parquet"
).select("PassengerId", "Survived", "Pclass", "Sex", "Fare")

titanic_df.show(5)


StatementMeta(, 0da083a2-ac04-497b-ae3f-dfe53716692a, 20, Finished, Available, Finished)

# Apply Filters Early (Predicate Pushdown)

In [17]:
titanic_df = spark.read.parquet(
    "Files/titanic.parquet"
).filter("Survived = 1 AND Fare > 30")

titanic_df.show(5)


StatementMeta(, 0da083a2-ac04-497b-ae3f-dfe53716692a, 19, Finished, Available, Finished)

# Cache Only If Reused

In [19]:
titanic_df.cache()


StatementMeta(, 0da083a2-ac04-497b-ae3f-dfe53716692a, 21, Finished, Available, Finished)

DataFrame[PassengerId: bigint, Survived: bigint, Pclass: bigint, Sex: string, Fare: double]

# Q&A


1. Which format provides the best read performance in Microsoft Fabric Spark?

    A. CSV

    B. JSON

    C. XML

    D. Parquet

Answer: D

2. Why is Parquet faster than CSV in Fabric?

    A. It stores data as text

    B. It is row-based

    C. It is columnar and supports column pruning

    D. It requires schema inference

Answer: C

3. Which Spark operation improves performance by reading only required columns?

    A. Caching

    B. Column pruning using select()

    C. Repartition

    D. Collect

Answer: B

4. Applying filters while reading Parquet helps because of:

    A. Lazy evaluation

    B. Predicate pushdown

    C. Broadcast join

    D. Shuffle reduction

Answer: B

5. Where should titanic.parquet be stored for Spark processing in Fabric?

    A. SQL Analytics tables

    B. Warehouse

    C. Lakehouse Files

    D. Power BI Dataset

Answer: C