# Final Exam - Data Analytics with Apache Spark

## Rules
1. Do not rename the placeholder variables
2. The required results / outputs must be assigned to corresponding placeholder variables. Otherwise, no points will be counted.
3. Variables prefixed with `df_` have DataFrames datatype. You must return or assign DataFrames value to those variables.

## Init codes - No changes required

##### Init Spark session using the below cell

In [21]:
from pyspark.sql import SparkSession
# the following Spark session creation contains jars which support writing to Delta format
# note: checkout this link to see which version of Delta is compatible with your version of Pyspark: https://docs.delta.io/latest/releases.html
spark = SparkSession.builder \
    .appName("Final Exam Application")  \
    .config("spark.sql.shuffle.partitions", "2")    \
    .config("spark.sql.streaming.forceDeleteTempCheckpointLocation", "true")    \
    .config("spark.sql.streaming.checkpointLocation", "final_exam/streaming_checkpoints")    \
	.config('spark.jars.packages', 'io.delta:delta-spark_2.12:3.2.0,graphframes:graphframes:0.8.4-spark3.5-s_2.12')	\
	.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')	\
	.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')	\
    .getOrCreate()

In [22]:
sc = spark.sparkContext
sc.setCheckpointDir('final_exam/spark_checkpoint')

In [23]:
input_path = "final_exam/12345678/data/input"
input_streams_path = "final_exam/12345678/data/input/streams"
output_path = "final_exam/12345678/data/output"

## Data Engineering, Preprocessing, and Cleansing with Apache Spark

### Question 1: DataFrame Creation
- Create a DataFrame from a CSV file containing user data. 
- The CSV file is located under `input_path`.
	- User data: "user_data.csv"
	- Sales data: "sales_data.csv"
	- Transactions data: "transactions_data.csv"
	- Orders data: "orders_data.csv"
	- Customers data: "customer_data.csv"
	- and others ...

- Ensure that the DataFrame includes the header from the CSV file.


In [24]:
# Read the CSV file into a DataFrame
user_data_file_path = f"{input_path}/user_data.csv"
sales_data_file_path = f"{input_path}/sales_data.csv"
transaction_data_file_path = f"{input_path}/transactions.csv"
orders_data_file_path = f"{input_path}/orders.csv"
customer_data_file_path = f"{input_path}/customer_data.csv"
# and others ...

df_user = spark.read.option("header", "true").option("inferSchema", "true").csv(user_data_file_path)
df_sales = spark.read.option("header", "true").option("inferSchema", "true").csv(sales_data_file_path)
df_transactions = spark.read.option("header", "true").option("inferSchema", "true").csv(transaction_data_file_path)
df_orders = spark.read.option("header", "true").option("inferSchema", "true").csv(orders_data_file_path)
df_customer_data = spark.read.option("header", "true").option("inferSchema", "true").csv(customer_data_file_path)
# and others ...

# Show the DataFrames
df_user.show(10)
# and others ...

+------+------------+--------------------+
|userId|        name|               email|
+------+------------+--------------------+
|     1|     JohnDoe| johndoe@example.com|
|     2|   JaneSmith|janesmith@example...|
|     3|    BobBrown|bobbrown@example.com|
|     4|AliceJohnson|alicejohnson@exam...|
|     5|CharlieDavis|charliedavis@exam...|
|     6|  DianaMoore|dianamoore@exampl...|
|     7| EthanTaylor|ethantaylor@examp...|
|     8| GraceWilson|gracewilson@examp...|
|     9|HankAnderson|hankanderson@exam...|
|    10| IvyThompson|ivythompson@examp...|
+------+------------+--------------------+



### Question 2: Data Cleaning
In `df_sales`, some entries in the `price` column are null. 
Your task is to remove rows where the `price` is null and display the cleaned DataFrame.

In [25]:
# Remove rows with null values in the price column
df_cleaned_sales = df_sales.filter(df_sales.price.isNotNull())

# Show the cleaned DataFrame
df_cleaned_sales.show()

+-------+----------+----------+------+-----+--------+--------+----------------+-------------------+
|OrderID|      Date|CustomerID|ItemID|price|quantity|s_market|f_original_price|s_original_currency|
+-------+----------+----------+------+-----+--------+--------+----------------+-------------------+
|OR12345|15.10.2015|  CUS54343|IT8798|   45|       1|      HU|        14234.76|                HUF|
|OR12345|15.10.2015|  CUS54343|IT2235|   30|       1|      HU|         9546.23|                HUF|
|OR12345|15.10.2015|  CUS54343|IT8840|   25|       1|      HU|         3452.45|                HUF|
|OR12346|16.10.2015|  CUS54344|IT8799|   50|       2|      HU|         15000.0|                HUF|
|OR12346|16.10.2015|  CUS54344|IT2240|   20|      -1|      HU|         -7500.0|                HUF|
|OR12347|17.10.2015|  CUS54345|IT8800|  -30|      -1|      HU|         -9000.0|                HUF|
|OR12348|18.10.2015|  CUS54346|IT8801|  -25|      -1|      HU|         -5000.0|                HUF|


### Question 3: Grouping and Aggregation
Given a DataFrame `df_transactions` containing transaction records, group the data by `storeId` and calculate the total sales for each store. Display the results in descending order of total sales.

In [26]:
# Group by storeId and calculate total sales
df_total_sales = df_transactions.groupBy("storeId").agg({"sales": "sum"}).orderBy("sum(sales)", ascending=False)

# Show the results
df_total_sales.show()

+-------+----------+
|storeId|sum(sales)|
+-------+----------+
|     25|       278|
|   NULL|        85|
|     26|        51|
|      3|        48|
|      2|        29|
|     27|        10|
+-------+----------+



### Question 4: Handling Duplicate Records
You have a DataFrame `df_orders` with potential duplicate entries based on the `orderId`. Write code to remove duplicates while keeping the first occurrence of each order.


In [27]:
# Remove duplicate orders based on orderId
df_unique_order = df_orders.dropDuplicates(["orderId"])

# Show unique orders
df_unique_order.show()

+-------+----------+----------+-------+
|orderId|      date|customerId| itemId|
+-------+----------+----------+-------+
|OR10001|2024-01-01|  CUS10001|ITEM001|
|OR10002|2024-01-02|  CUS10002|ITEM002|
|OR10003|2024-01-03|  CUS10003|ITEM003|
|OR10004|2024-01-04|  CUS10004|ITEM004|
|OR10005|2024-01-05|  CUS10005|ITEM005|
|OR10006|2024-01-06|  CUS10006|ITEM006|
|OR10007|2024-01-07|  CUS10007|ITEM007|
|OR10008|2024-01-08|  CUS10008|ITEM008|
|OR10009|2024-01-09|  CUS10009|ITEM009|
|OR10010|2024-01-10|  CUS10010|ITEM010|
+-------+----------+----------+-------+



### Question 5: Complex Filtering
Using a DataFrame `df_customer_data`, filter out customers who have not made any purchases in the last year. Assume there is a column `last_purchase_date`. Display only customer IDs and names.

In [28]:
from pyspark.sql.functions import current_date, datediff

# Filter customers based on last purchase date
df_active_customers = df_customer_data.filter(datediff(current_date(), df_customer_data.last_purchase_date) <= 365).select("customerId", "name")

# Show active customers
df_active_customers.show()

+----------+----+
|customerId|name|
+----------+----+
+----------+----+



## Feature Engineering

### Question 6: Feature Selection
You have a DataFrame `df_training_data` with multiple features. Select only the features feature1, feature2, and label for your model training.


In [29]:
training_data_file_path = f"{input_path}/training_data.csv"
df_training_data = spark.read.option("header", "true").option("inferSchema", "true").csv(training_data_file_path)
df_training_data.show(10)

+--------+-----+-----+
|feature1|value|label|
+--------+-----+-----+
|     0.1| NULL|    1|
|     0.2| NULL|    0|
|    0.15| NULL|    1|
|    0.25| NULL|    0|
|    0.35| NULL|    1|
|    0.45| NULL|    0|
|    0.55| NULL|    1|
|    0.65| NULL|    0|
|    0.75| NULL|    1|
|    0.85| NULL|    0|
+--------+-----+-----+



In [30]:

# Select specific features for model training
df_selected_features = df_training_data.select("feature1", "label")

# Show selected features
df_selected_features.show()

+--------+-----+
|feature1|label|
+--------+-----+
|     0.1|    1|
|     0.2|    0|
|    0.15|    1|
|    0.25|    0|
|    0.35|    1|
|    0.45|    0|
|    0.55|    1|
|    0.65|    0|
|    0.75|    1|
|    0.85|    0|
+--------+-----+



### Question 7: Creating New Features
From a DataFrame `df_employee`, create a new feature called years_of_experience by subtracting the start_year from the current year. Display the updated DataFrame with this new feature.

In [31]:
employee_data_file_path = f"{input_path}/employee_data.csv"
df_employee = spark.read.option("header", "true").option("inferSchema", "true").csv(employee_data_file_path)
df_employee.show(10)

+----------+----------+
|employeeId|start_year|
+----------+----------+
|    EMP001|      2018|
|    EMP002|      2019|
|    EMP003|      2020|
|    EMP004|      2017|
|    EMP005|      2016|
|    EMP006|      2018|
|    EMP007|      2019|
|    EMP008|      2020|
|    EMP009|      2021|
|    EMP010|      2022|
+----------+----------+



In [32]:
from pyspark.sql.functions import year, current_date, col


# Add years_of_experience feature
df_employee_with_experience = df_employee.withColumn("years_of_experience", year(current_date()) - col("start_year"))

# Show updated employee data
df_employee_with_experience.show()

+----------+----------+-------------------+
|employeeId|start_year|years_of_experience|
+----------+----------+-------------------+
|    EMP001|      2018|                  6|
|    EMP002|      2019|                  5|
|    EMP003|      2020|                  4|
|    EMP004|      2017|                  7|
|    EMP005|      2016|                  8|
|    EMP006|      2018|                  6|
|    EMP007|      2019|                  5|
|    EMP008|      2020|                  4|
|    EMP009|      2021|                  3|
|    EMP010|      2022|                  2|
+----------+----------+-------------------+



### Question 8: One-Hot Encoding
You have a categorical feature category in your DataFrame `df_products`. Perform one-hot encoding on this feature and display the resulting DataFrame.

In [33]:
products_data_file_path = f"{input_path}/products.csv"
df_products = spark.read.option("header", "true").option("inferSchema", "true").csv(products_data_file_path)
df_products.show(10)

+---------+----------+--------------+
|productId|      name|    categories|
+---------+----------+--------------+
|     P001|    Laptop|   Electronics|
|     P002|Smartphone|   Electronics|
|     P003|    Tablet|   Electronics|
|     P004|     Shoes|      Footwear|
|     P005|     Shirt|      Clothing|
|     P006|     Pants|      Clothing|
|     P007|       Hat|   Accessories|
|     P008|     Watch|   Accessories|
|     P009|  Backpack|Bags & Luggage|
|     P010|Sunglasses|   Accessories|
+---------+----------+--------------+



In [34]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer


# String indexer for category feature
indexer = StringIndexer(inputCol="categories", outputCol="category_index")
model = indexer.fit(df_products)
df_indexed = model.transform(df_products)

# One-hot encoding for indexed category feature
encoder = OneHotEncoder(inputCols=["category_index"], outputCols=["category_ohe"])
ohe_model = encoder.fit(df_indexed)
df_encoded = ohe_model.transform(df_indexed)

# Show encoded features
df_encoded.show()

+---------+----------+--------------+--------------+-------------+
|productId|      name|    categories|category_index| category_ohe|
+---------+----------+--------------+--------------+-------------+
|     P001|    Laptop|   Electronics|           1.0|(4,[1],[1.0])|
|     P002|Smartphone|   Electronics|           1.0|(4,[1],[1.0])|
|     P003|    Tablet|   Electronics|           1.0|(4,[1],[1.0])|
|     P004|     Shoes|      Footwear|           4.0|    (4,[],[])|
|     P005|     Shirt|      Clothing|           2.0|(4,[2],[1.0])|
|     P006|     Pants|      Clothing|           2.0|(4,[2],[1.0])|
|     P007|       Hat|   Accessories|           0.0|(4,[0],[1.0])|
|     P008|     Watch|   Accessories|           0.0|(4,[0],[1.0])|
|     P009|  Backpack|Bags & Luggage|           3.0|(4,[3],[1.0])|
|     P010|Sunglasses|   Accessories|           0.0|(4,[0],[1.0])|
+---------+----------+--------------+--------------+-------------+



### Question 9: Normalization of Features
Normalize the feature amount in your DataFrame transactions. Use Min-Max scaling to achieve this. Display the transformed DataFrame with normalized values.


In [35]:
df_transactions.show(10)

+-------------+---------+-----+-------+---------+-----+
|transactionId|predError|value|storeId|productId|sales|
+-------------+---------+-----+-------+---------+-----+
|            1|     NULL| NULL|     25|        1|   17|
|            2|        6|    7|      2|        2|   29|
|            3|        3| NULL|     25|        3|   93|
|            4|     NULL| NULL|      3|        2|   48|
|            5|     NULL| NULL|   NULL|        2|   85|
|            6|        3|    2|     25|        2|   57|
|            7|        4|    4|     25|        3|   56|
|            8|        5| NULL|     25|        4|   55|
|            9|        6| NULL|     26|        5|   51|
|           10|        7|    8|     27|        6|   10|
+-------------+---------+-----+-------+---------+-----+



In [36]:
from pyspark.ml.feature import MinMaxScaler
from pyspark.ml.feature import VectorAssembler

vector_assembler = VectorAssembler(inputCols=["sales"], outputCol="sales_vector")
vectorized_transactions = vector_assembler.transform(df_transactions)

scaler = MinMaxScaler(inputCol="sales_vector", outputCol="scaled_sales")
scaler_model = scaler.fit(vectorized_transactions)
scaled_transactions = scaler_model.transform(vectorized_transactions)

# Show scaled amounts
scaled_transactions.select("scaled_sales").show()

+--------------------+
|        scaled_sales|
+--------------------+
|[0.08433734939759...|
|[0.22891566265060...|
|               [1.0]|
|[0.45783132530120...|
|[0.9036144578313253]|
|[0.5662650602409639]|
|[0.5542168674698795]|
|[0.5421686746987953]|
|[0.49397590361445...|
|               [0.0]|
+--------------------+



### Question 10: Preparing Training and Test Datasets

In [37]:
training_data_file_path = f"{input_path}/training_data.csv"
df_training_data = spark.read.option("header", "true").option("inferSchema", "true").csv(training_data_file_path)
df_training_data.show(10)

+--------+-----+-----+
|feature1|value|label|
+--------+-----+-----+
|     0.1| NULL|    1|
|     0.2| NULL|    0|
|    0.15| NULL|    1|
|    0.25| NULL|    0|
|    0.35| NULL|    1|
|    0.45| NULL|    0|
|    0.55| NULL|    1|
|    0.65| NULL|    0|
|    0.75| NULL|    1|
|    0.85| NULL|    0|
+--------+-----+-----+



In [38]:
test_data_file_path = f"{input_path}/test_data.csv"
df_test_data = spark.read.option("header", "true").option("inferSchema", "true").csv(test_data_file_path)
df_test_data.show(10)

+--------+-----+-----+
|feature1|value|label|
+--------+-----+-----+
|    0.12| NULL|  1.0|
|    0.22| NULL|  0.0|
|    0.18| NULL|  1.0|
|    0.28| NULL|  0.0|
|    0.38| NULL|  1.0|
|    0.48| NULL|  0.0|
|    0.58| NULL|  1.0|
|    0.68| NULL|  0.0|
|    0.78| NULL|  1.0|
|    0.88| NULL|  0.0|
+--------+-----+-----+



## Train New Models, Evaluate Models, Select Model Performance

### Question 11: Model Training
Train a linear regression model using the DataFrame `df_training_data`, which includes features and labels. Display the coefficients of the trained model.


In [39]:
from pyspark.ml.regression import LinearRegression

vector_assembler = VectorAssembler(inputCols=["feature1"], outputCol="features")
df_vectorized_training = vector_assembler.transform(df_training_data)

# Train a linear regression model
lr = LinearRegression(featuresCol='features', labelCol='label')
lr_model = lr.fit(df_vectorized_training)

# Display model coefficients
print(lr_model.coefficients)

[-0.40257648953301134]


### Question 12: Model Evaluation Metrics
Evaluate your trained linear regression model using RMSE (Root Mean Square Error) on test data. Display the RMSE value obtained from evaluation.

In [40]:
from pyspark.ml.evaluation import RegressionEvaluator

df_vectorized_test = vector_assembler.transform(df_test_data)

# Evaluate model performance on test data
predictions = lr_model.transform(df_vectorized_test)
evaluator = RegressionEvaluator(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse_value = evaluator.evaluate(predictions)

print(f"RMSE: {rmse_value}")

RMSE: 0.49014973017959534


### Question 13: Cross-Validation for Hyperparameter Tuning
Set up cross-validation for hyperparameter tuning of your linear regression model using a parameter grid for regularization parameter (regParam). Display best parameters after fitting.

In [41]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

paramGrid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01]).build()
crossval = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)

cv_model = crossval.fit(df_vectorized_training)
best_model_params = cv_model.bestModel.extractParamMap()

print(best_model_params)

{Param(parent='LinearRegression_511695632ff9', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, Param(parent='LinearRegression_511695632ff9', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0, Param(parent='LinearRegression_511695632ff9', name='epsilon', doc='The shape parameter to control the amount of robustness. Must be > 1.0. Only valid when loss is huber'): 1.35, Param(parent='LinearRegression_511695632ff9', name='featuresCol', doc='features column name.'): 'features', Param(parent='LinearRegression_511695632ff9', name='fitIntercept', doc='whether to fit an intercept term.'): True, Param(parent='LinearRegression_511695632ff9', name='labelCol', doc='label column name.'): 'label', Param(parent='LinearRegression_511695632ff9', name='loss', doc='The loss function to be optimized. Supported options: squaredError, huber.'): 'squaredError', Par

### Question 14: Model Selection Based on Tuned Hyperparameter
Train the model with the `best_model_params`

In [42]:
best_model = lr.fit(df_vectorized_training, params=best_model_params)

### Question 15: Predictions Using Selected Model
Using your best-performing model from previous steps, make predictions on new data stored in new_data. Display predictions alongside actual labels if available.


In [43]:
new_data_file_path = f"{input_path}/new_data.csv"
df_new_data = spark.read.option("header", "true").option("inferSchema", "true").csv(new_data_file_path)
df_vectorized_new_data = vector_assembler.transform(df_new_data)
df_vectorized_new_data.show(10)

+--------+-----+-----+--------+
|feature1|value|label|features|
+--------+-----+-----+--------+
|       0|     |     |   [0.0]|
|       1|     |     |   [1.0]|
|       2|     |     |   [2.0]|
|       3|     |     |   [3.0]|
|       4|     |     |   [4.0]|
|       5|     |     |   [5.0]|
|       6|     |     |   [6.0]|
|       7|     |     |   [7.0]|
|       8|     |     |   [8.0]|
|       9|     |     |   [9.0]|
+--------+-----+-----+--------+



In [44]:
predictions_new_data = best_model.transform(df_vectorized_new_data)
predictions_new_data.select("prediction", "label").show()

+--------------------+-----+
|          prediction|label|
+--------------------+-----+
|  0.6442565754159957|     |
|  0.3087761674718197|     |
|-0.02670424047235631|     |
| -0.3621846484165323|     |
| -0.6976650563607083|     |
| -1.0331454643048843|     |
| -1.3686258722490603|     |
| -1.7041062801932365|     |
| -2.0395866881374123|     |
|  -2.375067096081588|     |
+--------------------+-----+



## Spark Structured Streaming

### Question 16: Basic Streaming Setup
- Set up a basic streaming job that reads data from a CSV source and 
- Leave trigger mode as default (runs micro-batch as soon as it can)
- Writes it to a memory table named `tbl_streaming_01`
- Output mode is set to `update`
- Print the results using Spark SQL
- Stop the Streaming query

In [45]:
# Create streaming context and read from csv source
stream_path = "final_exam/12345678/data/input/streams"
schema = 'timestamp timestamp, value double'

In [46]:
df_streaming_data = spark.readStream.format('csv').option('header','true').schema(schema).load(stream_path)

In [47]:
query = df_streaming_data.writeStream.format('memory').outputMode('update').queryName('tbl_streaming_01').start()

AnalysisException: This query does not support recovering from checkpoint location. Delete final_exam/streaming_checkpoints/tbl_streaming_01/offsets to start over.

In [None]:
df_tbl_streaming_01 = spark.sql("SELECT * FROM tbl_streaming_01")

df_tbl_streaming_01.show()

In [None]:
query.stop()

### Question 17: Filtering Streaming Data
- Filter incoming streaming data to include only messages with value >= 22.5. 
- Trigger mode is default.
- Output mode is `append`
- Write to an in-memory table named `tbl_streaming_02`
- Print these filtered messages using Spark SQL.
- Stop the Streaming query


In [None]:
filteredStreamDF = df_streaming_data.filter('value > 22.5')

In [None]:
query_filtered = df_streaming_data.writeStream.format('memory').outputMode('append').queryName('tbl_streaming_02').start()

In [None]:
df_tbl_streaming_02 = spark.sql("SELECT * FROM tbl_streaming_02")

df_tbl_streaming_02.show()

In [None]:
query_filtered.stop()

### Question 18: Aggregating Streaming Data
- Aggregate streaming data to get the average value each 5 seconds and print results to console.
- Trigger mode: `availableNow`
- Output mode: `complete`
- Output format: `console`
- Stop the Streaming query after use


In [None]:
from pyspark.sql.functions import window

In [None]:
df_aggregated_stream = df_streaming_data.groupBy(window('timestamp', '5 second')).avg('value')

In [None]:

query_agg = df_aggregated_stream.writeStream \
	.format("console")	\
	.outputMode("complete")	\
	.trigger(availableNow=True)	\
	.start()

In [None]:
query_agg.stop()

### Question 19: Writing Streaming Output to File Sink
- Write streaming output of `filteredStreamDF` file sink in Parquet format.
- Trigger mode is `availableNow`
- Output mode is "Append"
- Stop the query after use

In [None]:
stream_output_path = "final_exam/12345678/data/output/streams"

In [None]:
query_file_sink = filteredStreamDF.writeStream.format('parquet').outputMode('append').trigger(availableNow=True).start(f'{stream_output_path}/parquet')

In [None]:
query_file_sink.stop()

### Question 20: Read and verify the written stream data
- Read to the data frame named `df_written_stream`
- Show top 10 rows of the dataframe

In [None]:
df_written_stream = spark.read.parquet(f'{stream_output_path}/parquet')
df_written_stream.show(10)

## Graph Analytics with GraphFrames

### Question 21: Creating GraphFrames from DataFrames
Create a GraphFrame using vertices and edges stored in two separate DataFrames named `vertices` and `edges`. Display basic information about the graph created.


In [None]:
# %pip install graphframes

In [None]:
from graphframes import GraphFrame

v_path = 'final_exam/12345678/data/input/vertices.csv'
e_path = 'final_exam/12345678/data/input/edges.csv'

df_vertices = spark.read.csv(path=v_path, header=True, inferSchema=True)
df_edges = spark.read.csv(path=e_path, header=True,inferSchema=True)

In [None]:
g = GraphFrame(df_vertices, df_edges)

In [None]:
print(g.vertices.show())
print(g.edges.show())

### Question 22: Finding Shortest Paths in Graphs
Using your GraphFrame g, find shortest paths from vertex ID 'V002' to all other vertices within a maximum path length of 3. Display results as paths found.


In [None]:
results = g.shortestPaths(landmarks=["V002"])
results.show()

### Question 23: PageRank Algorithm Implementation
Implement PageRank algorithm on your GraphFrame to find important vertices based on their connectivity.

In [None]:
results_pagerank = g.pageRank(resetProbability=0.15, maxIter=10)
results_pagerank.vertices.select("id", "pagerank").show()

### Question 24: Connected Components Analysis
Perform connected components analysis on your GraphFrame to identify clusters of connected vertices.


In [None]:
result_cc = g.connectedComponents()
result_cc.select("id", "component").show()

### Question 25: Triangle Count in Graphs
Count triangles formed by vertices in your GraphFrame using triangle counting functionality.


In [None]:
triangle_count_result = g.triangleCount()
triangle_count_result.select("id", "count").show()

## Complete & Cleanup

In [None]:
spark.stop()

In [None]:
### Clean up Checkpoints & Outputs
import os
import shutil

# Define the path to the checkpoint directory
spark_checkpoint_dir = "final_exam/spark_checkpoint/"
stream_checkpoint_dir = "final_exam/streaming_checkpoints/"
directories = [spark_checkpoint_dir, stream_checkpoint_dir]

# Remove the checkpoint directory if it exists
for dir in directories:
	if os.path.exists(dir):
		shutil.rmtree(dir)
		print(f"Directory {dir} cleaned up.")
	else:
		print(f"Directory {dir} does not exist.")

Directory final_exam/spark_checkpoint/* does not exist.
Directory final_exam/streaming_checkpoints/ does not exist.
