## Project Template

# Introduction

To run the code using the same environment as provided for the TP in class, follow these steps:

1. Copy the `Project_template.ipynb` and `Kafka-Producer-for-project.ipynb` files, as well as the `stocks.csv`, to the `/local` directory.

2. Run the following command:
docker-compose -f docker-compose.kafka.yml up

This will start the necessary Kafka services.

3. Make sure to run the Kafka producer before executing this code.


In [None]:
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

from pyspark.context import SparkContext
from pyspark.conf import SparkConf
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .appName("MyApp") \
    .config("spark.jars.packages", 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.0') \
    .config("spark.sql.repl.eagerEval.enabled", True) \
    .getOrCreate()


Be sure to start the stream on Kafka!

In [None]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, BooleanType, TimestampType, DateType

schema = StructType(
      [
        StructField("name", StringType(), False),
        StructField("price", DoubleType(), False),
        StructField("timestamp", TimestampType(), False),
      ]
    )

In [None]:
kafka_server = "kafka1:9092"   
from pyspark.sql.functions import from_json

lines = (spark.readStream                        # Get the DataStreamReader
  .format("kafka")                                 # Specify the source format as "kafka"
  .option("kafka.bootstrap.servers", kafka_server) # Configure the Kafka server name and port
  .option("subscribe", "stock")                       # Subscribe to the "en" Kafka topic 
  .option("startingOffsets", "earliest")           # The start point when a query is started
  .option("maxOffsetsPerTrigger", 100)             # Rate limit on max offsets per trigger interval
  .load()
  .select(from_json(col("value").cast("string"), schema).alias("parsed_value"))
# Load the DataFrame
)
df = lines.select("parsed_value.*")


## Select the N most valuable stocks in a window

## Summary of the query code:
- **Time Window Duration:** The `window_duration` variable is set to "5 minutes," specifying the duration of each time window for data analysis.

- **Calculate Total Value:** A new column, "total_value," is created in the DataFrame `df` to represent the total value of each stock at any given time.

- **Group and Aggregate:** The data is grouped based on two criteria: the time window (defined by the timestamp) and the stock's name. The code calculates the sum of the "total_value" within each group, aggregating the total value for each stock within each time window.

- **Sort Results:** The `windowed_df` DataFrame is sorted in descending order based on the "total_value" column, ensuring that the most valuable stocks within each time window appear at the top of the results.

- **Display Results:** To view the analysis results, the code utilizes Spark's Structured Streaming to write the sorted DataFrame to the console. The chosen output mode is "complete," meaning that the complete results for each time window are displayed.



In [None]:
from pyspark.sql.functions import col, window, sum
from pyspark.sql import SparkSession

# Define the time window duration 
window_duration = "5 minutes"

# Calculate the total value for each stock within the time window
df_with_total_value = df.withColumn("total_value", col("price"))

# Group the data by time windows and aggregate within each window
windowed_df = df_with_total_value.groupBy(
    window(col("timestamp"), window_duration),
    col("name")
).agg(
    sum(col("total_value")).alias("total_value")
)

# Sort the aggregated DataFrame by total value in descending order
sorted_df = windowed_df.orderBy(col("total_value").desc())

# Display the results to the console for debugging
query = sorted_df.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()


## Select the stocks that lost value between two windows

## Summary of query 2:
- **Group and Aggregate by Time Windows:** The code groups the data by time windows specified by the "window_duration" variable and stock name. Within each window, it calculates the price difference, which is the maximum price minus the minimum price for each stock in the given time window.

- **Define Stateful Function:** A custom stateful function, "display_price_difference," is defined. This function takes two parameters: "batch_df," representing the DataFrame for a specific batch, and "batch_id," which is the identifier for the batch. The function's purpose is to filter the DataFrame to find stocks that have lost value (price difference < 0) within a given time window.

- **Apply Stateful Function with foreachBatch:** The stateful function is applied using the "foreachBatch" operation in Structured Streaming. This operation processes each batch of data. When new batches arrive, it filters the DataFrame to identify stocks with a negative price difference (i.e., stocks that lost value) within the defined time windows. The results are displayed using the "show" function.

- **Start the Streaming Query:** The streaming query is initiated by the "start" function, and the code waits for the query to continue processing with "awaitTermination."


In [None]:
from pyspark.sql.functions import col, window, max, min
window_duration = "1 minutes"


# Group the data by time windows and stock name and aggregate within each window
windowed_df = df.groupBy(
    window("timestamp", window_duration),
    col("name")
).agg(
    (max(col("price")) - min(col("price"))).alias("price_difference")
)

# Define the stateful function to display the price difference
def display_price_difference(batch_df, batch_id):
    # Filter the DataFrame to find stocks that lost value
    loss_df = batch_df.filter(batch_df.price_difference < 0)
    loss_df.show()

# Apply the stateful function using foreachBatch
query = windowed_df.writeStream \
    .outputMode("complete") \
    .foreachBatch(display_price_difference) \
    .start()

query.awaitTermination()

# Find the stocks that gained the most between windows

## Summary of query 3

1. Define the time window duration (e.g., 5 minutes).
2. Create a windowed DataFrame with the total value for each stock.
3. Group the data by time windows and stock name, calculating the average price within each window.
4. Define a stateful function to display the average price for each window.
5. Apply the stateful function using foreachBatch to display the average price for each window.
6. Start the streaming query, and it waits for new data.
7. The query runs in complete output mode, showing complete results for each batch.


In [None]:
from pyspark.sql.functions import col, window, avg

# Define the time window duration (e.g., 5 minutes)
window_duration = "1 minutes"

# Create a windowed DataFrame with total value
df_with_total_value = df.withColumn("total_value", col("price"))
# Group the data by time windows and stock name and aggregate within each window
windowed_df = df_with_total_value.groupBy(
   window("timestamp", window_duration),
   col("name")
).agg(
   avg(col("price")).alias("average_price")
)
# Define the stateful function to display the average price
def display_average_price(batch_df, batch_id):
   batch_df.show()
# Apply the stateful function using foreachBatch
query = windowed_df.writeStream \
   .outputMode("complete") \
   .foreachBatch(display_average_price) \
   .start()
query.awaitTermination()

# Implement a control that checks if a stock does not lose too much value in a period of time (feel free to choose the value you prefer).

## Summary of query 4

In [None]:
from pyspark.sql.functions import col, window, max, min
from pyspark.sql import SparkSession


# Define the time window duration (e.g., 5 minutes)
window_duration = "5 minutes"

# Define the maximum allowed loss
max_loss = -10.0  

# Group the data by time windows and stock name and aggregate within each window
windowed_df = df.groupBy(
    window("timestamp", window_duration),
    col("name")
).agg(
    (max(col("price")) - min(col("price"))).alias("price_difference")
)

# Define the stateful function to display the stocks that lost too much value
def display_large_losses(batch_df, batch_id):
    # Filter the DataFrame to find stocks that lost too much value
    large_loss_df = batch_df.filter(batch_df.price_difference < max_loss)
    large_loss_df.show()

# Apply the stateful function using foreachBatch
query = windowed_df.writeStream \
    .outputMode("complete") \
    .foreachBatch(display_large_losses) \
    .start()

query.awaitTermination()

# Compute how your asset changes with the fluctuation of the market

## Summary of query 5

1. Define a 10-minute time window for data aggregation.
2. Group data by stock name and time window, collecting prices into a list.
3. Pivot data to create columns for each stock's prices within each time window.
4. Fill missing values with 0.
5. Use a vector assembler to combine columns into a single vector.
6. Calculate correlations between stocks.
7. Extract the correlation matrix from the result.
8. Choose to save or display the matrix.
9. Start a streaming query in complete output mode.
10. Continuously display the updated correlation matrix in the console.


In [None]:
#query 5
from pyspark.ml.stat import Correlation
from pyspark.ml.feature import VectorAssembler
from pyspark.sql.functions import col, window, collect_list, first
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("StockMarketAnalysis").getOrCreate()


window_duration = "10 minutes"
windowed_df = df.groupBy(col("name"), window(col("timestamp"), window_duration)).agg(collect_list("price").alias("prices"))

# Pivot the data to have columns for each stock and the price in each window
pivoted_df = windowed_df.groupBy("window").pivot("name").agg(first("prices"))
pivoted_df = pivoted_df.na.fill(0)  # Fill null values with 0

# Create a vector assembler to assemble the columns into a single vector
assembler = VectorAssembler(inputCols=pivoted_df.columns[1:], outputCol="features")
vectorized_df = assembler.transform(pivoted_df)

# Calculate the correlation between stocks
correlation_df = vectorized_df.select("window", "name", "features")
correlation_matrix = Correlation.corr(correlation_df, "features").head()[0]

# Save the correlation matrix to a file or display it
correlation_query = correlation_matrix.writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

# Start the streaming query
correlation_query.awaitTermination()