<a href="https://colab.research.google.com/github/reubenvas/using-apache-spark/blob/main/big_data_assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment (3) – Using Apache Spark to create streams, and predict in live settings

In [None]:
import threading

# Helper thread to avoid the Spark StreamingContext from blocking Jupyter

class StreamingThread(threading.Thread):
    def __init__(self, ssc):
        super().__init__()
        self.ssc = ssc
    def run(self):
        self.ssc.start()
        self.ssc.awaitTermination()
    def stop(self):
        print('----- Stopping... this may take a few seconds -----')
        self.ssc.stop(stopSparkContext=False, stopGraceFully=True)

In [None]:
sc

In [None]:
spark

In [None]:
import os
from pyspark.streaming import StreamingContext

## Our code for this assignment

### 1 - make sure you collect a historical set of data

This pre-written script from Prof Seppe reads the socket for historic articles and saves them to file. It also saves wheater end up on the frontpage or not.

In [None]:
# Save less frequently and create less subdirectories. I.e. every 120 seconds
ssc = StreamingContext(sc, 120)



In [None]:
lines = ssc.socketTextStream("seppe.net", 7778)

In [None]:
out_dir = f"{os.path.abspath('')}{os.path.sep}saved_stories"
lines.saveAsTextFiles(f"file:///{out_dir}")
print("Saving to", out_dir)

Saving to /Users/reuben/temp/big_data_course_assignment3/spark/notebooks/saved_stories


In [None]:
ssc_t = StreamingThread(ssc)
ssc_t.start()

In [None]:
# Don't run this cell unless you want to stop. You should see subdirectories appear in the out_dir
ssc_t.stop()

----- Stopping... this may take a few seconds -----


24/04/29 13:43:20 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/sun.nio.ch.NioSocketImpl.endRead(NioSocketImpl.java:253)
	at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:332)
	at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
	at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
	at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:270)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:313)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:188)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:177)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:162)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:329)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:396)
	at org.apache.spark.streaming.dstream.SocketReceiver$

24/04/29 14:39:35 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 1021647 ms exceeds timeout 120000 ms
24/04/29 14:39:36 WARN SparkContext: Killing executors is not supported by current scheduler.
24/04/29 14:39:42 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$

#### filter data to be non duplicates

Look for and filter out duplicate entries for traintest split.

In [None]:
import os
from pyspark.sql import functions as F

root_path = os.path.abspath('')


df = spark.read.json(root_path + "/saved_stories-*")

df = df.dropDuplicates(["aid"])
df = df.dropna()

# Show the schema and some data to confirm duplicates are removed
# df.printSchema()
df.show(5)





+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----+
|     aid|comments|              domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|       user|votes|
+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------+-----+
|39975364|       0|     theguardian.com|    false|2024-04-09 01:08:03|Silver coin boom ...|Silver coin boom ...|Silver coin boom ...|https://www.thegu...|   zeristor|    2|
|39975395|       0|           axios.com|     true|2024-04-09 01:12:59|Just a moment...\...|    Just a moment...|Tesla settles fat...|https://www.axios...|       rurp|    3|
|39975415|       0|        thedrive.com|    false|2024-04-09 01:17:25|Transporting Gira...|Transporting Gira...|Transporting Gira...|ht

                                                                                

In [None]:
# count where frontpage is true vs false
df.groupBy("frontpage").count().show()



+---------+-----+
|frontpage|count|
+---------+-----+
|     true|  885|
|    false| 4194|
+---------+-----+



                                                                                

### 2 - construct a predictive model that can predict whether an article will end up one the frontpage based on what we can observe during 1h of monitoring

In decidining the model design, maybe look at these resources:
- https://www.linkedin.com/advice/1/what-most-effective-text-classification-algorithms
- https://arxiv.org/pdf/1009.4582

In [None]:
from pyspark.sql.functions import col
from pyspark.ml import Pipeline
from pyspark.ml.feature import HashingTF, Tokenizer, VectorAssembler
from pyspark.ml.classification import LogisticRegression


tokenizer1 = Tokenizer(inputCol="source_title", outputCol="tokenised_title")
hashingTF1 = HashingTF(inputCol=tokenizer1.getOutputCol(), outputCol="title_features")

tokenizer2 = Tokenizer(inputCol="source_text", outputCol="tokenised_text")
hashingTF2 = HashingTF(inputCol=tokenizer2.getOutputCol(), outputCol="text_features")

# Assemble features
assembler = VectorAssembler(
    inputCols=["title_features", "text_features"],
    outputCol="features"
)

lr = LogisticRegression(maxIter=10, regParam=0.001)

pipeline = Pipeline(stages=[tokenizer1, hashingTF1, tokenizer2, hashingTF2, assembler, lr])


In [None]:
# cast labels as 1s and 0s (1 for True, 0 for False)
df = df.withColumn("label", col("frontpage").cast("integer"))

# Split the data into training and test sets
(train_df, test_df) = df.randomSplit([0.8, 0.2])

df_train = train_df.drop("frontpage") # drop the label column from training data


In [None]:

# Fit the pipeline to the data
model = pipeline.fit(df_train)

# # Make predictions
# predictions = model.transform(df_train)

# # Show the predictions
# predictions.show()


24/04/29 19:31:32 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:31:46 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/04/29 19:31:46 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.VectorBLAS
24/04/29 19:31:48 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:00 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:03 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:06 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:09 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:12 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:15 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB
24/04/29 19:32:18 WARN DAGScheduler: Broadcasting large task binary with size 32.2 MiB


#### Make predictions on test data to evaluate model accuracy

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Make predictions on the test data
test_predictions = model.transform(test_df)

# Evaluate the model
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction", labelCol="label")
test_accuracy = evaluator.evaluate(test_predictions)
print("Test Model Accuracy: ", test_accuracy)


24/04/29 19:35:37 WARN DAGScheduler: Broadcasting large task binary with size 34.7 MiB
                                                                                

Test Model Accuracy:  0.5516779297617111


In [None]:
# Save the model
model.save(root_path + "/model")

24/04/29 19:36:58 ERROR Instrumentation: java.io.IOException: Path /Users/reuben/temp/big_data_course_assignment3/spark/notebooks/model already exists. To overwrite it, please use write.overwrite().save(path) for Scala and use write().overwrite().save(path) for Java and Python.
	at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:683)
	at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:167)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
	at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWri

Py4JJavaError: An error occurred while calling o764.save.
: java.io.IOException: Path /Users/reuben/temp/big_data_course_assignment3/spark/notebooks/model already exists. To overwrite it, please use write.overwrite().save(path) for Scala and use write().overwrite().save(path) for Java and Python.
	at org.apache.spark.ml.util.FileSystemOverwrite.handleOverwrite(ReadWrite.scala:683)
	at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:167)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174)
	at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169)
	at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344)
	at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
	at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:842)


### 3 - show that your model can make predictions in a "deployed" setting on new stories


In [None]:
from pyspark.ml import PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, struct
from pyspark.sql.types import DoubleType

spark = SparkSession.builder.appName("Model Prediction Stream").getOrCreate()
root_path = os.path.abspath('')

globals()['models_loaded'] = False
globals()['my_model'] = None

# Load in the model if not yet loaded:
if not globals()['models_loaded']:
    # Assume root_path is defined and points to the model's directory
    # root_path = "path_to_your_model_directory"
    globals()['my_model'] = PipelineModel.load(root_path + "/model")
    globals()['models_loaded'] = True

def process(time, rdd):
    if rdd.isEmpty():
        return

    print("========= %s =========" % str(time))

    # Convert RDD to DataFrame
    df = spark.read.json(rdd)
    df.show()

    # Predict using the loaded model
    df_result = globals()['my_model'].transform(df)
    df_result.show()

# Assuming 'streamingContext' and 'stream' are properly defined and started elsewhere
# e.g., streamingContext.start()


                                                                                

In [None]:
ssc = StreamingContext(sc, 10)



In [None]:
lines = ssc.socketTextStream("seppe.net", 7778)
lines.foreachRDD(process)

In [None]:
ssc_t = StreamingThread(ssc)
ssc_t.start()

24/04/29 20:17:26 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:26 WARN BlockManager: Block input-0-1714414645800 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:17:29 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:29 WARN BlockManager: Block input-0-1714414648800 replicated to only 0 peer(s) instead of 1 peers
                                                                                



                                                                                

+--------+--------+--------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|     aid|comments|        domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|     user|votes|
+--------+--------+--------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|40195043|       0|     sifted.eu|    false|2024-04-29 06:17:42|IBM sues a Zurich...|IBM sues a Zurich...|IBM sues a Zurich...|https://sifted.eu...|FredericJ|    1|
|40195059|       0|purpleidea.com|    false|2024-04-29 06:19:55|Running `make` fr...|Running `make` fr...|Running `Make` fr...|https://purpleide...| diginova|    1|
+--------+--------+--------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+



24/04/29 20:17:33 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:33 WARN BlockManager: Block input-0-1714414653000 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:17:36 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:36 WARN BlockManager: Block input-0-1714414655800 replicated to only 0 peer(s) instead of 1 peers
                                                                                

+--------+--------+--------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|        domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|     user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+--------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195043|       0|     sifted.

                                                                                



24/04/29 20:17:41 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:41 WARN BlockManager: Block input-0-1714414661000 replicated to only 0 peer(s) instead of 1 peers


+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+----------+-----+
|     aid|comments|              domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|      user|votes|
+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+----------+-----+
|40195061|       0|          kotaku.com|    false|2024-04-29 06:20:07|Xbox Console Sale...|Xbox Console Sale...|Xbox Console Sale...|https://kotaku.co...|jay_kyburz|    1|
|40195074|       0|twitter.com/geoff...|    false|2024-04-29 06:20:50|X\n\nDon’t miss w...|                   X|Good UIs for vers...|https://twitter.c...|      tosh|    1|
+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+-------

                                                                                

+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+----------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|              domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|      user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+----------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195061|

24/04/29 20:17:45 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:45 WARN BlockManager: Block input-0-1714414665000 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:17:46 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:46 WARN BlockManager: Block input-0-1714414666000 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:17:50 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:50 WARN BlockManager: Block input-0-1714414670000 replicated to only 0 peer(s) instead of 1 peers


+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+
|     aid|comments|          domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|        user|votes|
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+
|40195102|       0|austinhenley.com|    false|2024-04-29 06:24:36|Mistakes that dat...|Mistakes that dat...|Mistakes that dat...|https://austinhen...|       ingve|    1|
|40195104|       0| privatdozent.co|    false|2024-04-29 06:24:49|The Birthplace of...|The Birthplace of...|The Birth of AI (...|https://www.priva...|privatdozent|    1|
|40195115|       0|      hcrypt.net|    false|2024-04-29 06:26:46|Testing Fork Bomb...|   Testing Fork Bomb|   Testing Fork Bomb|https://hcrypt.ne...|

                                                                                

+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|          domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|        user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195102|      

24/04/29 20:17:55 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:55 WARN BlockManager: Block input-0-1714414675000 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:17:59 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:17:59 WARN BlockManager: Block input-0-1714414679000 replicated to only 0 peer(s) instead of 1 peers
                                                                                



24/04/29 20:18:01 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:01 WARN BlockManager: Block input-0-1714414681000 replicated to only 0 peer(s) instead of 1 peers


+--------+--------+----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|     aid|comments|    domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|     user|votes|
+--------+--------+----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+
|40195164|       0|    cbc.ca|     true|2024-04-29 06:34:10|www.cbc.ca\n\n# T...|          www.cbc.ca|London Drugs clos...|https://www.cbc.c...|      nvy|    9|
|40195169|       0| canva.dev|    false|2024-04-29 06:34:28|Scaling to Count ...|Scaling to Count ...|Scaling to Count ...|https://www.canva...|kiyanwang|    2|
|40195171|       2|ntietz.com|    false|2024-04-29 06:34:58|The only two log ...|The only two log ...|The only two log ...|https://www.ntiet...|kiyanwang|    6|
+--------+--------+----------+----

24/04/29 20:18:02 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:02 WARN BlockManager: Block input-0-1714414682000 replicated to only 0 peer(s) instead of 1 peers
                                                                                

+--------+--------+----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|    domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|     user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+---------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195164|       0|    cbc.ca|     true|202

24/04/29 20:18:05 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:05 WARN BlockManager: Block input-0-1714414685200 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:18:08 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:08 WARN BlockManager: Block input-0-1714414688000 replicated to only 0 peer(s) instead of 1 peers
                                                                                



                                                                                

+--------+--------+------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+
|     aid|comments|            domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|        user|votes|
+--------+--------+------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+
|40195174|       0|         twitch.tv|    false|2024-04-29 06:35:19|startupschool's V...|startupschool Pas...|Old YC Videos Onl...|https://www.twitc...|jdcampolargo|    1|
|40195181|       2|      futurism.com|    false|2024-04-29 06:36:10|SpaceX Employees ...|SpaceX Employees ...|SpaceX employees ...|https://futurism....|     Gaishan|    2|
|40195187|       0|franciscomelojr.ca|    false|2024-04-29 06:37:32|PIOSEE Decision M...|PIOSEE Decision M...|Piosee Decision M...|https://f

24/04/29 20:18:13 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:13 WARN BlockManager: Block input-0-1714414693200 replicated to only 0 peer(s) instead of 1 peers
                                                                                

+--------+--------+------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|            domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|        user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+------------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195174|

24/04/29 20:18:15 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:15 WARN BlockManager: Block input-0-1714414695200 replicated to only 0 peer(s) instead of 1 peers
                                                                                



24/04/29 20:18:20 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:20 WARN BlockManager: Block input-0-1714414700200 replicated to only 0 peer(s) instead of 1 peers


+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+
|     aid|comments|           domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|        user|votes|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+
|40195195|       3|github.com/automq|    false|2024-04-29 06:38:58|AutoMQ vs Other S...|AutoMQ vs Other S...|Streaming Platfor...|https://github.co...| jackbauer24|   20|
|40195199|       0|        twitch.tv|    false|2024-04-29 06:39:37|startupschool - T...|              Twitch|Forgotten Mark Zu...|https://www.twitc...|jdcampolargo|    1|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+---------------

                                                                                

+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|           domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|        user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+------------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195195|   

24/04/29 20:18:25 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:25 WARN BlockManager: Block input-0-1714414705200 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:18:30 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:30 WARN BlockManager: Block input-0-1714414710200 replicated to only 0 peer(s) instead of 1 peers


+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----+
|     aid|comments|           domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|    user|votes|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----+
|40195209|       0|         aclu.org|    false|2024-04-29 06:41:20|How is One of Ame...|How is One of Ame...|ACLU is suing NSA...|https://www.aclu....| skilled|    1|
|40195210|       0|exploit.education|    false|2024-04-29 06:41:24|Exploit Education...|Exploit Education...|   Exploit.education|https://exploit.e...|udev4096|    1|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----

24/04/29 20:18:32 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:32 WARN BlockManager: Block input-0-1714414712200 replicated to only 0 peer(s) instead of 1 peers
                                                                                

+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|     aid|comments|           domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|    user|votes|     tokenised_title|      title_features|      tokenised_text|       text_features|            features|       rawPrediction|         probability|prediction|
+--------+--------+-----------------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+
|40195209|       0|      

24/04/29 20:18:35 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:35 WARN BlockManager: Block input-0-1714414715200 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:18:36 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:36 WARN BlockManager: Block input-0-1714414716200 replicated to only 0 peer(s) instead of 1 peers
24/04/29 20:18:40 WARN RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
24/04/29 20:18:40 WARN BlockManager: Block input-0-1714414720200 replicated to only 0 peer(s) instead of 1 peers




                                                                                

+--------+--------+-----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----+
|     aid|comments|     domain|frontpage|          posted_at|         source_text|        source_title|               title|                 url|    user|votes|
+--------+--------+-----------+---------+-------------------+--------------------+--------------------+--------------------+--------------------+--------+-----+
|40195235|      19|    noyb.eu|     true|2024-04-29 06:44:41|ChatGPT provides ...|ChatGPT provides ...|ChatGPT provides ...|https://noyb.eu/e...| skilled|   23|
|40195236|       0|cpponsea.uk|    false|2024-04-29 06:44:45|Workshops and spe...|Workshops and spe...|     C++ on Sea 2024|https://cpponsea....|  pcw888|    1|
|40195238|       0|nytimes.com|    false|2024-04-29 06:44:52|A.I. Start-Ups Fa...|A.I. Start-Ups Fa...|A.I. Startups Fac...|https://www.nytim...|  marban|    6|
|40195239|       0|nytimes.com|   

[Stage 68:>                                                         (0 + 1) / 1]

In [None]:
ssc_t.stop()

----- Stopping... this may take a few seconds -----


24/04/29 20:20:23 WARN SocketReceiver: Error receiving data
java.net.SocketException: Socket closed
	at java.base/sun.nio.ch.NioSocketImpl.endRead(NioSocketImpl.java:253)
	at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:332)
	at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
	at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
	at java.base/java.net.Socket$SocketInputStream.read(Socket.java:966)
	at java.base/sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:270)
	at java.base/sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:313)
	at java.base/sun.nio.cs.StreamDecoder.read(StreamDecoder.java:188)
	at java.base/java.io.InputStreamReader.read(InputStreamReader.java:177)
	at java.base/java.io.BufferedReader.fill(BufferedReader.java:162)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:329)
	at java.base/java.io.BufferedReader.readLine(BufferedReader.java:396)
	at org.apache.spark.streaming.dstream.SocketReceiver$

24/04/29 21:51:59 WARN TransportChannelHandler: Exception in connection from /10.45.44.215:57758
java.io.IOException: Operation timed out
	at java.base/sun.nio.ch.SocketDispatcher.read0(Native Method)
	at java.base/sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:47)
	at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:330)
	at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:284)
	at java.base/sun.nio.ch.IOUtil.read(IOUtil.java:259)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:417)
	at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:254)
	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimi

---

## Challenges

List alla challenges here

- challenge 1 etc