In interactive notebook, the `spark` object is already created.
Instructors tested with 1 driver, 6 executors of small e4 (24 cores, 192GB memory)

### Launch spark environment

In [15]:
spark

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 33, 20, Finished, Available)

In [16]:
%%configure -f \
{"conf": {"spark.jars.packages": "com.johnsnowlabs.nlp:spark-nlp_2.12:5.1.2"}}

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, -1, Finished, Available)

Unrecognized options: 

### Set up data configuration

In [17]:
blob_account_name = "marckvnonprodblob"
blob_container_name = "bigdata"
# read only
blob_sas_token = "?sv=2021-10-04&st=2023-10-04T01%3A42%3A59Z&se=2024-01-02T02%3A42%3A00Z&sr=c&sp=rlf&sig=w3CH9MbCOpwO7DtHlrahc7AlRPxSZZb8MOgS6TaXLzI%3D"

wasbs_base_url = (
    f"wasbs://{blob_container_name}@{blob_account_name}.blob.core.windows.net/"
)
spark.conf.set(
    f"fs.azure.sas.{blob_container_name}.{blob_account_name}.blob.core.windows.net",
    blob_sas_token,
)

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 6, Finished, Available)

#### Reading in single parquet file

In [18]:
comments_path = "reddit-parquet/comments/"
submissions_path = "reddit-parquet/submissions/"

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 7, Finished, Available)

In [19]:
topic = ["Tetris","pokemon","SuperMario","GTA","CallOfDuty","FIFA","legostarwars",
"assassinscreed","thesims","FinalFantasy"] 

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 8, Finished, Available)

### Reeading in all of the Reddit data

In [20]:
comments_df = spark.read.parquet(f"{wasbs_base_url}{comments_path}")
submissions_df = spark.read.parquet(f"{wasbs_base_url}{submissions_path}")

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 9, Finished, Available)

In [21]:
from pyspark.sql.functions import length, col,split
sub_filtered = submissions_df.filter((length(col("selftext")) > 0)& (col("selftext") != "[deleted]")&(col('selftext')!= "[removed]"))\
.filter(col("subreddit").isin(topic))

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 10, Finished, Available)

In [22]:
df_save = sub_filtered.select("subreddit", "title", "selftext","year","month").cache()
df_save.show()

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 11, Finished, Available)

+--------------+--------------------+--------------------+----+-----+
|     subreddit|               title|            selftext|year|month|
+--------------+--------------------+--------------------+----+-----+
|       pokemon|the PokemonTogeth...|So several days a...|2023|    2|
|       pokemon|Who's a non-villa...|For me, Tyme insp...|2023|    2|
|       pokemon|i have a realization|&amp;#x200B;\n\n[...|2023|    2|
|          FIFA|Is there any reas...|For the past 10 d...|2023|    2|
|           GTA|What should I buy...|I have around 5 m...|2023|    2|
|           GTA|what is the name ...|I know the Nero i...|2023|    2|
|       pokemon|Name any Bug type...|Ok now we’re doin...|2023|    2|
|       pokemon|My starters for e...|Gen 1: Charizard ...|2023|    2|
|       thesims|The Victoria Chal...|\n\nI made my own...|2023|    2|
|       pokemon|I really fucking ...|I feel like it's ...|2023|    2|
|       thesims|The sim 4 build m...|So whenever I pla...|2023|    2|
|          FIFA|  fl

## Using TFIDF to identify the key points for each game 

In [23]:
!pip install spark-nlp

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 12, Finished, Available)



In [24]:
import pandas as pd
import numpy as np
import json
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from sparknlp.annotator import *
from sparknlp.base import *
import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from pyspark.ml.feature import HashingTF, IDF, Tokenizer as tot, StopWordsRemover

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 13, Finished, Available)

In [25]:
# Define the Spark ML components
tokenizer_nlp = (
    Tokenizer()
    .setInputCols(["document"])
    .setOutputCol("tokens_nlp")
)
stop_words = (
    StopWordsCleaner().pretrained("stopwords_iso","en")
    .setInputCols("tokens_nlp")
    .setOutputCol("cleanTokens")
)

documentAssembler = DocumentAssembler()\
    .setInputCol("selftext")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained(name="tfhub_use", lang="en")\
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")


sentimental = SentimentDLModel.pretrained(lang="en")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("sentiment")
# Create a pipeline
pipeline1 = Pipeline(stages=[documentAssembler, use,sentimental])

# Fit the pipeline on the data
model = pipeline1.fit(df_save)

# Transform the data to get TF-IDF features
result = model.transform(df_save)
result.cache()

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 14, Finished, Available)

stopwords_iso download started this may take some time.
Approximate size to download 2.1 KB
[OK!]
tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]
sentimentdl_use_imdb download started this may take some time.
Approximate size to download 12 MB
[OK!]


DataFrame[subreddit: string, title: string, selftext: string, year: int, month: int, document: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentence_embeddings: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, sentiment: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>]

In [29]:
tokenizer = tot(inputCol="selftext", outputCol="tokens")

# StopWordsRemover
stopwords_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens")

# HashingTF and IDF
hashing_tf = HashingTF(inputCol="filtered_tokens", outputCol="rawFeatures")
idf = IDF(inputCol="rawFeatures", outputCol="features")

# Pipeline
pipeline2 = Pipeline(stages=[tokenizer, stopwords_remover, hashing_tf, idf])

# Fit and transform the data
model = pipeline2.fit(result)
result = model.transform(result)



StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 18, Finished, Available)

In [30]:
result.cache().show()

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 19, Finished, Available)

+--------------+--------------------+--------------------+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|     subreddit|               title|            selftext|year|month|            document| sentence_embeddings|           sentiment|              tokens|     filtered_tokens|         rawFeatures|            features|
+--------------+--------------------+--------------------+----+-----+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       pokemon|the PokemonTogeth...|So several days a...|2023|    2|[{document, 0, 13...|[{sentence_embedd...|[{category, 0, 13...|[so, several, day...|[several, days, a...|(262144,[3888,840...|(262144,[3888,840...|
|       pokemon|Who's a non-villa...|For me, Tyme insp...|2023|    2|[{document, 0, 66...|[{sentence_embedd...|[{category, 0, 66...|

In [36]:
result = result.sample(fraction=0.2, seed=42)

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 25, Finished, Available)

In [37]:
from pyspark.sql import functions as f
from pyspark.sql.types import MapType, StringType,ArrayType, IntegerType
ndf = result.select("subreddit",f.explode('filtered_tokens').name('expwords'),"rawFeatures","year","month","sentiment").withColumn('filtered_tokens',f.array('expwords'))
hashudf = f.udf(lambda vector: vector.indices.tolist(),ArrayType(IntegerType()))
ndf = ndf.withColumn("row_index", f.monotonically_increasing_id())
ndf.show()

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 26, Finished, Available)

+---------+----------+--------------------+----+-----+--------------------+---------------+---------+
|subreddit|  expwords|         rawFeatures|year|month|           sentiment|filtered_tokens|row_index|
+---------+----------+--------------------+----+-----+--------------------+---------------+---------+
|  pokemon|       gen|(262144,[4959,129...|2023|    2|[{category, 0, 30...|          [gen]|        0|
|  pokemon|        1:|(262144,[4959,129...|2023|    2|[{category, 0, 30...|           [1:]|        1|
|  pokemon| charizard|(262144,[4959,129...|2023|    2|[{category, 0, 30...|    [charizard]|        2|
|  pokemon|     🔥🕊️|(262144,[4959,129...|2023|    2|[{category, 0, 30...|        [🔥🕊️]|        3|
|  pokemon|       gen|(262144,[4959,129...|2023|    2|[{category, 0, 30...|          [gen]|        4|
|  pokemon|        2:|(262144,[4959,129...|2023|    2|[{category, 0, 30...|           [2:]|        5|
|  pokemon|typhlosion|(262144,[4959,129...|2023|    2|[{category, 0, 30...|   [typhlos

In [38]:
wordtf = result.select("rawFeatures").withColumn('wordhash', hashudf(col('rawFeatures')))
wordtf = wordtf.withColumn("row_index", f.monotonically_increasing_id())
wordtf = wordtf.withColumn('wordhash',f.explode("wordhash"))
wordtf.cache()
# Add the exploded column directly to 'ndf' using crossJoin


StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 27, Finished, Available)

DataFrame[rawFeatures: vector, wordhash: int, row_index: bigint]

In [39]:
result_df = ndf.join(wordtf,["row_index"],"left_outer")

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 28, Finished, Available)

In [42]:
result_df.cache()
result_df.select("subreddit","expwords","year","month","sentiment","wordhash")

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 31, Finished, Available)

DataFrame[subreddit: string, expwords: string, year: int, month: int, sentiment: array<struct<annotatorType:string,begin:int,end:int,result:string,metadata:map<string,string>,embeddings:array<float>>>, wordhash: int]

In [47]:
udf1 = f.udf(lambda vec : dict(zip(vec.indices.tolist(),vec.values.tolist())),MapType(StringType(),StringType()))
valuedf = result.select('subreddit',"filtered_tokens","year","month","sentiment",f.explode(udf1(f.col('features'))).name('wordhash','value'))
valuedf = valuedf.withColumn("sentiment",f.explode("sentiment.result"))
valuedf.show()

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 36, Finished, Available)

+---------+--------------------+----+-----+---------+--------+------------------+
|subreddit|     filtered_tokens|year|month|sentiment|wordhash|             value|
+---------+--------------------+----+-----+---------+--------+------------------+
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|  139265| 9.740140549187382|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|   88005| 5.463474430171326|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|  215686| 6.332122839048911|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|  114628|10.992903517682748|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|   63750| 7.041659799101322|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|   80646| 20.59951267424561|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|   12999|10.992903517682748|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|  186058| 5.152261860309351|
|  pokemon|[gen, 1:, chariza...|2023|    2|      pos|  113673|2.5955078897221076|
|  pokemon|[gen,

In [49]:
valuedf = valuedf.drop_duplicates(subset=["subreddit","wordhash"])

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 38, Finished, Available)

In [50]:
valuedf.cache()
import os
CSV_DIR = os.path.join("Users/yc1063/fall-2023-reddit-project-team-11/data", "csv")
valuedf.toPandas().to_csv(f"{CSV_DIR}/analysis-2.csv")

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 39, Submitted, Running)

In [44]:
result_df = result_df.join(valuedf,['subreddit','wordhash'],"right_outer").cache()
result_df.show()

StatementMeta(3c67b279-1d53-4b7a-b0d9-41cb8b4b6723, 34, 33, Cancelled, Waiting)

In [None]:
result_without_duplicates = joined_df.dropDuplicates()

# Show the resulting DataFrame without duplicates
result_without_duplicates.cache().show()

StatementMeta(, , , Cancelled, )

### Saving intermediate data

The intermediate outputs go into the azureml workspace attached storage using the URI `azureml://datastores/workspaceblobstore/paths/<PATH-TO_STORE>` this is the same for all workspaces. Then to re-load you use the same URI

In [None]:
import os
CSV_DIR = os.path.join("Users/yc1063/fall-2023-reddit-project-team-11/data", "csv")
joined_df.write.parquet(f"{CSV_DIR}/sentiment_tfidf.csv")

StatementMeta(, , , Cancelled, )