# IMDB Reviews Keyword Extraction Using Scala

Author: You-Jun Chen (yc7093)

## 1. Introduction

This Zeppelin notebook demonstrates the ingestion and processing of the [IMDB Review Dataset](https://www.kaggle.com/datasets/ebiswas/imdb-review-dataset?resource=download), sourced from Kaggle.
The dataset contains user reviews of movies and TV shows. It includes fields such as `review_id`, `review_summary`, `review_detail`, `rating`, and more. The dataset is split into six JSON files, each approximately 1.5GB in size. This notebook processes a single file, `part-01.json`, as a demonstration. The methods shown can be applied to the remaining files.

The primary objective of the data ingestion process is to partition the dataset by rating and keywords. This allows efficient access to subsets of the data for targeted analysis. For instance, we can quickly retrieve records with a specific rating and keyword using the following partition path:

```scala
val specificPartitionPath = "/user/yc7093_nyu_edu/imdb_partitioned_by_rating_and_words_parquet/rating=9.0/word=fun_watch"
```

### Challenges
Initally, I encountered issues loading the JSON files directly into Spark. The error was likely due to non-standard formatting in the JSON file. 
I attempted debugging but could not resolve the issue before the deadline.


### Workaround

To overcome this challenge, I used Pandas to preprocess the data:
1. Converted the problematic JSON file into parquet format.
2. Used the resulting parquet file (`part-01.parquet`) as input to the Spark pipeline in this notebook.

Here is the Python code used for preprocessing:

```python
import pandas as pd

df = pd.read_json("part-01.json")

df.to_parquet("~/part-01.parquet", index=False)
```

I will investigate further to identify and resolve the issues with the JSON format for future scalability. For now, the CSV file serves as a clean and manageable input to proceed with data ingestion and transformation.


## 2. Load Data



In [2]:
val basePath = "/user/yc7093_nyu_edu/imdb-reviews-w-emotion/part"
val fileSuffixes = List("-01-all") //, "-02-all", "-03-all", "-04-all")

val initialPath = s"$basePath${fileSuffixes.head}"
var rawDF = spark.read.parquet(initialPath)

for (suffix <- fileSuffixes.tail) {
  val fullPath = s"$basePath$suffix" 
  val part_df = spark.read.parquet(fullPath) 
  rawDF = rawDF.union(part_df) 
}


rawDF.show(5)


In [3]:
rawDF.count 

In [4]:
rawDF.select("movie").distinct().count()

## 3. Drop redundant columns

In [6]:
val df = rawDF.drop("spoiler_tag", "helpful", "predicted_emotion")

df.show(5)


## 4. Process the ratings
Remove records with invalid ratings

In [8]:
val distinctRatings = df.select("rating").distinct()

distinctRatings.show()


In [9]:
val ratingsDf = df.filter(col("rating").between(0, 10.0))

ratingsDf.show()

In [10]:
ratingsDf.count()

In [11]:
ratingsDf.select("rating").distinct().show()

 
## 5. Process the keywords in the reviews

### Unigrams analysis
Objective: Identify the most common words in `review_summary` and `review_detail` while removing stop words.

In [13]:
import org.apache.spark.sql.functions._

val stopWords = Set("a", "an", "the", "and", "or", "of", "to", "in", "is", "on", "with", "for", "it", "at", "by", "this", "that", "i", "you", "he", "she", "we", "they", "be", "was", "were", "been", "but", "if", "then", "so", "no", "yes", "not", "am", "are", "as", "do", "does", "did", "my", "your", "our", "their", "who", "what", "which", "how", "me", "us", "them", "about", "movie", "film", "films", "from", "one", "all", "have", "his", "her", "just", "more", "very", "t", "s", "story", "show", "out", "can", "than", "much", "don", "its", "ever", "too", "series", "will", "see", "when", "episode", "would", "get", "even", "only", "still", "movies", "into", "characters", "review", "make", "seen", "plot", "character", "after", "why", "also", "another", "end", "watching", "man", "over", "drama", "because", "should", "time", "watch", "has", "there", "here", "some", "made", "where", "him", "tv", "could", "many", "m", "1", "way", "ve", "2", "3" )

val broadcastStopWords = spark.sparkContext.broadcast(stopWords)


val tokenizeAndFilter = udf { (text: String) =>
  if (text == null) Array.empty[String]
  else {
    text.toLowerCase
      .split("\\W+") // Split by non-word characters
      .filter(word => word.nonEmpty && !broadcastStopWords.value.contains(word)) // Remove stop words and empty strings
  }
}



Show top 100 words in `review_summary`

In [15]:
val tokenizedSummaryDf = ratingsDf
  .withColumn("summary_tokens", explode(tokenizeAndFilter(col("review_summary")))) // Tokenize and explode

val summaryWordCounts = tokenizedSummaryDf
  .groupBy("summary_tokens")
  .count()
  .orderBy(desc("count")) 
  .withColumnRenamed("summary_tokens", "word") 

println("Top Frequent Words in review_summary:")
summaryWordCounts.show(100, truncate = false)

Show top 100 words in `review_detail`

In [17]:
// Tokenize and flatten `review_detail`
val tokenizedDetailDf = ratingsDf
  .withColumn("detail_tokens", explode(tokenizeAndFilter(col("review_detail")))) 


val detailWordCounts = tokenizedDetailDf
  .groupBy("detail_tokens")
  .count()
  .orderBy(desc("count")) 
  .withColumnRenamed("detail_tokens", "word") 


println("Top Frequent Words in review_detail:")
detailWordCounts.show(100, truncate = false)

In [18]:
val unigramTargetWords = List(
    "good", "great", "best", "love", "bad", "funny", "fun", "amazing", "worst", "comedy", "excellent", "boring", "horror", "entertaining",
    "beautiful", "masterpiece", "brilliant", "classic", "interesting", "awesome", "terrible", "perfect", "enjoyable", "original", "fantastic", "wonderful", "horrible",
    "disappointing", "underrated", "family"
)

    
unigramTargetWords.size

 
### Bigrams Analysis

In my initial exploration, I observed that analyzing individual words (unigrams) did not yield sufficient context or meaningful insights about the dataset. Many words appeared frequently but lacked the ability to convey the relationships or patterns within the reviews.

To address this, I utilized the **NGram** model to generate **bigrams** (two-word sequences). This approach captures relationships between adjacent words and provides richer insights into common phrases used in the reviews. For example, phrases like "great acting" or "bad movie" offer more actionable information than the individual words "great" or "bad."


Show top 100 bigrams in `review_summary`

In [21]:
import org.apache.spark.ml.feature.NGram
import org.apache.spark.sql.functions._

// Define stop words
val stopWords = Set("a", "an", "the", "and", "or", "of", "to", "in", "is", "on", "with", "for", "it", "at", "by", "this", "that", "i", "you", "he", "she", "we", "they", "be", "was", "were", "been", "but", "if", "then", "no", "yes", "not", "am", "are", "as", "do", "does", "did", "my", "your", "our", "their", "who", "what", "which", "how", "me", "us", "them", "every", "set", "up", "there", "each", "feel like", "felt like", "feels like", "m", "has", "look like", "seems like", "could ve")

val broadcastStopWords = sc.broadcast(stopWords)


val tokenizedSummaryDf = ratingsDf.withColumn("summary_tokens", tokenizeAndFilter(col("review_summary")))

val nGramSummary = new NGram()
  .setN(2)
  .setInputCol("summary_tokens")
  .setOutputCol("summary_bigrams")

val bigramSummaryDf = nGramSummary.transform(tokenizedSummaryDf)

val explodedSummaryBigrams = bigramSummaryDf.withColumn("summary_bigram", explode(col("summary_bigrams")))

val filteredSummaryBigrams = explodedSummaryBigrams.filter { row =>
  val bigram = row.getString(row.fieldIndex("summary_bigram"))
  val words = bigram.split(" ")
  words.forall(word => !broadcastStopWords.value.contains(word))
}

val summaryBigramCounts = filteredSummaryBigrams
  .groupBy("summary_bigram")
  .count()
  .orderBy(desc("count"))

println("Top Frequent Bigrams in review_summary:")
summaryBigramCounts.show(100, truncate = false)



Show top 100 bigrams in `review_detail`

In [23]:
import org.apache.spark.ml.feature.NGram
import org.apache.spark.sql.functions._

val tokenizedDetailDf = ratingsDf.withColumn("detail_tokens", tokenizeAndFilter(col("review_detail")))

// Generate bigrams for `review_detail`
val nGramDetail = new NGram()
  .setN(2)
  .setInputCol("detail_tokens")
  .setOutputCol("detail_bigrams")

val bigramDetailDf = nGramDetail.transform(tokenizedDetailDf)

val explodedDetailBigrams = bigramDetailDf.withColumn("detail_bigram", explode(col("detail_bigrams")))

val filteredDetailBigrams = explodedDetailBigrams.filter { row =>
  val bigram = row.getString(row.fieldIndex("detail_bigram"))
  val words = bigram.split(" ")
  words.forall(word => !broadcastStopWords.value.contains(word))
}

val detailBigramCounts = filteredDetailBigrams
  .groupBy("detail_bigram")
  .count()
  .orderBy(desc("count"))

println("Top Frequent Bigrams in review_detail:")
detailBigramCounts.show(100, truncate = false)


In [24]:
val bigramTargetWords = List(
  "sci fi", "well done", "really good", "better expected", "low budget", "feel good", "surprisingly good", "thought provoking", "really bad", "let down", "good acting", "bad acting", "worth seeing", "waste money", "well worth", "science fiction",
  "well acted", "action packed", "mind blowing", "romantic comedy", "great cast", "special effects", "good fun", "nothing special", "really enjoyed",
  "action flick", "good idea", "rip off", "wow wow", "best horror", "rom com", "cult classic", "nothing new", "above average", "soap opera", "high school",
  "hear warming", "top notch", "definitely worth", "visually stunning", "best action", "horror flick", "die hard", "pleasantly surprised",
  "absolutely amazing", "hidden gem", "great family", "highly recommend"
)


    
bigramTargetWords.size

### Trigrams

Upon analyzing the trigrams, I found that they do not provide additional meaningful keywords beyond what is already captured by bigrams. Most of the significant phrases are sufficiently represented in the bigrams, making trigrams redundant for this analysis. Therefore, I chose to focus on bigrams for extracting useful insights. 







In [26]:
val stopWords = Set("a", "an", "the", "and", "or", "of", "to", "in", "is", "on", "with", "for", "it", "at", "by", "this", "that", "i", "you", "he", "she", "we", "they", "be", "was", "were", "been", "but", "if", "then", "no", "yes", "not", "am", "are", "as", "do", "does", "did", "my", "your", "our", "their", "who", "what", "which", "how", "me", "us", "them", "every", "set", "up", "there", "each", "feel like", "felt like", "feels like", "m", "has", "look like", "seems like", "could ve")

val broadcastStopWords = spark.sparkContext.broadcast(stopWords)

val tokenizedDf = ratingsDf.withColumn("tokens", tokenizeAndFilter(col("review_detail")))

val nGram = new NGram()
  .setN(3) // Set n=3 for trigrams
  .setInputCol("tokens")
  .setOutputCol("trigrams")

val trigramDf = nGram.transform(tokenizedDf)

val explodedTrigrams = trigramDf.withColumn("trigram", explode(col("trigrams")))

val filteredTrigrams = explodedTrigrams.filter { row =>
  val trigram = row.getString(row.fieldIndex("trigram"))
  val words = trigram.split(" ")
  words.forall(word => !broadcastStopWords.value.contains(word)) // All words must not be stop words
}

val trigramCounts = filteredTrigrams
  .groupBy("trigram")
  .count()
  .orderBy(desc("count"))

trigramCounts.show(100, truncate = false)

## 6. Save data as Parquet Partitioned by Rating and Word

From the analysis above, I identified a set of keywords that are frequently used in the reviews. These keywords represent various sentiments or aspects of the movies and TV shows, such as positive adjectives (e.g., "good", "great", "amazing"), negative adjectives (e.g., "bad", "terrible", "awful"), and thematic phrases (e.g., "special effects", "sci-fi", "roller coaster ride"). 

### Target Keywords
The following list of target keywords was used to flag reviews based on their content.


In [28]:
val targetWords = unigramTargetWords ++ bigramTargetWords

targetWords

In [29]:
targetWords.size

Each review was checked for the presence of the keywords in both review_summary and review_detail.
If a keyword was found, a flag was added to indicate its presence, and the keyword was formatted to replace spaces with underscores for ease of partitioning.

In [31]:
import org.apache.spark.sql.functions._

val dfWithWordFlags = targetWords.foldLeft(df) { (tempDf, word) =>
  tempDf.withColumn(word.replaceAll(" ", "_"), 
    lower(col("review_detail")).contains(word) || lower(col("review_summary")).contains(word)
  )
}

In [32]:
val explodedDf = dfWithWordFlags.selectExpr(
  "review_id",
  "movie",
  "review_summary",
  "review_detail",
  "emotion",
  "rating",
  "stack(" + targetWords.length + ", " +
  targetWords.map(word => s"'$word', ${word.replaceAll(" ", "_")}").mkString(", ") +
  ") as (word, is_present)"
).filter(col("is_present"))

explodedDf.show(5)



In [33]:
// explodedDf.write.mode("overwrite").parquet("/user/yc7093_nyu_edu/imdb-all-w-emotion-keyword-part1-4")

Saving the data in parquet format

In [35]:
// Save the DataFrame partitioned by `rating` and modified `word` as Parquet
explodedDf
    .withColumn("word", regexp_replace(col("word"), " ", "_")) // Replace spaces with underscores in `word`
    .write
    .mode("overwrite") // Overwrite existing data
    .partitionBy("emotion", "rating", "word") // 
    .parquet("/user/yc7093_nyu_edu/imdb_partitioned_by_emotion_rating_word")


 

## 6. Load the partition for testing
After partitioning the dataset by `rating` and `keyword`, the partitions can be loaded selectively for further analysis.



In [37]:
import org.apache.hadoop.fs.{FileSystem, Path}

// Specify the path you want to list
val hdfsPath = new Path("/user/yc7093_nyu_edu/imdb_partitioned_by_emotion_rating_word/emotion=anger/rating=9.0")

// Get the FileSystem object
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

// List files and directories in the specified path
val files = fs.listStatus(hdfsPath)
files.foreach(file => println(file.getPath.toString))


Below is an example of loading a specific partition based on `rating=9.0` and `word=fun_watch`.

In [39]:
val specificPartitionPath = "/user/yc7093_nyu_edu/imdb_partitioned_by_emotion_rating_word/emotion=anger/rating=9.0/word=hidden_gem"
val specificPartitionDf = spark.read.parquet(specificPartitionPath)

// Show the data
specificPartitionDf.show() // Use truncate = false to see full text



In [40]:
val outputPath = "/user/yc7093_nyu_edu/rating_9_fun_watch_partition"

// Write the DataFrame to HDFS as Parquet
specificPartitionDf.write
  .mode("overwrite") // Overwrite if the path already exists
  .parquet(outputPath)

println(s"Partition saved successfully to HDFS at $outputPath")
