Problem Statement:

Here have a dataset (google_files) containing a column content with text data, where each row represents a document. Your goal is to:

Extract individual words from the content column.
Count the occurrences of specific target words (e.g., 'SQL' and 'PySpark') across the entire dataset.
Output the word and its corresponding count.

In [0]:
# Create the data for the table
data = [
    ("file1.txt", "Google Uses SQL"),
    ("file2.txt", "Google Uses SQL and PySpark to fetch the Data"),
    ("file3.txt", "Google Uses NoSQL DataBase and PySpark for processing of Data"),
    ("file4.txt", "Writing code in PySpark is very easy"),
]

# Define the schema
columns = ["file_name", "content"]

# Create a DataFrame
google_files_df = spark.createDataFrame(data, schema=columns)

# display the DataFrame
google_files_df.display()

file_name,content
file1.txt,Google Uses SQL
file2.txt,Google Uses SQL and PySpark to fetch the Data
file3.txt,Google Uses NoSQL DataBase and PySpark for processing of Data
file4.txt,Writing code in PySpark is very easy


In [0]:
google_files_df.createOrReplaceTempView("google_files")

In [0]:
from pyspark.sql.functions import col, split, explode, count

# Explode the content into words
words_df = google_files_df.withColumn("word", explode(split(col("content"), " ")))

# Filter the words of interest ('SQL' and 'PySpark') and count them
result_df = (
    words_df.filter(col("word").isin("SQL", "PySpark"))
    .groupBy("word")
    .agg(count("word").alias("word_count"))
)

# Show the result
result_df.display()

word,word_count
SQL,2
PySpark,3


In [0]:
%sql
SELECT
  EXPLODE(SPLIT(content, ' ')) AS word
FROM
  google_files

word
Google
Uses
SQL
Google
Uses
SQL
and
PySpark
to
fetch


In [0]:
%sql
WITH cte AS (
  SELECT
    EXPLODE(SPLIT(content, ' ')) AS word
  FROM
    google_files
)
SELECT
  word,
  COUNT(word) AS word_count
FROM
  cte
WHERE
  word IN ('SQL', 'PySpark')
GROUP BY
  word

word,word_count
SQL,2
PySpark,3


Explanation:

split(content, ' '): Splits the string content into an array of words, based on the space character.

explode(split(...)): Converts the array of words into individual rows (one word per row).

WHERE word IN (...): Filters the rows for specific words ('SQL', 'PySpark').

GROUP BY word: Groups the filtered words and counts occurrences.