# Topic Modeling on Twitter Data with LDA using PySparkML

In [None]:
# !pip3 install nb_black==1.0.7 nltk==3.6.6 altair==4.1.0

In [None]:
%load_ext lab_black
%load_ext autoreload
%autoreload 2

In [None]:
import os
import time
from datetime import datetime
from functools import reduce
from typing import List

import boto3
import nltk
import sagemaker_pyspark
from pandas import Series as pd_Series
from pyspark import SparkConf, keyword_only
from pyspark.ml import Pipeline, Transformer
from pyspark.ml.clustering import LDA
from pyspark.ml.feature import (
    CountVectorizer,
    CountVectorizerModel,
    IDF,
    RegexTokenizer,
    StopWordsRemover,
    Tokenizer,
    VectorAssembler,
)
from pyspark.sql import Column, SparkSession, functions as F, types as T
from pyspark.sql.dataframe import DataFrame as pdf
from pyspark.sql.window import Window

In [None]:
# used for display purposes only
from pandas import DataFrame as pd_DataFrame, option_context as pd_option_context

# only used to verify that data containing ML model predictions, which was exported to S3, can
# be re-loaded (into a Dask or Pandas DataFrame)
import dask.dataframe as dd
from pandas import read_parquet as pd_read_parquet

In [None]:
%aimport src.model_interpretation
from src.model_interpretation import interpret_model as mih, import_export_models as iem

%aimport src.nlp
from src.nlp import clean_text as ch

%aimport src.s3
from src.s3 import bucket_contents as s3h

%aimport src.visualization
from src.visualization import visualize as vh

## About

In this notebook, PySpark ML will be used to perform topic modeling on the streamed tweets data stored in the CSV files (prepared using `3_combine_raw_data.ipynb`) in an S3 bucket.

**Notes**
1. This is an initial attempt at unsupervised learning with this data. The objective is to build up a minimum reliable end-to-end workflow here consisting of **both**
   - complete data processing
     - with an emphasis on processing text data
   - unsupervised ML
     - with manual hyper-parameter adjustments

   each of which can be iteratively improved in the future.
2. As the requirement for this project is to use big-data tools only, we will restrict ourselves to using
   - PySpark for manipulating the data in a PySpark `DataFrame`
     - even though the dataset used here does fit in memory and could be analysed using in-memory tools
   - PySpark ML for implementing topic modeling (unsupervised machine learning)
3. An AWS SageMaker notebook hosted on a T3 XLarge instance ([specifications](https://aws.amazon.com/ec2/instance-types/)) was used for running this notebook.
4. To ensure repeatability in
   - applying NLP on the tweet text
     - the previous (most recently) trained vectorizer model will be loaded from a sub-folder in the same S3 bucket containing the CSV files
       - here, this was done after reasonable choices for the `minDF` and `maxDF` hyper-parameters of the vectorizer model were found (from tuning these two hyper-parameters manually) and only the topic modeling algorithm (LDA) was left to be optimized
     - a newly defined vectorizer model will be trained and saved (with a timestamp in the suffix of its filename) to a sub-folder in the same S3 bucket
   - applying LDA [randomness in LDA is due to the probabilistic aspect of this algorithm ([1](https://stackoverflow.com/a/60661226/4057186))]
     - the `seed` hyper-parameter in the PySparkML `LDA` object will be set to a fixed value
       - this is used to save the state of a random function, so it yields the same random numbers on multiple calls
     - the LDA hyper-parameter for the maximum number of iterations (`maxIter`, [link](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.LDA.html#pyspark.ml.clustering.LDA.maxIter)) will need to be increased sufficiently (from its default value of 20) such that the learned topics do converge
       - a best effort was made to optimize this choice manually

**Requirements**
1. This notebook must be run on an AWS SageMaker instance.
2. Required Python libraries can be installed by un-commenting the first cell of this notebook.
3. Two environment variables
   - `AWS_S3_BUCKET_NAME`
      - the name of the S3 bucket containing the hourly CSVs of streamed Twitter data
   - `AWS_REGION`

   must be accessible to this SageMaker instance.

The Python package requirements to run this notebook are different to those listed in the `requirements.txt` file (used by notebook `3_combine_raw_data.ipynb`) for this project

In [None]:
%%time
!pip3 freeze | grep -E 'boto3|s3fs|black==|jupyter-server|sagemaker|sagemaker_pyspark|pandas|pyspark|dask|nltk|altair|plotly'

## User Inputs

In [None]:
# S3
path_to_folder = "/datasets/twitter/kinesis-demo/"

# Data Loading (from hourly CSV files)
num_files_to_use = 25
# number of rows (streamed tweets) to load into PySpark DataFrame
nrows = 800_000

# Data processing
all_cols_to_process = [
    "document",  # 'document' is optional
    "created_at",
    "user_joined",
    "in_reply_to_screen_name",
    "source_text",
    "place_country",
    "user_followers",
    "user_friends",
    "user_listed",
    "user_favourites",
    "user_statuses",
    "user_protected",
    "user_verified",
    "user_location",
    "reviewText",
]

# Vectorizer
count_vectorizer_filename = "count_vec_model"
save_count_vectorizer = False
s3_models_subfolder = "models"  # note: change this from 'predictions' to 'models'

# LDA
# # this should be tuned iteratively (discussed later)
num_topics = 4

# Model Evaluation (for reading tweets predicted to belong to the same topic)
num_top_terms_per_topic = 15
num_top_docs_to_read = 8

# Output file (containing data with predicted topic)
output_file_name = "processed_with_predictions"

In [None]:
s3_bucket_name = os.getenv("AWS_S3_BUCKET_NAME")
aws_region = os.getenv("AWS_REGION")

In [None]:
topic_cols = [f"topic_{i}" for i in range(num_topics)]
cols_to_show_when_reading = (
    ["document", "reviewText"] + topic_cols + ["dominant_prob", "dominant_topic"]
)

In [None]:
def show_pyspark_df(df: pdf, nrows: int = 5) -> pd_DataFrame:
    """Display the first n rows of a PySpark DataFrame as a Pandas DataFrame."""
    return df.limit(nrows).toPandas()

Download NLTK stopwords, if not previously done

In [None]:
%%time
if not os.path.isdir(
    os.path.join(os.path.expanduser("~"), "nltk_data", "corpora", "stopwords")
):
    nltk.download("stopwords")

from nltk.corpus import stopwords

all_stopwords = set(stopwords.words("english"))

Append custom stopwords to the list of stopwords provied by the NLTK library

In [None]:
manual_stop_words = [
    # specific to crypto mining
    "crypto",
    "token",
    "koistarter",
    "daostarter",
    "decentralized",
    "services",
    "pancakeswap",
    "eraxnft",
    "browsing",
    "kommunitas",
    "hosting",
    "internet",
    "exipofficial",
    "servers",
    "wallet",
    "liquidity",
    "rewards",
    "floki",
    "10000000000000linkstelegram",
    "dogecoin",
    "czbinance",
    "watch",
    "binance",
    "dogelonmars",
    "cryptocurrency",
    "hbomax",
    "money",
    "danheld",
    "dogelon",
    "bitcoin",
    "nftart",
    "bvbtc",
    # inappropriate
    "fuckkk",
    "fucking",
    # general words that won't be useful to analysis here (subjective choices)
    "provides",
    "crazy",
    "marketing",
    "locked",
    "happy",
    "first",
    "would",
    "always",
    "still",
    "could",
    "right",
    "thank",
    "project",
    "great",
    "really",
    "think",
    "check",
    "supply",
    "going",
    "completed",
    "still",
    "people",
    "years",
    "matter",
    "never",
    "always",
    "things",
    "amazing",
    "around",
    "better",
    "another",
    "please",
    "looking",
    "today",
    "since",
    "thing",
    "every",
    "something",
    "future",
    "thanks",
    "youre",
    "don't",
    "don't",
    "someone",
    "ready",
    "taken",
    "using",
    "enough",
    "maybe",
    "believe",
    "making",
    "stuff",
    "might",
    "point",
    "makes",
    "family",
    "everyone",
    "thats",
    "actually",
    "everything",
    "little",
    "change",
    "without",
    "gonna",
    "already",
    "getting",
    "theres",
    "looks",
    "can't",
    "didn't",
    "called",
    "found",
    "nothing",
    "though",
    "literally",
    "bring",
    # not words
    "aaaaaaaaaa",
    "aaaaaand",
    "aaaaand",
    "aaaahhhh",
    "aaaand",
    "aaaaaa",
    "aaahh",
    "aaare",
]
# Manually add to stop words
for manual_stop_word in manual_stop_words:
    all_stopwords.add(manual_stop_word)
all_stopwords = list(all_stopwords)

**Notes**
1. This custom list was iteratively built up by training the topic model algorithm, inspecting the top words in each learned topic and removing any occurrences of commonly occurring words that were (subjectively) determined to not add value to the topic. Such words should not contribute the highest weight to a topic and so should be removed from the text vocabulary that is used here.
2. Sine we're trying to learn topics related to tweets about *space* news, tweets about crypto currency (which are likely being picked up due to their connection to the company SpaceX) are not useful and should be removed from the vocabulary. An alternative approach would be to keep those words in and choose a number of topics that (if possible) confines tweets about crypto-current to a single topic. This alternate approach was not used here.

## PySpark Setup (on AWS SageMaker)

In [None]:
%%time
conf = (SparkConf()
        .set("spark.driver.extraClassPath", ":".join(sagemaker_pyspark.classpath_jars())))

Start a Spark session

In [None]:
%%time
spark = (
    SparkSession
    .builder
    .config(conf=conf)
    .appName("schema_test")
    .getOrCreate()
)

## Load Data

### Get List of S3 CSV Data Files

Get a list of all the CSV files containing the tweets data (files with a prefix `tweets_*.csv`), and not the metadata (prefix `tweets_metadata_*.csv`), from `csvs/` folder in the S3 bucket path at `<bucket-name>/datasets/twitter/kinesis-demo/`

In [None]:
%%time
existing_csv_files_list = s3h.get_existing_csv_files_list(
    s3_bucket_name, path_to_folder[1:] + "csvs/tweets_"
)
files_csvs_list = [f for f in existing_csv_files_list if "metadata" not in f]
files_csvs_list

### Load all CSV Files into Single PySpark `DataFrame`

Read all CSV files from the `csvs/` in the S3 bucket path at `<bucket-name>/datasets/twitter/kinesis-demo/` into a PySpark `DataFrame`

In [None]:
schema = T.StructType(
    [
        T.StructField("id", T.StringType()),
        T.StructField("contributors", T.StringType()),
        T.StructField("created_at", T.StringType()),
        T.StructField("source", T.StringType()),
        T.StructField("in_reply_to_screen_name", T.StringType()),
        T.StructField("source_text", T.StringType()),
        T.StructField("place_id", T.StringType()),
        T.StructField("place_url", T.StringType()),
        T.StructField("place_place_type", T.StringType()),
        T.StructField("place_country_code", T.StringType()),
        T.StructField("place_country", T.StringType()),
        T.StructField("user_name", T.StringType()),
        T.StructField("user_screen_name", T.StringType()),
        T.StructField("user_followers", T.IntegerType()),
        T.StructField("user_friends", T.IntegerType()),
        T.StructField("user_listed", T.IntegerType()),
        T.StructField("user_favourites", T.IntegerType()),
        T.StructField("user_statuses", T.IntegerType()),
        T.StructField("user_protected", T.BooleanType()),
        T.StructField("user_verified", T.BooleanType()),
        T.StructField("user_joined", T.StringType()),
        T.StructField("user_location", T.StringType()),
        T.StructField("retweeted_tweet", T.StringType()),
        T.StructField("text", T.StringType()),
        T.StructField("file_name", T.StringType()),
    ]
)

In [None]:
%%time
df = spark.read.csv(
    [f's3a://{s3_bucket_name}' + f"/{f}" for f in files_csvs_list],
    header=True,
    schema=schema,
    # inferSchema=True
).withColumnRenamed("text", "reviewText")
df = df.limit(nrows)

(Optional) Add a row number (row counter) column to the data

In [None]:
%%time
w = Window().orderBy(F.lit('A'))
df = df.withColumn("document", F.row_number().over(w))
with pd_option_context("display.max_columns", 50):
    display(show_pyspark_df(df))

Get the number of rows (retrieved tweets) in the data, number of pyspark `DataFrame` partitions and the number of workers on the host (single-node) cluster

In [None]:
%%time
print(
    f"Raw data contains {df.count():,} rows and {len(df.columns):,} columns "
    f"in {df.rdd.getNumPartitions()} partitions, on a host with "
    f"{len(os.sched_getaffinity(0))} CPUs"
)

Show the first 4 rows from the PySpark `DataFrame`

In [None]:
%%time
with pd_option_context("display.max_columns", 100):
    display(show_pyspark_df(df, 4))

Get a `DataFrame` version of the Spark Schema (`df.printSchema()`) for the PySpark `DataFrame`

In [None]:
df_dtypes_pyspark = pd_DataFrame.from_records(
    [
        {"name": field.name, "dtype": field.dataType, "nullable": field.nullable}
        for field in df.schema.fields
    ]
).set_index("name")
df_dtypes_pyspark

Cache the data

In [None]:
df.cache()

## Data Processing

For processing the data, text and non-text columns will be treated separately.

### Processing Non-Text Columns

We'll define a PySparkML pipeline ([v2.4.0](https://spark.apache.org/docs/2.4.0/ml-pipeline.html#pipeline), [latest version](https://spark.apache.org/docs/latest/ml-pipeline.html), [API for latest version](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.Pipeline.html)) to process all useful non-text columns from the data.

This pipeline will accept a list of all useful columns and return a DataFrame with the same input columns and the processed versions, a suffix will be added to the column name to
- indicate that it is has been processed
- keep it separate from the raw data

As an example, the column `created_at` will be converted from a string into a datetime datatype, and the converted version of this column will be returned as `created_at_dt`.

The following processing steps will be applied
- drop rows with duplicated tweets (retweets) since the text of the tweet (which will be used in NLP) is repeated
  - for analysing text data with topic modeling, we don't need multiple observations (rows) with the same text (re-tweets)
    - such rows are only useful when exploring the data after the text-based analysis has been completed
      - if a topic is learned for a specific tweet, then that same topic applies to all re-tweets
    - so, for NLP analysis, the duplicated rows (re-tweets) in the text column can be dropped
  - based on how data was collected using `twitter_s3.py`, an example of a retweet is
    - (row 10) `retweeted_tweet = 'no'` and text column = `'text here'`
    - (row 91) `retweeted_tweet = 'yes'` and text column = `'text here'`
    - (row 201) `retweeted_tweet = 'yes'` and text column = `'text here'`

    where we only need the first tweet (row 10) and so we can drop all rows corresponding to retweets
- convert the following columns from the `string` datatype to `datetime`s
  - `created_at` (date and time when the tweet was posted)
  - `user_joined` (date and time when user joined Twitter)

Apply the non-text processing pipeline to process all the useful non-text columns from the data

In [None]:
%%time
# Remove retweets
df = df.filter(df.retweeted_tweet != 'yes')

# Select the columns to be processed, including the text column
df_processed = df.select(all_cols_to_process)

# Drop rows with a missing value in the text column
df_processed = df_processed.na.drop(subset=["reviewText"])

# Apply datetime formatting for the two datetime columns
for c in ["created_at", "user_joined"]:
    df_processed = df_processed.withColumn(
        f"{c}_dt",
        F.to_timestamp(F.col(c), "yyyy-MM-dd HH:mm:ss"),
    )

print(f"Number of rows in processed data = {df_processed.count():,}")
with pd_option_context("display.max_colwidth", 1_000):
    display(show_pyspark_df(df_processed, 7))

The non-text data processing is now ready and we can proceed to preparing the text data column for quantitative analysis.

### Processing Text Data

We'll now process the text data column. All the processed data columns from the previous section, including the text column, will be retained. However, here, we will only be processing the text column from this data.

#### Cleaning Text Data

Since we are looking to build up a useful text vocabulary on which to perform NLP tasks, we'll first perform the following text cleaning steps
- replace multiple whitespaces with a single whitespace from the text of the tweet
- remove leading and trailing whitespace from the text of the tweet
- drop rows where the tweet is missing or contains an empty string (if any)
- change text to lowercase
- remove numbers
- remove punctuation

The steps will help with [tokenization](https://neptune.ai/blog/tokenization-in-nlp) during the NLP data preparation step (done in the next sub-section).

In [None]:
%%time
# Replace multiple whitespaces with a single whitespace
df_processed = df_processed.select(
    all_cols_to_process
    + ["created_at_dt", "user_joined_dt"]
    + [
        ch.replace_multiple_spaces(F.col("reviewText")).alias(
            "reviewText_processed"
        )
    ]
)

# Remove leading and trailing spaces
df_processed = df_processed.select(
    all_cols_to_process
    + [
        ch.remove_lead_trail_spaces(F.col("reviewText_processed")).alias(
            "reviewText_processed"
        )
    ]
)

# Change text to lowercase
df_processed = df_processed.select(
    all_cols_to_process
    + [F.lower(F.col("reviewText_processed")).alias("reviewText_processed")]
)

# Remove special characters
df_processed = df_processed.withColumn(
    'reviewText_processed', F.regexp_replace('reviewText_processed', r"[^a-zA-z]", " ")
)

# Remove numbers
df_processed = df_processed.select(
    all_cols_to_process
    + [
        F.regexp_replace(F.col("reviewText_processed"), "\d+", "").alias(
            "reviewText_processed"
        )
    ]
)

# Remove punctuation
df_processed = df_processed.select(
    all_cols_to_process
    + [ch.remove_punctuation(F.col("reviewText_processed")).alias("reviewText_processed")]
)

# Drop rows where the tweet text is a blank string
df_processed_no_blanks = df_processed.filter(df_processed["reviewText"] != '')

# Get words from the raw text (used as crude filter for tweets based on their length,
# to remove short tweets)
df_processed_no_blanks = df_processed_no_blanks.withColumn(
    "reviewText_trimmed", F.trim(F.col("reviewText"))
).withColumn("words", F.split("reviewText_trimmed", "\s+"))

print(f"Number of rows in processed data = {df_processed_no_blanks.count():,}")
show_pyspark_df(df_processed_no_blanks, 7)

**Notes**
1. In the Twitter streaming script `twitter_s3.py`, hashtags and usernames were removed from the text of the tweet and stored in a separate variable in the raw data. For this version of the analysis, we will not use hashtags and usernames but these can be combined with the tweet text in future iterations of this analysis.
2. The last processing step above was to get a crude count of the number of words in each tweet. This was done in order to filter out short tweets, in order to help the effectiveness of the LDA algorithm (more is discussed next).

As mentioned earlier, LDA can be sub-optimal for topic modeling with short texts. We will now apply the following filter to the data to remove short tweets
- for the purposes of this analysis, we will only keep raw tweets that have a minimum of 25 words
  - as more tweets are streamed and the size of the processed data (shown immediately above) increases, we can increase this minimum number of words to a larger number than 25
  - more is discussed about this length and about LDA for short texts at the end of this notebook
  - earlier, we split the raw tweets into words; we will use this `words` column to filter out short tweets

In [None]:
df_processed_no_blanks = df_processed_no_blanks.filter(F.size("words") > 25)
print(
    "Number of rows in processed data, after filtering out tweets based on "
    f"length of text = {df_processed_no_blanks.count():,}"
)
show_pyspark_df(df_processed_no_blanks, 5)

**Notes**
1. This significant reduces the size of the dataset passed to LDA, but it removes tweets that will be challenging to use with LDA.
2. It is better to apply this filter here, before data is passed to the NLP pipeline, since those are expensive computing tasks that won't benefit from having these shorter texts.

#### NLP on Cleaned Text Data

Now, we'll apply an NLP pipeline to extract features from the cleaned text data. This pipeline will consist of the following three steps
- tokenization
  - here we will restrict the minimum token length that we will accept using the `minTokenLength` key word
    - this is a hyperparameter of the NLP pipeline that can be tuned during future versions of this analysis
- removal of stop words
  - these are frequently occurring words that won't offer any useful information
- vectorization
  - this is the process of associating words or phrases from a text vocabulary to a real-valued vector
  - there are several approaches to vectorization, but we will restrict ourselves to TFIDF vectorization ([1](https://openclassrooms.com/en/courses/6532301-introduction-to-natural-language-processing/7067116-apply-the-tf-idf-vectorization-approach), [2](https://monkeylearn.com/blog/what-is-tf-idf/))
    - in PySpark this can be done using a combination of a `CountVectorizer` ([link](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.CountVectorizer.html)) and `IDF` ([link](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.IDF.html)) classes from the `pyspark.ml` module
      - `CountVectorizer` has three particularly useful hyperparameters `minDF`, `maxDF` and `vocabSize` that could be extensively tuned in future versions of this analysis
    - disadvantage of the TFIDF technique
      - If a text corpus consists of 10 documents, then the vector created has a length of 10; if there are 10,000 documents, then the vector is of length 10,000. This means that the size and words in the vocabulary depend completely on the text corpus. The same words in two different vocabularies will produce different vector representations depending on the corpus being analysed. An alternative to this form of vectorization includes using word embeddings such as `Word2Vec` ([1](https://stackoverflow.com/questions/62749877/word2vec-in-short-text-clustering#comment110985077_62749877)), which was not used in this iteration of the analysis.

In [None]:
# Tokenization
tokenizer = RegexTokenizer(
    minTokenLength=5,
    inputCol="reviewText_processed",
    outputCol="tokens",
    toLowercase=True,
    pattern="\\s+",  # default (meaning: https://stackoverflow.com/a/13750765/4057186)
    # pattern="\\W",  # other options to try to keep words: '[\\W_]+' or "\\W"
)

# Removal of Stop Words
remover = StopWordsRemover(
    inputCol="tokens", outputCol="tokens_no_stopwords", stopWords=all_stopwords
)

# TFIDF Vectorization
count_vec_params = dict(
    inputCol="tokens_no_stopwords",
    outputCol="rawFeatures",
    # vocabSize: default = 262144
    vocabSize=262144,
    # minDF: if float, ignores terms with document freq less than minDF (default = 1.0)
    minDF=5,
    # maxDF: if float, ignores tokens with document freq > maxDF (default = 9223372036854775807)
    maxDF=0.75,
)
if save_count_vectorizer:
    # Create new count vectorizer model
    count_vectorizer = CountVectorizer(**count_vec_params)
    print(f"Defined new CountVectorizer object")
else:
    # Get all paths (excluding prefix with protocol and bucket name) to previously saved
    # count vectorizer models (sorted in ascending order by timestamp suffix in their filenames)
    vectorizer_fpaths = iem.get_all_saved_vectorizer_models_from_s3(
        s3_bucket_name,
        aws_region,
        f"{path_to_folder[1:]}{s3_models_subfolder}/",
        "count_vec",
    )
    # Get full filepath to latest count vectorizer model
    count_vectorizer_filepath = f"s3a://{s3_bucket_name}/{vectorizer_fpaths[-1][:-1]}"
    # Load latest count vectorizer model from folder in S3 bucket
    count_vectorizer = CountVectorizerModel.load(count_vectorizer_filepath)
    print(
        f"Loaded CountVectorizer from folder {vectorizer_fpaths[-1][:-1]} in S3 bucket"
    )
idf = IDF(minDocFreq=0, inputCol="rawFeatures", outputCol="features")
tfidf_vectorizer = Pipeline(stages=[count_vectorizer, idf])

assembler = VectorAssembler(inputCols=["1gram_idf"], outputCol="features")

# Combined text processing pipeline
pipe = Pipeline(stages=[tokenizer, remover, tfidf_vectorizer])

Apply the text processing pipeline to process the text column from the data

In [None]:
%%time
pipe_trained = pipe.fit(df_processed_no_blanks)
df_text_processed_no_blanks = pipe_trained.transform(df_processed_no_blanks)
print(f"Number of rows in processed data = {df_text_processed_no_blanks.count():,}")
show_pyspark_df(df_text_processed_no_blanks, 5)

Save the trained `CountVectorizer` to a folder in S3 bucket (if specified in **User Inputs** section)

In [None]:
%%time
if save_count_vectorizer:
    # Assemble S3 filepath with timestamp in filename
    timestr = time.strftime("%Y%m%d_%H%M%S")
    count_vectorizer_filepath = (
        f"s3a://{s3_bucket_name}{path_to_folder}{s3_models_subfolder}/{count_vectorizer_filename}_{timestr}"
    )
    # Save the trained CountVectorizer in the relevant stage from the PySpark NLP pipeline
    pipe_trained.stages[-1].stages[0].save(count_vectorizer_filepath)
    print(f"Saved newly trained CountVectorizer to path {count_vectorizer_filepath} in S3 bucket")

We'll now drop duplicates based on the processed text (before stopwords were removed)

In [None]:
%%time
df_text_processed = df_text_processed_no_blanks.dropDuplicates(
    subset=["reviewText_processed"]
)

**Notes**
1. All the texts are now in lowercase and have been processed so it is now possible to identify duplicates among tweets that could have previously differed from eachother only in whitespace or the case of the text that makes up the tweet (making it difficult or impossible to identify duplicated tweets), and then drop such duplicates.

Check if cached

In [None]:
df_text_processed.storageLevel.useMemory

Cache the processed data, which will be queried and then used for LDA modeling

In [None]:
df_text_processed_cached = df_text_processed.cache()

#### Dropping Irrelevant Tweets

Although duplicated and re-tweets have been dropped, some tweets might be leftover that differ from others by a few characters or words. We'll refer to these as *leftover duplicates*. We'll now get a random sample of the processed data to visually inspect and identify any such leftover duplicated or inappropriate / irrelevant tweets that can be manually dropped for the analysis to be performed here.

Get the number of rows in the processed data as a Python variable

In [None]:
%%time
n_rows_proc = df_text_processed_cached.count()

In [None]:
%%time
print(
    "Number of rows in processed data, after filtering tweets by length and"
    f" removing duplicates = {n_rows_proc:,}"
)
show_pyspark_df(df_text_processed_cached, 5)

Get a random sample of the processed data

In [None]:
%%time
df_sample = df_text_processed_cached.sample(withReplacement=False, fraction=50 / n_rows_proc).toPandas()
with pd_option_context("display.max_rows", 200):
    with pd_option_context("display.max_colwidth", 1_000):
        display(df_sample[["document", "reviewText"]])

## Topic Model Training

We'll now perform the quantitative analysis, which will be limited to unsupervised ML using the Latent Dirichlet Allocation (LDA) algorithm. It is implemented in PySpark ML and that implementation will be used here. Limitations of this choice are discussed in the **Conclusions and Future Work** section.

### LDA

We'll define a dictionary of LDA hyper-parameters to be used to train the LDA model

In [None]:
lda_params_dict = dict(
    featuresCol="features",  # features or tokens_no_stopwords
    optimizer="em",  # 'online' or 'em'
    maxIter=85,
    k=num_topics,
    seed=88,
)
lda = LDA(**lda_params_dict)

**Notes**
1. Hyper-parameter tuning of the `maxIter` and `k` (number of topics) hyper-parameters was done by manually changing its values in this hyper-parameter dictionary and inspecting a chart of the top words (by weight) per topic (will be discussed shortly) and adjusting hyper-parameter values until coherent terms appeared in a given topic. With better data processing to remove leftover duplicated tweets (identified in the last sub-section of **Data Processing** above), these choices would be further tuned by reading the top documents per topic. All hyper-parameters in the NLP step were kept fixed while this manual tuning of the `maxIter` hyper-parameter was performed.
2. The same approach was used for brief hyper-parameter tuning of the `minDF` and `maxDF` hyper-parameters of the TFIDF Vectorization step from the NLP sub-section earlier. During this brief tuning process, the `maxIter` value in the LDA algorithm was kept fixed.

Train the LDA model

In [None]:
%%time
print(f"Starting time = {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...", end="")
model = lda.fit(df_text_processed_cached)
print(f"Done at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}.")

## ML Model Interpretation

### Terms per Topic

Get the size of the vocabulary

In [None]:
model.vocabSize()

Get vocabulary

In [None]:
%%time
vocabList = pipe_trained.stages[-1].stages[0].vocabulary
vocabList[:7]

Get topic-terms matrix with the top `n` term (token) weights for each topic ([1](https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/mllib/clustering/LDAModel.html#describeTopics()), [2](https://spark.apache.org/docs/latest/ml-clustering.html#latent-dirichlet-allocation-lda), [3](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.mllib.clustering.LDAModel.html#pyspark.mllib.clustering.LDAModel.describeTopics))

In [None]:
%%time
df_topic_terms = model.describeTopics(maxTermsPerTopic=num_top_terms_per_topic)
show_pyspark_df(df_topic_terms)

In order to interpret the LDA model's topics, the term names are of use to us and not the indices. Term names come from the vocabulary of the `CountVectorizer` ([1](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.feature.CountVectorizer.html#pyspark.ml.feature.CountVectorizer.binary)) pre-processing step, which then allows us to convert the `termIndices` column to `termNames` that can be interpreted.

We will now get the term names for each topic using the vocabulary we retrieved above

In [None]:
%%time
dff = df_topic_terms.toPandas().apply(
    lambda s: s.apply(pd_Series).stack().reset_index(drop=True, level=1)
).reset_index(drop=True)
# Map term indices to vocabulary terms, in order to get the term (token)
# corresponding to each term index
dff["termNames"] = dff["termIndices"].map(vocabList.__getitem__)
dff['topic'] = dff['topic'].map('topic_{}'.format)

Plot the top `n` terms for each topic

In [None]:
%%time
vh.altair_plot_grid_by_column(
    dff,
    xvar="termWeights",
    yvar="termNames",
    col2grid="topic",
    space_between_plots=10,
    row_size=1,
    fig_size=(150, 200),
)

**Notes**
1. This was the chart used to manually tune hyper-parameters, as mentioned earlier.

We'll now export this small `DataFrame` with the weights of the tterms for each topic to a CSV file in the `predictions/` sub-folder in the same S3 bucket containing the sub-folder with the hourly CSV files that were loaded earlier

In [None]:
%%time
print(f"Starting time = {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...", end="")
timestr = time.strftime("%Y%m%d_%H%M%S")
full_file_path = (
    f"s3://{s3_bucket_name}{path_to_folder}"
    f"{s3_models_subfolder}/term_weights_{timestr}.csv"
)
dff.to_csv(full_file_path)
print(f"done at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}.")

### Topics per Document

Get the topics per document.i.e. make predictions with the trained LDA model

In [None]:
%%time
df_topics_matrix = model.transform(df_text_processed_cached)
with pd_option_context('display.max_colwidth', 1_000):
    display(
        show_pyspark_df(
            df_topics_matrix.select(
                ["document", "created_at", "user_joined", "reviewText", "topicDistribution"]
            ),
            3,
        )
    )

**Notes**
1. The LDA model's predictions for topics is a vector of probabilities (that add up to 1.0) for each topic, per row (document) in the data. This vector is shown in the `topicDistribution` column above and is referred to as the distribution of topics across for each document ([1](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.clustering.LDA.html#pyspark.ml.clustering.LDA.topicDistributionCol), [2](https://stackoverflow.com/questions/33072449/extract-document-topic-matrix-from-pyspark-lda-model), [3](https://stackoverflow.com/questions/49740675/how-to-get-the-topic-probability-for-each-document-for-topic-modeling-using-lda), [4](https://www.mathworks.com/help/textanalytics/ref/ldamodel.html#d123e21088)). LDA's predictions don't give a single discrete topic but, instead, return this vector and the user must decide how to interpret the values in each vector.

Check if cached

In [None]:
df_topics_matrix.storageLevel.useMemory

Cache the LDA model's predictions since it will be used to explode the vectors of topic distribution into separate columns and then query those columns

In [None]:
%%time
df_topics_matrix_cached = df_topics_matrix.cache()

In [None]:
# %%time
# IGNORE
# df_topics_matrix_pandas = df_topics_matrix_cached.toPandas()

We'll now extract the vector of topic probabilities into separate columns

In [None]:
%%time
ith = F.udf(mih.ith_, T.DoubleType())
df_processed_topics_matrix = df_topics_matrix_cached.select(
    df_topics_matrix_cached.columns
    + [
        ith("topicDistribution", F.lit(i)).alias("topic_" + str(i))
        for i in range(num_topics)
    ]
)
show_pyspark_df(df_processed_topics_matrix)

Finally, we'll extract a new column (`dominant_topic`) with the topic with the highest probability predicted by the LDA model (`dominant_prob`)

In [None]:
%%time
df_processed_topics_matrix = mih.get_max_val_name(
    df_processed_topics_matrix, topic_cols, ["dominant_prob", "dominant_topic"]
)
show_pyspark_df(df_processed_topics_matrix)

Check if cached

In [None]:
df_processed_topics_matrix.storageLevel.useMemory

Cache the data with these two new columns added, since it will be used in multiple queries (next) to print out the raw tweets (texts) for reading

In [None]:
%%time
df_processed_topics_matrix_cached = df_processed_topics_matrix.cache()

A random sample of the documents (tweets) predicted with the highest probability to belong to each topic are printed out below for reading

In [None]:
%%time
visual_sep = '=' * 25
for q, topic_col in enumerate([f"topic_{n}" for n in range(num_topics)]):
    # Get top 50 documents based on predicted probability, within a single topic
    df_single_topic = df_processed_topics_matrix_cached.select(
        cols_to_show_when_reading
    ).filter(
        f"dominant_topic == '{topic_col}'"
    ).orderBy(topic_col, ascending=[False]).limit(50)
    # Get a random sample of the top 50 documents (tweets)
    n_rows_proc = df_single_topic.count()
    df_single_topic = df_single_topic.sample(
        withReplacement=False,
        fraction=num_top_docs_to_read / n_rows_proc,
    )
    # Convert sample to pandas (this will be a small DataFrame of < 10 rows)
    df_single_topic_pandas = df_single_topic.toPandas()

    # Create string of top n terms and term weights
    term_weights_names_str = ", ".join(
        [
            f"{row['termNames']} = {row['termWeights']:.4f}"
            for _, row in (
                dff.query(f"topic == '{topic_col}'").iloc[:, -2:].iterrows()
            )
        ]
    )
    if q > 0:
        print("\n")
    print(f"{visual_sep} topic = {topic_col} {visual_sep}\n{term_weights_names_str}\n")
    # Print random sample of the text of the tweet from the n documents with the highest
    # predicted probability of belonging to a given topic
    # - this involves iterating over the rows of the small pandas DataFrame created above
    for idx, row in df_single_topic_pandas.iterrows():
        topic_probs_str = ", ".join(
            [
                f"{topic_name_str} = {topic_prob_value:.3f}"
                for topic_name_str, topic_prob_value in row[topic_cols]
                .sort_values(ascending=False)
                .iteritems()
            ]
        )
        print(f"document = {row['document']}: {topic_probs_str}\n{row['reviewText'].strip()}")
        if idx < len(df_single_topic_pandas) - 1:
            print("\n")

**Notes**
1. These should allow for fine-tuning the choices of topic names determined by inspecting the top terms in each topic.

**Observations**
1. When 10-12 topics were chosen, a number of different topics contained tweets about the same subject. This is an indicator that the number of topics must be reduced. When 5-7 topics were picked, the overlap reduced sufficiently that it seemed like there was some distinction between the topics. **From the runs of this notebook with varying `num_topics` (number of topics passed to the LDA algorithm), the optimal choice for the number of topics in the streamed twitter data was determined to be 4, based on the best combination of (a) reading the content of the tweets with the highest predicted probability of belonging to a topic and (b) interpreting the top terms (by weight) per topic.**
2. It is possible that there are two choices for the optimal number of topics
   - choosing a smaller number that reveals the high-level topics (this appears to be the case here)
   - choosing a larger number that reveals the low-level topics, possibly sub-topics within each of the high-level topics
     - from the preliminary tuning of `num_topics` done here, using this approach, it was not possible to extract meaningful topics that were also able to capture the same subject (in the text of the tweets) as the subject suggested by the top terms (by weight) in each topic
3. It is clear that further pre-processing is required to filter out the *leftover duplicated tweets*. Without doing this, it can be difficult to use the output printed above to fine-tune / verify the name of the topic found using the top words in each topic, since multiple (duplicated) versions of the same tweet appear in the random sample of the top `n` tweets (documents) within each topic. We want these printed tweets to be unique since such text will help to verify the name assigned to the topics. Futher filtering to remove these duplicates (by extending the list `unwanted_partial_strings_list` used to filter tweets in `3_combined_data.ipynb`) does appear necessary before reading tweets will be useful for naming topics and picking an appropriate number of topics.

### Naming the Topics

Based on the top terms per topic and reading the tweets predicted (with a high probability) to belong to a given topic, the following are the names assigned to the topics
- topic 3
  - All about Satellites and Telescopes for Space Exploration
    - Space Mission updates
    - space research competition
    - space explorer comic book
- topic 2
  - Activities related to Space Research
    - People and companies associated with advancing research in the space sector
    - shuttle launch tests
    - opening of space research facility
- topic 1
  - [Astrology](https://undsci.berkeley.edu/article/astrology_checklist) predictions
    > Astrology uses a set of rules about the relative positions and movements of heavenly bodies to generate predictions and explanations for events on Earth and human personality traits.
- topic 0
  - Astronomy
    - satellite images of the Earth and Moon
    - people involved in astronomy research
    - satellite centers associated with facilities also appear here

Use a Python dictionary to replace placeholder values in the `dominant_topic` column of the cached `DataFrame` above with the names assigned above

In [None]:
mapping = {
    "topic_3": "Satellites and Telescopes",
    "topic_2": "Activities related to Space Research",
    "topic_1": "Astrology",
    "topic_0": "Astronomy",
}

Show the number of missing values in the mapped `dominant_topic` that would result when applying this mapping (there should be no missing values)

In [None]:
df_processed_topics_matrix_cached.withColumn(
    "dominant_topic_named", F.col("dominant_topic")
).replace(to_replace=mapping, subset=["dominant_topic_named"]).where(
    F.col("reviewText").isNull()
).count()

**Observations**
1. There are no missing values after applying the mapping.

Apply the mapping

In [None]:
%%time
df_processed_topics_matrix_cached = df_processed_topics_matrix_cached.withColumn(
    "dominant_topic_named",
    F.col("dominant_topic")
).replace(
    to_replace=mapping,
    subset=['dominant_topic_named']
)

We'll also re-generate the above plot with the new topic names

In [None]:
%%time
vh.altair_plot_grid_by_column(
    dff.replace({"topic": mapping}),
    xvar="termWeights",
    yvar="termNames",
    col2grid="topic",
    space_between_plots=10,
    row_size=1,
    fig_size=(150, 200),
)

## Merge with Processed Data and Export to S3 `predictions/` sub-folder

### Merge

The processed data is shown below (this was before duplicates in the `reviewText_processed` column were dropped and after tweets were filtered by their length)

In [None]:
%%time
show_pyspark_df(df_text_processed_no_blanks.select(all_cols_to_process), 3)

**Notes**
1. The predicted topics are only valid for the tweets (texts) that are sufficiently long. For this reason, we can only merge with the processed data that was prepared by removing the short tweets and cannot merge with the raw data since there would be a lot of tweets without a predicted topic.

We'll now `LEFT JOIN` this with the filtered data that was used for LDA analysis, so that we can get the topic assigned to all duplicates of a particular tweet. The join will be performed on the processed version of the tweet text column

In [None]:
%%time
df_processed_with_topics = df_text_processed_no_blanks.select(all_cols_to_process + ["reviewText_processed"]).alias("left").join(
    df_processed_topics_matrix_cached.select(["reviewText", "reviewText_processed", "dominant_topic_named", "dominant_prob"]).alias("right"),
    on=["reviewText_processed"],
    how="left"
)
show_pyspark_df(df_processed_with_topics, 3)

Since the raw version of the tweet text columns were not involved in the `JOIN` but are present on both the LHS and RHS, they will both appear in the merged data. Find all rows where these two columns do not agree with each other after the `JOIN` was performed

In [None]:
%%time
with pd_option_context("display.max_colwidth", 5000):
    display(
        show_pyspark_df(
            df_processed_with_topics
            .withColumn("d", F.col("left.reviewText") == F.col("right.reviewText"))
            .where(F.col("d") == False)
            .select(
                [
                    F.col("left.reviewText").alias("reviewText_left"),
                    F.col("right.reviewText").alias("reviewText_right"),
                ]
            ),
            100
        ).sample(10)
    )

**Notes**
1. Above is a random sample of 10 rows where these two columns do not agree with eachother.

**Observeations**
1. Above is a comparison of the `reviewText` columns on either side of the `LEFT JOIN` that do not match eachother. It is not very clear why most of these rows do not match. The column on the left comes from the processed data and the column on the right from the processed and de-duplicated data that was passed through the NLP pipeline and LDA. Since there doen't appear to be clearly visible differences in most of these rows, we'll
   - use the left version in EDA (next section)
   - keep both versions
   - rename the columns by adding the `_left` and `_right` suffix respectively

Rename `reviewText` columns by appending a suffix

In [None]:
%%time
col_renaming_dict = {'left.reviewText': 'reviewText_left', 'right.reviewText': 'reviewText_right'}
for k, v in col_renaming_dict.items():
    df_processed_with_topics = df_processed_with_topics.withColumn(v, F.col(k)).drop(F.col(k))

Check if cached

In [None]:
df_processed_with_topics.storageLevel.useMemory

Cache the merged data, which will be used to count missing values next

In [None]:
%%time
df_processed_with_topics_cached = df_processed_with_topics.cache()

Show the number of rows in the processed data (LHS of the `LEFT JOIN`)

In [None]:
%%time
df_text_processed_no_blanks.count()

Show the number of rows in the merged data (RHS of the `LEFT JOIN`)

In [None]:
%%time
df_processed_with_topics_cached.count()

Verify that there are no missing values in the merged data

In [None]:
%%time
# count missing values in the tweet text of the processed data (LHS of the LEFT JOIN)
df_text_processed_no_blanks.where(F.col("reviewText").isNull()).count()

In [None]:
%%time
# count missing values in the tweet text of the merged data
df_processed_with_topics_cached.where(F.col("reviewText_left").isNull()).count()

In [None]:
%%time
# count missing values in the tweet text of the merged data
df_processed_with_topics_cached.where(F.col("reviewText_right").isNull()).count()

Count the number of rows with a missing value in the predicted topic name column

In [None]:
%%time
df_processed_with_topics_cached.where(F.col("dominant_topic_named").isNull()).count()

### Export to `predictions/` sub-folder in S3 Bucket

Show the number of PySpark `DataFrame` partitions in the merged data

In [None]:
df_processed_with_topics_cached.rdd.getNumPartitions()

We'll now export this merged data to the `predictions/` sub-folder in the same S3 bucket containing the sub-folder with the hourly CSV files that were loaded earlier

In [None]:
%%time
print(f"Starting time = {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}...", end="")
timestr = time.strftime("%Y%m%d_%H%M%S")
full_file_path = (
    f"s3a://{s3_bucket_name}{path_to_folder}predictions/"
    f"{output_file_name}_{timestr}.parquet.gzip"
)
(
    df_processed_with_topics_cached
    .write.mode("overwrite")
    .option('compression', 'gzip')
    .option("header", "true")
    .parquet(full_file_path)
)
print(f"Done at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}.")

### (Optional) Read Exported Data from the `predictions/` sub-folder in S3 Bucket

Get path to pre-existing Parquet file saved in the `predictions/` sub-folder in S3 bucket

In [None]:
%%time
parquet_files_list = s3h.get_existing_csv_files_list(
    s3_bucket_name, path_to_folder[1:] + f"predictions/{output_file_name}_"
)
main_parquet_files_list = [f.rstrip("/_SUCCESS") for f in parquet_files_list if "_SUCCESS" in f]
print(main_parquet_files_list)

Assemble dictionary with full filepath to parquet file

In [None]:
full_file_paths_dict = {
    "pyspark": f"s3a://{s3_bucket_name}/{main_parquet_files_list[0]}",
    "pandas": f"s3://{s3_bucket_name}/{main_parquet_files_list[0]}",
}

#### PySpark

Check that we can read the exported data back with PySpark

In [None]:
%%time
df_reloaded = spark.read.parquet(full_file_paths_dict["pyspark"])
show_pyspark_df(df_reloaded, 3)

Check if cached

In [None]:
df_reloaded.storageLevel.useMemory

Cache the reloaded data

In [None]:
%%time
df_reloaded_cached = df_reloaded.cache()

Show (a `DataFrame` version of) the Spark Schema

In [None]:
df_dtypes_pyspark = pd_DataFrame.from_records(
    [
        {"name": field.name, "dtype": field.dataType, "nullable": field.nullable}
        for field in df_reloaded.schema.fields
    ]
).set_index("name")
df_dtypes_pyspark

Show the number of rows in the re-loaded PySpark `DataFrame`

In [None]:
%%time
df_reloaded_cached.count()

Show the number of PySpark `DataFrame` partitions in the reloaded data

In [None]:
%%time
df_reloaded_cached.rdd.getNumPartitions()

#### `dask.DataFrame`

Check that we can read the exported data back with Dask `DataFrame`

In [None]:
import dask.dataframe as dd

In [None]:
%%time
ddf = dd.read_parquet(full_file_paths_dict["pandas"], engine="auto")
ddf.head(3)

Show the datatypes

In [None]:
ddf.dtypes.rename("dtype").rename_axis("name").to_frame()

Show the number of rows in the re-loaded `dask.DataFrame` (DDF)

In [None]:
%%time
len(ddf)

Show the number of `dask.DataFrame` partitions in the reloaded data

In [None]:
%%time
ddf.npartitions

Get length of each partition in the re-loaded DDF

In [None]:
%%time
ddf_partition_sizes = (
    ddf.map_partitions(len)
    .compute()
    .rename("num_rows_in_partition")
    .reset_index()
    .rename(columns={"index": "partition_index"})
)

In [None]:
with pd_option_context("display.max_rows", 200):
    display(ddf_partition_sizes.sample(n=10))

#### `pandas`

Check that we can read this back with Pandas (this will only be possible if the data is small enough to fit into memory; if not, then `dask` or `pyspark` are required)

In [None]:
%%time
df_reloaded_pandas = pd_read_parquet(full_file_paths_dict["pandas"], engine="auto")
df_reloaded_pandas.head(3)

Show the datatypes

In [None]:
df_reloaded_pandas.dtypes.rename("dtype").rename_axis("name").to_frame()

Show the number of rows in the re-loaded `pandas.DataFrame`

In [None]:
len(df_reloaded_pandas)

## EDA of Data with Assigned Topic Names

This section will include a brief exploration of the data with the predicted topic names using the data reloaded into the `PySpark` `DataFrame`.

### Pre-Processing

Drop rows with a missing topic name in the `dominant_topic_named` column (see discussion about this in the **Merge with Processed Data and Export to S3** section)

In [None]:
%%time
# # Pandas
# import pandas as pd
# df_reloaded_pandas = df_reloaded_pandas.dropna(subset=["dominant_topic_named"])

# PySpark
df_reloaded_cached = df_reloaded_cached.dropna(subset=["dominant_topic_named"])

Convert timestamp columns to `datetime` format

In [None]:
%%time
# # Pandas
# df_reloaded_pandas["created_at_hour"] = pd.to_datetime(
#     df_reloaded_pandas["created_at"]
# ).dt.hour
# df_reloaded_pandas["created_at_weekday"] = pd.to_datetime(
#     df_reloaded_pandas["created_at"]
# ).dt.day_name()
# df_reloaded_pandas['created_at_dt'] = pd.to_datetime(df_reloaded_pandas['created_at']).dt.date

# PySpark
df_reloaded_cached = df_reloaded_cached.withColumn(
    "created_at_dt", F.date_format(df_reloaded_cached.created_at, "yyyy-MM-dd HH:mm:ss")
)
df_reloaded_cached = df_reloaded_cached.withColumn(
    "created_at_hour", F.hour("created_at_dt")
).withColumn("created_at_weekday", F.date_format("created_at_dt", "E"))
df_reloaded_cached = df_reloaded_cached.withColumn("created_at_date", F.to_date("created_at_dt"))

### Business Questions about the Data

**1. Get the 10 most common Twitter user screen names who received a reply to any of their tweets**

In [None]:
%%time
# # Pandas
# df_reloaded_pandas["in_reply_to_screen_name"].value_counts().nlargest(10).to_frame()

# PySpark
df_most_replied_to_user_toPandas = (
    df_reloaded_cached.groupBy(["in_reply_to_screen_name"])
    .count()
    .orderBy(["count"], ascending=False)
    .toPandas()
)
df_most_replied_to_user_toPandas.head(10)

**Observations**
1. The majority of the tweets were not posted as a reply to another user.

**2. What are the 20 Twitter clients that were most frequently used to post a tweet?**

In [None]:
%%time
# # Pandas
# df_reloaded_pandas["source_text"].value_counts().nlargest(10).to_frame()

# PySpark
df_top_clients_toPandas = (
    df_reloaded_cached.groupBy(["source_text"])
    .count()
    .orderBy(["count"], ascending=False)
    .toPandas()
)
df_top_clients_toPandas.head(20)

**3. For the top 7 most frequently used Twitter clients, show an appropriate chart of the frequency (number of tweets) for each of the named topics in this dataset.**

Get a list with the top twenty most frequently used Twitter clients (using result of above question)

In [None]:
sources_wanted = df_top_clients_toPandas["source_text"].tolist()[:7]
sources_wanted

Count the number of tweets posted from each of these top 20 clients

In [None]:
%%time
# # Pandas
# df_reloaded_pandas[df_reloaded_pandas["source_text"].isin(sources_wanted)].groupby(
#     ["source_text", "dominant_topic_named"], as_index=False
# ).size()

# PySpark
df_sources_grouped = df_reloaded_cached.filter(F.col("source_text").isin(sources_wanted)).groupBy(
    ["source_text", "dominant_topic_named"]
).count().orderBy(["source_text", "dominant_topic_named"], ascending=[True, True])
show_pyspark_df(df_sources_grouped)

**Observations**
1. Twitter for iPhone seems to be the preferred platform for users posting tweets about Astrology. The absolute number of tweets, within this topic, is the highest for the twitter app on the iPhone. On all other platforms, the fewest tweets posted fall under the Astrology topic. However, on the iPhone app, Astrology is among the highest.

Pivot the clientwise grouped data keeping client along the rows and topics along the columns

In [None]:
%%time
distinct_column_vals = [
    v.asDict()["dominant_topic_named"]
    for v in df_reloaded_cached.select("dominant_topic_named").distinct().collect()
]
df_sources_pivotted_toPandas = df_sources_grouped.groupBy("source_text").pivot(
    "dominant_topic_named", distinct_column_vals
).sum("count").withColumnRenamed(
    "Activities related to Space Research", "Space Research"
).withColumnRenamed(
    "Satellites and Telescopes", "Satellites / Telescopes"
).toPandas().set_index("source_text")
df_sources_pivotted_toPandas

Re-shape data into format suitable for PlotLy `go.Heatmap()`

In [None]:
df_annot = df_sources_pivotted_toPandas.copy()
for col in df_annot:
    df_annot[col] = df_annot[col].map("{:,}".format)
data_dict = vh.convert_df_to_format_for_plotly_heatmap(
    df_sources_pivotted_toPandas, df_annot, True,
)

Plot the heatmap from the reshaped data

In [None]:
%%time
vh.plot_plotly_heatmap(
    data_dict=data_dict,
    annotation_text=data_dict["annotation_text"],
    margin_dict=dict(l=30, r=0, b=0, t=0, pad=0),
    fig_width=900,
)

**4. Show descriptive statistics (min, mean, median and max) about the number of users followers, by topic**

In [None]:
%%time
# # Pandas
# df_reloaded_pandas.groupby(["dominant_topic_named"], as_index=False)[
#     "user_followers"
# ].agg(["min", "mean", "median", "max"])

# PySpark
df_stats_by_topic_toPandas = df_reloaded_cached.groupby(["dominant_topic_named"]).agg(
    F.min(F.col('user_followers')).alias('user_followers_min'),
    F.avg(F.col('user_followers')).alias('user_followers_mean'),
    # F.percentile_approx("user_followers", 0.5).alias("user_followers_median"),  # pyspark>=3.1.0
    F.expr('percentile(user_followers, array(0.5))')[0].alias('50%'),  # pyspark==2.4.0
    F.max(F.col('user_followers')).alias('user_followers_max'),
).toPandas()
df_stats_by_topic_toPandas

**Observations**
1. Users who tweeted about Astrology have the fewest followers on average.

**5. Show a heatmap of the number of tweets by hour of the day and day of the week, for the most popular topic during every combination of hour and weekday on which Twitter data was streamed. Create a chart from this.**

In [None]:
%%time
# # Pandas
# df_reloaded_pandas.groupby(
#     ["created_at_weekday", "created_at_hour", "dominant_topic_named"], as_index=False
# )["document"].count().sort_values("document", ascending=False).groupby(
#     ["created_at_weekday", "created_at_hour"]
# ).first().reset_index()

# PySpark
df_reloaded_cached_dt_agg = df_reloaded_cached.groupBy(
    ["created_at_date", "created_at_weekday", "created_at_hour", "dominant_topic_named"]
).count().orderBy(["count"], ascending=False)
w = Window.partitionBy(
    ["created_at_date", "created_at_weekday", "created_at_hour"]
).orderBy(F.desc("count"))
df_most_tweeted_topics = df_reloaded_cached_dt_agg.withColumn(
    "row_number", F.row_number().over(w)
).where("row_number = 1").drop("row_number")
df_most_tweeted_topics_toPandas = df_most_tweeted_topics.toPandas()
df_most_tweeted_topics_toPandas

Convert the `create_at_date` column to a string, which Altair can serialize to JSON

In [None]:
df_most_tweeted_topics_toPandas[date_col_name] = pd.to_datetime(
    df_most_tweeted_topics_toPandas[date_col_name]
).dt.strftime("%Y-%m-%d")

Plot the heatmap

In [None]:
%%time
chart = vh.plot_altair_heatmap(
    data=df_most_tweeted_topics_toPandas,
    legend=alt.Legend(
        title="Average Number of Tweets",
        orient="none",
        legendX=250,
        titleAnchor="start",
        direction="vertical",
    ),
    tooltip=[
        alt.Tooltip("created_at_hour:O", title="Hour"),
        alt.Tooltip("created_at_weekday:N", title="Weekday"),
        alt.Tooltip("created_at_date:N", title="Date"),
        alt.Tooltip("dominant_topic_named:N", title="Topic"),
        alt.Tooltip(f"mean(count):Q", title="Avg. Number of Tweets", format=","),
    ],
    agg="mean",
    xvar="created_at_weekday",
    yvar="created_at_hour",
    color_by_col="count",
    ptitle="Tweets During the Day",
    sort_x=["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    sort_y=list(range(0, 23 + 1)),
    marker_linewidth=1,
    cmap="yelloworangered",
    scale="log",
    show_x_labels=True,
    show_y_labels=True,
    fig_size=(240, 750),
).configure_view(strokeWidth=0).configure_axis(
    domain=False, labelFontSize=14, titleFontSize=16
).configure_title(
    anchor="middle", fontSize=16
).configure_legend(labelFontSize=14)
display(chart)

**6. Did any verified or protected Twitter users posted tweets?**

In [None]:
%%time
df_reloaded_cached.where(F.col("user_protected") == True).count()

In [None]:
%%time
df_reloaded_cached.where(F.col("user_verified") == True).count()

**Observations**
1. There were no tweets by protected users.
2. There were approximately 4,000 tweets by verified users, out of a total of approximately 240,000 tweets with a topic.

**7. In how many tweets is the user's location available? In how many is it missing?**

Count for non-missing and missing values in the `place_country` column

In [None]:
%%time
df_reloaded_cached.where(~F.col("place_country").isNull()).count()

In [None]:
%%time
df_reloaded_cached.where(F.col("place_country").isNull()).count()

Count for non-missing and missing values in the `user_location` column

In [None]:
%%time
df_reloaded_cached.where(F.col("user_location").isNull()).count()

In [None]:
%%time
df_reloaded_cached.where(F.col("user_location") == "None").count()

In [None]:
%%time
df_reloaded_cached.where(~(F.col("user_location") == "None")).count()

**Observations**
1. Out of 240,000 tweets
   - approximately 1,000 tweets are not missing a value in the `place_country` column
   - approximately 154,000 tweets are not missing a value in the `user_location` column

**8. What are the 50 user locations from which tweets were posted?**

In [None]:
%%time
# # Pandas
# df_reloaded_pandas["user_location"].value_counts().nlargest(10).to_frame()

# PySpark
df_top_locations_toPandas = (
    df_reloaded_cached
    .where(~(F.col("user_location").isin(["Earth", "Planet Earth", "she/her", "None"])))
    .groupBy(["user_location"])
    .count()
    .orderBy(["count"], ascending=False)
    .toPandas()
)
print(f"Tweets were posted from {df_top_locations_toPandas['user_location'].nunique():,} unique locations")
df_top_locations_toPandas.head(50)

In [None]:
top_50_user_locations = df_top_locations_toPandas["user_location"].head(50).tolist()

**9. Find the name of the country containing the top 50 user locations from question 8.**

In [None]:
%%time
df_reloaded_cached_with_country = df_reloaded_cached.withColumn(
    "country",
    F.when(
        df_reloaded_cached.user_location.like("% CA")
        | df_reloaded_cached.user_location.like("% DC")
        | df_reloaded_cached.user_location.like("% NY")
        | df_reloaded_cached.user_location.like("% WA")
        | df_reloaded_cached.user_location.like("% TX")
        | df_reloaded_cached.user_location.like("% IL")
        | df_reloaded_cached.user_location.like("% FL")
        | df_reloaded_cached.user_location.like("% MD")
        | df_reloaded_cached.user_location.like("% PA")
        | df_reloaded_cached.user_location.like("%, OR")
        | df_reloaded_cached.user_location.like("% MA")
        | df_reloaded_cached.user_location.like("% AZ")
        | df_reloaded_cached.user_location.like("% TX")
        | df_reloaded_cached.user_location.like("% GA")
        | df_reloaded_cached.user_location.like("% USA")
        | (df_reloaded_cached.user_location == "United States")
        | (df_reloaded_cached.user_location.like("Los Angeles%"))
        | (df_reloaded_cached.user_location == "USA")
        | (df_reloaded_cached.user_location.like("Texas"))
        | (df_reloaded_cached.user_location == "Los Angeles"),
        "USA",
    )
    .when(
        (df_reloaded_cached.user_location == "United Kingdom")
        | df_reloaded_cached.user_location.like("% United Kingdom")
        | df_reloaded_cached.user_location.like("% England")
        | (df_reloaded_cached.user_location == "London")
        | (df_reloaded_cached.user_location == "UK"),
        "UK",
    )
    .when(
        df_reloaded_cached.user_location.like("% India")
        | (df_reloaded_cached.user_location == "India"),
        "India",
    )
    .when(
        df_reloaded_cached.user_location == "Australia",
        "Australia",
    )
    .when(
        (df_reloaded_cached.user_location == "Canada")
        | df_reloaded_cached.user_location.like("% Ontario"),
        "Canada",
    )
    .when(
        (df_reloaded_cached.user_location == "Republic of the Philippines")
        | (df_reloaded_cached.user_location == "Philippines"),
        "Philippines",
    )
    .when(
        df_reloaded_cached.user_location == "Indonesia",
        "Indonesia",
    )
    .when(
        df_reloaded_cached.user_location == "None",
        "None",
    )
     .when(
        (df_reloaded_cached.user_location == "France")
        | (df_reloaded_cached.user_location.like("% France")),
        "France",
    )
    .when(
        (df_reloaded_cached.user_location == "Germany")
        | (df_reloaded_cached.user_location == "Deutschland"),
        "Germany",
    )
    .when(
        (df_reloaded_cached.user_location.like("% Kenya")),
        "Kenya",
    )
    .when(
        df_reloaded_cached.user_location.like("%xico%"),
        "Mexico",
    )
    .otherwise("Other"),
)
show_pyspark_df(df_reloaded_cached_with_country, 5)

Verify `Other` is not present in the derived country name column

In [None]:
%%time
assert (
    df_reloaded_cached_with_country
    .filter(F.col("user_location").isin(top_50_user_locations))
    .filter(F.col("country") == "Other")
    .select(["user_location", "country"])
    .count()
) == 0

**10. Count the number of tweets by country, for those tweets originating from any of the top 50 locations (by number of tweets) found in question 8.**

In [None]:
%%time
df_top_50_countries_toPandas = (
    df_reloaded_cached_with_country
    .filter(F.col("user_location").isin(top_50_user_locations))
    .groupBy(["country"])
    .count()
    .orderBy(["count"], ascending=False)
    .toPandas()
)
df_top_50_countries_toPandas

**11. For each of the countries containing the top 50 user locations (by number of tweets posted) from question 8., count the number of tweets by topic. Create a chart from this.**

For the top 50 user locations, count the number of tweets by country

In [None]:
%%time
df_top_50_countries_grouped = (
    df_reloaded_cached_with_country
    .filter(F.col("user_location").isin(top_50_user_locations))
    .groupBy(["country", "dominant_topic_named"])
    .count()
    .orderBy(["country", "count"], ascending=[True, False])
)
show_pyspark_df(df_top_50_countries_grouped)

Pivot the countrywise grouped data keeping country along the rows and topics along the columns

In [None]:
%%time
df_countries_pivotted_toPandas = df_top_50_countries_grouped.groupBy("country").pivot(
    "dominant_topic_named", distinct_column_vals
).sum("count").withColumnRenamed(
    "Activities related to Space Research", "Space Research"
).withColumnRenamed(
    "Satellites and Telescopes", "Satellites / Telescopes"
).toPandas().set_index("country")
df_countries_pivotted_toPandas

Re-shape data into format suitable for PlotLy `go.Heatmap()`

In [None]:
%%time
df_annot = df_countries_pivotted_toPandas.copy()
for col in df_annot:
    df_annot[col] = df_annot[col].map("{:,}".format)
data_dict = vh.convert_df_to_format_for_plotly_heatmap(
    df_countries_pivotted_toPandas, df_annot, True,
)

Plot the heatmap from the reshaped data

In [None]:
%%time
vh.plot_plotly_heatmap(
    data_dict=data_dict,
    annotation_text=data_dict["annotation_text"],
    margin_dict=dict(l=30, r=0, b=0, t=0, pad=0),
    fig_width=900,
)

## Conclusions and Future Work

### Conclusions
In this preliminary attempt to learn the topics from twitter data, the LDA algorithm has at least suggested topics that can be named based on the top `n` terms within each topic. Manual hyper-parameter tuning has been performed to help with this and the number of topics was also determined by inspecting the top `n` words by weight for each topic. This is a preliminary attempt at topic modeling with PySpark ML.

### Difficulties and Recommendations for Future Work
#### ML Modeling
Using LDA for topic modeling with shorter texts can be challenging ([1](https://towardsdatascience.com/topic-modeling-with-latent-dirichlet-allocation-e7ff75290f8)). The Latent part of LDA refers to the hidden nature of the topics in the documents (tweet texts) ([1](https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883), [2](https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/)). LDA assumes each document is made up of a distribution of topics and each topic itself is comprised of a distribution of words. With these documents and words, LDA will learn the hidden link (layer) between them ([1](https://towardsdatascience.com/short-text-topic-modelling-lda-vs-gsdmm-20f1db742e14)).i.e. learn the topics. With short texts, there is generally room for only one topic in the text so there can be a larger error in the learned topic probabilities ([3](https://stackoverflow.com/a/29789165)).

Both the length of the text used and the number of text documents in the data influence the efficiency of LDA ([1](https://dl.acm.org/doi/10.5555/3044805.3044828)). For streaming twitter data, the number of documents (tweets) is not such a big problem since we can simply collect more tweets for analysis. However, these tweets need to be long enough to improve the efficiency of using LDA for topic modeling. The overall domain in which the documents fall will have some influence on these two variables.

Recommendations for the minimum number of words required by LDA (in a single document) range from 50-200 to a few thousand ([1](https://towardsdatascience.com/short-text-topic-modeling-70e50a57c883), [2](https://towardsdatascience.com/lda-topic-modeling-with-tweets-deff37c0e131), [3](https://www.researchgate.net/post/What_would_be_considered_the_minimum_length_of_document_minimum_number_of_words_for_training_an_LDA_SLDA_topic_model), [4](https://link.springer.com/article/10.1007/s11135-020-00976-w), [5](https://www.frontiersin.org/articles/10.3389/frai.2020.00042/full)). These are approximations and not hard claims. For the current analysis (tweets related to *space*) even the lower end of this range requires filtering out a lot of tweets and significantly reducing the size of the dataset from one that would require a big-data ML tool (as would be the case with our raw streamed tweets data) such as Spark ML to others that do not have this requirement. This opens up the possibility of using other algorithms that are more equipped to handle shorter texts, such as Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM) ([1](https://www.semanticscholar.org/paper/A-dirichlet-multinomial-mixture-model-based-for-Yin-Wang/d03ca28403da15e75bc3e90c21eab44031257e80?p2df)), which is [implemented in Python](https://github.com/rwalk/gsdmm) with example uses available ([1](https://stackoverflow.com/a/62331941/4057186), [2](https://towardsdatascience.com/short-text-topic-modelling-lda-vs-gsdmm-20f1db742e14), [3](https://gist.github.com/rrpelgrim/ef88b94f32dff78af4ef3253c93b6436)). Future work should either explore use of this technique for this dataset or assess more references comparing LDA to other unsupervised ML techniques for short texts.

#### Text Data Processing
In the current work, the topics were named using the top words per topic. Removing the *leftover duplicated* tweets from twitter data will help re-affirm or reject these names. As we saw earlier, such duplicates can come in the form of tweets that differ in as little as one word. From reading the documents within each topic, we can see that the further processing is needed to remove such tweets before LDA. So, the top `n` (unique) documents printed for reading contain several of these *leftover duplicates* which makes it difficult to decide if a topic is appropriately named or if it should even exist by reading the documents (tweets) in that topic. Regardless of the ML algorithm used, iteratively removing such *leftover duplicates* will help in
- validating the names of the topics by reading the top (unique) tweets within each topic (without unnecessary duplication)
- fine-tune the number of topics to be learned
- reduce ML model training time

## Summary of Assumptions and Limitations
### Assumptions
1. Re-tweets (identical text in tweet) are not useful to the topic modeling algorithm (LDA). Only a single version of each tweet is sufficient.
2. Regarding short tweets, tweets with more than 25 words in their text can be used with LDA.

### Limitations
1. After processing to replace a double whitespace by a single one, changing to lowercase and trimming the text, the majority of duplicated tweets (which are not retweets) can be removed. Some duplicates, which differ from the original tweet by a single word, remain and (per assumption 1. above) future work should focus on trying to remove these.
2. Preliminary hyper-parameter tuning was performed for TFIDF vectorization and LDA.