<div class="alert alert-block alert-danger">

# FIT5196 Task 2 in Assessment 1
    
#### Student Name: Yehezkiel Efraim Darmadi, Yogi Sarumaha
#### Student ID: 34078215, 34087672

Date: 22 August 2024

Environment: Python3

### Libraries used:

* **os** (for interacting with the operating system, included in Python 3 from Colab)
* **pandas 1.1.0** (for working with dataframes, installed and imported)
* **multiprocessing** (for performing processes on multiple cores, included in Python 3.6.9 package)
* **itertools** (for performing operations on iterables, included in Python 3.x.x)
* **nltk 3.5** (Natural Language Toolkit, installed and imported, for text processing tasks)
  - **nltk.tokenize** (for tokenization, installed and imported)
  - **nltk.stem** (for stemming the tokens, installed and imported)
  - **nltk.probability** (for working with frequency distributions, included in the Natural Language Toolkit)
  - **nltk.util** (for generating n-grams, included in the Natural Language Toolkit)
* **re** (for defining and using regular expressions, included in Python 3.x.x)
* **matplotlib.pyplot** (for generating visualizations and plots, installed and imported)
* **collections** (specifically `defaultdict`, for creating dictionary-like collections with default values, included in Python 3.x.x)

    </div>

<div class="alert alert-block alert-info">
    
## Table of Contents

</div>

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Input File](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Tokenization](#tokenize) <br>
$\;\;\;\;$[4.2. Removing Context-Independent Stopwords](#context_independent) <br>
$\;\;\;\;$[4.3. Filtering Tokens](#filtering_tokens) <br>
$\;\;\;\;$[4.4. Stemming](#stemming) <br>
$\;\;\;\;$[4.5. Removing Context-Dependent Stopwords](#context_dependent) <br>
$\;\;\;\;$[4.6. Removing Rare Tokens](#rare_tokens) <br>
$\;\;\;\;$[4.7. Generating Bigrams](#bigrams) <br>
$\;\;\;\;$[4.8. Matrix Representation](#Matrix) <br>
[5. Writing Output Files](#write) <br>
$\;\;\;\;$[5.1. Vocabulary List](#write-vocab) <br>
$\;\;\;\;$[5.2. Sparse Matrix](#write-sparseMat) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

<div class="alert alert-block alert-success">
    
## 1.  Introduction  <a class="anchor" name="Intro"></a>

This assessment concerns textual data, and the aim is to extract data, process it, and transform it into a proper format. The dataset provided is in the format of a PDF file containing Google Map reviews from businesses in California. The task involves pre-processing the text data to convert it into numerical representations suitable for downstream modeling tasks. Specifically, the assignment required us to extract review text from businesses with at least 70 text reviews, clean the data by removing context-independent and context-dependent stopwords, stem the tokens, and generate a vocabulary list. The processed text is then transformed into a sparse numerical representation and exported in a specified format to be used for the exploratory data analysis that we combine with the meta data later.

<div class="alert alert-block alert-success">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>

In this assessment, various Python packages were utilized to accomplish the required tasks, including but not limited to:

* **os**: For interacting with the operating system, such as navigating directories and handling file operations.
* **re**: For defining and using regular expressions, which are essential for pattern matching in strings.
* **pandas**: To work with dataframes, providing powerful data manipulation capabilities.
* **multiprocessing**: To perform processes on multiple cores, improving the performance of computationally intensive tasks.
* **itertools**: For efficient looping constructs, specifically for performing operations on iterables.
* **nltk**: The Natural Language Toolkit, used for various text processing tasks.
  - **nltk.probability**: For working with probability distributions and frequency distributions.
  - **nltk.tokenize**: For tokenizing text into words or phrases.
  - **nltk.stem**: For stemming tokens to their root forms.
  - **nltk.util**: For working with n-grams, which are contiguous sequences of tokens.
* **matplotlib.pyplot**: For creating visualizations and plots.
* **collections**: Specifically, `defaultdict` is used to handle dictionary-like collections that provide a default value for non-existent keys.

Additional libraries may be used as needed throughout the assignment to handle specific tasks related to data processing and analysis.

In [46]:
pip install langid



In [47]:
import os
import re
import langid
import pandas as pd
import multiprocessing
from itertools import chain
import nltk
from nltk.probability import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from __future__ import division
import matplotlib.pyplot as plt
from collections import defaultdict

-------------------------------------

<div class="alert alert-block alert-success">
    
## 3.  Examining Input File <a class="anchor" name="examine"></a>

Connect the notebook to the Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Let's examine what is the content of the file. For this purpose, we will first load the file and inspect the format. By doing so,  we can determine the necessary steps for extracting the relevant information required for further processing.

In [None]:
#### this need to be changed (for the tutor)
input_files_dir = "/content/drive/MyDrive/Data Wrangling/assignment 1/Task 1"

In [None]:
df_csv = pd.read_csv(input_files_dir + "/task1_111.csv")
df_json = pd.read_json(input_files_dir + "/task1_111.json")

In [None]:
df_csv.head()

In [None]:
df_json.head()

It is noticed that the df_csv DataFrame appears to contain summary statistics or metadata related to reviews, such as gmap_id, review_count, review_text_count, and response_count. In contrast, the df_json DataFrame seems to be more detailed, containing nested data, such as reviews, earliest_review_date, and latest_review_date. Each column in df_json appears to correspond to a unique gmap_id, with each cell containing detailed review information, possibly including user IDs and timestamps.

<div class="alert alert-block alert-success">
    
## 4.  Loading and Parsing File <a class="anchor" name="load"></a>

In this section, we are going to filter the gmap_id by the number of review_text_count. We want to analyse the gmap_ids that have at least 70 text reviews.

For the first part of this section, we are going to deal with the dataframes and then transform the dataframe into dictionary to be pre-processed.

In [None]:
# filter the df_csv file
df_csv_filtered = df_csv[df_csv["review_text_count"] >= 70]

# Extract the gmap_ids as a list
gmap_ids_to_filter = df_csv_filtered["gmap_id"].tolist()

# Use the list to select columns from df_json
df_json_filtered = df_json.loc[:, gmap_ids_to_filter]
df_json_filtered.head()

Let's see one of the example of the review.

In [None]:
df_json_filtered["0x14e4bcd95f3c0451:0x7ccf04478a4d59af"]["reviews"][0]

Since we want to analyse the google reviews for each of the business, it would make more sense if it combine the review_text into one per gmap_id. So, our goal is to concatenate the "review_text" for each use_id into a single review_text for each gmap_id. We can directly store it into a dictionary.

In [None]:
review_list = []

gmap_id_list = df_csv_filtered["gmap_id"].tolist()

# looping each of the gmap_id
for index, gmap_id in enumerate(gmap_id_list):
  review_combine = ""
  for review in df_json_filtered[gmap_id].loc["reviews"]:
    review_combine += review["review_text"] + " "
  review_list.append(review_combine)

# make it into df
df_review = pd.DataFrame(
    {
        "gmap_id": gmap_id_list,
        "review": review_list
    }
)

# transform it into dict
gmap_id_review = dict(
    zip(df_review["gmap_id"].tolist(), df_review["review"].tolist())
)

# let's see one gmap_id
gmap_id_review["0x14e4bcd95f3c0451:0x7ccf04478a4d59af"]

The above operation results in a dictionary with PID representing gmap_id and a single string for all reviews of the day concatenated to each other.

Below are the steps to preprocessed the review text:
1. Tokenization

2. Removing Context-Independent Stopwords

3. Filtering Tokens

4. Stemming

5. Removing Context-Dependent Stopwords

6. Removing Rare Tokens

7. Generating Bigrams

8. Matrix Representation

<div class="alert alert-block alert-warning">
    
### 4.1. Tokenization <a class="anchor" name="tokenize"></a>

Tokenization is a principal step in text processing and producing unigrams. It is crucial because it breaks down the continuous stream of text into manageable units, or tokens, which are the basic building blocks for text analysis. By converting the text into tokens, we can more easily analyze the frequency, context, and relationships between words. This process also facilitates subsequent steps such as stopword removal, stemming, and the creation of numerical representations for modeling. The resulting tokens from our regex-based approach will ensure that we focus on meaningful words, laying a solid foundation for the remainder of the text processing workflow.

In this section, we are going to tokenize the text using the following regex which is required by the assigment, the regex term is '[a-zA-Z]+'. This regular expression is designed to capture sequences of alphabetic characters, effectively isolating words from the text.

In [None]:
# define the tokenizer
tokenizer = RegexpTokenizer(r"[a-zA-Z]+")

def tokenize_data(gmap_id):
    """
    Tokenizes the review text associated with a given Google Map ID.

    The function retrieves the review text corresponding to the provided
    `gmap_id`, then tokenizes the text using a regular expression tokenizer
    that extracts only alphabetic words (ignoring numbers, punctuation, etc.).
    The result is returned as a tuple containing the `gmap_id` and the list of tokens.

    Args:
        gmap_id (str): The Google Map ID for which the review text should be tokenized.

    Returns:
        tuple: A tuple containing the `gmap_id` and a list of tokenized words from the review.
    """
    review = gmap_id_review[gmap_id]
    tokenised_review = tokenizer.tokenize(review)
    return gmap_id, tokenised_review

gmap_id_token = dict(tokenize_data(gmap_id) for gmap_id in gmap_id_review.keys())

#create copy for bigrams section
gmap_id_bigram = gmap_id_token.copy()

Let's see one sample gmap_id below.

In [None]:
count = 0
for key, value in gmap_id_token.items():
    print(f"{key}: {value[:10]}")
    count += 1
    if count == 5:
        break

At this stage, all reviews for each gmap_id are tokenized and are stored as a value in the new dictionary.

<div class="alert alert-block alert-warning">
    
### 4.2. Removing Context-Independent Stopwords <a class="anchor" name="context_independent"></a>

Let's look at the values in the sample gmap_id that we have printed in the previous section. It can be seen that there are lots of unusable stopwords, such as 'i', 'a', etc. First let's import the stop words from the google drive.

In [None]:
######## needs to be changed by the tutor
stop_word_dir = "/content/drive/Shareddrives/FIT5196_S2_2024/GroupAssessment1"
with open(stop_word_dir + '/stopwords_en.txt', 'r') as file:
    # previously context_independent_stopwords
    stop_words = set(file.read().splitlines())

for i in range(0, 30, 10):
    print(list(stop_words)[i:i+10])

It can be seen that the stopwords have not beet stemmed or lemmatized. Thus, it is preferable to actually remove the stopwords before we do any stemming.

It can also be seen that there is "i'll" word in the stopwords while our regex is specified to capture sequences of alphabetic characters. Like our sample tokens in the the previous section, it can be seen there is a token 'don' and 't', which indicate the word "don't". This will be a problem later on that we need to address.

In [None]:
# removing stopwords
for key, value in gmap_id_token.items():
    gmap_id_token[key] = [word for word in value if word not in stop_words]

In [None]:
# let's see the sample result
for i in range(0, 100, 10):
    print(gmap_id_token["0x14e4bcd95f3c0451:0x7ccf04478a4d59af"][i:i+10])

The "don" is still lingering in the dictionary, we are going to deal with this later after stemming and removing tokens. However, stemming with the combination of "removing context-dependent stopwords" will remove these tokens. The explanation will be explain in the later sections.

<div class="alert alert-block alert-warning">
    
### 4.3. Filtering Tokens <a class="anchor" name="filtering_tokens"></a>

Filtering out unusable tokens is crucial, and in this context, we assume that tokens with fewer than three letters are not significant. Short tokens like “la” and “pa,” as seen in the previous section, may not contribute meaningfully to the analysis and could introduce noise, potentially skewing results and reducing accuracy.

By removing these short tokens, we can focus on more substantial words that carry meaningful information, enhancing the quality and relevance of the data. However, a more thorough analysis is necessary to determine which words should be removed to optimize the process.

In [None]:
# (Monash University, 2024)
words = list(chain.from_iterable(gmap_id_token.values()))
freq_dist_1 = FreqDist(words)
filtered_freq_dist = FreqDist({word: freq for word, freq in freq_dist_1.items() if len(word) < 3})

print(f"The number of tokens that have less than 3 is: {len(filtered_freq_dist)}")
filtered_freq_dist.keys()

In [None]:
less_common_words = filtered_freq_dist.hapaxes()

less_common_words[:10]

Let's try to search what are the text that has those values.

In [None]:
nltk.Text(words).concordance('pa')

In [None]:
nltk.Text(words).concordance('dr')

In [None]:
nltk.Text(words).concordance('perlman')

In [None]:
nltk.Text(words).concordance('ll')

In [None]:
nltk.Text(words).concordance('tv')

In [None]:
nltk.Text(words).concordance('television')

Looking at the values, most of the tokens that have length less than 3 are stopwords, that previously we have not addressed due to the regex limitation.

Another thing worth mentioning is that some of the most rare tokens that have length less than 3, those tokens are very rare and thus will not have significant impact to the analysis. Thus, it needs to be removed as well.

The last thing is that, some tokens up there like "dr" is an abbreviation for "Doctor", which is used in the context of calling a doctor name such as "Dr. Lin". The name "Lin" will also be removed in later section due to its rare frequency. Thus, the "dr" needs to be removed as well.

Let's remove all of the tokens that has less than 3 letters.

In [None]:
# filtering tokens
threshold = 3
for key, value in gmap_id_token.items():
    gmap_id_token[key] = [word for word in value if len(word) >= threshold]

Let's check the number of tokens that we remove and also the tokens

In [None]:
# (Monash University, 2024)
vocab = set(words)
words_filtered = list(chain.from_iterable(gmap_id_token.values()))
vocab_filtered = set(words_filtered)
freq_dist_filtered = FreqDist(words_filtered)

print(f"The number of removed tokens : {len(list(vocab - set(freq_dist_filtered.keys())))}")

for i in range(0, len(list(vocab - set(freq_dist_filtered.keys()))), 10):
    print(list(vocab - set(freq_dist_filtered.keys()))[i:i+10])

Let's see the sample output now.

In [None]:
count = 0
for key, value in gmap_id_token.items():
    print(f"{key}: {value[:20]}")
    count += 1
    if count == 5:
        break

<div class="alert alert-block alert-warning">
    
### 4.4. Stemming <a class="anchor" name="stemming"></a>

Stemming is an essential step in text processing that involves reducing words to their root forms, thereby standardising variations of the same word. Looking at the sample results, we see many words that have similar meanings but are in different forms, such as "feel," "felt," and "feeling," or "doctor" and "doctors." Without stemming, these variations would be treated as distinct entities, which could dilute the analysis and lead to redundant or misleading insights. By applying stemming, we can consolidate these variations into a single representative form, such as "feel" and "doctor," thus enhancing the consistency and clarity of the text data. This process helps improve the accuracy of tasks like sentiment analysis, keyword extraction, and text classification by ensuring that semantically similar words are grouped together.

For this assignment, we are going to use stemming from PorterStemmer.

In [None]:
# initiate the PorterStemmer
stemmer = PorterStemmer()

# looping each key to stem each tokens
for key, value in gmap_id_token.items():
    gmap_id_token[key] = [stemmer.stem(word) for word in value]

Let's see the result.

In [None]:
# let's see the sample result
for i in range(0, 200, 10):
    print(gmap_id_token["0x14e4bcd95f3c0451:0x7ccf04478a4d59af"][i:i+10])

There might be tokens that have length less than 3 letters, however, we still want to keep that tokens because it might have meaning (due to stemming). But, we still need to check whether those tokens provide significant impact to the analysis later.

<div class="alert alert-block alert-warning">
    
### 4.5. Removing Context-Dependent Stopwords <a class="anchor" name="context_dependent"></a>

In this section, we are going to remove words that are very frequent throughout the business reviews but do not contribute significant meaning to the analysis. These context-dependent stopwords, though common within the dataset, do not help in distinguishing between different reviews or businesses. By filtering out these high-frequency words, we can reduce the noise in the data and enhance the focus on more meaningful terms that provide valuable insights into customer feedback and business performance.

One issue with removing context-dependent stopwords after stemming is that it may lead to the removal of more tokens than if the stopwords were eliminated before stemming. This is because different words with distinct meanings can sometimes share the same root form. For example, words like "manage" and "management" may be reduced to the same stem, even though they convey different nuances. However, in most cases, these edge cases are minimal, and the risk of retaining context-dependent stopwords that fall below the frequency threshold due to slight variations in meaning is not worth it. Therefore, it's generally more effective to remove these stopwords after stemming to ensure that all irrelevant tokens are eliminated, thereby reducing noise and improving the quality of the data.

First let's get the frequency for distinct token per gmap_id.

In [None]:
# get distinct token per gmap_id
words_set = list(chain.from_iterable([set(value) for value in gmap_id_token.values()]))
fd_set = FreqDist(words_set)

for word in fd_set.most_common(10):
    print(word)

Let's identify the context-dependent stopwords, which we define as those that appear in 95% of the gmap_id.

In [None]:
threshold = 0.95 * len(gmap_id_token)
context_dependent = [word for word, freq in fd_set.items() if freq > threshold]
print(f"The number of context dependent stopwords in the gmap_id: {len(context_dependent)}")
print(context_dependent)

# remove the context dependent stopwords
for key, value in gmap_id_token.items():
    gmap_id_token[key] = [word for word in value if word not in context_dependent]

Let's reconfirm how many tokens are removed and how many unique tokens are removed.

In [None]:
words_con_dependent = list(chain.from_iterable(gmap_id_token.values()))
vocab_con_dependent = set(words_con_dependent)
freq_dist_con_dependent = FreqDist(words_con_dependent)

print(f"The number of removed tokens : {len(words_filtered) - len(words_con_dependent)}")
print(f"The number of unique removed tokens : {len(list(vocab_filtered - set(freq_dist_con_dependent.keys())))}")

# sample removed vocab
for i in range(0, 100, 10):
    print(list(vocab_filtered - set(freq_dist_con_dependent.keys()))[i:i+10])

Let's see the sample result

In [None]:
count = 0
for key, value in gmap_id_token.items():
    print(f"{key}: {value[:10]}")
    count += 1
    if count == 5:
        break

<div class="alert alert-block alert-warning">
    
### 4.6. Removing Rare Tokens <a class="anchor" name="rare_tokens"></a>

In this section, we focus on removing rare tokens from the dataset. As discussed earlier, certain tokens, like doctor names such as “Lin,” add little value to the analysis and can introduce noise. By filtering out these infrequent tokens, we can enhance the quality of the data, keeping it focused on more impactful words.

It’s more efficient to remove rare tokens after stemming, rather than before, to avoid redundant processing. Rare tokens, by definition, have minimal impact compared to context-dependent stopwords. Additionally, removing tokens prematurely might lead to the loss of potentially valuable counterparts later on. Therefore, it’s better to remove all rare tokens at once after stemming, ensuring that we retain as much useful information as possible.

The rare tokens in this assignment is defined as tokens that appear in less than 5% of the whole token size.

First let's get the frequency for each tokens in the dictionary.

In [None]:
threshold = 0.05 * len(gmap_id_token)

rare_tokens = [word for word, freq in fd_set.items() if freq < threshold]
print(f"The number of rare tokens in the gmap_id: {len(rare_tokens)}")

# remove the context dependent stopwords
for key, value in gmap_id_token.items():
    gmap_id_token[key] = [word for word in value if word not in rare_tokens]

Let's reconfirm how many tokens are removed and how many unique tokens are removed.

In [None]:
words_rare = list(chain.from_iterable(gmap_id_token.values()))
vocab_rare = set(words_rare)
freq_dist_rare = FreqDist(words_rare)

print(f"The number of removed tokens : {len(words_con_dependent) - len(words_rare)}")
print(f"The number of unique removed tokens :{len(list(vocab_con_dependent - set(freq_dist_rare.keys())))}")
print(f"Check whether one of the doctor name 'Lin' is in the rare_vocab : {('lin' in rare_tokens)}")

for i in range(0, 20, 10):
    print(list(vocab_con_dependent - set(freq_dist_rare.keys()))[i:i+10])

Let's see sample result

In [None]:
count = 0
for key, value in gmap_id_token.items():
    print(f"{key}: {value[:10]}")
    count += 1
    if count == 5:
        break

<div class="alert alert-block alert-warning">
    
### 4.7. Generating Bigrams <a class="anchor" name="bigrams"></a>

Creating bigrams is an essential step in text processing that allows us to capture pairs of words that often appear together, providing context that individual tokens might not fully convey. Bigrams can reveal important relationships between words that single-word tokens might miss, such as "New York" or "artificial intelligence," which carry specific meanings when combined.

It’s generally better to generate bigrams before performing data preprocessing steps like stopword removal. For example, a common stopword like "at" might be insignificant on its own, but in combination with other words like "rate", it forms meaningful phrases such as "at rate," which we wouldn't want to lose by removing "at" too early in the process.

In this section we will create the bigrams before the tokens are getting stemmed. This is due to that we want to preserves the tokens context. Stemming is changing tokens into the root words, thus it will make certain bigrams lose their context.

As for frequency, it doesn’t pose a significant concern in this context, as we only select the top significant bigrams. This ensures that the most relevant and contextually important word pairs are retained, making our analysis more accurate and insightful.

In this assignment we will be using PMI measurement and will take the top 200 bigrams tokens.

Let's start with collecting all of the tokens from the original tokens dictionary, which we had created in section 4.1

In [None]:
all_tokens = list(chain.from_iterable(gmap_id_bigram.values()))

Now let's use the nltk build in function which is BigramAssocMeasures

In [None]:
# (Monash University, 2024)
# initiate the BigramAssocMeasures
token_bigram_measures = nltk.collocations.BigramAssocMeasures()
token_bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(all_tokens)
# do not include tokens length less than 3
token_bigram_finder.apply_word_filter(lambda w: len(w) < 3)
token_top_200_bigrams = token_bigram_finder.nbest(token_bigram_measures.pmi, 200)
token_top_200_bigrams[:10]

Let's change the tokens associated with the token_top_200_bigrams in the gmap_id_bigrams into bigram using the MWETokenizer.

In [None]:
gmap_mwetokenizer = MWETokenizer(token_top_200_bigrams)
review_bigram_tokenized = dict((gmap_id_list, gmap_mwetokenizer.tokenize(review)) for gmap_id_list,review in gmap_id_bigram.items())

all_word_collect = list(chain.from_iterable(review_bigram_tokenized.values()))
total_vocab = list(set(all_word_collect))

for i in range(0, 100, 10):
    print(total_vocab[i:i+10])


We do not need to stem the bigrams because the purpose of stemming is to group words with similar meanings together. However, in the context of bigrams, where the focus is on capturing meaningful phrases, stemming may not be as beneficial. Stemming could alter one or both words in a bigram, potentially distorting the phrase’s original meaning and reducing the accuracy of the analysis. For instance, stemming might convert “running late” to “run late,” which could change the nuance of the phrase. Therefore, it is more effective to keep bigrams intact without applying stemming, ensuring that the full context and meaning of the phrases are preserved.

Now we need to combine only the bigrams with the dictionary token that we have created in the previous sections.

Also, let's check the number of bigrams and the total unique bigrams.

In [None]:
bigrams_dict = {}

for gmap_id, tokens in review_bigram_tokenized.items():
    bigrams = [token for token in tokens if '_' in token]
    bigrams_dict[gmap_id] = bigrams

count = 0
for key, bigrams in bigrams_dict.items():
    print(f"{key}: {bigrams[:10]}")
    count += 1
    if count == 5:
        break

bigrams_word_collect = list(chain.from_iterable(bigrams_dict.values()))
total_bigrams = list(set(bigrams_word_collect))

print(f"The total number of bigrams: {len(bigrams_word_collect)}")
print(f"The total number of unique bigrams: {len(total_bigrams)}")

# append the tokens
for key, bigrams in bigrams_dict.items():
    gmap_id_token[key].extend(bigrams)

Let's check the result.

In [None]:
words_final = list(chain.from_iterable(gmap_id_token.values()))
vocab_final = set(words_final)
freq_dist_final = FreqDist(words_final)

bigram_words = list(chain.from_iterable(bigrams_dict.values()))
bigram_vocab = set(bigram_words)
bigram_freq_dist = FreqDist(bigram_words)

print(f"The total number of tokens before : {len(words_rare)}")
print(f"The total number of tokens after : {len(words_final)}")
print(f"The total number of bigram tokens that we add: {len(bigram_words)}")
print(f"The total number of unique tokens before : {len(vocab_rare)}")
print(f"The total number of unique tokens after : {len(vocab_final)}")
print(f"The total number of unique bigram tokens that we add: {len(bigram_vocab)}")

Let's see sample result.

In [None]:
count = 0
for key, value in gmap_id_token.items():
    print(f"{key}: {value[:10]}")
    count += 1
    if count == 5:
        break

<div class="alert alert-block alert-warning">
    
### 4.8. Matrix Representation<a class="anchor" name="Matrix"></a>

One of the tasks is to generate the numerical representation for all tokens in abstract form, which involves converting the text data into a matrix format where each row represents a document and each column corresponds to a token or bigram. This matrix allows us to quantify the presence or frequency of each token in the dataset, enabling further analysis such as machine learning or clustering. The resulting matrix serves as a crucial input for various natural language processing (NLP) models, allowing them to process and learn from the data in a structured and efficient manner.

In [None]:
all_tokens = set()
for tokens in gmap_id_token.values():
    all_tokens.update(tokens)

all_tokens = sorted(all_tokens)

matrix_representation = pd.DataFrame(0, index=gmap_id_token.keys(), columns=all_tokens)

for key, tokens in gmap_id_token.items():
    token_counts = defaultdict(int)
    for token in tokens:
        token_counts[token] += 1
    for token, count in token_counts.items():
        matrix_representation.loc[key, token] = count

print(matrix_representation.shape)
matrix_representation.head()

<div class="alert alert-block alert-success">
    
## 5. Writing Output Files <a class="anchor" name="write"></a>

files need to be generated:
* Vocabulary list
* Sparse matrix (count_vectors)

This is performed in the following sections.

<div class="alert alert-block alert-warning">
    
### 5.1. Vocabulary List <a class="anchor" name="write-vocab"></a>

List of vocabulary should also be written to a file, sorted alphabetically, with their reference codes in front of them. This file also refers to the sparse matrix in the next file. For this purpose, we will generate a vocabulary list where each token or bigram is assigned a unique reference code. These codes will be used to index the corresponding columns in the sparse matrix, allowing for efficient lookups and ensuring consistency between the vocabulary file and the matrix representation.

In [None]:
# make a new column called "gmap_id" to store the gmap_id
matrix_representation.insert(0, 'gmap_id', matrix_representation.index)
matrix_representation.reset_index(drop=True, inplace=True)

# Create a dictionary that maps column names to indices
column_index_mapping = {col: idx for idx, col in enumerate(matrix_representation.columns)}

# Remove the 'gmap_id' entry from the dictionary
column_index_mapping.pop('gmap_id', None)

# Reassign the indices starting from 0
column_index_mapping = {col: new_idx for new_idx, (col, _) in enumerate(column_index_mapping.items())}

# Export the modified dictionary to a text file
with open('/content/drive/MyDrive/Data Wrangling/assignment 1/Task 2/111_vocab.txt', 'w') as file:
    for key, value in column_index_mapping.items():
        file.write(f"{key}: {value}\n")

<div class="alert alert-block alert-warning">
    
### 5.2. Sparse Matrix <a class="anchor" name="write-sparseMat"></a>

For writing the sparse matrix for a paper, we first calculate the frequency of words for that paper. Each word’s frequency is then represented in the matrix, with rows corresponding to individual documents (such as papers) and columns representing the unique tokens or bigrams from the vocabulary list.

In [None]:
# Open a text file for writing the output
with open('/content/drive/MyDrive/Data Wrangling/assignment 1/Task 2/111_countvec.txt', 'w') as file:
    # Iterate over each row in the DataFrame to create the desired output
    for _, row in matrix_representation.iterrows():
        gmap_id = row['gmap_id']  # Get the gmap_id
        row_string = gmap_id + ","  # Start with the gmap_id

        # Add index:value pairs for non-zero values
        non_zero_values = [
            f"{column_index_mapping[col]}:{row[col]}"
            for col in matrix_representation.columns if row[col] != 0 and col != 'gmap_id'
        ]

        row_string += ",".join(non_zero_values)

        # Write the row string to the file
        file.write(row_string + "\n")

-------------------------------------

<div class="alert alert-block alert-success">
    
## 6. Summary <a class="anchor" name="summary"></a>

In this task, we systematically processed a dataset containing Google Map reviews from various businesses in California. The process began with loading and examining the input files, followed by a series of preprocessing steps to clean and prepare the data for further analysis. These steps included:

1.	Tokenization: Breaking down the text into individual tokens, allowing us to work with manageable units of text.

2.	Removing Context-Independent Stopwords: Eliminating common words like “the” and “is” that do not add significant meaning, thus reducing noise in the data.

3.	Filtering Tokens: Removing tokens that are less than three letters long, focusing on substantial words that contribute meaning.

4.	Stemming: Reducing words to their base form (e.g., “running” becomes “run”) to group similar words together, ensuring consistency in the analysis.

5.	Removing Context-Dependent Stopwords: Filtering out words that, although frequent across the dataset, do not distinguish between different reviews, such as “great” or “lot.”

6.	Removing Rare Tokens: Excluding infrequent tokens that are unlikely to add significant insights, thereby reducing noise in the dataset.

7.	Generating Bigrams: Creating pairs of words (e.g., “New York”) that frequently appear together to capture meaningful phrases that might be missed if words were only considered individually.

8.	Matrix Representation: Converting the processed tokens into a numerical matrix format, where each row represents a document and each column corresponds to a token or bigram, enabling further analysis like machine learning or clustering.

The final output included a vocabulary list and a sparse matrix, both of which were exported to files for future use in analytical tasks. These preprocessing steps ensured that the dataset was thoroughly cleaned, organized, and ready for advanced processing, ultimately enabling more accurate and insightful analysis.

-------------------------------------

<div class="alert alert-block alert-success">
    
## 7. References <a class="anchor" name="Ref"></a>

[1] Pandas dataframe.drop_duplicates(), https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/, Accessed 13/08/2022.
[2] Monash University. (2024). Week 5 Applied Session: Text Pre-Processing [Jupyter Notebook]. Monash University, FIT5196. Accessed August 24, 2024, from https://colab.research.google.com/drive/1OSc-vgcp3rZv9dRUjXLEUCgejbmQpFxo?usp=sharing



## --------------------------------------------------------------------------------------------------------------------------