<div class="alert alert-block alert-danger">

# FIT5196 Task 2 in Assessment 1
    
#### Student Name: Deshui Yu
#### Student ID: 34253599

Date: 24/08/2024

Environment: xxxxxx

Libraries used:
* os (for interacting with the operating system, included in Python xxxx) 
* pandas 1.1.0 (for dataframe, installed and imported) 
* multiprocessing (for performing processes on multi cores, included in Python 3.6.9 package) 
* itertools (for performing operations on iterables)
* nltk 3.5 (Natural Language Toolkit, installed and imported)
* nltk.tokenize (for tokenization, installed and imported)
* nltk.stem (for stemming the tokens, installed and imported)

    </div>

<div class="alert alert-block alert-info">
    
## Table of Contents

</div>

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Input File](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Tokenization](#tokenize) <br>
$\;\;\;\;$[4.2. Whatever else](#whetev) <br>
$\;\;\;\;$[4.3. Genegrate numerical representation](#whetev1) <br>
[5. Writing Output Files](#write) <br>
$\;\;\;\;$[5.1. Vocabulary List](#write-vocab) <br>
$\;\;\;\;$[5.2. Sparse Matrix](#write-sparseMat) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

<div class="alert alert-block alert-success">
    
## 1.  Introduction  <a class="anchor" name="Intro"></a>

This assessment concerns textual data and the aim is to extract data, process them, and transform them into a proper format. The dataset provided is in the format of a PDF file containing ....

<div class="alert alert-block alert-success">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>

In this assessment, any python packages is permitted to be used. The following packages were used to accomplish the related tasks:

* **os:** to interact with the operating system, e.g. navigate through folders to read files
* **re:** to define and use regular expressions
* **pandas:** to work with dataframes
* **multiprocessing:** to perform processes on multi cores for fast performance 
* **langid:** to detect the language of the text data, ensuring it is in English.
* **itertools.chain:** to flatten lists or iterate over multiple lists consecutively.
* **nltk (Natural Language Toolkit):** a comprehensive library for natural language processing tasks.
* **sklearn.feature_extraction.text.CountVectorizer:** to convert a collection of text documents into a matrix of token counts.
* **collections.defaultdict:** to create dictionaries with default values for missing keys.

In [1]:
import os
import re
import langid
import pandas as pd
import multiprocessing
from itertools import chain
import nltk
from nltk.probability import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
import json
from nltk.probability import FreqDist
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from sklearn.feature_extraction.text import CountVectorizer
from collections import defaultdict

-------------------------------------

<div class="alert alert-block alert-success">
    
## 3.  Examining Input File <a class="anchor" name="examine"></a>

In [2]:
# from google.colab import drive
# drive.mount('/content/drive')

Let's examine what is the content of the file. For this purpose, i need open the json file.In the previous assignment, it was required to convert all review comments in the JSON file to lowercase and remove any emojis.

It is noticed that the file contains several reviews associated with various businesses. These reviews are in the form of JSON objects, where each review includes fields such as review_text, review_rating, user_id, and others. The goal is to preprocess this textual data by performing specific operations like lowercasing, removing emojis, tokenization, and filtering, to ultimately create a vocabulary list and a sparse numerical representation of the text.

Having parsed the pdf file, the following observations can be made:

Review Text Characteristics:
The review_text field must be normalized to lowercase and stripped of emojis.
Tokenization should handle punctuation, special characters, and numeric values appropriately.

Business Review Count:
Only businesses with at least 70 text reviews will be processed to ensure robust vocabulary generation.

Preprocessing Requirements:
Validate that the text is in English, convert it to lowercase, and remove emojis.
Perform tokenization, stemming, stopword removal, and filter out rare and short tokens.

Vocabulary and Count Vector Generation:
Generate both unigrams and meaningful bigrams, selecting the top 200 bigrams based on the PMI measure.
Output a sorted vocabulary list and a sparse numerical representation mapping business IDs to token frequencies.

<div class="alert alert-block alert-success">
    
## 4.  Loading and Parsing File <a class="anchor" name="load"></a>

In this section, I read the business review-related data from task1_181.json and initialized an empty dictionary called Temp_review_dictionary to store qualifying review_text entries. For each valid dictionary entry, the code retrieves the list of reviews associated with the business. It then extracts the review_text from each review, ensuring the text is not None or an empty string, and stores it in a temporary list for that business. After processing all reviews for the business, the code checks if the number of reviews meets or exceeds 70. If this threshold is met, the reviews are added to Temp_review_dictionary under the corresponding business key.

In [3]:
# Open and load the JSON file
with open('task1_181.json', 'r', encoding='utf-8') as file:
    all_data = json.load(file)

# Initialize a dictionary to store qualifying review_text
Temp_review_dictionary = {}

# Iterate over all items in the data
for key, value in all_data.items():
    if isinstance(value, dict):
        reviews = value.get("reviews", [])
        if isinstance(reviews, list):
            business_reviews = []  # To store all review_text for the current business
            for review in reviews:
                # Get the review_text
                row_review_text = review.get("review_text")
                if row_review_text:  # Ensure review_text is not None or an empty string
                    row_review_text = str(row_review_text)
                    business_reviews.append(row_review_text)
            # If the number of reviews for the current business is 70 or more, save the reviews
            if len(business_reviews) >= 70:
                Temp_review_dictionary[key] = business_reviews

Let's examine the dictionary generated. For counting the total number of reviews extracted ....

In [4]:
total_reviews = sum(len(reviews) for reviews in Temp_review_dictionary.values())
print(total_reviews)

21132


<div class="alert alert-block alert-warning">
    
### 4.1. Tokenization and Generate numerical representatio<a class="anchor" name="tokenize"></a>

Tokenization is a principal step in text processing and producing unigrams. In this section, I first tokenize the review text into words using a regular expression, then store the tokenized reviews back into the dictionary. Next, it flattens the tokenized word lists and removes both context-independent and context-dependent stopwords. The code then stems the words using the Porter stemmer and removes rare words that appear in less than 5% of the reviews. Finally, it filters out any tokens that are shorter than 3 characters.

In [5]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+")
# Iterate over all business reviews and perform tokenizati
for business_id, reviews in Temp_review_dictionary.items():
    tokenized_reviews = []  # To store tokenized reviews for each business
    for review in reviews:
        # Tokenize
        tokens = tokenizer.tokenize(review)
        tokenized_reviews.append(tokens)
    
    # Update the review_dictionary with tokenized reviews
    Temp_review_dictionary[business_id] = tokenized_reviews
review_dictionary = {}

# Flatten the list of tokenized words for each business
for review_id, word_list in Temp_review_dictionary.items():
    flattened_list = list(chain.from_iterable(word_list))
    review_dictionary[review_id] = flattened_list
# review_dictionary

# Load the context-independent stopwords from a file    
with open('stopwords_en.txt', 'r') as f:
    stopwords = set(f.read().splitlines())
# Remove context-independent stopwords from the dictionary
for id, flattened_tokens in review_dictionary.items():
    filtered_tokens = []
    for word in flattened_tokens:
        if word not in stopwords:
            filtered_tokens.append(word)# Remove context-independent stopwords from the dictionary
    review_dictionary[id] = filtered_tokens

# Calculate document frequency for each word
doc_freq = FreqDist()
for wordList in review_dictionary.values():
    unique_words = set(wordList)
    for word in unique_words:
        doc_freq[word] += 1
# Identify context-dependent stopwords that appear in more than 95% of the businesses
threshold = 0.95 * len(review_dictionary)
context_dependent_stopwords = [word for word, count in doc_freq.items() if count > threshold]
print(context_dependent_stopwords)
# Remove context-dependent stopwords from the dictionary
for id, flattened_tokens in review_dictionary.items():
    filtered_tokens = []
    for word in flattened_tokens:
        if word not in context_dependent_stopwords:
            filtered_tokens.append(word)

    review_dictionary[id] = filtered_tokens

stemmer = PorterStemmer()
# Stem each word in the dictionary
for id, flattened_tokens in review_dictionary.items():
    stemmed_tokens = []
    for word in flattened_tokens:
        stemmed_word = stemmer.stem(word)
        stemmed_tokens.append(stemmed_word)
    # Update the dictionary with stemmed tokens
    review_dictionary[id] = stemmed_tokens
# Recalculate document frequency after stemming
doc_freq = FreqDist()
for wordList in review_dictionary.values():
    unique_words = set(wordList)
    for word in unique_words:
        doc_freq[word] += 1
# Identify rare words that appear in less than 5% of the businesses
threshold = 0.05 * len(review_dictionary)
rare_words = [word for word, count in doc_freq.items() if count < threshold]
print(rare_words)
# Remove rare words from the dictionary
for id, flattened_tokens in review_dictionary.items():
    filtered_tokens = []
    for word in flattened_tokens:
        if word not in rare_words:
            filtered_tokens.append(word)
    review_dictionary[id] = filtered_tokens

filtered_review_dictionary = {}
# Remove tokens with a length of fewer than 3 characters
for id, flattened_tokens in review_dictionary.items():
    filtered_tokens = [word for word in flattened_tokens if len(word) >= 3]
    filtered_review_dictionary[id] = filtered_tokens
# Final processed dictionary
review_dictionary = filtered_review_dictionary

['don', 'great', 'nice', 'service', 've', 'back', 'love', 'excellent', 'friendly', 'amazing', 'good', 'place', 'staff', 'time']
['newbi', 'immers', 'den', 'brain', 'crime', 'freebi', 'ypu', 'sequenc', 'beginn', 'unforeseen', 'allot', 'ro', 'magnet', 'addict', 'optim', 'uncov', 'riddl', 'bond', 'countrysid', 'knight', 'neon', 'anniversari', 'intric', 'grandkid', 'rancheria', 'jailbreak', 'puzzl', 'heck', 'idiot', 'brightli', 'plot', 'hilltop', 'moseley', 'amus', 'boggl', 'jt', 'prop', 'beep', 'voupl', 'actor', 'exercis', 'januari', 'ashley', 'prison', 'versu', 'impact', 'these', 'powerpoint', 'old', 'jail', 'tinker', 'teas', 'media', 'doabl', 'unsur', 'thrive', 'roommat', 'definet', 'captur', 'fasten', 'shasta', 'cousin', 'teambuild', 'script', 'het', 'joseph', 'businessman', 'soin', 'cab', 'exterior', 'camper', 'shenanigan', 'shi', 'egr', 'rv', 'carolyn', 'glenn', 'ounc', 'rust', 'phil', 'spark', 'negit', 'rewir', 'johnson', 'curteou', 'fluid', 'vandal', 'enuff', 'smoge', 'dui', 'canta

The above operation results in a dictionary with PID representing keys and a single string for all reviews of the day concatenated to each other. ...

In [6]:
concatenated_review_dictionary = {}
for pid, tokens in review_dictionary.items():
    concatenated_review_dictionary[pid] = ' '.join(tokens)

At this stage, all reviews for each PID are tokenized and are stored as a value in the new dictionary (separetely for each day).

-------------------------------------

<div class="alert alert-block alert-success">
    
## 5. Writing Output Files <a class="anchor" name="write"></a>

files need to be generated:
* Vocabulary list
* Sparse matrix (count_vectors)
This is performed in the following sections.

<div class="alert alert-block alert-warning">
    
### 5.1. Vocabulary List <a class="anchor" name="write-vocab"></a>

List of vocabulary should also be written to a file, sorted alphabetically, with their reference codes in front of them. This file also refers to the sparse matrix in the next file. For this purpose, This code identifies the top 200 bigrams using Pointwise Mutual Information (PMI) and combines them with unigrams from the tokenized reviews to create a vocabulary. The vocabulary is then sorted alphabetically, each word is assigned an index, and the result is saved to 181_vocab.txt.

In [7]:

# Set up bigram measures and find top 200 bigrams using PMI
bigram_measures = nltk.collocations.BigramAssocMeasures()
# Convert the dictionary values (tokenized reviews) to a list for processing
tokenized_reviews = list(review_dictionary.values())
# Create a BigramCollocationFinder to find the most frequent bigrams in the tokenized reviews
finder = BigramCollocationFinder.from_documents(tokenized_reviews)
top_200_bigrams = finder.nbest(bigram_measures.pmi, 200)
#reference from chatGPT
collocated_bigrams = [' '.join(bigram) for bigram in top_200_bigrams]
# Create vocabulary from unigrams and top bigrams, then save to a file
unigrams = set(chain.from_iterable(review_dictionary.values()))
# Combine unigrams and collocated bigrams, then sort them to create the final vocabulary list
vocab = sorted(list(unigrams) + collocated_bigrams)
with open("181_vocab.txt", "w") as vocab_file:
    for index, word in enumerate(vocab):
        vocab_file.write(f"{word}:{index}\n")

<div class="alert alert-block alert-warning">
    
### 5.2. Sparse Matrix <a class="anchor" name="write-sparseMat"></a>

For writing sparse matrix for a paper, we firstly calculate the frequency of words for that paper ....

In [8]:
# Initialize a CountVectorizer using the previously created vocabulary
vectorizer = CountVectorizer(vocabulary=vocab)
# Combine tokens for each review into a single string to prepare for vectorization
reviews_as_text = [' '.join(tokens) for tokens in review_dictionary.values()]
X = vectorizer.transform(reviews_as_text)

# 获取每个业务 ID 的顺序列表，以确保 ID 与评论文本一一对应
business_ids = list(review_dictionary.keys())
# Open a file to write the sparse matrix data
with open(f"181_countvec.txt", "w") as countvec_file:
    for i, business_id in enumerate(business_ids):
        row = X[i].tocoo()  # Convert the sparse matrix row to COOrdinate format
        # Create a list of token indices and their frequencies in the format "index:frequency"
        tokens_freq = [f"{col}:{val}" for col, val in zip(row.col, row.data)]
        # Create a line with the business ID followed by the token frequencies, separated by commas
        line = f"{business_id}, " + ", ".join(tokens_freq) + "\n"
        countvec_file.write(line)

-------------------------------------

<div class="alert alert-block alert-success">
    
## 6. Summary <a class="anchor" name="summary"></a>

.....

This script processes business reviews by loading data from a JSON file, filtering reviews for businesses with at least 70 reviews, and performing text preprocessing steps such as tokenization, stopword removal, stemming, and filtering of rare and short words. It then identifies the top 200 bigrams using PMI, combines them with unigrams to create a vocabulary, and saves this vocabulary to a file. Finally, the script vectorizes the reviews using CountVectorizer and writes the sparse matrix representation of the review data, mapped to business IDs, to an output file.

-------------------------------------

<div class="alert alert-block alert-success">
    
## 7. References <a class="anchor" name="Ref"></a>

[1] Pandas dataframe.drop_duplicates(), https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/, Accessed 13/08/2022.
We use ChatGPT to help us improve text expression, correct grammar, provide assignment step suggestions, offer code advice, debug code, and check and enhance regular expressions.https://chatgpt.com/



## --------------------------------------------------------------------------------------------------------------------------