<div class="alert alert-block alert-danger">

# FIT5196 Task 2 in Assessment 1
    
#### Student Name: Manh Tung Vu, Ilya Bessonov

#### Student ID: 30531438, 34466029

Date: 24/08/2024

Environment: Python 3

Libraries used:
* os (for interacting with the operating system, included in Python xxxx)
* pandas 2.1.4 (for dataframe, installed and imported)
* itertools (for performing operations on iterables)
* nltk 3.5 (Natural Language Toolkit, installed and imported)
* nltk.tokenize (for tokenization, installed and imported)
* nltk.stem (for stemming the tokens, installed and imported)

    </div>

<div class="alert alert-block alert-info">
    
## Table of Contents

</div>

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Input File](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Tokenization](#tokenize) <br>
$\;\;\;\;$[4.2. Whatever else](#whetev) <br>
$\;\;\;\;$[4.3. Genegrate numerical representation](#whetev1) <br>
[5. Writing Output Files](#write) <br>
$\;\;\;\;$[5.1. Vocabulary List](#write-vocab) <br>
$\;\;\;\;$[5.2. Sparse Matrix](#write-sparseMat) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

<div class="alert alert-block alert-success">
    
## 1.  Introduction  <a class="anchor" name="Intro"></a>

In this notebook, we preprocess Google Reviews data to generate a vocabulary list and numerical representations, which will be used in downstream NLP tasks such as recommender systems, information retrieval, and machine translation. The data comes from businesses with at least 70 text reviews. Our primary goal is to extract meaningful features that will aid in understanding patterns in the text.

The input data consists of two files:

A CSV file containing business-related information, including:

* gmap_id: The ID of the business.
* review_count: The total number of reviews for the business.
* review_text_count: The number of reviews that contain text.
* response_count: The number of responses from the business.

A JSON file containing detailed review data for each business, including:

* gmap_id: The ID of the business.
* reviews: A list of reviews, each containing:
  - user_id: The ID of the reviewer.  
  - time: The time of the review in UTC format (YYYY-MM-DD tt:hh
).
  - review_rating: The rating of the business.
  * review_text: The English review text. If the review is in another language, only the English translation is extracted, and emojis are removed. The text is normalized to lowercase.
  * If_pic: Whether the reviewer included pictures (Y/N).
  * pic_dim: The dimensions of any pictures included, as a list of tuples (e.g., [[h,w],[h,w]...]). An empty list is used if there are no pictures.
  * If_response: Whether the review has a response (Y/N).
* earliest_review_date: The earliest review date for the business in the given data subset (UTC format).
* latest_review_date: The latest review date for the business in the given data subset (UTC format).

These files serve as the foundation for extracting and processing the text data, enabling us to build a structured vocabulary and numerical representation for further analysis.

<div class="alert alert-block alert-success">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>

In this assessment, any python packages is permitted to be used. The following packages were used to accomplish the related tasks:

* **os:** to interact with the operating system, e.g. navigate through folders to read files
* **re:** to define and use regular expressions
* **pandas:** to work with dataframes
* **itertools**: provides tools for creating iterators for efficient looping
* **nltk** (Natural Language Toolkit): to work with human text data
* **nltk.tokenize**: tools to split text into tokens
* **nltk.stem**: provides algorithm for stemming (reducing words to their root forms)
* **json**: to handle json (JavaScript Object Notation) data
* **sklearn.feature_extraction.text**: use "CountVectorizer" module to converts a collection of text documents to a matrix of token counts

In [1]:
import os
import re
import pandas as pd
import json

from itertools import chain

import nltk
from nltk.probability import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from nltk import bigrams

from sklearn.feature_extraction.text import CountVectorizer

-------------------------------------

<div class="alert alert-block alert-success">
    
## 3.  Examining Input File <a class="anchor" name="examine"></a>

In [2]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [3]:
# Load data
path = '/content/drive/MyDrive/FIT5196_assignment_files/'
summary_df = pd.read_csv(path + 'task1_044.csv') #('/content/drive/MyDrive/FIT5196_assignment_files/task1_044.csv')
summary_df.shape

(176, 4)

In [4]:
summary_df.dtypes

Unnamed: 0,0
gmap_id,object
review_count,int64
review_text_count,int64
response_count,int64


In [5]:
summary_df.describe()

Unnamed: 0,review_count,review_text_count,response_count
count,176.0,176.0,176.0
mean,214.238636,131.698864,32.3125
std,317.10694,194.234954,96.145182
min,51.0,17.0,0.0
25%,74.0,50.0,0.0
50%,108.0,73.0,0.0
75%,199.0,130.5,23.5
max,2853.0,1776.0,904.0


In [6]:
ilya_path = '/content/drive/MyDrive/Assigment1data/task1_044.json'
tung_path = '/content/drive/MyDrive/FIT5196_assignment_files/task1_044.json'
with open('/content/drive/MyDrive/FIT5196_assignment_files/task1_044.json', 'r') as file:
  data = json.load(file)

Let's examine what is the content of the file. For this purpose, ....

In [7]:
type(data)

dict

It is noticed that the data is a dictionary. Now, we will further examine the structure of the data by checking its keys.

In [8]:
# Examine the structure of json file
gmap_list = list(data.keys())
print(len(gmap_list))
gmap_list[:10]

176


['0x54d4000810cde343:0x144f867433fa0966',
 '0x808164de5730862d:0xcb473d5d65fb1d39',
 '0x808327ba5f197e6d:0x5e31fb9492a334e7',
 '0x8084047107a176cb:0x889336c7010bad47',
 '0x808437fdd8593cbb:0x813a0531fbb5eb48',
 '0x80843e0cb60c8a03:0x32689617e649dd',
 '0x808447e3edf965c9:0x3f8efa82b45b0b45',
 '0x8084acae2f9d2541:0xe64c0a74ef6732af',
 '0x8084d092a8afc95d:0x1b7c5c974eb7e63f',
 '0x8084d0e3d7fa160f:0xf481ebe2fa126e60']

As we know from Task 1, our json file has a structure:
* gmap_id: The ID of the business.
* reviews: A list of reviews, each containing:
  - user_id: The ID of the reviewer.  
  - time: The time of the review in UTC format (YYYY-MM-DD tt:hh
).
  - review_rating: The rating of the business.
  * review_text: The English review text. If the review is in another language, only the English translation is extracted, and emojis are removed. The text is normalized to lowercase.
  * If_pic: Whether the reviewer included pictures (Y/N).
  * pic_dim: The dimensions of any pictures included, as a list of tuples (e.g., [[h,w],[h,w]...]). An empty list is used if there are no pictures.
  * If_response: Whether the review has a response (Y/N).
* earliest_review_date: The earliest review date for the business in the given data subset (UTC format).
* latest_review_date: The latest review date for the business in the given data subset (UTC format).

We'll continue to examine to see if we can access each element of our data

In [9]:
# Examine the structure of json file
data[gmap_list[0]].keys()

dict_keys(['reviews', 'earliest_review_date', 'latest_review_date'])

In [10]:
data[gmap_list[0]]['reviews'][0].keys()

dict_keys(['user_id', 'time', 'review_rating', 'review_text', 'if_pic', 'pic_dim', 'if_response'])

Next steps is to verify that our review_text contain no emoji and is all lowercase.

In [11]:
def contains_emoji(text):
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"
        "\U0001F300-\U0001F5FF"
        "\U0001F680-\U0001F6FF"
        "\U0001F1E0-\U0001F1FF"
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE
    )
    return bool(emoji_pattern.search(text))

In [12]:
# Check the text reviews
n_issues = 0

for gmap_id in gmap_list:
  reviews = data[gmap_id]['reviews']
  for review in reviews:
    if review['review_text'] is not None:
      # Check if all text review conatin no emoji
      if contains_emoji(review['review_text']):
        print(f"The record for gmap_id {gmap_id} contains an emoji. Text: {review['review_text']}")
        n_issues += 1

print(f"\n {n_issues} has been found.")
print('Done')


 0 has been found.
Done


In [16]:
# Check if all review_text are in lowercase (excepting 'None')
n_issues = 0
for gmap_id in gmap_list:
  reviews = data[gmap_id]['reviews']
  for review in reviews:
    if review['review_text'] is not None and any(char.isalpha() for char in review['review_text']):  # Check if text has alphabetic characters
          if not (review['review_text'].islower() or review['review_text'] == 'None'):
            print(f"The record for gmap_id {gmap_id} is not in lowercase. Text: {review['review_text']}")
            n_issues += 1

print(f"\n {n_issues} has been found.")
print('Done')

The record for gmap_id 0x80ea13b6e203e78d:0x36861194143d1ebd is not in lowercase. Text: ت

 1 has been found.
Done


It is a non-English symbol

In [15]:
# Check the text reviews

english_symbols = r"a-zA-Z0-9\s\.,;!?\'\"\-\+\$\#\:\(\)\\\/\’\@\&\“\”\*\%`\-…\=\—\!"
pattern_is_english = re.compile(r'^[' + english_symbols+r']+$')
not_english_symbols = re.compile(r'[^' + english_symbols+r']')
n_issues = 0
non_english_words = []

for gmap_id in gmap_list:
  reviews = data[gmap_id]['reviews']
  for review in reviews:
    if review['review_text'] is not None:
      # Check if all review_text are in English
      if not pattern_is_english.match(review['review_text']):
        non_english_words.append(not_english_symbols.findall(review['review_text']))
        n_issues += 1

print(non_english_words[:10])
print(f"\n {n_issues} has been found.")
print('Done')

[['🤗'], ['ñ'], [], ['🤙'], ['é', 'é'], ['🤤'], ['ã'], ['û', 'é'], ['🤗'], ['í']]

 288 has been found.
Done


These symbols are non-English or emojis but not from provided list to remove

In [17]:
# Check if all review_text are in English
for gmap_id in gmap_list:
    reviews = data[gmap_id]['reviews']
    for review in reviews:
        review_text = review['review_text']
        if review_text is not None:
          if len(re.findall(r'\(original\).*', review_text)) > 0:
            print(gmap_id)
            print(review_text)
print('Done')

Done


The cells above verify that the input text is of lowercase, in English, and contains no emoji. The cell that check if the text is in English printed out some issues; however, majority of them are emojies that are not in the list we need to remove so we just leave it there. Moreover, we have over 30 thousand observation of data and only 265 observations contain problem, so it is acceptable.

<div class="alert alert-block alert-success">
    
## 4.  Filtering and Parsing File <a class="anchor" name="load"></a>

In this section, we will create a vocab and countvector for review_text of gmap_id that has more than 70 reviews. We will start by filtering out gmap_id with more than 70 review_text by looking at the summary datat in csv file.

In [18]:
#Extract gmap_id that has at least 70 text reviews
filt = summary_df['review_count'] >= 70
gmap_id_list = summary_df[filt]['gmap_id'].tolist()
len(gmap_id_list)

135

So there are 135 businesses that has more than 70 reviews. Next we will create a dictionary where keys are gmap_id of the business and values are the corresponding review_text for that gmap_id. One business will have multiple reviews, but for processing, we will concatnate the review into 1 long string.

In [19]:
gmap_review_dict = {}
for gmap_id in gmap_id_list:
    review_sum = ''
    for review in data[gmap_id]['reviews']:
        review_text = review['review_text']
        if review_text != 'None':
            if review_sum:  # Check if review_sum is not empty
                review_sum += ' '  # Add space before the new review
            review_sum += review_text
    gmap_review_dict[gmap_id] = review_sum

In [20]:
# Check to see if the dictionary is created correctly
gmap_review_dict['0x80859a3b08246ae3:0x73dd87ce4c42c354'][:500]

"they saved mango's life.mango was diagnosed with lymphoma at another clinic and quickly went downhill after his elspar injection.  he could barely stand.  at 14 we thought his little body couldn't handle any more and we feared the worst when we brought him here.the staff was calming and professional and most important they kept us constantly informed so we were never left wondering what was going on.after his first night here we came to visit in the outside tent. honestly i thought it would be o"

We examine the first 500 characters of one of our review_text to see how it looks. The next step is Tokenization.

<div class="alert alert-block alert-warning">
    
### 4.1. Approach and Rationale <a class="anchor" name="Approach and Rationale"></a>

**Orders of Processing Text and Rationalet**
We decided to process our text data with the following order:
1. Tokenization
2. Context-independent stopwords and short tokens removal
3. Rare tokens and context-dependent stopwords removal
4. Generate top 200 bigrams
5. Stemming unigram and keep bigram intact
6. Calculate vocabulary including both unigrams and bigrams
7. Create a sparse matrix and CountVector

The rationale for that specific order of processing is as followed.
1. **Tokenization**: This step breaks the text into individual tokens. This is being done first because it's fundamental for all subsequent text processing steps.

2. **Context-independent stopwords and short tokens removal**: This step removes common stopwords (such as "and", "the", etc.) and short tokens, which are assumed to be meaningless. This step is being done before **Rare token and context-dependent stopwords removal** because removing stopwords and short tokens early can reduces the dataset size quickly. These common stopwords and short tokens don't add much value to the analysis so removing them would simplies the following steps.

3. **Rare token and context-dependent stopwords removal**: Rare tokens and context-dependent stopwords are removed to further clean the data. Rare tokens and context-dependent stopwords are defined as having document frequency of less than 5% and more than 95% respectively. Rare tokens can introduce noise while context-dependent stopwords (words that appear in almost every document) don't provide meaningful information for analysis.

4. **Generating top 200 bigrams**: Top 200 bigrams based on Pointwise Mutual Information (PMI) are added into the list of tokens.

**Remarks:** Stopwords and rare tokens removal was done before generating bigrams to prevent noise in bigrams and reduce the computing load. Stopwords and rare tokens don't have much meaning and if they were kept when generating bigrams, meaningless bigrams like "of the", "and it" would be generated. Removing stopwords and rare tokens early on also reduce the number of potential bigrams.

5. **Stemming unigram and keep bigram intact**: This step stems unigram to its root form using PorterStemmer. Bigrams are kept intact to preserve the specific word combinations that could lose meaning if stemmed.

**Remarks:** Stemming was done after stopwords (context-dependent and independent) removal to avoid stemming irrelevant words and altering the document frequency of some words. Since stemming return words to its root form, some words may experience an increased in document frequency compare to its original forms. For example, the words "run", "running", and "runs" which has document frequency 1 in three documents, after stemming to "run", the word "run" will have document frequency of 3. Context-dependent stopwords and rare tokens are defined based on document frequency, thus with changed document frequency, removing context-dependent and rare tokens would be inaccurate.

6. **Calculate vocabulary including both unigrams and bigrams**: Combining the cleaned and processed unigrams and bigrams into a final vocabulary list ensures that your text analysis captures both individual word meanings and important word combinations.

7. **Create a sparse matrix and CountVector**: Finally, the creation of a sparse matrix using CountVectorizer transforms the vocabulary into a numerical format, ready for machine learning models or other quantitative analysis.

The details of each steps are below.


<div class="alert alert-block alert-warning">
    
### 4.2. Tokenization and Stopwords Removal <a class="anchor" name="Tokenization and Stopwords Removal"></a>

Tokenization is a principal step in text processing and producing unigrams. In this section, we will turn review_text into a list of tokens for each business gmap_id. Then, context-independent stopwords and short tokens (of length less than three) will be removed. Our output would be a dictionary where each key is a gmap_id and each value is a corresponding list of tokens for that gmap_id (context-dependent stopwords and short tokens of length less than 3 removed).



In [21]:
#Stopwords
ilya_path = "/content/drive/Shareddrives/FIT5196_S2_2024/GroupAssessment1/stopwords_en.txt"
tung_path = "/content/drive/MyDrive/FIT5196_assignment_files/stopwords_en.txt"
with open(ilya_path, 'r') as file:
    context_independent_stopwords_list = file.read().splitlines()

After having a list of context-independent stopwords, we will loop through each gmap_id and review_text in our gmap_review_dict and tokenize each review_text using RegexpTokenizer with regular expression r[a-zA-Z]+". After that we'll remove context-independent stopwords and tokens with length less than 3 using Python list comprehension. Finally, we will store our list of tokens for each gmap_id into a new dictionary, gmap_id_token_dict.

In [22]:
gmap_id_token_dict = {}
for gmap_id, review_text in gmap_review_dict.items():
    # Tokenize the review text
    tokenizer = RegexpTokenizer(r"[a-zA-Z]+")
    unigram_tokens = tokenizer.tokenize(review_text)

    #Remove context-independent stopwords and tokens with length < 3
    unigram_tokens = [token for token in unigram_tokens if token.lower() not in context_independent_stopwords_list]
    unigram_tokens = [token for token in unigram_tokens if len(token) >= 3]

    #Store token list to a corresponding gmap_id
    gmap_id_token_dict[gmap_id] = unigram_tokens

len(gmap_id_token_dict['0x80859a3b08246ae3:0x73dd87ce4c42c354'])

5056

<div class="alert alert-block alert-warning">

### 4.3. Generating top 200 bigrams <a class="anchor" name="Generating top 200 bigrams"></a>

First, we will generate a list of all tokens available from our review_text, all_tokens_list, using chain.from_iterable(), then we create a unique vocabulary by create a set of all_tokens_list.



In [23]:
all_tokens_list = list(chain.from_iterable(gmap_id_token_dict.values()))
len(all_tokens_list)

245552

In [24]:
uni_voc = list(set(all_tokens_list))
len(uni_voc)

16320

We generate bigrams using nltk.collocations.BigramAssocMeasures() and apply to our all_tokens_list. Next, top 200 bigrams by PMI is filtered.

In [25]:
#generate top 200 bigrams
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(all_tokens_list)
top_200_bigrams = finder.nbest(bigram_measures.pmi, 200)
top_200_bigrams[:10]


[('aarrrggh', 'corporations'),
 ('abbot', 'kenny'),
 ('abusive', 'creeps'),
 ('accountability', 'instilling'),
 ('accounted', 'attatude'),
 ('accurated', 'calculations'),
 ('achievable', 'empowering'),
 ('administrational', 'therapeutic'),
 ('afb', 'ventured'),
 ('affectation', 'refined')]

We then check if our bigrams are actually appear in our review_text by re-tokenize them using MWETokenizer()


In [26]:
uni_voc.extend(top_200_bigrams)
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer(uni_voc)
mwe_tokens = tokenizer.tokenize(all_tokens_list)
len(mwe_tokens)

245362

In [27]:
mwe_tokens = list(set(mwe_tokens))
len(mwe_tokens)

16130

In [28]:
# Check if we have bigrams in our list
bigram_checklist = []
for bigram in mwe_tokens:
  if '_' in bigram:
    bigram_checklist.append(bigram)

# Inspect the output
len(bigram_checklist)

190

So we have added 190 bigrams into our vocab.

Now we remove tokens with length less than 3.

In [29]:
mwe_tokens = [token for token in mwe_tokens if len(token) >= 3]
len(mwe_tokens)

16130

<div class="alert alert-block alert-warning">

### 4.3. Context-dependent stopwords and rare tokens <a class="anchor" name="Approach and Rationale"></a>


The next step is to remove context-dependent stopwords and rare tokens. Context-dependent stopwords and rare tokens are defined using document frequency of words, so we will create a document frequency for each words in our vocab.

First, we create a set of tokens for each gmap_id_token_dict values. The set operation would make each word appear once only in each review, thus when combine all the review and count the number of appearance for each tokens, we will get the document frequency for each word. Combining set(value) for value in gmap_id_token_dict.values() is done using chain.from_iterable, and counting appearance of each tokens is done by FreqDist() function from nltk.probability.

The threshhold for words removal is set at less than 5% and more than 95% as specified in the assignment requirements.

Then, we remove tokens from vocab.

In [30]:
words = list(chain.from_iterable([set(value) for value in gmap_id_token_dict.values()]))
fd = FreqDist(words)

#Set up threshhold for inclusion criteria
num_businesses = len(gmap_review_dict)
upper_threshhold = 0.95 * num_businesses
lower_threshhold = 0.05 * num_businesses

context_dependent_stopwords_list = set([word for word, freq in fd.items() if freq > upper_threshhold])
rare_tokens = set([word for word, freq in fd.items() if freq < lower_threshhold])

mwe_tokens = [token for token in mwe_tokens if token not in context_dependent_stopwords_list and token not in rare_tokens]
len(mwe_tokens)

2729

We printed out the number of tokens in a our vocab after removing context-dependent stopwords and rare tokens to see if our code works.

So, our tokens list contain both unigrams and bigrams we need. The next step is to stem unigrams using PorterStemmer()

<div class="alert alert-block alert-warning">
    
### 4.5. Stemming, create vocabulary, and output to vocab.txt <a class="anchor" name="Stemming"></a>

We will stem unigrams and preserve bigrams using PorterStemmer(). Bigrams are kept intact to preserve the specific word combinations that could lose meaning if stemmed.

Firstly, we write a function that stem unigrams only, then we apply that function to out list of mwe_tokens. Finally, we use set() so that each tokens appear only once in our list.

In [31]:
stemmer = PorterStemmer()

def stem_unigrams_only(tokens):
    stemmed_tokens = []
    for token in tokens:
        if "_" in token:  # Skip bigrams (containing "_")
            stemmed_tokens.append(token)
        else:  # Stem unigrams
            stemmed_tokens.append(stemmer.stem(token))
    return stemmed_tokens

stemmed_tokens = stem_unigrams_only(mwe_tokens)

# Inspect the output
stemmed_tokens[:10]

['paid',
 'appar',
 'spanish',
 'abbot_kenny',
 'slice',
 'form',
 'patron',
 'mask',
 'adult',
 'grate']

In [32]:
len(stemmed_tokens)

2729

In [33]:
# Create a unique and sorted vocab list
stemmed_tokens = list(set(stemmed_tokens))
stemmed_tokens = sorted(stemmed_tokens)
len(stemmed_tokens)

1991

We have a vocab, stemmed_token, that contains both bigrams and unigrams. We have remove context-dependent, context-independent stopwords, short tokens, and rare tokens. We will now write them to our output file.

In [34]:
with open('/content/drive/MyDrive/FIT5196_assignment_files/044_vocab.txt', 'w') as f:
    for index, word in enumerate(stemmed_tokens, start=0):
        f.write(f"{word}:{index}\n")

<div class="alert alert-block alert-warning">
    
### 4.6. Create a sparse and CountVector <a class="Create a sparse" name="whetev"></a>

The final step is to create a sparse matrix and its countvector using CountVectorizer().

Because we created new bigrams but we haven't added them into the token list corresponding to each of our gmap_id, we have to add them before we create a count vector.

In [35]:
# Add bigrams into corresponding gmap_id_token_dict, re-tokenize, and stem unigrams

for gmap_id, tokens in gmap_id_token_dict.items():
    # Generate bigrams for each gmap_id
    uni_voc = list(set(tokens))
    uni_voc.extend(top_200_bigrams)

    # Create a new list of tokens
    tokenizer = MWETokenizer(uni_voc)
    mwe_tokens = tokenizer.tokenize(tokens)

    # Update the list of tokens
    gmap_id_token_dict[gmap_id] = mwe_tokens

    # Stem unigrams for each gmap_id
    gmap_id_token_dict[gmap_id] = stem_unigrams_only(gmap_id_token_dict[gmap_id])

We create a mapping between token and index number to match the expected output.

In [36]:
vocab = stemmed_tokens
token_to_index = {token: index for index, token in enumerate(vocab)}

In [37]:
gmap_list = list(gmap_id_token_dict.keys())
gmap_list[:10]

['0x808164de5730862d:0xcb473d5d65fb1d39',
 '0x808327ba5f197e6d:0x5e31fb9492a334e7',
 '0x8084047107a176cb:0x889336c7010bad47',
 '0x8084acae2f9d2541:0xe64c0a74ef6732af',
 '0x8084d092a8afc95d:0x1b7c5c974eb7e63f',
 '0x8084d0e3d7fa160f:0xf481ebe2fa126e60',
 '0x8085072c051576f1:0xef004a4a10bf4322',
 '0x808566df363f8e6b:0x4981b6aa71d7df9',
 '0x80857342c8df43f5:0x63ca2b246ed1de5e',
 '0x8085778fd5a0ea71:0xe9c5b4b613568661']

We instantiated a CountVectorizer() instance with vocabulary is our vocab and created a sparse matrix.

In [38]:
vectorizer = CountVectorizer(vocabulary=vocab)
text = [' '.join(tokens) for tokens in gmap_id_token_dict.values()]
X = vectorizer.fit_transform(text)

In [39]:
# Inspect the matrix
X.shape

(135, 1991)

In [40]:
# Convert matrix to array for access
X_arr = X.toarray()
X_arr

array([[1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 2, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 2, 0]])

Now we have a matrix where each rows is a gmap_id and each column is how many times a specific word appear in that document. Knowing the structure of the matrix, we will start writing our output

In [41]:
with open('/content/drive/MyDrive/FIT5196_assignment_files/044_countvec.txt', 'w') as file:
  for i, row in enumerate(X_arr):
    line = gmap_list[i]
    for j, val in enumerate(row):
      if val > 0:
        line += f",{token_to_index[vocab[j]]}:{val}"
    file.write(line + '\n')

<div class="alert alert-block alert-success">
    
## 6. Summary <a class="anchor" name="summary"></a>

In this notebook, we successfully processed Google Reviews data to extract meaningful features that are crucial for various downstream natural language processing tasks, such as recommender systems, information retrieval, and machine translation. The preprocessing pipeline involved several key steps:

1. Data Loading and Initial Exploration: We started by loading data from a CSV file containing business-related information and a JSON file containing detailed review data. We ensured that the review texts were normalized, free of emojis, and in English.

2. Text Processing: The core text processing tasks were systematically executed, including tokenization, stopwords removal, and stemming. We also identified and removed rare tokens and context-dependent stopwords to clean the dataset effectively.

3. Bigram Generation and Integration: We generated the top 200 bigrams based on Pointwise Mutual Information (PMI) and integrated these into our vocabulary, maintaining the meaning of specific word combinations.

4. Vocabulary Creation and Sparse Matrix Generation: A final vocabulary list, comprising both unigrams and bigrams, was generated. Using this vocabulary, we created a sparse matrix and a corresponding CountVector, which were outputted for further analysis.

Overall, the structured and detailed approach taken in this notebook provided a solid foundation for further text analysis, ensuring that the processed data is in a format suitable for machine learning models and other quantitative analyses.

<div class="alert alert-block alert-success">
    
## 7. References <a class="anchor" name="Ref"></a>

[1] Pandas DataFrame: Wes McKinney, “pandas: powerful Python data analysis toolkit,” https://pandas.pydata.org/, Accessed 24/08/2024.

[2] NLTK Tokenization and Stemming: Steven Bird, Ewan Klein, and Edward Loper, “Natural Language Toolkit,” http://www.nltk.org/, Accessed 24/08/2024.

[3] CountVectorizer (scikit-learn): Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011, https://scikit-learn.org/, Accessed 24/08/2024.

[4] RegexpTokenizer (NLTK): Bird, S., Klein, E., & Loper, E. (2009). “Natural Language Processing with Python,” O’Reilly Media Inc. http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.RegexpTokenizer, Accessed 24/08/2024.

[5] MWETokenizer (NLTK): Bird, S., Klein, E., & Loper, E. (2009). “Natural Language Processing with Python,” O’Reilly Media Inc. http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.mwe.MWETokenizer, Accessed 24/08/2024.

[6] FreqDist (NLTK): Bird, S., Klein, E., & Loper, E. (2009). “Natural Language Processing with Python,” O’Reilly Media Inc. http://www.nltk.org/api/nltk.probability.html#nltk.probability.FreqDist, Accessed 24/08/2024.

[7] PorterStemmer (NLTK): Bird, S., Klein, E., & Loper, E. (2009). “Natural Language Processing with Python,” O’Reilly Media Inc. http://www.nltk.org/api/nltk.stem.html#nltk.stem.PorterStemmer, Accessed 24/08/2024.

We acknowledge the use of Gen-AI (ChatGPT, Copylot and Gemini) to get some hints about specific Python libraries/functions/classes/methods/options/settings. Every information from AI has been found and double-checked in the original library manuals.




## --------------------------------------------------------------------------------------------------------------------------