<div class="alert alert-block alert-danger">

# FIT5196 Task 2 in Assessment 1
    
#### Student Name: Huangjin Wang
#### Student ID: 32189222

Date: 30/08/2022

Environment: Python 3.9.7

Libraries used:
* os (for interacting with the operating system, included in Python 3.9) 
* re (for extracting pid and review for text,installed and imported)
* pandas 1.1.0 (for dataframe, installed and imported) 
* multiprocessing (for performing processes on multi cores, included in Python 3.6.9 package) 
* itertools (for performing operations on iterables)
* nltk 3.5 (Natural Language Toolkit, installed and imported)
* nltk.collocations (for finding bigrams, installed and imported)
* nltk.tokenize (for tokenization, installed and imported)
* nltk.stem (for stemming the tokens, installed and imported)
* sklearn.feature_extraction.text (for creating count vector,installed and imported)
* pdfminer(for reading pdf file,installed and imported)
* io (for trasfer pdf to text format,installed and imported)
* math (for calculate threshold,installed and imported)
    </div>

<div class="alert alert-block alert-info">
    
## Table of Contents

</div>

[1. Introduction](#Intro) <br>
[2. Importing Libraries](#libs) <br>
[3. Examining Input File](#examine) <br>
[4. Loading and Parsing Files](#load) <br>
$\;\;\;\;$[4.1. Tokenization](#tokenize) <br>
$\;\;\;\;$[4.2. Whatever else](#whetev) <br>
$\;\;\;\;$[4.3. Finding First 200 Meaningful Bigrams](#bigrams) <br>
$\;\;\;\;$[4.4. Whatever else](#whetev1) <br>
[5. Writing Output Files](#write) <br>
$\;\;\;\;$[5.1. Vocabulary List](#write-vocab) <br>
$\;\;\;\;$[5.2. Sparse Matrix](#write-sparseMat) <br>
[6. Summary](#summary) <br>
[7. References](#Ref) <br>

<div class="alert alert-block alert-success">
    
## 1.  Introduction  <a class="anchor" name="Intro"></a>

This assessment concerns textual data and the aim is to extract data, process them, and transform them into a proper format. The dataset provided is in the format of a PDF file containing pids and reviews. First, we need to read the pdf file and get the text of it. Then, apply regular expression to extract pids and their reviews, and stored in a dictionary. And tokenize the reviews for each pid's review, store in another dictionary. After that, do the following process to output the correct ` vocab.txt` and `countVec.txt`

[1] `case normalisation` for those tokens

[2] Find ` top200 meaningful bigrams` by PMI measure

[3] Remove `stopwords` and tokens which length`less than 3`

[4] Remove `duplicated tokens` since we need to find the token frequency

[5] Find `token frequency`

[6] Remove `rare token` and `context_dependent` vocab

[7] `Stemming` for unigrams and remove vocab which length`less than 3` and `stopwords`

[8] Output `vocab.txt`

[9] Find vocabs which appear in each pid and frequncy for each vocab

[10] Create 'sparse matrix' by using `CountVector`

<div class="alert alert-block alert-success">
    
## 2.  Importing Libraries  <a class="anchor" name="libs"></a>

In this assessment, any python packages is permitted to be used. The following packages were used to accomplish the related tasks:

* **os:** to interact with the operating system, e.g. navigate through folders to read files
* **re:** to define and use regular expressions
* **pandas:** to work with dataframes
* **multiprocessing:** to perform processes on multi cores for fast performance 
* **itertools:** to work with tokens
* **nltk:** to use tokenizer and stemmer
* **pdfminer:** to read pdf file
* **io:** to read pdf file
* **sklearn.feature_extraction.text:** to create CountVector
* **math:** to calculate threshold

In [1]:
import os
import re
import math
import pandas as pd
import multiprocessing
from itertools import chain
import nltk
from nltk.probability import *
from nltk.collocations import *
from nltk.tokenize import RegexpTokenizer
from nltk.tokenize import MWETokenizer
from nltk.stem import PorterStemmer
from nltk.util import ngrams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
from sklearn.feature_extraction.text import CountVectorizer

-------------------------------------

<div class="alert alert-block alert-success">
    
## 3.  Examining Input File <a class="anchor" name="examine"></a>

Let's examine what is the content of the file. For this purpose, PIDs is length of 10 and start with 'B0' or digital numbers. Also the review text start with `[` and end with `]`. However, the text content also include `[` and `]` like `[amazon]`.

<div class="alert alert-block alert-success">
    
## 4.  Loading and Parsing File <a class="anchor" name="load"></a>

In this section, read pdf and get pid and reviews

In [2]:
# Function to trasfer pdf to text
def pdf_to_text(path):
    manager = PDFResourceManager()
    retstr = StringIO()
    layout = LAParams(all_texts=True)
    device = TextConverter(manager, retstr, laparams=layout)
    filepath = open(path, 'rb')
    interpreter = PDFPageInterpreter(manager, device)

    for page in PDFPage.get_pages(filepath, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    filepath.close()
    device.close()
    retstr.close()
    return text

Let's examine the dictionary generated. For counting the total number of reviews extracted is 100

In [3]:
# Reading pdf file 
pdf_text = pdf_to_text('32189222_task2.pdf')
pdf_text

pid_pattern = r'B0\w{8}(?=\n{2})|\d{9}\w(?=\n{2})' # regular expression for extracting product ids
pids = re.findall(pid_pattern, pdf_text)

text_pattern = r'(?:\n*\[)(.*(?:\n.*)*?\n.*)(?:\]\n{2})' # regular expression for extracting reviews
texts = re.findall(text_pattern, pdf_text)

pid_text_dict = dict(zip(pids,texts))
print(len(pids))
print(len(texts))
print(len(pid_text_dict))

100
100
100


<div class="alert alert-block alert-warning">
    
### 4.1. Tokenization <a class="anchor" name="tokenize"></a>

Tokenization is a principal step in text processing and producing unigrams and bigrams. In this section, we need to do case normalization before we can do tokenization for texts

In [4]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
tokenized_dict = {}
for k,v in pid_text_dict.items():
    v = str(v).lower() # case normalization
    tokenized_dict[k] = tokenizer.tokenize(v) # tokenization for texts
tokenized_dict

{'B00006IRR4': ["i've",
  'recently',
  'got',
  'into',
  'to',
  'the',
  'itunes',
  'thing',
  'and',
  'it',
  'became',
  'pretty',
  'apparent',
  'to',
  'me',
  'that',
  'my',
  'computer',
  'speakers',
  "aren't",
  'going',
  'to',
  'cut',
  'it',
  'i',
  'spent',
  'some',
  'time',
  'looking',
  'around',
  'for',
  'a',
  'set',
  'of',
  'reasonably',
  'priced',
  'set',
  'and',
  'when',
  'i',
  'saw',
  'the',
  'price',
  'here',
  'on',
  'amazon',
  'com',
  'i',
  'saw',
  'this',
  'set',
  'go',
  'for',
  's',
  'in',
  'most',
  'places',
  'i',
  'decided',
  'to',
  'buy',
  'it',
  'anyway',
  'wow',
  'really',
  'fantastic',
  'sound',
  "it's",
  'hard',
  'to',
  'believe',
  'such',
  'great',
  'sound',
  'comes',
  'from',
  'this',
  'small',
  'compact',
  'set',
  'plan',
  'on',
  'having',
  'the',
  'whole',
  'thing',
  'on',
  'your',
  'desk',
  'though',
  'in',
  'order',
  'to',
  'get',
  'easy',
  'access',
  'to',
  'the',
  'co

The above operation results in a dictionary with PID representing keys and a single string for all reviews of the day concatenated to each other.

In [5]:
words = list(chain.from_iterable(tokenized_dict.values()))
words

["i've",
 'recently',
 'got',
 'into',
 'to',
 'the',
 'itunes',
 'thing',
 'and',
 'it',
 'became',
 'pretty',
 'apparent',
 'to',
 'me',
 'that',
 'my',
 'computer',
 'speakers',
 "aren't",
 'going',
 'to',
 'cut',
 'it',
 'i',
 'spent',
 'some',
 'time',
 'looking',
 'around',
 'for',
 'a',
 'set',
 'of',
 'reasonably',
 'priced',
 'set',
 'and',
 'when',
 'i',
 'saw',
 'the',
 'price',
 'here',
 'on',
 'amazon',
 'com',
 'i',
 'saw',
 'this',
 'set',
 'go',
 'for',
 's',
 'in',
 'most',
 'places',
 'i',
 'decided',
 'to',
 'buy',
 'it',
 'anyway',
 'wow',
 'really',
 'fantastic',
 'sound',
 "it's",
 'hard',
 'to',
 'believe',
 'such',
 'great',
 'sound',
 'comes',
 'from',
 'this',
 'small',
 'compact',
 'set',
 'plan',
 'on',
 'having',
 'the',
 'whole',
 'thing',
 'on',
 'your',
 'desk',
 'though',
 'in',
 'order',
 'to',
 'get',
 'easy',
 'access',
 'to',
 'the',
 'controls',
 'on',
 'the',
 'subwoofer',
 'and',
 'speakers',
 "it's",
 'not',
 'a',
 'big',
 'deal',
 'i',
 'have',

`words` stores all the tokens in the pdf text.

-------------------------------------

<div class="alert alert-block alert-warning">
    
### 4.2. Whatever else <a class="anchor" name="whetev"></a>

<div class="alert alert-block alert-warning">
    
### 4.3. Finding First 200 Meaningful Bigrams <a class="anchor" name="bigrams"></a>

One of the tasks is to find the first 200 meaningful bigrams. These bigrams should also be included in the final vocabulary list. And the top 200 meaningful bigrams should concatenate by `'_'` such as `abandonment_supposedly`

In [6]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(words)
top200_bigrams = finder.nbest(bigram_measures.pmi, 200)
top200_bigrams

[('abi', 'ean'),
 ('accessibility', 'chaining'),
 ('accident-prone', 'cowardly'),
 ('acrylic', 'pane'),
 ('action-filled', 'retelling'),
 ('aestheticised', 'transgression'),
 ('agave', 'nectar'),
 ('ahora', 'hasta'),
 ('all-natural', 'sugars'),
 ('all-night', 'ayahuasca'),
 ('alows', 'youto'),
 ('amillion', 'yeasrs'),
 ('amps', 'loudspeakers'),
 ('amyl', 'nitrate'),
 ('anderson', 'magnolia'),
 ('angeles', 'dodgers'),
 ('anita', 'flores'),
 ('ann', 'sothern'),
 ('antichrist', 'hugh'),
 ('apologizes', 'thusly'),
 ('appropriated', 'translations'),
 ('ar', 'wedge'),
 ('arabians', 'tricked'),
 ('art-house', 'stylings'),
 ('arts-based', 'tourism'),
 ('ashtaavakr', 'geetaa'),
 ('asphalt', 'jungle'),
 ('asteroids', 'slamming'),
 ('atvpdkikx', 'der'),
 ('audible', 'crackling'),
 ('audition', 'crouched'),
 ('automobile', 'accidents'),
 ('avermedia', 'broadcaster'),
 ('ayahuasca', 'ceremony'),
 ('bande', 'dessinee'),
 ('barred', 'hardcore'),
 ('base-plate', 'pivots'),
 ('bd', 'aacs'),
 ('belasari

Having found the top 200 meaningful bigrams, we need to retokenize tweets considering the bigrams as well

In [7]:
# concatenate these bigrams by '_'
bigrams = []
for i in top200_bigrams:
    bigrams.append(i[0] + '_' + i[1])
bigrams
# We need to keep the bigram instead of tokenize it, by using MWETokenzier
mwe_tokenizer = MWETokenizer(top200_bigrams)
bigram_dict= {}
for k, v in tokenized_dict.items():
    bigram_dict[k] = mwe_tokenizer.tokenize(v)
bigram_dict

{'B00006IRR4': ["i've",
  'recently',
  'got',
  'into',
  'to',
  'the',
  'itunes',
  'thing',
  'and',
  'it',
  'became',
  'pretty',
  'apparent',
  'to',
  'me',
  'that',
  'my',
  'computer',
  'speakers',
  "aren't",
  'going',
  'to',
  'cut',
  'it',
  'i',
  'spent',
  'some',
  'time',
  'looking',
  'around',
  'for',
  'a',
  'set',
  'of',
  'reasonably',
  'priced',
  'set',
  'and',
  'when',
  'i',
  'saw',
  'the',
  'price',
  'here',
  'on',
  'amazon',
  'com',
  'i',
  'saw',
  'this',
  'set',
  'go',
  'for',
  's',
  'in',
  'most',
  'places',
  'i',
  'decided',
  'to',
  'buy',
  'it',
  'anyway',
  'wow',
  'really',
  'fantastic',
  'sound',
  "it's",
  'hard',
  'to',
  'believe',
  'such',
  'great',
  'sound',
  'comes',
  'from',
  'this',
  'small',
  'compact',
  'set',
  'plan',
  'on',
  'having',
  'the',
  'whole',
  'thing',
  'on',
  'your',
  'desk',
  'though',
  'in',
  'order',
  'to',
  'get',
  'easy',
  'access',
  'to',
  'the',
  'co

In [8]:
# Stop words list
with open('stopwords_en.txt') as f:
    stop_words = f.read().splitlines()
    stop_words = list(set(stop_words))                                              
# Remove tokens which is stop word or length less than 3
for k,v in bigram_dict.items():
    bigram_dict[k] = [token for token in v if len(token) >= 3 and token not in stop_words]
    
# Remove duplicated tokens
uniq_dict = {}
for k,v in bigram_dict.items():
    uniq_dict[k] = list(set(v))
uniq_words = list(chain.from_iterable(uniq_dict.values()))
uniq_words

['recently',
 "wife's",
 'reason',
 'store',
 'worth',
 'track',
 'soo',
 'great',
 'true',
 'home',
 'labs',
 'apple',
 'muddy',
 'excellent',
 'hear',
 'amazing',
 'turn',
 'earlier',
 'sad',
 'purest',
 'item',
 'bass',
 'apparent',
 'cars',
 'expected',
 'desire',
 'itunes',
 'timbre',
 'decision-maker',
 'nice',
 'crisper',
 'economical',
 'recomend',
 'wow',
 'places',
 'satisfy',
 'durable',
 'feedback',
 'sony',
 'deal',
 'products',
 'listed',
 'amazed',
 'hot',
 'high',
 'decided',
 'owned',
 'days',
 'years',
 'proud',
 'distinct',
 'earth',
 'dissapointed',
 'incredibly',
 'fired',
 'visuals',
 'guitar',
 'listen',
 'person',
 'crank',
 'research',
 'digital',
 'speaker',
 'version',
 'cheap',
 'pulled',
 'bank',
 'entrusted',
 'hurriedly',
 'love',
 'creatures',
 'speakers',
 'big',
 'exact',
 'order',
 'movies',
 'rebates',
 'extremely',
 'faster',
 'trendy',
 'sticks',
 'positive',
 'exceptional',
 'fantastic',
 'designs',
 'roughly',
 'alive',
 'hard',
 'care',
 'good',

Now we have the deduplicated token list, so we can get token frequency.

In [9]:
# Frequency of each token
fd = FreqDist(uniq_words)
fd

FreqDist({'great': 98, 'good': 97, 'time': 91, 'make': 84, 'recommend': 80, 'love': 77, 'find': 73, 'work': 72, 'made': 70, 'lot': 68, ...})

In [10]:
# Remove rare token and context-dependent stopwords
context_dependent = []
rare_tokens = []
for w, f in fd.items():
    if f > math.ceil(len(pids)/2) and w not in bigrams:
        context_dependent.append(w)
    if f < 10 and w not in bigrams:
        rare_tokens.append(w)

list for context dependent and rare tokens.

In [11]:
vocab = []
for word in uniq_words:
    # Remove context dependent words and rare tokens
    if word not in context_dependent and word not in rare_tokens:
        vocab.append(word)
vocab = list(set(vocab))
vocab.sort()

# Stemming for the vocab
stemmer = PorterStemmer()
# vocab = list(set([stemmer.stem(w) for w in vocab]))
stemmed_vocab = []
for w in vocab:
    if '_' not in w: # do not stemming for bigrams as stemming will change the meaning of bigrams
        stemmed_vocab.append(stemmer.stem(w))
    else:
        stemmed_vocab.append(w)
stemmed_vocab = list(set(stemmed_vocab)) # remove duplicated vocab
stemmed_vocab = [w for w in stemmed_vocab if len(w)>= 3 and w not in stop_words] #
stemmed_vocab.sort()
stemmed_vocab

['abi_ean',
 'abil',
 'absolut',
 'accessibility_chaining',
 'accident-prone_cowardly',
 'acrylic_pane',
 'act',
 'action',
 'action-filled_retelling',
 'actor',
 'actual',
 'adapt',
 'add',
 'addit',
 'adult',
 'advantag',
 'adventur',
 'advertis',
 'advic',
 'aestheticised_transgression',
 'agave_nectar',
 'age',
 'ago',
 'agre',
 'ahead',
 'air',
 'all-natural_sugars',
 'all-night_ayahuasca',
 'alot',
 'alows_youto',
 'amaz',
 'amazon',
 'american',
 'amillion_yeasrs',
 'amount',
 'amps_loudspeakers',
 'amyl_nitrate',
 'anderson_magnolia',
 'angeles_dodgers',
 'angl',
 'anita_flores',
 'ann_sothern',
 'annoy',
 'answer',
 'antichrist_hugh',
 'anymor',
 'apologizes_thusly',
 'appar',
 'appropriated_translations',
 'ar_wedge',
 'arabians_tricked',
 'area',
 'arriv',
 'art',
 'art-house_stylings',
 'arts-based_tourism',
 'ashtaavakr_geetaa',
 'aspect',
 'asphalt_jungle',
 'asteroids_slamming',
 'attach',
 'attempt',
 'attent',
 'attract',
 'atvpdkikx_der',
 'audible_crackling',
 'audie

Now we can do stemming for the vocab, since stemming will impact our bigram, the stemming will be done by now. Also, we need to filter stemmed vocab again in case some stemmed vocab is stop word or length less than 3. At this stage, vocabs are found and ready to output.

As we already have the vocab list, we can start to check what vocabs are in the product reviews

In [12]:
vec_dict = {}
for k,v in bigram_dict.items(): # the reason we use tokens before deduplication is we want to record frequency of the vocabs
    vec_dict[k] = []
    for w in v:
        if w not in context_dependent and w not in rare_tokens and '_' not in w and len(stemmer.stem(w)) >= 3 and stemmer.stem(w) not in stop_words:
            vec_dict[k].append(stemmer.stem(w))
# dictionary that store pids and vocabs in its revew
vec_dict

{'B00006IRR4': ['recent',
  'pretti',
  'comput',
  'speaker',
  'cut',
  'spent',
  'price',
  'price',
  'amazon',
  'place',
  'decid',
  'wow',
  'fantast',
  'sound',
  'sound',
  'small',
  'plan',
  'order',
  'speaker',
  'deal',
  'space',
  'purchas',
  'speaker',
  'replac',
  'speaker',
  'qualiti',
  'sound',
  'speaker',
  'incred',
  'cool',
  'sound',
  'touch',
  'volum',
  'control',
  'number',
  'local',
  'price',
  'amazon',
  'price',
  'free',
  'ship',
  'listen',
  'pro',
  'night',
  'fine',
  'soni',
  'person',
  'incred',
  'product',
  'audio',
  'product',
  'comput',
  'speaker',
  'day',
  'cheap',
  'speaker',
  'matter',
  'hear',
  'sad',
  'short',
  'disappoint',
  'speaker',
  'clear',
  'crisp',
  'sound',
  'home',
  'theater',
  'system',
  'smaller',
  'cheaper',
  'speaker',
  'amazon',
  'carri',
  'longer',
  'item',
  'older',
  'comput',
  'incred',
  'stand',
  'home',
  'theater',
  'system',
  'sound',
  'audio',
  'system',
  'incred

In [13]:
vocab_index_dict = {}
for i in enumerate(stemmed_vocab):
    vocab_index_dict[i[1]] = i[0]
vocab_index_dict

{'abi_ean': 0,
 'abil': 1,
 'absolut': 2,
 'accessibility_chaining': 3,
 'accident-prone_cowardly': 4,
 'acrylic_pane': 5,
 'act': 6,
 'action': 7,
 'action-filled_retelling': 8,
 'actor': 9,
 'actual': 10,
 'adapt': 11,
 'add': 12,
 'addit': 13,
 'adult': 14,
 'advantag': 15,
 'adventur': 16,
 'advertis': 17,
 'advic': 18,
 'aestheticised_transgression': 19,
 'agave_nectar': 20,
 'age': 21,
 'ago': 22,
 'agre': 23,
 'ahead': 24,
 'air': 25,
 'all-natural_sugars': 26,
 'all-night_ayahuasca': 27,
 'alot': 28,
 'alows_youto': 29,
 'amaz': 30,
 'amazon': 31,
 'american': 32,
 'amillion_yeasrs': 33,
 'amount': 34,
 'amps_loudspeakers': 35,
 'amyl_nitrate': 36,
 'anderson_magnolia': 37,
 'angeles_dodgers': 38,
 'angl': 39,
 'anita_flores': 40,
 'ann_sothern': 41,
 'annoy': 42,
 'answer': 43,
 'antichrist_hugh': 44,
 'anymor': 45,
 'apologizes_thusly': 46,
 'appar': 47,
 'appropriated_translations': 48,
 'ar_wedge': 49,
 'arabians_tricked': 50,
 'area': 51,
 'arriv': 52,
 'art': 53,
 'art-ho

Now we have the vocabs and their index, we can match vocabs with their index for each pid

In [14]:
vec_dict_final = {}
for k,v in vec_dict.items():
    vec_dict_final[k] = []
    for w in v:
        vec_dict_final[k].append(vocab_index_dict[w])
vec_dict_final

{'B00006IRR4': [733,
  697,
  230,
  815,
  269,
  818,
  700,
  700,
  31,
  676,
  285,
  933,
  422,
  812,
  812,
  804,
  678,
  651,
  815,
  280,
  813,
  711,
  815,
  745,
  815,
  716,
  812,
  815,
  539,
  246,
  812,
  874,
  903,
  244,
  641,
  583,
  700,
  31,
  700,
  469,
  785,
  580,
  702,
  636,
  448,
  810,
  670,
  539,
  705,
  67,
  705,
  230,
  815,
  278,
  179,
  815,
  605,
  506,
  760,
  786,
  320,
  815,
  200,
  263,
  812,
  517,
  860,
  850,
  805,
  180,
  815,
  31,
  160,
  584,
  550,
  646,
  230,
  539,
  820,
  517,
  860,
  850,
  812,
  67,
  850,
  539,
  804,
  471,
  815,
  230,
  725,
  31,
  752,
  435,
  697,
  624,
  747,
  887,
  731,
  24,
  710,
  812,
  401,
  408,
  200,
  812,
  513,
  840,
  312,
  880,
  315,
  512,
  446,
  408,
  246,
  865,
  63,
  608,
  815,
  303,
  393,
  812,
  716,
  354,
  812,
  812,
  850,
  105,
  576,
  636,
  539,
  260,
  180,
  158,
  815,
  158,
  872,
  818,
  815,
  827,
  901,
  676,


A dictionary store pids and vocabs index, now we are ready to use CountVector to get vocabs' frequency

At this stage, we can output the vocab.txt and countVector.txt

-------------------------------------

<div class="alert alert-block alert-success">
    
## 5. Writing Output Files <a class="anchor" name="write"></a>

Four files need to be generated:
* Vocabulary list
* Sparse matrix

This is performed in the following sections.

<div class="alert alert-block alert-warning">
    
### 5.1. Vocabulary List <a class="anchor" name="write-vocab"></a>

List of vocabulary should also be written to a file, sorted alphabetically, with their reference codes in front of them. This file also refers to the sparse matrix in the next file. By using enuerate, we can get words and their index.

In [15]:
with open ('32189222_vocab.txt', 'w') as f:
    for i in enumerate(stemmed_vocab):
        f.write(i[1]+ ':' + str(i[0]) + '\n')


<div class="alert alert-block alert-warning">
    
### 5.2. Sparse Matrix <a class="anchor" name="write-sparseMat"></a>

For writing sparse matrix for each PID, we firstly calculate the frequency of words for that PID and for each PID write the words' index and their frequency by using CountVectorizer.

In [16]:
with open('32189222_countVec.txt', 'w') as w:
    vectorizer = CountVectorizer(analyzer = "word") # each word is a feature
    for k ,v in vec_dict_final.items():
        v = [str(i) for i in v ]
        data_features = vectorizer.fit_transform([' '.join(v)])
        name_features = vectorizer.get_feature_names()
        w.write(k + ',')
        for word, count in zip(name_features, data_features.toarray()[0]):
            if word != name_features[-1]:
                w.write(word + ':' +str(count)+',')
            else:
                w.write(word + ':' +str(count))
        w.write('\n')
    

-------------------------------------

<div class="alert alert-block alert-success">
    
## 6. Summary <a class="anchor" name="summary"></a>

The task is to create vocab and sparse matrix for pdf files. We used pdfminer to read pdf file and regular expressions to extract pids and reviews. Then do the following step to generate vocab.txt and sparse matrix.

[1] `case normalisation` for those tokens

[2] Find ` top200 meaningful bigrams` by PMI measure

[3] Remove `stopwords` and tokens which length`less than 3`

[4] Remove `duplicated tokens` since we need to find the token frequency

[5] Find `token frequency`

[6] Remove `rare token` and `context_dependent` vocab

[7] `Stemming` for unigrams and remove vocab which length`less than 3` and `stopwords`

[8] Output `vocab.txt`

[9] Find vocabs which appear in each pid and frequncy for each vocab

[10] Create 'sparse matrix' by using `CountVector`

-------------------------------------

<div class="alert alert-block alert-success">
    
## 7. References <a class="anchor" name="Ref"></a>

[1] Pandas dataframe.drop_duplicates(), https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/, Accessed 27/08/2022.

[2] sklearn.feature_extraction.text.CountVectorizer, https://scikitlearn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer, Accessed 29/08/2022.



## --------------------------------------------------------------------------------------------------------------------------