**TASK - 2**

**Text Pre-processing**

The data is usually unclean and has a lot of inconsistencies. Wrangling the data helps us to achieve a cleaner data that can be used for further analysis. When it comes to text data, there are a lot of inconsistencies, spelling errors, punctuation and grammatical errors, redundancies and words that are incomplete. To rectify it and get the desirable data, text preprocessing is used. Text preprocessing usually involves cleaning the text data to make it more meaningful for the analysis. There are many steps when it comes to text preprocessing and we shall see all of them in this task. 

Firstly, we import all the required packages for the task which are pandas, nltk, re and sklearn

In [119]:
from google.colab import drive
import pandas as pd
import nltk
import copy
from datetime import datetime
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer
import nltk.data
from nltk.corpus import stopwords
from nltk.collocations import BigramAssocMeasures
from nltk import BigramCollocationFinder
from sklearn.feature_extraction.text import CountVectorizer
import re

To access the text data, we are mounting the drive in our colab space. 

In [120]:
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


Once the drive is mounted, we can access the data from the drive. Firstly, we are retrieving the excel file. This file has 19  sheets which has 3 columns of text data in each sheet namely the 'reviewTime', 'reviewText' and the 'summary'.
The review time has dates on when the review has been given. The reviewText has the review and the summary has the summary of the review. 

In [121]:
ls_df, df = pd.read_excel("/content/drive/Shareddrives/FIT5196-s1-2022/A1/Task2/input_data/32263546.xlsx", sheet_name = None), pd.DataFrame()

for i in ls_df.values():
  #making sure that there are no missing values
  t = i.dropna(how='all', axis=1).dropna(how='all').reset_index(drop=True)
  if 'reviewTime' not in t.columns:
    t = t.rename(columns=t.iloc[0])[1:]

  t.reviewTime = pd.to_datetime(t.reviewTime)
  df = pd.concat([df, t])

df.set_index('reviewTime', inplace = True)
df

Unnamed: 0_level_0,reviewText,summary
reviewTime,Unnamed: 1_level_1,Unnamed: 2_level_1
2012-09-18,"I do have other dvd's of World War 1, World Wa...",World at War makes you feel that you're there!!!
2007-01-15,What an outstanding collection of information ...,Wonderful Book on a Wonderful Dog
2012-12-26,O! What a lovely and beautiful film. Rich visu...,Good Deeds Upping The Bar
2013-06-27,I am trying to copy it in a notebook to have ...,great help
2013-01-08,This title was what caught my interest. This s...,War on sacred mountain
...,...,...
2004-02-18,As an investment banker I enjoy reading novels...,Great Financial Novel
2014-05-19,"In my home, I have quite a few earbuds and hea...","Great earbuds with excellent sound, high quali..."
2012-08-10,I bought these and can recommend them for the ...,Very nice for the price
2014-02-19,That was breathtaking and marvelous all rolled...,OMG


The file is loaded with all the sheets merged into one single sheet so that it is easy to start the text preprocessing. 

Text preprocessing has multiple steps. They are
removing

*  Removing the punctuations, symbols, links, etc,.
*  Removing the stop words
*  Converting the strings to the lowercase
*  Tokenization
*  Stemming
*  Lemmatization

All these can be performed in any order depending on what task we have to perform. We shall see how these tasks can make changes to the final data

Here, we have used PorterStemmer() for stemming. Stemming involves standardizing the words to its root forms. 

For example: walking -> walk. 

The root word of walking is walk. It stems the word to its root form which should have a definite meaning. Stemming the words should make sure that the root word after stemming has a definite meaning and is not a meaningless word.

Below, we have concatenated the 'summary' and the 'reviewText' columns into a single column namely 'textcopy'
Before proceeding futher, we have to remove the stopwords. Stopwords are the words that are most commonly used. Examples of stopwords are = 'the', 'an', 'a', 'and', and much more. Removing the stop words are integral to further process the data. Here, we are uploading the stopwords_en.txt file as a csv file to compare the words in this file with the ones from the iput file based on which, these words will be removed from the dataframe 

In [122]:
#creating a copy of the existing list and then concatenating the reviewText column and the summary column into a new column
stemmer = PorterStemmer()
test = df.copy()
test['textcopy'] = test.reviewText +test.summary
#reading the stop_words file as a csv and not as a text file 
stop_words = pd.read_csv('/content/drive/Shareddrives/FIT5196-s1-2022/A1/Task2/stopwords_en.txt').T.reset_index().iloc[0].values
test

Unnamed: 0_level_0,reviewText,summary,textcopy
reviewTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-09-18,"I do have other dvd's of World War 1, World Wa...",World at War makes you feel that you're there!!!,"I do have other dvd's of World War 1, World Wa..."
2007-01-15,What an outstanding collection of information ...,Wonderful Book on a Wonderful Dog,What an outstanding collection of information ...
2012-12-26,O! What a lovely and beautiful film. Rich visu...,Good Deeds Upping The Bar,O! What a lovely and beautiful film. Rich visu...
2013-06-27,I am trying to copy it in a notebook to have ...,great help,I am trying to copy it in a notebook to have ...
2013-01-08,This title was what caught my interest. This s...,War on sacred mountain,This title was what caught my interest. This s...
...,...,...,...
2004-02-18,As an investment banker I enjoy reading novels...,Great Financial Novel,As an investment banker I enjoy reading novels...
2014-05-19,"In my home, I have quite a few earbuds and hea...","Great earbuds with excellent sound, high quali...","In my home, I have quite a few earbuds and hea..."
2012-08-10,I bought these and can recommend them for the ...,Very nice for the price,I bought these and can recommend them for the ...
2014-02-19,That was breathtaking and marvelous all rolled...,OMG,That was breathtaking and marvelous all rolled...


The stopwords file is uploaded so as to compare and remove the stopwords from the dataframe that we have. Once the stopwords are removed, the next process is to tokenize, and then stem the data. After removing the stopwords, we have meaningful words. These words are tokenized. The word-tokenization is made to follow the regular expression for tokens generation. This is done using the RegexpTokenizer() for the regular expression and then tokenize() for the overall tokenization. By this, we have generated tokens for all the words in the temp list. 
I have made a copy of the list so that it does not affect the original dataframe. The next step after tokenization is the stemming which is done using the PorterStemmer(). Once this over, we are removing all the tokens whose length is less than 3. Here I have stemmed the tokens whose length is greater or equal to 3 and the also not in the stopwords.


In [123]:
#regex and tokenizing for every phrase
temp = test.textcopy.apply(lambda x: RegexpTokenizer("[a-zA-Z]+(?:[-'][a-zA-Z]+)?").tokenize(x))

#Stemming and removing stopwords whose length is less than 3
# the variable temp is a list of stemmed tokens
temp = temp.apply(lambda x: [stemmer.stem(w) for w in x if w not in stop_words and len(w) >2])

In the above cell, we have performed tokenization, stemming, and made sure the words length are 3 or above and not i the stopwords. Once this is over, the next step is to make an empty dictionary named 'token_dict' that stores the unique dates and the corresponding text. The key will be the dates and the values will be a list of the texts for that particular date. 

In [124]:
#creating a dictionary for the unique dates and the respective tokens
new = temp.copy()
new.index = new.index.strftime('%Y-%m-%d')
token_dict = {}
for i in range(len(new)):
  token_dict[new.iloc[[i]].index[0]] = list(set(token_dict.get(new.iloc[[i]].index[0], []) + new.iloc[i]))

token_dict

{'2012-09-18': ['sit',
  'laugh-out',
  'author',
  'gear',
  'clever',
  'slade',
  'rush',
  'tend',
  'surpris',
  'veteran',
  'night',
  'sister',
  'fact',
  'cute',
  'take',
  'battl',
  'stiller',
  'channel',
  'necessarili',
  'knew',
  'prepar',
  'garant',
  'sens',
  'water',
  'dian',
  'testament',
  'potti',
  'think',
  'mother',
  'wild',
  'call',
  'camera',
  'entertain',
  'buff',
  'ador',
  'point',
  'peopl',
  'relax',
  'suppos',
  'loyal',
  'tie',
  'origin',
  'cave',
  'clear',
  'school',
  'purpos',
  'interview',
  'ear',
  'the',
  'doe',
  'asid',
  'good',
  'biblic',
  'singer',
  'and',
  'sound',
  'overanalyz',
  'elev',
  'comedi',
  'pretend',
  'episod',
  'blindli',
  'place',
  'mention',
  'spare',
  'incorpor',
  'ask',
  'work',
  'genuin',
  'divorc',
  'she',
  'interest',
  'jenna',
  'begin',
  'hang',
  'plug',
  'give',
  'partner',
  'comput',
  'love',
  'design',
  'obvious',
  'inform',
  'built',
  'titl',
  'live',
  'merit'

By this, we get a dictionary of lists with each list having the token words for that particular date. Based on that we generate the count of each token using a dictionary 'count_dict'. By this, we get the count of every token from the 'token_dict' dictionary

In [125]:
#retrieving the count of each token over the respective dates thus giving each word and its count

count_dict = {}
for i in token_dict.keys():
  for j in token_dict[i]:
    count_dict[j] = count_dict.get(j,0) + 1
count_dict

{'sit': 202,
 'laugh-out': 7,
 'author': 699,
 'gear': 42,
 'clever': 80,
 'slade': 3,
 'rush': 89,
 'tend': 64,
 'surpris': 363,
 'veteran': 25,
 'night': 292,
 'sister': 130,
 'fact': 466,
 'cute': 159,
 'take': 562,
 'battl': 137,
 'stiller': 3,
 'channel': 88,
 'necessarili': 39,
 'knew': 236,
 'prepar': 111,
 'garant': 1,
 'sens': 330,
 'water': 114,
 'dian': 18,
 'testament': 23,
 'potti': 5,
 'think': 303,
 'mother': 242,
 'wild': 87,
 'call': 360,
 'camera': 340,
 'entertain': 394,
 'buff': 22,
 'ador': 77,
 'point': 539,
 'peopl': 793,
 'relax': 43,
 'suppos': 186,
 'loyal': 33,
 'tie': 93,
 'origin': 432,
 'cave': 17,
 'clear': 292,
 'school': 203,
 'purpos': 106,
 'interview': 88,
 'ear': 115,
 'the': 2175,
 'doe': 95,
 'asid': 21,
 'good': 1646,
 'biblic': 32,
 'singer': 32,
 'and': 759,
 'sound': 489,
 'overanalyz': 1,
 'elev': 17,
 'comedi': 173,
 'pretend': 33,
 'episod': 184,
 'blindli': 2,
 'place': 491,
 'mention': 241,
 'spare': 51,
 'incorpor': 35,
 'ask': 107,
 'wo

Having the count of all the tokens, the next step is to filter for rare tokens where the count is lesser than 10 with unique dates

In [126]:
#checking for the count that is greater than 10
filtered_count = copy.deepcopy(count_dict)
for i in list(filtered_count.keys()):
  if filtered_count[i] < 10:
    del filtered_count[i]

print(len(count_dict), len(filtered_count))

35384 5022


By this, the actual number of tokens which are 33,125 in count has been filtered and reduced to 4979 tokens. By far, we have completed tokenization, stemming, removing stop words, counting the tokens and filtering them. The next step is to generate the bigrams for those tokens. The bigrams are generated in sets where we have two words that are likely to go together giving a definite sense of meaning. First 200 meaningful bigrams are being generated as a part of this task which is shown below.

In [127]:
#generating bigrams 
all_words = sum(test.textcopy.apply(lambda x: RegexpTokenizer("[a-zA-Z]+(?:[-'][a-zA-Z]+)?").tokenize(x)).values, [])
all_words = [i.lower() for i in all_words]
bigram_measures = BigramAssocMeasures()
bigrams = BigramCollocationFinder.from_words(all_words)
#bigrams.apply_freq_filter(20)
bigrams.apply_word_filter(lambda w: len(w)<3)
top_200_bigrams = bigrams.nbest(bigram_measures.pmi, 200)

In [128]:
scored = bigrams.score_ngrams(BigramAssocMeasures().pmi)
scored

[(('a-heart', 'of-gold'), 20.361147043051652),
 (('a-spiritual', 'sounding-name'), 20.361147043051652),
 (('abazurowa', 'granice'), 20.361147043051652),
 (('abra', 'caboring'), 20.361147043051652),
 (('abscesses', 'lacerations'), 20.361147043051652),
 (('abwehr', 'signalman'), 20.361147043051652),
 (('academically', 'non-impressive'), 20.361147043051652),
 (('accredited', 'cadaver'), 20.361147043051652),
 (('achei', 'bom'), 20.361147043051652),
 (('achieva', 'shimian'), 20.361147043051652),
 (('acknowledgments', 'xoxoxomg'), 20.361147043051652),
 (('acquiredsony', 'nex-f'), 20.361147043051652),
 (('actionis', 'countered'), 20.361147043051652),
 (('actiontec', 'wirless'), 20.361147043051652),
 (('acuratetly', 'prortrayed'), 20.361147043051652),
 (('adaptationdavid', 'cunliffe'), 20.361147043051652),
 (('adm', 'cartwright'), 20.361147043051652),
 (('admissions', 'damagesthe'), 20.361147043051652),
 (('adquirir', 'conocimiento'), 20.361147043051652),
 (('adrianople', 'hadrians'), 20.36114

The PMI score is the probability or likelihood of both the words of the bigram occuring together. The biagrams and the corresponding PMI score are generated in the form of a set as shown above. Now that the bigrams are generated, the next step is to combine the unigrams and bigrams together, sort the words alphabetically  and make it into a text file as vocab_text.

In [129]:
#creating a list for the bigram words
bigram_list = []
for i in top_200_bigrams:
  bigram_list.append(str(i[0] + '_' + i[1]))

#creating a list for the unigram words
unigram_list = list(count_dict.keys())
unigram_list =[i for i in unigram_list if len(i)>2]


#joining both the unigrams and bigrams and sorting them alphabetically
bigram_unigram_joined = bigram_list + unigram_list
bigram_unigram_joined.sort()






In [130]:

save_location = "/content/drive/MyDrive/dw_assignment1_output/32263546_vocab.txt"
with open(save_location, 'a+') as f:
  for i in range(len(bigram_unigram_joined)):
    f.write(bigram_unigram_joined[i] + ":" + str(i) + "\n")
  f.close()

The next step is to make the sparse representation of the token counts using the CountVectorizer(). For this, the bigram_unigram_joined is passed to fit in the CountVectorizer() as a variable cvec and a new vocabulary is generated based on this using the 'vocabulary_'

In [131]:
cvec = CountVectorizer().fit(bigram_unigram_joined)
new_vocab = cvec.vocabulary_
new_vocab


{'la': 14989,
 'poppin': 20678,
 'bel': 2452,
 'chang': 4553,
 'class': 5023,
 'darn': 6588,
 'divorc': 7586,
 'grad': 11230,
 'half': 11699,
 'heart': 12083,
 'heart_of': 12084,
 'gold': 11085,
 'kind': 14688,
 'lifetim': 15543,
 'lik': 15563,
 'list': 15670,
 'maz': 16718,
 'plenti': 20483,
 'seri': 23968,
 'spiritu': 25403,
 'spiritual_sounding': 25404,
 'name': 18099,
 'team': 26837,
 'aa': 1,
 'aaa': 2,
 'aacut': 3,
 'aaker': 4,
 'aaliyah': 5,
 'aam': 6,
 'aao': 7,
 'aappl': 8,
 'aargh': 9,
 'aaron': 10,
 'aaspect': 11,
 'aat': 12,
 'aaveri': 13,
 'aback': 14,
 'abacu': 15,
 'abandon': 17,
 'abattoir': 19,
 'abazurowa': 20,
 'abazurowa_granice': 21,
 'abba': 22,
 'abbatoir': 23,
 'abbey': 24,
 'abbi': 25,
 'abbot': 26,
 'abbott': 27,
 'abbrevi': 28,
 'abby': 29,
 'abc': 30,
 'abdi': 31,
 'abdic': 32,
 'abdomin': 33,
 'abduct': 34,
 'abductor': 35,
 'abe': 36,
 'abel': 37,
 'abenteu': 38,
 'aberr': 39,
 'aberystwyth': 40,
 'abet': 41,
 'abhimanyu': 42,
 'abhor': 43,
 'abhorr': 44,


Then the final step is to show the sparse representation of one day of the review. The format of the date is changed for acquiring the required format and the final dictionary will be written to a text document named '32263546_countVec.txt'.

In [132]:
t = test.textcopy.apply(lambda x: pd.Series([new_vocab[i] for i in x.split() if i in new_vocab.keys()]).value_counts().to_dict())
t.index = t.index.strftime('%Y/%m/%d')
t.to_dict()


  """Entry point for launching an IPython kernel.


{'2012/09/18': {72: 1,
  258: 1,
  445: 2,
  694: 1,
  972: 9,
  1489: 2,
  2401: 1,
  2424: 2,
  2646: 1,
  3832: 4,
  3863: 1,
  5359: 1,
  7587: 1,
  7707: 1,
  8327: 1,
  9006: 1,
  9397: 2,
  9576: 1,
  9832: 1,
  10114: 1,
  10356: 1,
  10840: 1,
  10950: 1,
  11117: 3,
  11977: 2,
  12238: 5,
  12416: 1,
  12724: 1,
  12845: 1,
  12940: 1,
  13120: 6,
  13735: 13,
  13802: 5,
  13854: 1,
  14367: 1,
  14808: 1,
  15230: 1,
  15322: 1,
  15700: 1,
  16661: 2,
  17744: 3,
  17801: 8,
  17868: 1,
  18556: 1,
  18663: 2,
  18940: 4,
  19066: 1,
  19070: 2,
  19099: 2,
  20044: 2,
  20397: 1,
  21470: 1,
  21910: 3,
  22894: 1,
  23481: 1,
  23822: 1,
  23981: 1,
  24179: 4,
  24326: 1,
  27096: 5,
  27108: 13,
  27159: 1,
  27181: 1,
  27265: 2,
  27451: 1,
  27529: 8,
  27733: 1,
  27869: 1,
  28183: 1,
  28851: 1,
  29180: 1,
  29348: 1,
  29572: 1,
  29657: 1,
  29699: 2,
  29882: 2,
  30004: 1,
  30188: 1,
  30646: 2,
  30661: 1},
 '2007/01/15': {3854: 1,
  5353: 1,
  10114: 1,


In [133]:
save_location = "/content/drive/MyDrive/dw_assignment1_output/32263546_countVec.txt"
with open(save_location, 'a+') as f:
  for key, values in t.items():
    line = []
    line.append(key)
    for word,count in values.items():
      line.append(str(word) + ":" + str(count))
    f.write(",".join(line) + "\n")
  f.close()

Finally copying the values for the countVector into a text file - 32263546_countVec.txt.  

Thus, by doing all these tasks, we have preprocessed the text successfully. Now the data can be used for further analysis.

Refences:


*   Tutorial solutions - week 4,5
*   Analytics Vidhya. 2022. Text Preprocessing NLP | Text Preprocessing in NLP with Python codes. [online] Available at: <https://www.analyticsvidhya.com/blog/2021/06/text-preprocessing-in-nlp-with-python-codes/>.

