
# Pre-Processing of Textual Data

This notebook depicts the various steps involved in the pre-processing of textual data. In order to enuse smooth interpretation of text by machines a lot pre-processing is required - bag of words is done of the most popular ways to represent textual data for machines to process it. Here, I've taken a sample corpus (link provided at the end) which contains 10 txt files. We will be converting the corpus into bag of words (both count vector and tf-idf vector) through a series of steps. For the sake of understanding - the I've tried to perfom all the steps with and without in-built functions.

Some important terms:

**NLTK:** NLTK stands for Natural Language Tool Kit; it is popular library in python that is used for NLP and text prcoessing

**Corpus:** A collection of texts

In [1]:
#importing necessary libraries
import nltk
import os

**Loading a corpus (of .txt files) :** <br> **1. File method**

In [2]:
#importing google drive in order to retrive the corpus
from google.colab import drive 
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [4]:
filenames = os.listdir("/content/gdrive/MyDrive/my_corpus")
filenames

['file1.txt',
 'file2.txt',
 'file3.txt',
 'file4.txt',
 'file5.txt',
 'file6.txt',
 'file7.txt',
 'file8.txt',
 'file10.txt',
 'file9.txt']

In [5]:
content=list()
for i in range(len(filenames)):
  f=open("/content/gdrive/MyDrive/my_corpus/"+filenames[i],'r')
  text=f.read()
  content.append(text)
  f.close()
print(content)

['Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.', 'The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by humans. Deep learning techniques enable this automatic learning through the absorption of huge amounts of unstructured data such as text, images, or video.', "When most people hear the term artificial intelligence, the first thing they usually think of is robots. That's because big-budget films and novels weave stories about human-like machines that wreak havoc on Ea

**2. PlaintextCorpusReader**

In [6]:
from nltk.corpus import PlaintextCorpusReader
CorpusContent=PlaintextCorpusReader("/content/gdrive/MyDrive/my_corpus/",'.*')
print(CorpusContent)
print(CorpusContent.fileids()) #fileids() returns the names of the files in the corpus

<PlaintextCorpusReader in '/content/gdrive/MyDrive/my_corpus'>
['file1.txt', 'file10.txt', 'file2.txt', 'file3.txt', 'file4.txt', 'file5.txt', 'file6.txt', 'file7.txt', 'file8.txt', 'file9.txt']


In [9]:
Content_final=[]
for i in range(len(CorpusContent.fileids())):
    Content_final.append(''.join(CorpusContent.raw(fileids=CorpusContent.fileids()[i])))
Content_final

['Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.',
 "Artificial intelligence also has applications in the financial industry, where it is used to detect and flag activity in banking and finance such as unusual debit card usage and large account deposits—all of which help a bank's fraud department. Applications for AI are also being used to help streamline and make trading easier. This is done by making supply, demand, and pricing of securities easier to estimate.",
 'The ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific goal. A subset of artificial intelligence is machine learning, which refers to the concept that computer programs can automatically learn fr

**Q2: Pre-process the corpus loaded in step 1 (apply Normalization, Tokenization, Stopword Removal, Stemming)**

**NORMALIZATION**

In [12]:
normailzed_content=list()
for i in range(len(Content_final)):
  normailzed_content.append(' '.join([word.lower() for word in Content_final[i].split() if word.isalpha()]))
print(normailzed_content)

['artificial intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their the term may also be applied to any machine that exhibits traits associated with a human mind such as learning and', 'artificial intelligence also has applications in the financial where it is used to detect and flag activity in banking and finance such as unusual debit card usage and large account of which help a fraud applications for ai are also being used to help streamline and make trading this is done by making and pricing of securities easier to', 'the ideal characteristic of artificial intelligence is its ability to rationalize and take actions that have the best chance of achieving a specific a subset of artificial intelligence is machine which refers to the concept that computer programs can automatically learn from and adapt to new data without being assisted by deep learning techniques enable this automatic learning through the absorp

**TOKENIZATION**

In [13]:
nltk.download('punkt') #downloading the necessary tokenizer (punkt tokenizer)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [14]:
tokenized_content=list()
for i in range(len(normailzed_content)):
  tokenized_content.append(nltk.word_tokenize(normailzed_content[i]))
print(tokenized_content)

[['artificial', 'intelligence', 'refers', 'to', 'the', 'simulation', 'of', 'human', 'intelligence', 'in', 'machines', 'that', 'are', 'programmed', 'to', 'think', 'like', 'humans', 'and', 'mimic', 'their', 'the', 'term', 'may', 'also', 'be', 'applied', 'to', 'any', 'machine', 'that', 'exhibits', 'traits', 'associated', 'with', 'a', 'human', 'mind', 'such', 'as', 'learning', 'and'], ['artificial', 'intelligence', 'also', 'has', 'applications', 'in', 'the', 'financial', 'where', 'it', 'is', 'used', 'to', 'detect', 'and', 'flag', 'activity', 'in', 'banking', 'and', 'finance', 'such', 'as', 'unusual', 'debit', 'card', 'usage', 'and', 'large', 'account', 'of', 'which', 'help', 'a', 'fraud', 'applications', 'for', 'ai', 'are', 'also', 'being', 'used', 'to', 'help', 'streamline', 'and', 'make', 'trading', 'this', 'is', 'done', 'by', 'making', 'and', 'pricing', 'of', 'securities', 'easier', 'to'], ['the', 'ideal', 'characteristic', 'of', 'artificial', 'intelligence', 'is', 'its', 'ability', 'to

**STOPWORD REMOVAL**

In [15]:
nltk.download('stopwords') #downloading the necessary corpus

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [16]:
#displaying the stopwords of english language
from nltk.corpus import *

stopwords=nltk.corpus.stopwords.words(fileids='english')
print(stopwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [17]:
#removing stopwords from content
stopword_removed_content=list()
for i in range(len(tokenized_content)):
  stopword_removed_content.append([w for w in tokenized_content[i] if w not in stopwords])
print(stopword_removed_content)

[['artificial', 'intelligence', 'refers', 'simulation', 'human', 'intelligence', 'machines', 'programmed', 'think', 'like', 'humans', 'mimic', 'term', 'may', 'also', 'applied', 'machine', 'exhibits', 'traits', 'associated', 'human', 'mind', 'learning'], ['artificial', 'intelligence', 'also', 'applications', 'financial', 'used', 'detect', 'flag', 'activity', 'banking', 'finance', 'unusual', 'debit', 'card', 'usage', 'large', 'account', 'help', 'fraud', 'applications', 'ai', 'also', 'used', 'help', 'streamline', 'make', 'trading', 'done', 'making', 'pricing', 'securities', 'easier'], ['ideal', 'characteristic', 'artificial', 'intelligence', 'ability', 'rationalize', 'take', 'actions', 'best', 'chance', 'achieving', 'specific', 'subset', 'artificial', 'intelligence', 'machine', 'refers', 'concept', 'computer', 'programs', 'automatically', 'learn', 'adapt', 'new', 'data', 'without', 'assisted', 'deep', 'learning', 'techniques', 'enable', 'automatic', 'learning', 'absorption', 'huge', 'amou

**STEMMING**

In [18]:
from nltk.stem import PorterStemmer
ps=PorterStemmer()
final=[]
for i in range(len(stopword_removed_content)):
    final.append(' '.join([ps.stem(word) for word in stopword_removed_content[i]]))
print(final)
#pre-processing of text ends here

['artifici intellig refer simul human intellig machin program think like human mimic term may also appli machin exhibit trait associ human mind learn', 'artifici intellig also applic financi use detect flag activ bank financ unusu debit card usag larg account help fraud applic ai also use help streamlin make trade done make price secur easier', 'ideal characterist artifici intellig abil ration take action best chanc achiev specif subset artifici intellig machin refer concept comput program automat learn adapt new data without assist deep learn techniqu enabl automat learn absorpt huge amount unstructur data', 'peopl hear term artifici first thing usual think film novel weav stori machin wreak havoc noth could', 'artifici intellig base principl human intellig defin way machin easili mimic execut simpl even goal artifici intellig includ mimick human cognit research develop field make surprisingli rapid stride mimick activ extent concret believ innov may soon abl develop system exceed cap

**Converting the corpus into Bag-of-Words and tf-idf feature matrix using:**  
**(a) TfidfVectorizer() and CountVectorizer**  


In [24]:
#using TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer(smooth_idf=False,norm=False)
X=tf.fit_transform(final)
print(X.toarray())
print(tf.get_feature_names())

[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [3.30258509 0.         3.30258509 ... 0.         2.60943791 0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         2.60943791 0.        ]]
['abil', 'abl', 'absorpt', 'account', 'achiev', 'act', 'action', 'activ', 'adapt', 'ai', 'alexa', 'also', 'amount', 'answer', 'appli', 'applic', 'approach', 'artifici', 'ask', 'assist', 'associ', 'automat', 'bank', 'base', 'basic', 'becom', 'believ', 'benchmark', 'benefit', 'best', 'calcul', 'capac', 'car', 'card', 'carri', 'chanc', 'charact', 'characterist', 'chess', 'cognit', 'complex', 'complic', 'comput', 'concept', 'concret', 'consequ', 'consid', 'continu', 'could', 'data', 'debit', 'deep', 'defin', 'design', 'detect', 'develop', 'differ', 'divi

In [25]:
import pandas as pd
tf_bow=pd.DataFrame(data=X.toarray(),columns=tf.get_feature_names(),index=filenames)
print(tf_bow)

                abil       abl   absorpt  ...      wire   without     wreak
file1.txt   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
file2.txt   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
file3.txt   3.302585  0.000000  3.302585  ...  0.000000  2.609438  0.000000
file4.txt   0.000000  0.000000  0.000000  ...  0.000000  0.000000  3.302585
file5.txt   0.000000  3.302585  0.000000  ...  0.000000  0.000000  0.000000
file6.txt   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
file7.txt   0.000000  0.000000  0.000000  ...  3.302585  0.000000  0.000000
file8.txt   0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
file10.txt  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000
file9.txt   0.000000  0.000000  0.000000  ...  0.000000  2.609438  0.000000

[10 rows x 193 columns]


In [26]:
#using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
cv=CountVectorizer()
Y=cv.fit_transform(final)
print(Y.toarray())
print(cv.get_feature_names())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [1 0 1 ... 0 1 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 0]]
['abil', 'abl', 'absorpt', 'account', 'achiev', 'act', 'action', 'activ', 'adapt', 'ai', 'alexa', 'also', 'amount', 'answer', 'appli', 'applic', 'approach', 'artifici', 'ask', 'assist', 'associ', 'automat', 'bank', 'base', 'basic', 'becom', 'believ', 'benchmark', 'benefit', 'best', 'calcul', 'capac', 'car', 'card', 'carri', 'chanc', 'charact', 'characterist', 'chess', 'cognit', 'complex', 'complic', 'comput', 'concept', 'concret', 'consequ', 'consid', 'continu', 'could', 'data', 'debit', 'deep', 'defin', 'design', 'detect', 'develop', 'differ', 'divid', 'done', 'dose', 'drug', 'easier', 'easili', 'embodi', 'enabl', 'end', 'even', 'evolv', 'exampl', 'exceed', 'execut', 'exhibit', 'extent', 'extern', 'field', 'film', 'financ', 'financi', 'first', 'flag', 'found', 'fraud', 'function', 'game', 'goal', 'grant', 'handl', 'havoc', 'healthcar', 'hear', 'help', 'hospit', 'huge

In [28]:
import pandas as pd
cv_bow=pd.DataFrame(data=Y.toarray(),columns=tf.get_feature_names(),index=filenames)
print(cv_bow)

            abil  abl  absorpt  account  ...  win  wire  without  wreak
file1.txt      0    0        0        0  ...    0     0        0      0
file2.txt      0    0        0        1  ...    0     0        0      0
file3.txt      1    0        1        0  ...    0     0        1      0
file4.txt      0    0        0        0  ...    0     0        0      1
file5.txt      0    1        0        0  ...    0     0        0      0
file6.txt      0    0        0        0  ...    0     0        0      0
file7.txt      0    0        0        0  ...    0     1        0      0
file8.txt      0    0        0        0  ...    0     0        0      0
file10.txt     0    0        0        1  ...    1     0        0      0
file9.txt      0    0        0        0  ...    0     0        1      0

[10 rows x 193 columns]


**(b) Without using in-built functions**  

In [29]:
#count vectorization
count={}
for i in range(len(final)):
    word_count={}
    for word in final[i].split():
        if word not in word_count.keys():
            word_count[word]=0
        word_count[word]+=1
    count[filenames[i]]=word_count
print(count)

{'file1.txt': {'artifici': 1, 'intellig': 2, 'refer': 1, 'simul': 1, 'human': 3, 'machin': 2, 'program': 1, 'think': 1, 'like': 1, 'mimic': 1, 'term': 1, 'may': 1, 'also': 1, 'appli': 1, 'exhibit': 1, 'trait': 1, 'associ': 1, 'mind': 1, 'learn': 1}, 'file2.txt': {'artifici': 1, 'intellig': 1, 'also': 2, 'applic': 2, 'financi': 1, 'use': 2, 'detect': 1, 'flag': 1, 'activ': 1, 'bank': 1, 'financ': 1, 'unusu': 1, 'debit': 1, 'card': 1, 'usag': 1, 'larg': 1, 'account': 1, 'help': 2, 'fraud': 1, 'ai': 1, 'streamlin': 1, 'make': 2, 'trade': 1, 'done': 1, 'price': 1, 'secur': 1, 'easier': 1}, 'file3.txt': {'ideal': 1, 'characterist': 1, 'artifici': 2, 'intellig': 2, 'abil': 1, 'ration': 1, 'take': 1, 'action': 1, 'best': 1, 'chanc': 1, 'achiev': 1, 'specif': 1, 'subset': 1, 'machin': 1, 'refer': 1, 'concept': 1, 'comput': 1, 'program': 1, 'automat': 2, 'learn': 3, 'adapt': 1, 'new': 1, 'data': 2, 'without': 1, 'assist': 1, 'deep': 1, 'techniqu': 1, 'enabl': 1, 'absorpt': 1, 'huge': 1, 'amount

In [30]:
cv_bow1=pd.DataFrame(data=count)
cv_bow1.fillna(0,inplace=True)
print(cv_bow1)

          file1.txt  file2.txt  file3.txt  ...  file8.txt  file10.txt  file9.txt
artifici        1.0        1.0        2.0  ...        1.0         1.0        3.0
intellig        2.0        1.0        2.0  ...        1.0         1.0        3.0
refer           1.0        0.0        1.0  ...        0.0         0.0        0.0
simul           1.0        0.0        0.0  ...        0.0         0.0        0.0
human           3.0        0.0        0.0  ...        0.0         0.0        0.0
...             ...        ...        ...  ...        ...         ...        ...
solv            0.0        0.0        0.0  ...        0.0         0.0        1.0
kind            0.0        0.0        0.0  ...        0.0         0.0        1.0
found           0.0        0.0        0.0  ...        0.0         0.0        1.0
car             0.0        0.0        0.0  ...        0.0         0.0        1.0
hospit          0.0        0.0        0.0  ...        0.0         0.0        1.0

[193 rows x 10 columns]


# **Link to the Corpus Used**

[Click here to see the corpus used for the entire pre-processing](https://drive.google.com/drive/folders/1HzbobbAoPh-irlajzesJvG6Wuy3y5RJT?usp=sharing)