### Abstract¶
#### This notebook pertains to a representative NLP pipeline which helps in extracting information from an unstructured log file.

### Terminology

* Corpus: A collection of documents
* Document: A log file
* Word: A single word in an unprocessed document
* Token: A single word in a processed document
* Processing (word -> token)
    * removal of punctuations, stopwords, and other user-defined characters
    * lemmatization (grouping together the inflected forms of a word so they can be analysed as a single item)
* Tokens are ready for vectorization and processing by different ML algorithms

### What are the key steps in an NLP workflow?

1. load the corpus
2. remove capitals
3. remove punctuations
4. check frequency distribution
5. removing stopwords and user-defined patterns (regex)
6. stemming
7. lemmatization
8. bag-of-words/tokenization
9. masking/n-grams
10. term frequency-inverse document frequency (TF-IDF)

### Key Features

1. meta data extraction 
2. word frequency analysis
3. sentiment analysis
4. document summarization
5. anomaly detection

In [1]:
!pip3 install PyPDF2

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m


In [2]:
import PyPDF2
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

In [3]:
filename = './manibook'
open_filename = open(filename, 'rb')

ind_manifesto = PyPDF2.PdfFileReader(open_filename)

### Extraction of the meta information from the pdf document

In [4]:
ind_manifesto.getDocumentInfo()

{'/Title': 'Microsoft Word - MANIBOOK  A4 L.docx',
 '/Producer': 'macOS Version 10.14.3 (Build 18D109) Quartz PDFContext',
 '/Creator': 'Word',
 '/CreationDate': "D:20191025172054Z00'00'",
 '/ModDate': "D:20191025172054Z00'00'",
 '/Keywords': '',
 '/AAPL:Keywords': []}

In [5]:
total_pages = ind_manifesto.numPages
total_pages

28

In [6]:
#!pip3 install textract

In [7]:
import textract   

In [8]:
count = 0
text  = ''

# Lets loop through, to read each page from the pdf file
while(count < total_pages):
    # Get the specified number of pages in the document
    mani_page  = ind_manifesto.getPage(count)
    # Process the next page
    count += 1
    # Extract the text from the page
    text += mani_page.extractText()

In [9]:
if text != '':
    text = text
    
else:
    textract.process(open_filename, method='tesseract', encoding='utf-8', langauge='eng' )    

In [10]:
#!pip3 install autocorrect

In [11]:
from autocorrect import Speller
from nltk.tokenize import word_tokenize


def to_lower(text):

    """
    Converting text to lower case as in, converting "Hello" to  "hello" or "HELLO" to "hello".
    """
    
    # Specll check the words
    spell  = Speller(lang='en')
    
    texts = spell(text)
    
    return ' '.join([w.lower() for w in word_tokenize(text)])

lower_case = to_lower(text)
# print(lower_case)

In [12]:
import nltk
import re
import string
from nltk.corpus import stopwords, brown
from nltk.tokenize import word_tokenize, sent_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from autocorrect import spell

In [13]:
def clean_text(lower_case):
    # split text phrases into words
    words  = nltk.word_tokenize(lower_case)
    
    
    # Create a list of all the punctuations we wish to remove
    punctuations = ['.', ',', '/', '!', '?', ';', ':', '(',')', '[',']', '-', '_', '%']
    
    # Remove all the special characters
    punctuations = re.sub(r'\W', ' ', str(lower_case))
    
    # Initialize the stopwords variable, which is a list of words ('and', 'the', 'i', 'yourself', 'is') that do not hold much values as key words
    stop_words  = stopwords.words('english')
    
    # Getting rid of all the words that contain numbers in them
    w_num = re.sub('\w*\d\w*', '', lower_case).strip()
    
    # remove all single characters
    lower_case = re.sub(r'\s+[a-zA-Z]\s+', ' ', lower_case)
    
    # Substituting multiple spaces with single space
    lower_case = re.sub(r'\s+', ' ', lower_case, flags=re.I)
    
    # Removing prefixed 'b'
    lower_case = re.sub(r'^b\s+', '', lower_case)
    
    # Removing non-english characters
    lower_case = re.sub(r'^b\s+', '', lower_case)
    
    # Return keywords which are not in stop words 
    keywords = [word for word in words if not word in stop_words  and word in punctuations and  word in w_num]
    
    return keywords

In [14]:
# Lemmatize the words
wordnet_lemmatizer = WordNetLemmatizer()

lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in clean_text(lower_case)]

# lets print out the output from our function above and see how the data looks like
clean_data = ' '.join(lemmatized_word)
print(clean_data)

table content introduction b improving sanitation strategy government economy infrastructure industry social service summ ary introduction nearly year hero heroine sacrificed many tortured imprisoned fight gain independence self socioeconomic development time ask promise freedom delivered need look closely constructively good namibia good need done meet expectation blame outside world country trouble sma country large resource resource world need african country fewer resource bigger population outperforming u economically something must wrong strategy decision manifesto set need install good governance identifies priority return benefit countryõs rich people struggle intended simply replace rich white people rich black people provide opportunity people participate building enjoying countryõs rich government living peopleõs expectation must due either people running government system government course namibia plagu ed common weakness emerging nation ð self elite whose initial intention

In [15]:
import pandas as pd
df = pd.DataFrame([clean_data])
df.columns = ['script']
df.index = ['Itula']
df

Unnamed: 0,script
Itula,table content introduction b improving sanitat...


In [16]:
#  Counting the occurrences of tokens and building a sparse matrix of documents x tokens.
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

corpus = df.script
vect = CountVectorizer(stop_words='english')

# Transforms the data into a bag of words
data_vect = vect.fit_transform(corpus)


In [17]:
feature_names = vect.get_feature_names()
data_vect_feat = pd.DataFrame(data_vect.toarray(), columns=feature_names)
data_vect_feat.index = df.index
data_vect_feat

Unnamed: 0,abandoning,ability,able,abridge,abroad,abuse,academia,academic,accelerate,accelerating,...,youth,youthfulness,zambezi,òjobs,òlift,ònatural,òthe,òunspoiltó,ôall,õs
Itula,1,1,2,1,1,1,1,2,2,1,...,23,1,1,1,1,1,1,1,1,1


In [18]:
data = data_vect_feat.transpose()
data

Unnamed: 0,Itula
abandoning,1
ability,1
able,2
abridge,1
abroad,1
...,...
ònatural,1
òthe,1
òunspoiltó,1
ôall,1


In [19]:
import matplotlib.pyplot as plt
import seaborn as sn

# Find the top 1000 words written in the manifesto
top_dict = {}
for c in data.columns:
    top = data[c].sort_values(ascending=False)
    top_dict[c]= list(zip(top.index, top.values))

for x in list(top_dict)[0:100]:
    print("key {}, value {} ".format(x,  top_dict[x]))

key Itula, value [('need', 39), ('people', 35), ('namibian', 26), ('namibia', 26), ('government', 25), ('youth', 23), ('economy', 21), ('health', 21), ('free', 18), ('citizen', 18), ('provide', 17), ('country', 16), ('education', 16), ('benefit', 16), ('service', 16), ('resource', 15), ('president', 14), ('shall', 14), ('political', 13), ('business', 13), ('new', 12), ('nation', 12), ('water', 12), ('sector', 11), ('road', 11), ('training', 11), ('land', 11), ('state', 11), ('access', 11), ('opportunity', 11), ('way', 10), ('party', 10), ('company', 10), ('good', 10), ('market', 10), ('leader', 10), ('private', 10), ('independent', 9), ('change', 9), ('right', 9), ('make', 9), ('industry', 9), ('policy', 9), ('community', 9), ('future', 9), ('activity', 9), ('mining', 9), ('especially', 8), ('large', 8), ('worker', 8), ('time', 8), ('skill', 8), ('development', 8), ('care', 8), ('public', 7), ('year', 7), ('professional', 7), ('national', 7), ('permit', 7), ('infrastructure', 7), ('tra

In [20]:
# Look at the most common top words --> add them to the stop word list
from collections import Counter

# Let's first pull out the top 100 words for each comedian
words = []
for president in data:
    top = [word for (word, count) in top_dict[president]]
    for t in top:
        words.append(t)

print(words[:10])

['need', 'people', 'namibian', 'namibia', 'government', 'youth', 'economy', 'health', 'free', 'citizen']


### Sentiment Analysis

* Polarity: a float value which ranges from [-1.0 to 1.0] where 0 indicates neutral, +1 indicates most positive statement and -1 rindicates most negative statement.

* Subjectivity: a float value which ranges from [0.0 to 1.0] where 0.0 is most objective while 1.0 is most subjective. Subjective sentence expresses some personal opinios, views, beliefs, emotions, allegations, desires, beliefs, suspicions, and speculations where as objective refers to factual information.

In [21]:
# !pip3 install vaderSentiment

In [22]:
from collections import defaultdict
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

In [23]:
blob = TextBlob(clean_data)
blob.sentiment

Sentiment(polarity=0.07240899626355898, subjectivity=0.456152041874476)

### Summarization of the Document

Text summarization is an important NLP task which is a way of producing a concise and fluent summary of a perticular textin an article, journal, book, comment review, etc while also preseving the key information and overall meaning.

In [24]:
from gensim.summarization.summarizer import summarize

In [25]:
print(summarize(lower_case))

nearly 30 years after our heroes and heroines sacrificed , many tortured , or imprisoned in the fight to gain our independence , self -determination and socioeconomic development , it is time to ask if the promises of freedom are being delivered .
this manifesto sets out what we need to do to install good governance and identifies priorities to return the benefits of our countryõs riches to the people .
it was to provide opportunities to all the people to participate in building and enjoying our countryõs riches .
namibia is plagu ed by a common weakness of emerging nations ð a self -perpetuating elite whose initial intentions however good , however bold have been replaced with protecting their status , their wealth and their privileges .
the millions looted from various government institutions need to be found , for example from gipf , sme bank , ghost salaries , inflated tendering procedures and inefficient state -owned enterprises .
these looted funds could be reinvested into the be

### Anomaly Detection 

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

class Anomaly(object):
    def __init__(self, corpus, max_features):
        self.corpus=corpus
        self.max_features=max_features
        
    def jaccard_similarity(self, list1):
        '''
            jaccard similarity score to measure similarity of two sets: intersection/union
        '''
        vectorizer = TfidfVectorizer(max_features=self.max_features, stop_words={"english"})
        X = vectorizer.fit_transform(corpus)
        feature_list = vectorizer.get_feature_names()
        intersection = len(list(set(list1).intersection(feature_list)))
        union = (len(set(list1)) + len(set(feature_list))) - intersection
        
        return float(intersection) / union
    
    def batch_similarity_vect(self):
        '''
            similarity scores across a corpus of data
        '''
        lst = []
        for idx in np.arange(len(corpus)):
            lst.append(self.jaccard_similarity(corpus[idx].split()))
            
        return lst
    
    def online_similarity_val(self, lst):
        '''
            similarity score for a new data point w.r.t. a corpus of data
        '''
        return self.jaccard_similarity(lst)
            

In [32]:
corpus = [
     'This is the first document',
     'This document is the second document',
     'And this is the third one',
     'Is this the first document',
     'I work at Nutanix'
]

In [34]:
obj=Anomaly(corpus, 5)
print()
print(f'batch similarity vector: {obj.batch_similarity_vect()}')
print()
print(f'online similarity value: {obj.online_similarity_val("Mary has a little lamb")}')


batch similarity vector: [0.6666666666666666, 0.42857142857142855, 0.375, 0.6666666666666666, 0.0]

online similarity value: 0.0
