# Objective

#### Write a code to extract the keywords (like Inheritance, encapsulation, multithreading) from the document shared in the link http://bit.ly/epo_keyword_extraction_document, and upload the code in Github and also mention the keywords in order of their weightages in a Google doc or excel sheet

## Import libraries

In [4]:
import textract
import pandas as pd
import numpy as np
import re #To work with regular expressions.
import PyPDF2 #Python library to read, manipulate and extract PDF files.

#### Followed tutorial "https://medium.com/@rqaiserr/how-to-convert-pdfs-into-searchable-key-words-with-python-85aab86c544f" to learn how to extract data from pdf files.

In [5]:
#Initialize file pointer and a PdfFileReader object.
filename ='JavaBasics-notes.pdf' 
pdfFileObj = open(filename, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #It initializes a PdfFileReader object.
num_pages = pdfReader.numPages #method numPages return total number of pages in a pdf document.

#Gather all the text in document in text form.
count = 0
text = ""
while count < num_pages:
    pageObj = pdfReader.getPage(count) #Retrieves a page by number from this PDF file, and returns a pageObject.
    count = count + 1
    text = text + pageObj.extractText() #It extracts the text and returns a unicode string object.
    
#PyPDF2 can not read scanned pdf files/images. And the document provided do contain some images ,
#and might contain keywords. 

if text != "":
    text = text
else:
    text = textract.process('http://bit.ly/epo_keyword_extraction_document', method='tesseract', language='eng')
#Now text file contain text data from pdf file.

In [6]:
text = text.encode('ascii', 'ignore').lower() #it encodes a unicode string to ascii and ignores errors,
#and then lowercase all words.

## Extract words from text

In [7]:
text = str(text) #re can't work on byte object, thus convert byte-->string. 
keywords = re.findall(r'[a-zA-Z]\w+',text)
len(keywords)       

3410

In [8]:
len(set(keywords)) #Remove repetitions of words, and returns unique set of keywords.

937

In [9]:
df = pd.DataFrame(list(set(keywords)), columns = ['keywords'])

## Extraction Approach

<h4>Two famous algorithms that I found for keyword extraction are:</h4>
<li>Rapid Automatic Keyword Extraction(RAKE)</li>
<li>Term frequency – Inverse document frequency (TF-IDF)</li>
<br>
TF-IDF algorithm can work on a large context and can figure out the most relevant words.t is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. <br>
<h4>source</h4><br>
https://nzmattgrant.wordpress.com/2018/01/31/a-comparison-of-rake-and-tf-idf-algorithms-for-finding-keywords-in-text/

### Implementing TF-IDF for extracting words

Typically, the tf-idf weight is composed by two terms: the first computes the normalized <b>Term Frequency (TF)</b>, aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the <b>Inverse Document Frequency (IDF)</b>, computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.

<li>TF: Term Frequency, which measures how frequently a term occurs in a document. Since every document is different in length, it is possible that a term would appear much more times in long documents than shorter ones. Thus, the term frequency is often divided by the document length (aka. the total number of terms in the document) as a way of normalization: <br>

<b>TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).</b><br><br><br>

<li>IDF: Inverse Document Frequency, which measures how important a term is. While computing TF, all terms are considered equally important.<b> However it is known that certain terms, such as "is", "of", and "that", may appear a lot of times but have little importance.</b> Thus we need to weigh down the frequent terms while scale up the rare ones, by computing the following:<br> 

<b>IDF(t) = log_e(Total number of documents / Number of documents with term t in it)</b><br>
<br><br>
Hence we can conclude TF-IDF algorithm can even filter out stop words.

<li>The product of the TF and IDF scores of a term is called the TF*IDF weight of that term.The higher the TF*IDF score (weight), the rarer the term and vice versa.</li>

Source(s) :- <br>
<li>https://www.elephate.com/blog/what-is-tf-idf/</li>
<li>http://www.tfidf.com/</li>


In case of single document<br>
<b>idf = (number_of_documents)/(number_of_times_word_appeared)</b>
<br>
This will give uniqueness of word

In [47]:
def key_weight(word,text,total_docs=1):
    list_of_words = re.findall(word,text) #findall return a list containing all the appearances of 'word'.
    number_of_times_word_appeared =len(list_of_words)
    tf = number_of_times_word_appeared/float(len(text)) #this returns term frequency
    idf = np.log((total_docs)/float(number_of_times_word_appeared))
    tf_idf = tf*idf
    return number_of_times_word_appeared,tf,idf ,tf_idf

In [48]:
df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: key_weight(x,text)[0])
df['tf'] = df['keywords'].apply(lambda x: key_weight(x,text)[1])
df['idf'] = df['keywords'].apply(lambda x: key_weight(x,text)[2])
df['tf_idf'] = df['keywords'].apply(lambda x: key_weight(x,text)[3])

In [55]:
df = df.sort_values('tf_idf', ascending=False)

<b>Words in the document with a high tfidf score occur frequently in the document and provide the most information about that specific document.</b>

## Storing dataframe in excel sheet and csv file

In [57]:
#Store dataframe in excel sheet.
df.to_excel('keywords.xlsx', sheet_name='keywords_weight') #pip3 install openpyxl


'''
To read excel files
pip3 install xlrd
df = pd.read_excel('keywords.xlsx')'''

"\nTo read excel files\npip3 install xlrd\ndf = pd.read_excel('keywords.xlsx')"

In [58]:
#Store dataframe in csv sheet
df.to_csv('keywords.csv')