<a href="https://colab.research.google.com/github/vanshika7-max/Automated-Question-Answering-System/blob/main/Q_A_System_building_Team1_Capstone_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Problem Statement**
We will solve the above-mentioned challenge by applying deep learning algorithms to textual data.
The solution to this problem can be obtained through Extractive Question Answering wherein we can
extract an answer from a text given the question.
###1.2.1 Topic Modelling
This is a theme extraction task on a collection of Data Science specific documents which can be done
via Latent Dirichlet Allocation (LDA). The topic model should identify the important themes of a
document and list down the top-N constituent words of the themes/topics.
###1.2.2 Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text given a question. The
text would essentially be the group of documents that have the highest concentration of the topic
closest to the asked question.


## **1.2.2.1 Head-start References**
❖ https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
<br>❖ https://pyldavis.readthedocs.io/en/latest/readme.html
<br>❖ https://huggingface.co/transformers/usage.html#extractive-question-answering
###NOTE - The solution should not be limited to the above references; students are encouraged to read relevant research papers.
### 1.3 Scope of project
<br>A. The topic model should be able to identify/extract important topics.
<br>B. The topic model would be built on the corpus of Data Science documents.
<br>C. The topic model should yield the most relevant and stable topics measured through the
perplexity score.
<br>D. Once the relevant documents have been retrieved, the extractive question answering
<br>model would generate the answer for the question.
<br>E. The entire dual-model pipeline would be deployed in AWS/GCP/Azure
<br>F. The dual-model pipeline must be accessible via a web application(Streamlit) for demo
purpose.


# **Part-1 - Making DataFrame in CSV Form**

## **Importing Importent Library**

In [None]:
import pandas as pd
import sys
import re
import os
import csv

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
###################################################################################
## file      information_retrieval_system.py
#  brief     The information_retrieval_system.py is a basic information retrieval system  
#             implemented using Python, NLTK and GenSIM.
###################################################################################

from nltk.tokenize import wordpunct_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from gensim import corpora, models, similarities
from operator import itemgetter
import abc
import re
import numpy as np

In [None]:
import nltk

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:


###################################################################################
## class   InformationRetrievalSystem
#  brief   This class represents the InformationRetrievalSystem, i.e., basic methods 
#           used to preprocess and rank documents according to user queries.
###################################################################################

class IRSystem(object):
    
    #################################################################################
    ## brief   Constructor
    #  details This method initializes the class with the parameters introduced by 
    #           the user and execute the query. 
    #################################################################################    
    def __init__(self, corpus,queries):
        __metaclass__ = abc.ABCMeta
        self.corpus = corpus
        self.queries= queries

    #################################################################################
    ## brief   preprocess_document
    #  details This method return the taxonomy of keywords for the given document.
    #  param   doc The document to be preprocessed
    #################################################################################    
    def preprocess_document(self,doc):
        stopset = set(stopwords.words('english'))
        stemmer = PorterStemmer()
        tokens = wordpunct_tokenize(doc) # split text on whitespace and punctuation
        clean = [token.lower() for token in tokens if token.lower() not in stopset and len(token) > 2]
        final = [stemmer.stem(word) for word in clean]
        return final


    #################################################################################
    ## brief   create_dictionary
    #  details This method creates a dictionary based on the taxonomy of keywords for each document.
    #  param   docs The documents to be preprocessed
    #################################################################################

    def create_dictionary(self,docs):
        pdocs = [self.preprocess_document(doc) for doc in docs]
        dictionary = corpora.Dictionary(pdocs)
        dictionary.save('vsm.dict')
        return dictionary,pdocs

    #################################################################################
    ## brief   get_keyword_to_id_mapping
    #  details This method prints the tokens id (word counts) for the given dictionary.
    #  param   dictionary The dictionary with the documents keywords.
    #################################################################################    
    def get_keyword_to_id_mapping(self,dictionary):
        print (dictionary.token2id)

    #################################################################################
    ## brief   docs2bows
    #  details This method converts document (a list of words) into the bag-of-words
    #  format = list of (token_id, token_count) 2-tuples.
    #  param   corpus Set of documents to be processed.
    #  param   dictionary The dictionary with the documents keywords.
    #################################################################################    
    def docs2bows(self,corpus, dictionary, pdocs):
        vectors = [dictionary.doc2bow(doc) for doc in pdocs]
        corpora.MmCorpus.serialize('vsm_docs.mm', vectors) # Save the corpus in the Matrix Market format
        return vectors



    def ranking_function(self,corpus, q, mode):
        model, dictionary = self.create_documents_view(corpus, mode)
        loaded_corpus = corpora.MmCorpus('vsm_docs.mm')
        index = similarities.MatrixSimilarity(loaded_corpus, num_features=len(dictionary))
        vq=self.create_query_view(q,dictionary)
        self.query_weight = model[vq]
        sim = index[self.query_weight]
        ranking = sorted(enumerate(sim), key=itemgetter(1), reverse=True)
        for doc, score in ranking:
            print ("[ Score = " + "%.3f" % round(score, 3) + "] " + corpus[doc]);
      
   
    #################################################################################  
    def create_query_view(self,query,dictionary):
        pq = self.preprocess_document(query)
        vq = dictionary.doc2bow(pq)
        return vq
  

    #################################################################################
    ## brief   create_documents_view
    #  details This method preprocess the documents written in NL to build the documents view
    #  param   corpus Set of documents to be processed.
    #################################################################################  
    def create_documents_view(self,corpus, ir_mode):
        dictionary,pdocs = self.create_dictionary(corpus)
        bow = self.docs2bows(corpus, dictionary,pdocs)     
        loaded_corpus = corpora.MmCorpus('vsm_docs.mm') # Recover the corpus
        model = models.TfidfModel(loaded_corpus) # TF IDF model
        return model, dictionary

    def query_launcher(self,corpus, queries, mode):
      self.ranking_function(corpus,queries,mode)
      return

#class IR_tf(IRSystem):

#  def __init__(self,corpus,queries):
#         IRSystem.__init__(self,corpus,queries)
#         print("\n--------------------------Executing TF information retrieval model--------------------------\n")
#         self.ranking_query=dict()
#         self.query_launcher(corpus,queries,0)


class IR_tf_idf(IRSystem):

    def __init__(self,corpus,queries):
        IRSystem.__init__(self,corpus,queries)
        print("\n--------------------------Executing TF IDF information retrieval model--------------------------\n")
        self.ranking_query=dict()
        self.query_launcher(corpus,queries,1)


In [None]:
tf_idf = IR_tf_idf(corpus_text,query_text)


--------------------------Executing TF IDF information retrieval model--------------------------



TypeError: ignored

In [None]:
# Import modules needed for this project
!pip install pdfplumber
import pdfplumber

Collecting pdfplumber
[?25l  Downloading https://files.pythonhosted.org/packages/7e/57/4d9768e9ed204c68bd5813a2a112d3d6af4912f0785d47080b5067cdce64/pdfplumber-0.5.27.tar.gz (44kB)
[K     |████████████████████████████████| 51kB 3.0MB/s 
[?25hCollecting pdfminer.six==20200517
[?25l  Downloading https://files.pythonhosted.org/packages/b0/c0/ef1c8758bbd86edb10b5443700aac97d0ba27a9ca2e7696db8cd1fdbd5a8/pdfminer.six-20200517-py3-none-any.whl (5.6MB)
[K     |████████████████████████████████| 5.6MB 6.8MB/s 
Collecting Wand
[?25l  Downloading https://files.pythonhosted.org/packages/d7/f6/05f043c099639b9017b7244791048a4d146dfea45b41a199aed373246d50/Wand-0.6.6-py2.py3-none-any.whl (138kB)
[K     |████████████████████████████████| 143kB 39.2MB/s 
Collecting pycryptodome
[?25l  Downloading https://files.pythonhosted.org/packages/ad/16/9627ab0493894a11c68e46000dbcc82f578c8ff06bc2980dcd016aea9bd3/pycryptodome-3.10.1-cp35-abi3-manylinux2010_x86_64.whl (1.9MB)
[K     |█████████████████████████

In [None]:
#################################################################################
#  brief   preprocess_input
#  details This method reads user input and transform it into a list
#  param   user_input The input given by the user
#################################################################################  
def preprocess_userinput(infile):
    pgList=[]
    with pdfplumber.open(infile) as pdf:
      totalpages = len (pdf.pages)
      for i in range(0,totalpages):
        page = pdf.pages[i]
        row = page.extract_text().split('\n')
        pgList.append(row)
      return pgList

#################################################################################
## brief   create_ir_system
#  details This method creates an information retrieval system with the model 
#           chosen by the user
#  param   irmodel_choice The id of the information retrieval model chosen by the user
#################################################################################  
def create_ir_system(irmodel_choice,corpus):
  return ir_system.IR_tf_idf(corpus)



if __name__ == '__main__':     
      corpus_input = input("Write a text or enter the corpus path:\n") 
      corpus_text=preprocess_userinput(corpus_input)

      query_text = input("Query Enter By User:\n")
      # ir = execute_IRsystem_prompt(corpus_text)
      # rocchio = execute_Rocchio_prompt(query_text)

Write a text or enter the corpus path:
/content/drive/MyDrive/AlmaBetter/Cohort Aravali/Module 8/Q A System Building/Automated Q_A PDFs/Applied Data Science.pdf
Query Enter By User:
What is Data Science


In [None]:
corpus_text

In [None]:
import pickle
with open('tfidf.pkl', 'wb') as p:
    pickle.dump(tfidf,p)

In [None]:
query_text

'What is Data Science?'

In [None]:
IR_tf_idf.