<a href="https://colab.research.google.com/github/sushi15/Online-QA-System/blob/main/wiki_qa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [1]:
# General downloads
!pip install transformers datasets
!pip install wptools
!pip install wikipedia 

import itertools 
import os 
import numpy 
import re 

# HuggingFace Transformers
from transformers import AutoTokenizer, AutoModelForQuestionAnswering, pipeline 
import tensorflow as tf
import spacy 

import nltk 
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 

# MediaWiki API 
import wptools 
import wikipedia 

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.metrics.pairwise import linear_kernel 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.0-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 5.2 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 69.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 61.6 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.5 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 57.3 MB/s 
Collecting xxhash
  Downloadin

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [2]:
# Model trained on the SQuAD 2.0 dev set 
model_name = "deepset/roberta-base-squad2"

qa_final = pipeline('question-answering', model = model_name, tokenizer = model_name) 

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

## Modules

In [37]:
def getKeywords(question): 
    tagged = nltk.pos_tag(nltk.word_tokenize(question)) 
    print(tagged) 

    # The NLTK POS Tagger follows the Penn Treebank Project tag conventions 
    # Only the following kinds of words are extracted from the query as keywords 
    limit = ['FW', 'JJ', 'JJS', 'JJR', 'NN', 'NNS', 'NNP', 'NNPS', 'SYM'] 
    keywords = ' '.join(i[0] for i in tagged if i[1] in limit) 
    print("Keywords: ")
    print(keywords) 
    print("\n" + '='*(20) + "\n")
    return keywords 

In [38]:
def retrieveDocs(keywords): 
    wiki_search = wikipedia.search(keywords) 
    # print("Wiki search results:") 
    # print(wiki_search) 
    # print("\n" + '='*(20) + "\n")
    documents = []
    documentTitles = []
    for i in wiki_search: 
        page = wptools.page(str(i))
        page.get_parse() 
        # print(str(i), page.data['pageid']) 
        try: 
            content = wikipedia.page(pageid=page.data['pageid']) 
        except: 
            continue
        documentTitles.append(str(i))
        res = cleanDoc(content.content)
        documents.append(res) 
    print("Entries considered:") 
    print(documentTitles) 
    print("\n" + '='*(20) + "\n")
    return documents 

In [29]:
def cleanDoc(content):
      headings_to_remove = ['== Further reading ==', '== Further references ==', '=== Citations ===', '== References ==', '== Footnotes ==', 
                            '=== Notes ===', '== Notes ==', '=== Sources ===', '== Sources ==', '== External links', '== See also ==', ]
      headings_to_remove = '|'.join(headings_to_remove) 
      inds = [m.start() for m in re.finditer(headings_to_remove, content)]
      # print(inds) 
      if len(inds) != 0: 
          mini = min(inds) 
          mini = min(mini, len(content)) 
      else: 
          mini = len(content)
      # print(mini)
      return content[:mini]

In [30]:
def splitDocs(question, documents): 
    passages = [question]
    for i in documents: 
        curr_passages = [p for p in i.split('\n') if p and not p.startswith('=')] 
        passages += curr_passages 
    return passages 

In [31]:
# def retrievePassages(passages): 
#     tfidf = TfidfVectorizer().fit_transform(passages) 
#     cosSims = linear_kernel(tfidf[0:1], tfidf).flatten()
#     # print(cosSims) 
#     passageInds = cosSims.argsort()[:-12:-1]
#     print("Indices of most relevant passages: ")
#     print(passageInds[1:]) 
#     print("\n" + '='*(20) + "\n")
#     return passageInds 

# def printRelevantPassages(passages, passageInds): 
#     print("Most relevant passages: ")
#     for i in range(1, len(passageInds)): 
#         print(passages[passageInds[i]]) 
#     print("\n" + '='*(20) + "\n") 

# def getAnswers(passages, passageInds): 
#     possibleAnswers = []
#     for i in range(1, len(passageInds)): 
#         possibleAnswers.append(qa_final(question = passages[0], context = passages[passageInds[i]])) 
#     # print(possibleAnswers) 
#     possibleAnswers = sorted(possibleAnswers, key = lambda i: i['score']) 
#     return possibleAnswers 

In [32]:
# def retrievePassages(question, documents): 
#     passages = {}
#     for i in documents: 
#         curr_passages = [p for p in i.split('\n') if p and not p.startswith('=')] 
#         curr_passages.insert(0, question) 
#         tfidf = TfidfVectorizer().fit_transform(curr_passages) 
#         cosSims = linear_kernel(tfidf[0:1], tfidf).flatten()
#         print(cosSims) 
#         passageInds = cosSims.argsort()[:-12:-1] 
#         print(cosSims) 
#         print(passageInds)
#         for i in range(1, len(passageInds)): 
#             passages[curr_passages[passageInds[i]]] = cosSims[i]
#         # passages.append(curr_passages[passageInds[-1]]) 
#     return passages 

def retrievePassages(question, documents): 
    passages = {}
    for i in documents: 
        curr_passages = [p for p in i.split('\n') if p and not p.startswith('=')] 
        curr_passages.insert(0, question) 
        tfidf = TfidfVectorizer().fit_transform(curr_passages) 
        cosSims = linear_kernel(tfidf[0:1], tfidf).flatten() 
        curr_passages = dict(zip(curr_passages, cosSims)) 
        curr_passages = dict(sorted(curr_passages.items(), key = lambda item: item[1], reverse = True)) 
        p = list(curr_passages.keys()) 
        s = list(curr_passages.values()) 
        i = 1 
        while i < len(p) and i < 10: 
          passages[p[i]] = s[i] 
          i += 1

    # print(passages) 
    passages = dict(sorted(passages.items(), key = lambda item: item[1], reverse = True)) 
    # print(passages.values())
    passages = list(passages.keys())[:10] 
    # print(passages) 
    return passages 

In [33]:
def printRelevantPassages(passages): 
    print("Most relevant passages: ")
    for i in passages: 
        print(i) 
    print("\n" + '='*(20) + "\n") 

In [34]:
def getAnswers(question, passages): 
    possibleAnswers = []
    for i in passages: 
        possibleAnswers.append(qa_final(question = question, context = i)) 
    # print(possibleAnswers) 
    possibleAnswers = sorted(possibleAnswers, key = lambda i: i['score']) 
    return possibleAnswers 

In [35]:
def printAllAnswers(possibleAnswers): 
    print("Possible answers sorted by confidence rating: ")
    for i in range(len(possibleAnswers) - 1, -1, -1): 
        print(str(len(possibleAnswers) - 1 - i + 1) + '.' + possibleAnswers[i]['answer'] + ':' + str(possibleAnswers[i]['score'])) 
    print("\n" + '='*(20) + "\n") 

## System

In [None]:
# question = input("Enter question: ") 

In [18]:
# Old 
# keywords = getKeywords(question) 
# documents = retrieveDocs(keywords) 
# passages = splitDocs(question, documents) 
# passageInds = retrievePassages(passages) 
# printRelevantPassages(passages, passageInds) 
# possibleAnswers = getAnswers(passages, passageInds) 
# printAllAnswers(possibleAnswers) 

In [None]:
# keywords = getKeywords(question) 
# documents = retrieveDocs(keywords) 
# passages = retrievePassages(question, documents) 
# printRelevantPassages(passages) 
# possibleAnswers = getAnswers(question, passages) 
# printAllAnswers(possibleAnswers) 

In [None]:
# print(question) 
# print(possibleAnswers[-1]['answer'])

In [None]:
##### 
# Type either keywords only or the entire question itself 
# Does not yet work for yes/no questions like "Is Australia a Continent?" 
# Model used to derive answers from context will be modified to increase accuracy, and NLG for the answer will also be tried out 
# Example questions that work: What is the capital of Assam?, Who is the Greek goddess of Wisdom?, Where is Addis Ababa? 
# Example questions that don't work: Who played Harley Quinn in the Suicide Squad?, What is a binary search tree? Who is the CEO of Apple? 
#####

In [40]:
#@title Enter a question! { run: "auto", vertical-output: true }
Question = "Who played Jack Sparrow???" #@param {type:"string"} 

# keywords = getKeywords(question) 
# documents = retrieveDocs(keywords) 
# passages = splitDocs(question, documents) 
# passageInds = retrievePassages(passages) 
# printRelevantPassages(passages, passageInds) 
# possibleAnswers = getAnswers(passages, passageInds) 
# printAllAnswers(possibleAnswers) 

keywords = getKeywords(question) 
documents = retrieveDocs(keywords) 
passages = retrievePassages(question, documents) 
printRelevantPassages(passages) 
possibleAnswers = getAnswers(question, passages) 
printAllAnswers(possibleAnswers) 

# print(question) 
print("Answer: ")
print(possibleAnswers[-1]['answer'])

[('Who', 'WP'), ('played', 'VBD'), ('Jack', 'NNP'), ('sparrow', 'NN')]
Keywords: 
Jack sparrow




en.wikipedia.org (parse) Jack Sparrow
en.wikipedia.org (imageinfo) File:Jack Sparrow In Pirates of the ...
Jack Sparrow (en) data
{
  image: <list(1)> {'kind': 'parse-image', 'file': "File:Jack Spar...
  infobox: <dict(14)> name, series, image, caption, first, last, c...
  iwlinks: <list(1)> https://commons.wikimedia.org/wiki/Category:J...
  pageid: 333335
  parsetree: <str(77084)> <root><template><title>Short description...
  requests: <list(2)> parse, imageinfo
  title: Jack Sparrow
  wikibase: Q202857
  wikidata_url: https://www.wikidata.org/wiki/Q202857
  wikitext: <str(64583)> {{Short description|Protagonist of the Pi...
}
en.wikipedia.org (parse) List of Pirates of the Caribbean characters
List of Pirates of the Caribbean characters (en) data
{
  infobox: <dict(10)> title, showflag, s, t, p, l, w, myr, j, y
  pageid: 5950581
  parsetree: <str(111800)> <root><template><title>Short descriptio...
  requests: <list(1)> parse
  title: List of Pirates of the Caribbean characters
  wiki

Entries considered:
['Jack Sparrow', 'List of Pirates of the Caribbean characters', 'Jack Sparrow (song)', 'Pirates of the Caribbean: Jack Sparrow', 'Pirates of the Caribbean: The Legend of Jack Sparrow', 'Jack Ward', 'Pirates of the Caribbean (attraction)', 'Pirates of the Caribbean: The Curse of the Black Pearl', 'Black Pearl', 'Davy Jones (character)']


Most relevant passages: 
Kevin R. McNally as Joshamee Gibbs: Jack Sparrow's loyal first mate. He was once a sailor for the Royal Navy, serving under Lieutenant Norrington aboard HMS Dauntless, and is the one who tells Will about the mutiny against Jack Sparrow as well as the pirate's marooning and legendary escape.
In the book series about Jack Sparrow's earlier adventures, Davy Jones shows interest in the Sword of Cortes, also sought by Jack. He is a minor character, but appears in the seventh book as Jack and his crew encounter the Flying Dutchman.
Davy Jones was released as a PEZ dispenser, along with Jack Sparrow and Will Turner