<a href="https://colab.research.google.com/github/yongsun-yoon/deep-learning-paper-implementation/blob/main/03-natural-language-process/Internet-QA-LM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Internet-QA-LM

## 0. Info

### Paper
* title: Internet-augmented language models through few-shot prompting for open-domain question answering
* author: Angeliki Lazaridou et al.
* url: https://arxiv.org/abs/2203.05115

### Features
* pretrained: facebook/opt-1.3b
* retriever: tfidf -> sentence embedding model

### Reference
* ref1

## 1. Setup

In [None]:
import easydict
import requests
from bs4 import BeautifulSoup
from newspaper import Article
from sentence_splitter import SentenceSplitter, split_text_into_sentences

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel, AutoModelForCausalLM, StoppingCriteriaList, StoppingCriteria

In [None]:
cfg = easydict.EasyDict(
    generator = 'facebook/opt-1.3b',
    encoder = 'sentence-transformers/all-MiniLM-L6-v2',
    device = 'cuda:2'
)

## 2. Search

In [None]:
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

def search(query, num_results=20, maxpage=10):
    query = query.replace(' ', '+')

    page = 0
    results = []
    while len(results) < num_results:
        query_url = f'https://www.google.com/search?q={query}&start={page*10}'
    
        res = requests.get(query_url, headers=HEADERS)
        html = res.text
        soup = BeautifulSoup(html, 'html.parser')
    
        anchors = soup.select('a:has(h3)')
        urls = [a.get('href') for a in anchors]
        urls = [u for u in urls if u.startswith('https')]
        
        results += urls
        page += 1
        
    results = results[:num_results]
    return results


def parse(url, num_sentences=6):
    try:
        article = Article(url)
        article.download()
        article.parse()
    except:
        return []
    
    sentences = split_text_into_sentences(article.text, language='en')
    
    paragraphs = []
    buffer = []
    for sent in sentences:
        if sent:
            buffer.append(sent)
        if len(buffer) == num_sentences:
            paragraphs.append(' '.join(buffer))
            buffer = []
    paragraphs.append(' '.join(buffer))
    
    return paragraphs

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

In [None]:
encoder_tokenizer = AutoTokenizer.from_pretrained(cfg.encoder)
encoder = AutoModel.from_pretrained(cfg.encoder)
_ = encoder.eval().requires_grad_(False)

In [None]:
question = "where is the best place for date in NY"
urls = search(question, num_results=5)

In [None]:
paragraphs = []
for url in urls:
    paragraphs += parse(url, num_sentences=5)

In [None]:
inputs = encoder_tokenizer([question] + paragraphs, padding=True, truncation=True, max_length=512, return_tensors='pt')
model_output = encoder(**inputs)
embeds = mean_pooling(model_output, inputs.attention_mask)
embeds = F.normalize(embeds, p=2, dim=1)

query_embeds, doc_embeds = embeds[:1], embeds[1:]

In [None]:
score = query_embeds @ doc_embeds.T
topk_indices = torch.topk(score, k=5).indices[0].tolist()
topk_paragraphs = [paragraphs[i] for i in topk_indices]

## 3. QA

In [None]:
PROMPT = """Look at the evidence and answer the question
Evidence: Top 20 rankings as of 16 October 2017 Rank Change Team Points Germany 1631 Brazil 1619 Portugal 1446 Argentina 1445 5 Belgium 1333 6 Poland 1323 7 France 1226 8 Spain 1218 9 Chile 1173 10 Peru 1160 11 Switzerland 1134 12 England 1116 13 Colombia 1095 14 Wales 1072 15 Italy 1066 16 Mexico 1060 17 Uruguay 1034 18 Croatia 1013 19 7 Denmark 1001 20 9 Netherlands 931 * Change from 14 September 2017 Complete rankings at FIFA.com
Question: who has been ranked no. 1 in the latest football rankings announced by fifa
Answer: Germany

Evidence: "Your Love" is a song by the English rock band the Outfield, taken from their debut album Play Deep (1985). The song was penned by the band's guitarist John Spinks.
Question: who sings i just want to use your love tonight
Answer: English rock band the Outfield

Evidence: Principal photography began on May 20, 2016, in Welch, West Virginia.
Question: where was the movie the glass castle filmed
Answer: in Welch, West Virginia

Evidence: No. Name Field Affiliation Date of Appointment Date of Retirement Roopa Ganguly Art Bharatiya Janata Party 04-Oct-2016 03-Oct-2022 Sambhaji Raje Social work Bharatiya Janata Party 07-Jun-2016 03-May-2022 Suresh Gopi Art Bharatiya Janata Party 25-Apr-2016 24-Apr-2022 Subramanian Swamy Economics Bharatiya Janata Party 25-Apr-2016 24-Apr-2022 5 Narendra Jadhav Economics Nominated 25-Apr-2016 24-Apr-2022 6 Mary Kom Sport Nominated 25-Apr-2016 24-Apr-2022 7 Swapan Dasgupta Journalism Nominated 25-Apr-2016 24-Apr-2022 8 K.T.S. Tulsi Law Nominated 25-Feb-2014 24-Feb-2020 9 K. Parasaran Law Nominated 09-Jun-2012 28-Jun-2018 10 Rekha Art Nominated 27-Apr-2012 26-Apr-2018 11 Sachin Tendulkar Social service Nominated 27-Apr-2012 26-Apr-2018 12 Anu Aga Business Nominated 27-Apr-2012 26-Apr-2018
Question: who was the first lady nominated member of the rajya sabha
Answer: Mary Kom

Evidence: The McChicken is a chicken sandwich sold by the international fast-food chain McDonald's. The sandwich consists of a toasted wheat bun, a breaded chicken patty, shredded lettuce, and mayonnaise.
Question: what is on a mcchicken sandwich from mcdonalds
Answer: a breaded chicken patty

Evidence: Life of Pi is a Canadian fantasy adventure novel by Yann Martel published in 2001. The protagonist is Piscine Molitor "Pi" Patel, an Indian boy from Pondicherry who explores issues of spirituality and practicality from an early age. He survives 227 days after a shipwreck while stranded on a lifeboat in the Pacific Ocean with a Bengal tiger named Richard Parker.
Question: what is the tigers name in life of pi
Answer: Richard Parker

Evidence: Malware, short for malicious software, is an umbrella term used to refer to a variety of forms of harmful or intrusive software, including computer viruses, worms, Trojan horses, ransomware, spyware, adware, scareware, and other malicious programs. It can take the form of executable code, scripts, active content, and other software. Malware is defined by its malicious intent, acting against the requirements of the computer user -- and so does not include software that causes unintentional harm due to some deficiency.
Question: the general term for software that is designed to damage disable or steal data is
Answer: Malware

Evidence: Mum Genre Sitcom Created by Stefan Golaszewski Written by Stefan Golaszewski Directed by Richard Laxton Stefan Golaszewski Starring Lesley Manville Peter Mullan Sam Swainsbury Lisa McGrillis Opening theme Cups by Lulu and the Lampshades Country of origin United Kingdom Original language (s) English No. of series No. of episodes 12 (to 27 March 2018) Production Running time 30 minutes Production company (s) Big Talk Productions Distributor ITV Studios Release Original network BBC Two (2016-present) BBC Two HD (2016-present) Picture format 16: 9 1080i Audio format Stereo Original release 13 May 2016 (2016-05-13) -- present
Question: who sings the theme tune to mum on bbc2
Answer: Lulu and the Lampshades

Evidence: The Chess World Cup 2017 was a 128-player single-elimination chess tournament, held in Tbilisi, Georgia, from 2 to 27 September 2017. It was won by Armenian grandmaster Levon Aronian. This was the second time he had won the Chess World Cup, 12 years after his first win in 2005.
Question: where was the world chess tournament 2017 held
Answer: Tbilisi, Georgia

Evidence: T.J. Miller as Randy Kevin Michael Richardson as Rosie, others David Koechner as Robert "Bob Pogo" Pogrohvich, Frank's obese, chainsmoking boss. Kevin Farley as Babe, Carl, others Gary Cole as Rodger Dunbarton, the owner and founder of the airlines where Frank and his co-workers work. Joe Buck as Lou Gagliardi, others John DiMaggio as Scoop Dunbarton, Roger Dunbarton's racist and moronic nephew. Allison Janney as Henrietta Van Horne T.J. Miller as Randy Michael K. Williams as Smoky
Question: who voices randy in f is for family
Answer: T.J. Miller"""

In [None]:
def create_prompt(evidence, question):
    prompt = f'{PROMPT}\n\nEvidence: {evidence}\nQuestion: {question}\nAnswer:'
    return prompt

def clean_answer(answer, input_text):
    return answer.split(input_text)[-1].split('\n')[0].strip()

In [None]:
generator_tokenizer = AutoTokenizer.from_pretrained(cfg.generator, padding_side='left')
generator = AutoModelForCausalLM.from_pretrained(cfg.generator)
_ = generator.eval().requires_grad_(False).to(cfg.device)

In [None]:
input_texts = [create_prompt(evidence, question) for evidence in topk_paragraphs]
inputs = generator_tokenizer(input_texts, padding=True, return_tensors='pt').input_ids.to(cfg.device)

In [None]:
outputs = generator.generate(inputs, do_sample=True, top_p=0.9, temperature=1., max_new_tokens=32, num_return_sequences=1, use_cache=True)
outputs = generator_tokenizer.batch_decode(outputs, skip_special_tokens=True)

In [None]:
answers = [clean_answer(o, i) for i, o in zip(input_texts, outputs)]
answers