# Answering questions using Roberta

## Main solution using pre-made model

In [1]:
"""Install requirements"""
# Install the transformers library from HuggingFace
!pip install transformers torch pytesseract
# You'll also need some extra tools that some of these models use under the hood
! pip install sentencepiece sacremoses



In [2]:
"""Import packages"""
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset

2023-11-29 12:54:45.762760: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-11-29 12:54:45.762807: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-11-29 12:54:45.805161: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-29 12:54:46.802807: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-11-29 12:54:46.802904: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: ca

In [3]:
"""For web scraping"""
import requests
from bs4 import BeautifulSoup
import re

In [4]:
"""Scrape BBC"""
story = "https://www.bbc.co.uk/news/uk-england-beds-bucks-herts-67407334"
response = requests.get(story)
soup = BeautifulSoup(response.content)
article = []
for para in soup.find_all("div", {"data-component": "text-block"}):
    article.append(para.text)
article = " ".join(article)
article

'A cat whose pictures went viral for regularly visiting a railway station is releasing a Christmas single. Four-year-old Nala has been delighting commuters who have been taking photos of her at Stevenage station. Owner Natasha Ambler revealed the cat was releasing a single called Meow and has been approached for a book deal. The ginger tabby has also recorded a video for the song due to be released this week, under the name Nala the Station Cat. It has been produced by Danny Kirsch, who wrote it with Joe Killington, while Nala is also co-credited as a songwriter, as well as a vocalist. Ms Ambler said "we want to spread the happiness that Stevenage has had, and she\'s had on socials to the world". The single is officially released on Wednesday and BBC Three Counties Radio\'s Justin Dealey gave the single an exclusive first play on Sunday. "I\'m slightly lost for words," said the presenter after the song finished. Nala\'s owner replied: "So am I to be fair." The musical cat does not yet 

In [17]:
"""Open a file to answer questions about"""
file = open("example_article.txt", "r")
content = file.read()
print(content)
file.close()

This is just some text to use as an example.
It does not particularly say much that is very interesting or useful.
The purpose of the article is that I can confirm how to open and read articles and to see whether my model can answer questions based on it.
Animals: mouse, cat, horse, hippo, elephant, whale.
Spain is a country in the Solar System.



In [6]:
"""Import our question answering model"""
qa = pipeline(model = 'deepset/roberta-base-squad2')

In [33]:
"""Ask a question, specifying some text as a context"""
qa(question="Where will profit go?", context=article)['answer']

'RSPCA and Stevenage homelessness charity Feed Up Warm Up'

## Importing audio as input for questions or answers

In [None]:
"""Installs to analyse audio"""
!sudo apt install ffmpeg
!pip3 install datasets
!pip install SoundFile
!pip install librosa

In [20]:
"""Example audio to analyse"""
!mkdir data
!curl https://wagon-public-datasets.s3.amazonaws.com/deep_learning_datasets/harvard.wav > data/harvard.wav

mkdir: cannot create directory ‘data’: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3173k  100 3173k    0     0  2497k      0  0:00:01  0:00:01 --:--:-- 2499k


In [21]:
"""Packages for audio"""
from scipy.io import wavfile
from IPython.display import Audio

In [34]:
"""Read the audio file and play it to verify"""
rate, audio = wavfile.read("data/harvard.wav")
Audio(audio.T, rate=rate)

  rate, audio = wavfile.read("data/harvard.wav")


In [70]:
"""Transcription of a downloaded wav file"""

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa  

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.config.forced_decoder_ids = None

# Whisper requires a sampling rate of 16000 so must convert this with librosa
audio, rate = librosa.load('data/harvard.wav', sr=16000)
input_features = processor(audio, sampling_rate=rate, return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [68]:
"""Transcription of a flac file from hugging face"""

from transformers import WhisperProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# load model and processor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.config.forced_decoder_ids = None

# load dummy dataset and read audio files
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = ds[0]["audio"]
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features 

# generate token ids
predicted_ids = model.generate(input_features)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [71]:
transcription

[' The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health in zest. A salt pickle tastes fine with ham. Tacos all pastora are my favorite. A zestful food is the hot cross bun.']

## Other things