# Question Answering
### For this project, I gather data from the Wikipedia Movie Plots dataset. I then ask a few questions about the movie plots through the transformer. I then look at the surety of the answers generated (the logits) grouped by a few different factors, such as the origin of the movie or the genre.

In [1]:
import pandas as pd

## Get the data and look at it

In [2]:
# Kaggle Wikipedia Movie Plots dataset
# https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
movies = pd.read_csv('wiki_movie_plots_deduped.csv')
movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
...,...,...,...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,Selçuk Aydemir,"Ahmet Kural, Murat Cemcir",comedy,https://en.wikipedia.org/wiki/%C3%87alg%C4%B1_...,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,Hakan Algül,"Ata Demirer, Tuvana Türkay, Ülkü Duru",comedy,https://en.wikipedia.org/wiki/Olanlar_Oldu,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,Brendan Bradley,"YouTubers Shanna Malcolm, Shira Lazar, Sara Fl...",romantic comedy,https://en.wikipedia.org/wiki/Non-Transferable...,The film centres around a young woman named Am...


In [3]:
# drop useless columns
movies = movies.drop(columns=['Director', 'Cast', 'Wiki Page'])
movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot
0,1901,Kansas Saloon Smashers,American,unknown,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,unknown,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,unknown,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,unknown,Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,unknown,The earliest known adaptation of the classic f...
...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,unknown,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,comedy,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,comedy,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,romantic comedy,The film centres around a young woman named Am...


In [4]:
# average length of plots, in words
word_lens = [len(x.split()) for x in movies['Plot'].tolist()]
sum(word_lens) / len(word_lens)

372.4932064438457

## Bring out the transformers!

In [5]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

model_ckpt = "bert-large-uncased-whole-word-masking-finetuned-squad"
#model_ckpt = "deepset/xlm-roberta-base-squad2"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt)

2022-11-07 17:10:37.041678: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-07 17:10:37.238263: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-07 17:10:38.093183: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-07 17:10:38.093285: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [6]:
small_movies = movies.sample(n=25000, random_state=42)
small_movies = small_movies.reset_index(drop=True)
small_movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot
0,1951,The Day the Earth Stood Still,American,science fiction,"When a flying saucer lands in Washington, D.C...."
1,1981,The Burning,American,horror,"One night at Camp Blackfoot, several campers p..."
2,2012,Nobel Chor,Bengali,suspense / drama,"The first Asian Nobel Laureate, Rabindranath T..."
3,1952,Trent's Last Case,British,detective,A major international financier is found dead ...
4,1977,Aafat,Bollywood,unknown,Inspector Amar and Inspector Chhaya are after ...
...,...,...,...,...,...
24995,1998,The Dentist 2,American,horror,Dr. Alan Feinstone is in the maximum security ...
24996,1994,Zakhmi Dil,Bollywood,"action, romance",Jaidev (Akshay Kumar) and Gayatri (Ashwini Bha...
24997,1998,Billy's Hollywood Screen Kiss,American,comedy,Billy Collier (Sean P. Hayes) is an aspiring p...
24998,2009,Renigunta,Tamil,action,"The movie begins in Madurai where a young boy,..."


## Q: Who is the main character?

In [7]:
from tqdm import tqdm
import logging

question = "Who is the main character?"
answers = []
logits = []

# so that it doesn't warn me about input being too long
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

for i in tqdm(range(len(small_movies))):
    encoding = tokenizer.encode_plus(text=question, text_pair=small_movies['Plot'][i])
    
    inputs = encoding['input_ids']  # Token embeddings
    
    if len(inputs) > 512:
        answers.append(None)
        logits.append(None)
        continue

    if not "roberta" in model_ckpt:
        sentence_embedding = encoding['token_type_ids']  # Segment embeddings - only needed for BERT

    tokens = tokenizer.convert_ids_to_tokens(inputs) # input tokens
    
    if "roberta" in model_ckpt:
        scores = model(input_ids=torch.tensor([inputs]))
    else:
        # BERT needs token_type_ids which mask the question and answer
        device = torch.device('cuda')
        x = torch.tensor([inputs]).to(device)
        y = torch.tensor([sentence_embedding]).to(device)
        model = model.to(device)
        scores = model(input_ids=x, token_type_ids=y)
#         scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
        
    start_index = torch.argmax(scores['start_logits'])
    end_index = torch.argmax(scores['end_logits'])
    
    start_logit = scores['start_logits'][0][start_index]
    answer = ' '.join(tokens[start_index:end_index+1])
    
    if start_logit is None or answer is None:
        raise Exception("that wasn't supposed to happen")
    
    answers.append(answer)
    logits.append(start_logit.item())

100%|██████████| 25000/25000 [09:26<00:00, 44.14it/s]


In [8]:
small_movies['Answers'] = answers
small_movies['Logits'] = logits
small_movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot,Answers,Logits
0,1951,The Day the Earth Stood Still,American,science fiction,"When a flying saucer lands in Washington, D.C....",,
1,1981,The Burning,American,horror,"One night at Camp Blackfoot, several campers p...",,
2,2012,Nobel Chor,Bengali,suspense / drama,"The first Asian Nobel Laureate, Rabindranath T...",b ##han ##u,3.600073
3,1952,Trent's Last Case,British,detective,A major international financier is found dead ...,phillip trent,5.286287
4,1977,Aafat,Bollywood,unknown,Inspector Amar and Inspector Chhaya are after ...,inspector amar,3.426002
...,...,...,...,...,...,...,...
24995,1998,The Dentist 2,American,horror,Dr. Alan Feinstone is in the maximum security ...,,
24996,1994,Zakhmi Dil,Bollywood,"action, romance",Jaidev (Akshay Kumar) and Gayatri (Ashwini Bha...,jai ##dev,3.040354
24997,1998,Billy's Hollywood Screen Kiss,American,comedy,Billy Collier (Sean P. Hayes) is an aspiring p...,,
24998,2009,Renigunta,Tamil,action,"The movie begins in Madurai where a young boy,...",,


In [9]:
# drop all rows with NaN in Logits column (which happens here when the plot is too long to fit in the transformer)
small_movies = small_movies.dropna()
small_movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot,Answers,Logits
2,2012,Nobel Chor,Bengali,suspense / drama,"The first Asian Nobel Laureate, Rabindranath T...",b ##han ##u,3.600073
3,1952,Trent's Last Case,British,detective,A major international financier is found dead ...,phillip trent,5.286287
4,1977,Aafat,Bollywood,unknown,Inspector Amar and Inspector Chhaya are after ...,inspector amar,3.426002
6,1947,I Cover Big Town,American,drama,"""Illustrated Press"" society editor Lorelei Kil...",lore ##lei ki ##lb ##our ##ne,5.650765
7,2008,Sultan,Malayalam,unknown,Sivan (Vinu Mohan) is a medical college studen...,si ##van,4.589819
...,...,...,...,...,...,...,...
24992,2014,Dekh Tamasha Dekh,Bollywood,comedy,"Inspired by a true incident, the film starts o...",sat ##ish ka ##ush ##ik,1.480206
24993,1980,The Legend of Alfred Packer,American,western,"McMurphy comes to Denver, Colorado to see Poll...",[CLS] who is the main character ? [SEP],1.799027
24994,2006,Euphoria,Russian,drama,The story unfolds in the Eurasian Steppes. Ver...,. vera,1.533089
24996,1994,Zakhmi Dil,Bollywood,"action, romance",Jaidev (Akshay Kumar) and Gayatri (Ashwini Bha...,jai ##dev,3.040354


#### Grouped by genre: not super exciting stuff here

In [10]:
small_movies[['Genre', 'Logits']].groupby(['Genre']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2
comedy / action / sci-fi / animation,8.603385,1
panorama studios,8.011988,1
drama / history / war,7.935113,1
drama / mystery / suspense,7.905862,1
comedy / western,7.541578,1
...,...,...
war documentary,-3.321903,1
social drama romance,-3.456677,1
"fantasy, drama, children's, action, comedy",-3.633883,1
adult/horror,-3.975382,1


#### Grouped by release year: this is more interesting, it seems that the plots of movies in the 10's and 20's have more clear-cut main characters that those in the first decade of the 1900's...

In [11]:
small_movies[['Release Year', 'Logits']].groupby(['Release Year']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Release Year,Unnamed: 1_level_2,Unnamed: 2_level_2
1901,4.640282,3
1923,4.633491,22
1925,4.580996,30
1915,4.462149,15
1924,4.171358,29
...,...,...
1907,3.026783,3
1902,2.054796,1
1903,1.366107,2
1906,1.232053,3


#### Grouped by origin/ethnicity: movies out of some regions seem to have more defined main characters than others

In [12]:
small_movies[['Origin/Ethnicity', 'Logits']].groupby(['Origin/Ethnicity']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Origin/Ethnicity,Unnamed: 1_level_2,Unnamed: 2_level_2
Maldivian,6.206839,1
South_Korean,4.129292,313
Chinese,4.12362,237
Turkish,4.003654,44
Punjabi,3.979159,50
Malayalam,3.940484,586
Assamese,3.887315,4
Hong Kong,3.794327,370
Malaysian,3.790133,35
Marathi,3.787433,81


## Q: What is the setting of the story?

In [13]:
small_movies = movies.sample(n=25000, random_state=42)
small_movies = small_movies.reset_index(drop=True)

question = "What is the setting of the story?"
answers = []
logits = []

for i in tqdm(range(len(small_movies))):
    encoding = tokenizer.encode_plus(text=question, text_pair=small_movies['Plot'][i])
    
    inputs = encoding['input_ids']  # Token embeddings
    
    if len(inputs) > 512:
        answers.append(None)
        logits.append(None)
        continue

    if not "roberta" in model_ckpt:
        sentence_embedding = encoding['token_type_ids']  # Segment embeddings - only needed for BERT

    tokens = tokenizer.convert_ids_to_tokens(inputs) # input tokens
    
    if "roberta" in model_ckpt:
        scores = model(input_ids=torch.tensor([inputs]))
    else:
        # BERT needs token_type_ids which mask the question and answer
        device = torch.device('cuda')
        x = torch.tensor([inputs]).to(device)
        y = torch.tensor([sentence_embedding]).to(device)
        model = model.to(device)
        scores = model(input_ids=x, token_type_ids=y)
#         scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))
        
    start_index = torch.argmax(scores['start_logits'])
    end_index = torch.argmax(scores['end_logits'])
    
    start_logit = scores['start_logits'][0][start_index]
    answer = ' '.join(tokens[start_index:end_index+1])
    
    if start_logit is None or answer is None:
        raise Exception("that wasn't supposed to happen")
    
    answers.append(answer)
    logits.append(start_logit.item())

100%|██████████| 25000/25000 [09:33<00:00, 43.57it/s]


In [14]:
small_movies['Answers'] = answers
small_movies['Logits'] = logits
small_movies = small_movies.dropna()
small_movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot,Answers,Logits
2,2012,Nobel Chor,Bengali,suspense / drama,"The first Asian Nobel Laureate, Rabindranath T...",kolkata,4.305354
3,1952,Trent's Last Case,British,detective,A major international financier is found dead ...,hampshire,3.416084
4,1977,Aafat,Bollywood,unknown,Inspector Amar and Inspector Chhaya are after ...,inspector amar and inspector ch ##haya,-2.576887
6,1947,I Cover Big Town,American,drama,"""Illustrated Press"" society editor Lorelei Kil...",illustrated press,2.650819
7,2008,Sultan,Malayalam,unknown,Sivan (Vinu Mohan) is a medical college studen...,medical college,3.997332
...,...,...,...,...,...,...,...
24992,2014,Dekh Tamasha Dekh,Bollywood,comedy,"Inspired by a true incident, the film starts o...",after the man dies,2.051994
24993,1980,The Legend of Alfred Packer,American,western,"McMurphy comes to Denver, Colorado to see Poll...","denver , colorado",3.321502
24994,2006,Euphoria,Russian,drama,The story unfolds in the Eurasian Steppes. Ver...,eurasian steppe ##s,7.187541
24996,1994,Zakhmi Dil,Bollywood,"action, romance",Jaidev (Akshay Kumar) and Gayatri (Ashwini Bha...,mumbai,1.637063


#### Grouped by genre: not too exciting here

In [15]:
small_movies[['Genre', 'Logits']].groupby(['Genre']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2
satire - thriller,8.119677,1
historical epic drama,7.952402,1
adventure action comedy,7.341554,1
"drama, romance, action",7.273798,1
drama / romance / mystery,7.175522,1
...,...,...
kung-fu film,-2.753941,1
"fantasy, drama, children's, action, comedy",-2.893498,1
adult romance,-2.919601,1
interactive cinema,-2.959424,1


#### Grouped by release year: here we see years in which the movie's plots had clearer settings

In [16]:
small_movies[['Release Year', 'Logits']].groupby(['Release Year']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Release Year,Unnamed: 1_level_2,Unnamed: 2_level_2
1904,4.649937,1
1919,2.763269,38
2010,2.681820,310
1947,2.643266,155
2009,2.637227,269
...,...,...
1914,1.252082,34
1907,0.561697,3
1909,0.204229,2
1908,0.046060,3


#### Grouped by origin/ethnicity: quite interesting that the 'Maldivian' movie had the most clear answer for the main character question, but is the most uncertain for the setting question. I print out its title and plot below!

In [17]:
small_movies[['Origin/Ethnicity', 'Logits']].groupby(['Origin/Ethnicity']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Origin/Ethnicity,Unnamed: 1_level_2,Unnamed: 2_level_2
Assamese,4.023631,4
Chinese,3.014825,237
Punjabi,2.856076,50
Bengali,2.750242,164
Malaysian,2.749345,35
Australian,2.719779,291
Canadian,2.718702,369
Marathi,2.664511,81
Russian,2.657073,116
Japanese,2.605691,452


In [23]:
mal_movie = small_movies.loc[small_movies['Origin/Ethnicity'] == 'Maldivian']
print(mal_movie['Title'][1404])
print(mal_movie['Plot'][1404])

Mikoe Bappa Baey Baey
The film opens when Ahmed Fazeel/Mohamed Saleem (Mohamed Manik) wakes up to an unfamiliar environment. Everything he sees seems to be queer and were unrecognized. When he starts looking and discovering things around him, he was amazed by seeing a baby monitor. In a while he hears a baby crying transmitted from the monitor, followed by a women's lullaby. He was further astonished when he found out that he was wearing a wedding ring on his finger. As a result he gets confused of his identity.
After a few minutes, Aminath Shifa/Nisha (Aishath Rishmy) enters the room and behaves like his wife. She tried to convince him that he is Fazeel. She revealed that he met with an accident and only remembers things that happened only recently. She medicates him for recover. However, since no change has been identified, he loses control and get furious. In return she had to hurt him in order to save their baby.
She struggles enough to save their wedding and their baby. His past