# Question Answering
### For this project, I gather data from the Wikipedia Movie Plots dataset. I then ask a few questions about the movie plots through the transformer. I then look at the surety of the answers generated (the logits) grouped by a few different factors, such as the origin of the movie or the genre.

In [1]:
import pandas as pd

## Get the data and look at it

In [2]:
# Kaggle Wikipedia Movie Plots dataset
# https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots
movies = pd.read_csv('wiki_movie_plots_deduped.csv')
movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
...,...,...,...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,Selçuk Aydemir,"Ahmet Kural, Murat Cemcir",comedy,https://en.wikipedia.org/wiki/%C3%87alg%C4%B1_...,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,Hakan Algül,"Ata Demirer, Tuvana Türkay, Ülkü Duru",comedy,https://en.wikipedia.org/wiki/Olanlar_Oldu,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,Brendan Bradley,"YouTubers Shanna Malcolm, Shira Lazar, Sara Fl...",romantic comedy,https://en.wikipedia.org/wiki/Non-Transferable...,The film centres around a young woman named Am...


In [3]:
# drop useless columns
movies = movies.drop(columns=['Director', 'Cast', 'Wiki Page'])
movies

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot
0,1901,Kansas Saloon Smashers,American,unknown,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,unknown,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,unknown,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,unknown,Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,unknown,The earliest known adaptation of the classic f...
...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,unknown,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,comedy,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,comedy,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,romantic comedy,The film centres around a young woman named Am...


In [4]:
# average length of plots, in words
word_lens = [len(x.split()) for x in movies['Plot'].tolist()]
sum(word_lens) / len(word_lens)

372.4932064438457

## Bring out the transformers!

In [5]:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# model_ckpt = "bert-large-uncased-whole-word-masking-finetuned-squad"
# model_ckpt = "deepset/xlm-roberta-base-squad2"
# model_ckpt = "allenai/longformer-large-4096-finetuned-triviaqa"
model_ckpt = "mrm8488/longformer-base-4096-finetuned-squadv2"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForQuestionAnswering.from_pretrained(model_ckpt, num_labels=2)

2022-11-15 18:37:22.176779: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-15 18:37:22.396239: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-15 18:37:23.236192: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-11-15 18:37:23.236294: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or 

In [6]:
# small_movies = movies.sample(n=25000, random_state=42)
# small_movies = small_movies.reset_index(drop=True)
movies = movies.head(15000)
# small_movies

## Q: Who is the main character?

In [7]:
# !nvidia-smi
# device = torch.device('cuda:4')
torch.cuda.get_device_name(0)

'NVIDIA TITAN V'

In [8]:
from tqdm import tqdm
import logging

question = "Who is the main character?"
answers = []
logits = []

device = torch.device('cuda:0')
model = model.to(device)

# so that it doesn't warn me about input being too long
# logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

for i in tqdm(range(len(movies))):
    encoding = tokenizer.encode_plus(text=question, text_pair=movies['Plot'][i])
    
    inputs = encoding['input_ids']  # Token embeddings
    
    if len(inputs) > 4096:
        answers.append(None)
        logits.append(None)
        continue

#     if not "roberta" in model_ckpt:
#         sentence_embedding = encoding['token_type_ids']  # Segment embeddings - only needed for BERT

    tokens = tokenizer.convert_ids_to_tokens(inputs) # input tokens
    
#     if "roberta" in model_ckpt:
#         scores = model(input_ids=torch.tensor([inputs]))
#     else:
    # BERT needs token_type_ids which mask the question and answer
#     device = torch.device('cuda')
#     x = torch.tensor([inputs]).to(device)
#     y = torch.tensor([sentence_embedding]).to(device)
#     model = model.to(device)
#     scores = model(input_ids=x, token_type_ids=y)
#         scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))

    x = torch.tensor([inputs]).to(device)
    scores = model(input_ids=x)
        
    start_index = torch.argmax(scores['start_logits'])
    end_index = torch.argmax(scores['end_logits'])
    
    start_logit = scores['start_logits'][0][start_index]
    answer = ' '.join(tokens[start_index:end_index+1])
    
    if start_logit is None or answer is None:
        raise Exception("that wasn't supposed to happen")
    
    answers.append(answer)
    logits.append(start_logit.item())

100%|██████████| 15000/15000 [19:14<00:00, 12.99it/s]


In [9]:
movies['Answers'] = answers
movies['Logits'] = logits
movies

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['Answers'] = answers
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  movies['Logits'] = logits


Unnamed: 0,Release Year,Title,Origin/Ethnicity,Genre,Plot,Answers,Logits
0,1901,Kansas Saloon Smashers,American,unknown,"A bartender is working at a saloon, serving dr...",ĠCarrie ĠNation,4.085766
1,1901,Love by the Light of the Moon,American,unknown,"The moon, painted with a smiling face hangs ov...",<s>,-1.154163
2,1901,The Martyred Presidents,American,unknown,"The film, just over a minute long, is composed...",ĠLady ĠJustice,0.787540
3,1901,"Terrible Teddy, the Grizzly King",American,unknown,Lasting just 61 seconds and consisting of two ...,ĠTheodore ĠRoosevelt,5.488764
4,1902,Jack and the Beanstalk,American,unknown,The earliest known adaptation of the classic f...,ĠJack Ġis Ġthe Ġson Ġof Ġa Ġdep osed Ġking,-0.409901
...,...,...,...,...,...,...,...
14995,2006,Mad Cowgirl,American,drama,The central character in Mad Cowgirl is Theres...,ĠThere se,5.528144
14996,2006,Madea's Family Reunion,American,comedy-drama,After Madea (Tyler Perry) violates the terms o...,,1.370978
14997,2006,Man About Town,American,comedy,"Top Hollywood talent agent, Jack Giamoro (Ben ...",ĠJack ĠG iam oro,-0.347427
14998,2006,Man of the Year,American,comedy,"Tom Dobbs is host of a satirical news program,...",<s>,2.310436


In [15]:
# drop all rows with NaN in Logits column (which happens here when the plot is too long to fit in the transformer)
# movies = movies.dropna()
# movies
print(movies['Answers'].value_counts()['<s>'])

2847


#### Grouped by genre: not super exciting stuff here

In [11]:
movies[['Genre', 'Logits']].groupby(['Genre']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Genre,Unnamed: 1_level_2,Unnamed: 2_level_2
"horror, fantasy",7.833051,1
"historical, erotic",7.522468,1
"sci-fi, drama",7.242088,1
"animated, short",7.027125,2
semi-staged documentary,6.781630,1
...,...,...
"drama, exploitation",-1.333774,2
short action/crime western,-1.387757,1
adventures,-1.764926,1
"crime, western",-1.774272,1


#### Grouped by release year: this is more interesting, it seems that the plots of movies in the 10's and 20's have more clear-cut main characters that those in the first decade of the 1900's...

In [12]:
movies[['Release Year', 'Logits']].groupby(['Release Year']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Release Year,Unnamed: 1_level_2,Unnamed: 2_level_2
1905,6.227219,2
1923,3.766922,25
1927,3.093295,44
1971,2.918604,131
1972,2.853445,122
...,...,...
1908,1.347897,6
1906,0.704565,3
1903,0.052075,2
1902,-0.409901,1


#### Grouped by origin/ethnicity: movies out of some regions seem to have more defined main characters than others

In [13]:
movies[['Origin/Ethnicity', 'Logits']].groupby(['Origin/Ethnicity']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

Unnamed: 0_level_0,Logits,Logits
Unnamed: 0_level_1,mean,count
Origin/Ethnicity,Unnamed: 1_level_2,Unnamed: 2_level_2
American,2.329058,15000


## Q: What is the setting of the story?

In [14]:
# small_movies = movies.sample(n=25000, random_state=42)
# small_movies = small_movies.reset_index(drop=True)

question = "What is the setting of the story?"
answers = []
logits = []

for i in tqdm(range(len(movies))):
    encoding = tokenizer.encode_plus(text=question, text_pair=movies['Plot'][i])
    
    inputs = encoding['input_ids']  # Token embeddings
    
    if len(inputs) > 512:
        answers.append(None)
        logits.append(None)
        continue

    if not "roberta" in model_ckpt:
        sentence_embedding = encoding['token_type_ids']  # Segment embeddings - only needed for BERT

    tokens = tokenizer.convert_ids_to_tokens(inputs) # input tokens
    
    #     if "roberta" in model_ckpt:
#         scores = model(input_ids=torch.tensor([inputs]))
#     else:
    # BERT needs token_type_ids which mask the question and answer
#     device = torch.device('cuda')
#     x = torch.tensor([inputs]).to(device)
#     y = torch.tensor([sentence_embedding]).to(device)
#     model = model.to(device)
#     scores = model(input_ids=x, token_type_ids=y)
#         scores = model(input_ids=torch.tensor([inputs]), token_type_ids=torch.tensor([sentence_embedding]))

    x = torch.tensor([inputs]).to(device)
    scores = model(input_ids=x)
        
    start_index = torch.argmax(scores['start_logits'])
    end_index = torch.argmax(scores['end_logits'])
    
    start_logit = scores['start_logits'][0][start_index]
    answer = ' '.join(tokens[start_index:end_index+1])
    
    if start_logit is None or answer is None:
        raise Exception("that wasn't supposed to happen")
    
    answers.append(answer)
    logits.append(start_logit.item())

  0%|          | 0/15000 [00:00<?, ?it/s]


KeyError: 'token_type_ids'

In [None]:
movies['Answers'] = answers
movies['Logits'] = logits
# small_movies = small_movies.dropna()
movies

#### Grouped by genre: not too exciting here

In [None]:
movies[['Genre', 'Logits']].groupby(['Genre']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

#### Grouped by release year: here we see years in which the movie's plots had clearer settings

In [None]:
movies[['Release Year', 'Logits']].groupby(['Release Year']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

#### Grouped by origin/ethnicity: quite interesting that the 'Maldivian' movie had the most clear answer for the main character question, but is the most uncertain for the setting question. I print out its title and plot below!

In [None]:
movies[['Origin/Ethnicity', 'Logits']].groupby(['Origin/Ethnicity']).agg(['mean', 'count']).sort_values(by=('Logits', 'mean'), ascending=False)

In [None]:
# mal_movie = movies.loc[movies['Origin/Ethnicity'] == 'Maldivian']
# print(mal_movie['Title'][1404])
# print(mal_movie['Plot'][1404])