# Повний цикл NLP-проекту

## I. Перевірка фактів на достовірність

У межах цієї задачі ви побудуєте систему видобування фактів на правилах, а також інструменти для оцінювання цієї системи та фактів, які вона добуватиме.

### 1. Домен

Виберіть домен, для якого можна побудувати невелику базу даних на основі Wikipedia, DBPedia, IMDB тощо. Проаналізуйте домен і напишіть програму для побудови бази даних. Приклади доменів:
- актор і фільми, в яких він знімався
- письменник і книжки, які він написав
- музичний гурт і його учасники/концерти/альбоми з роками діяльності/випуску
- компанія і всі її CEO/CTO з роками діяльності
- людина і всі її місця роботи з часовими проміжками
- політик і політичні партії, в яких він брав участь
- винахідник та його винаходи

### 2. Видобування фактів

2.1. Напишіть програму, яка шукає статтю у Вікіпедії про сутність, що належить до вашого домена, та витягає текст цієї статті.

2.2. Напишіть програму, яка опрацьовує текст статті (саме текст, а не таблички, якщо такі є) та витягає з нього інформацію про ваш домен. Цю інформацію ви будете порівнювати зі сформованою базою даних.

### 3. Оцінювання результатів

Розробіть метрику, яка покаже, наскільки інформація, яку ви дістали зі статті, збігається з інформацією в вашій базі даних. Скільки пропущеної інформації з кожного боку? Чи є часткові збіги? (Наприклад, ім'я СЕО збігається лише частково або ім'я СЕО збігається, а роки діяльності різні.)

### Приклад

1. З бази каверів пісень дістаємо пісні та виконавців цих пісень з роками виконання. Наприклад, *Fever* з https://secondhandsongs.com/work/886/versions стане
```
1956,Little Willie John,Sandra Meade	
1957,Earl Grant,Kay Martin and Her Bodyguards,Ray Peterson
1958,Peggy Lee,Maureen Evans,Sallie Blair
1959,Norma Bengell,Sam Butera and The Witnesses,Frankie Avalon	
1960,Elvis Presley
...
1982,The Jam,Amanda Lear
1983,Marine Girls
1984,The Neville Brothers
...
```

2.1. Скрейпимо інформацію про пісню з Вікіпедії: https://en.wikipedia.org/wiki/Fever_(Little_Willie_John_song).

2.2. Пишемо набір правил (за допомогою частин мови, синтаксичних дерев, регекспів тощо), які витягають потрібну нам інформацію з тексту:

*Elvis Presley released a near identical version to Lee's two years following her cover, for his 1960 album, Elvis Is Back!...*  
*During their 1982 world tour, the British group The Jam covered the song as part of a medley with their own "Pity Poor Alfie"...*  
```
1960,Elvis Presley
1982,The Jam
```

3. Оцінюємо результат.


# Possible data sources

I decided to select domain based on information I would be able to acquire.
First of all, it would be great to get bunch of statements about specific domain that could be
a) easily verified using constructed KB
b) have ground truth labels
c) versatile formulation, so rules could be build to account for that

## [Open Trivia DB](https://opentdb.com)


First data source I considered appropriate. Using REST API I downloaded 1k sets of 50 random samples of True/False questions using script:
```bash
#!/usr/bin/env bash

rm input_raw.json
touch input_raw.json
echo "[" >> input_raw.json
for i in `seq 1 1000`;
  do
    curl "https://opentdb.com/api.php?amount=50&type=boolean" | jq .results | sed "1,1d" | sed '$d'  >> input_raw.json
    echo "," >> input_raw.json
    echo "."
  done
echo "]" >> input_raw.json
```
analysis below

In [1]:
import numpy as np
import pandas as pd
import gzip
import html
import re
import codecs
import spacy
import itertools
import hashlib
from IPython.display import clear_output
import sys

In [117]:
with gzip.open('input_raw.json.gz') as input_file:
    input_raw_df = pd.read_json(input_file, orient='records')

In [118]:
input_raw_df.head(5)

Unnamed: 0,category,correct_answer,difficulty,incorrect_answers,question,type
0,Science & Nature,False,easy,[True],Igneous rocks are formed by excessive heat and...,boolean
1,Entertainment: Video Games,True,medium,[False],Amazon acquired Twitch in August 2014 for $970...,boolean
2,Entertainment: Music,True,easy,[False],The music group Daft Punk got their name from ...,boolean
3,Geography,False,medium,[True],The flag of South Africa features 7 colours.,boolean
4,Geography,True,easy,[False],A group of islands is called an &#039;archipel...,boolean


In [119]:
input_unique = input_raw_df.drop_duplicates(subset=['question'])

In [73]:
input_unique.to_csv("open_trivia_unique.csv", index=False)

In [8]:
input_unique.shape

(492, 6)

After duplication removal, we have only 492 statements left :(

In [9]:
input_unique.groupby('category')['question'].agg('count')

category
Animals                                   16
Art                                        4
Celebrities                                4
Entertainment: Board Games                 8
Entertainment: Books                       5
Entertainment: Cartoon & Animations        4
Entertainment: Comics                      4
Entertainment: Film                       26
Entertainment: Japanese Anime & Manga     21
Entertainment: Music                      29
Entertainment: Musicals & Theatres         1
Entertainment: Television                 16
Entertainment: Video Games               106
General Knowledge                         50
Geography                                 35
History                                   37
Mythology                                  8
Politics                                  16
Science & Nature                          31
Science: Computers                        32
Science: Gadgets                           3
Science: Mathematics                      15
S

After going through questions manually and based on other datasets available, I consider Movie domain and using questions from `Entertainment: Television`, `Entertainment: Film`, `Entertainment: Cartoon & Animations`, `Entertainment: Video Games`, manually selected to the subset of those that deal with the fact about %X being in %Y movie / series / games

In [22]:
input_unique[input_unique['category'] == 'Sports']['question'].apply(lambda x: print(x) or x)

Peyton Manning retired after winning Super Bowl XLIX.
Roger Federer is a famous soccer player.
Wilt Chamberlain scored his infamous 100-point-game against the New York Knicks in 1962.
Soccer player Cristiano Ronaldo opened a museum dedicated to himself.
Tennis was once known as Racquetball.
In association football, or soccer, a corner kick is when the game restarts after someone scores a goal.
Skateboarding will be included in the 2020 Summer Olympics in Tokyo.
In Rugby League, performing a &quot;40-20&quot; is punished by a free kick for the opposing team.
Formula E is an auto racing series that uses hybrid electric race cars.
Manchester United won the 2013-14 English Premier League.
The Olympics tennis court is a giant green screen.


16      Peyton Manning retired after winning Super Bow...
178              Roger Federer is a famous soccer player.
191     Wilt Chamberlain scored his infamous 100-point...
260     Soccer player Cristiano Ronaldo opened a museu...
318                 Tennis was once known as Racquetball.
390     In association football, or soccer, a corner k...
617     Skateboarding will be included in the 2020 Sum...
740     In Rugby League, performing a &quot;40-20&quot...
937     Formula E is an auto racing series that uses h...
1403    Manchester United won the 2013-14 English Prem...
1482    The Olympics tennis court is a giant green scr...
Name: question, dtype: object

## [Film Trivia 30](https://www.sporcle.com/games/sekula32/film-trivia-30-true-statements)
Another dataset that deals with movies and scraped from page's iframe directly and can be used to infer %X in movie %Y relation

In [81]:
# taken from https://www.sporcle.com/games/sekula32/film-trivia-30-true-statements
film_trivia_30 = """
Al Pacino and Dustin Hoffman never appeared in the same film
Alfred Hitchcock’s Psycho was the first American film ever to show a flushing toilet
Bob Hoskins and Paul Sorvino directed Super Mario Bros
Clint Eastwood became oldest Best Actor nominee, after receiving nomination for his role in Million Dollar Baby
Daniel Day-Lewis is only actor to win three Best Actor Oscars
Darth Vader only has 12 minutes of screen time in the original Star Wars
Denzel Washington is first African American to win the Best Actor Oscar
Die Hard was Alan Rickman's feature film debut
E.T. and Poltergeist were supposed to be the same film
Jason Scott Lee, who portrays Bruce Lee in Dragon: The Bruce Lee Story, is not related to Bruce Lee
Jason Voorhees is the killer in the original Friday the 13th
Jennifer Lawrence is youngest Best Actress winner ever
John Wayne never died in any of his films
Korbin Dallas and Zorg, hero and villain of The Fifth Element, never met each other
Nicolas Cage turned down the role of Aragorn in The Lord of the Rings
None of Kurosawa's films won Best Foreign Language Film Oscar
Quentin Tarantino wrote a screenplay for Layer Cake
Stanley Kubrick has one Oscar, but not for Best Director
Sylvester Stallone starred in a porn film
The Godfather: Part II is only sequel that won Oscar for Best Picture
The Matrix marks only collaboration between Keanu Reeves and Laurence Fishburne
The nude portrait of Kate Winslet in Titanic was actually drawn by Leonardo DiCaprio
The rain in Singin' in the Rain was actually milk
The Shawshank Redemption, #1 on IMDb was the 51st highest grossing film in 94′
The Shining received two Razzie nominations
The voice of the velociraptors communicating in Jurassic Park are actually children screaming
The word 'Zombie' is never used in Night of the Living Dead
Three Oscar winners portrayed Batman on film
Tim Burton is the director of The Nightmare Before Christmas
While writing 'Good Will Hunting' Ben Affleck and Matt Damon included a scene of gay sex
"""
"OK"

'OK'

## [The Moview Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset/data)

Might be considered for the role of KB

In [23]:
#movies_metadata - keep id, imdb_id, original_title, title
#credits.csv - keep cast[gender/name/id], id
from zipfile import ZipFile
import ast 
with ZipFile('the-movies-dataset.zip', 'r') as myzip:
    with myzip.open('credits.csv') as input_file:
        credits_df = pd.read_csv(input_file, usecols=('id', 'cast'))
    with myzip.open('movies_metadata.csv') as input_file:
        metadata_df = pd.read_csv(input_file, usecols=('id', 'imdb_id', 'original_title', 'title'))


In [24]:
def unwrap_cast(cast):
    return [{'id': person['id'], 'gender': person['gender'], 'name': person['name']} for person in ast.literal_eval(cast)]

credits_df['cast'] = credits_df['cast'].apply(unwrap_cast)

In [33]:
credits_df.to_csv('movie_chars.csv', index = False)

In [37]:
credits_df['cast'][0]

[{'gender': 2, 'id': 31, 'name': 'Tom Hanks'},
 {'gender': 2, 'id': 12898, 'name': 'Tim Allen'},
 {'gender': 2, 'id': 7167, 'name': 'Don Rickles'},
 {'gender': 2, 'id': 12899, 'name': 'Jim Varney'},
 {'gender': 2, 'id': 12900, 'name': 'Wallace Shawn'},
 {'gender': 2, 'id': 7907, 'name': 'John Ratzenberger'},
 {'gender': 1, 'id': 8873, 'name': 'Annie Potts'},
 {'gender': 0, 'id': 1116442, 'name': 'John Morris'},
 {'gender': 2, 'id': 12901, 'name': 'Erik von Detten'},
 {'gender': 1, 'id': 12133, 'name': 'Laurie Metcalf'},
 {'gender': 2, 'id': 8655, 'name': 'R. Lee Ermey'},
 {'gender': 1, 'id': 12903, 'name': 'Sarah Freeman'},
 {'gender': 2, 'id': 37221, 'name': 'Penn Jillette'}]

In [70]:
metadata_df

Unnamed: 0_level_0,imdb_id,original_title,title
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
862,tt0114709,Toy Story,Toy Story
8844,tt0113497,Jumanji,Jumanji
15602,tt0113228,Grumpier Old Men,Grumpier Old Men
31357,tt0114885,Waiting to Exhale,Waiting to Exhale
11862,tt0113041,Father of the Bride Part II,Father of the Bride Part II
949,tt0113277,Heat,Heat
11860,tt0114319,Sabrina,Sabrina
45325,tt0112302,Tom and Huck,Tom and Huck
9091,tt0114576,Sudden Death,Sudden Death
710,tt0113189,GoldenEye,GoldenEye


In [71]:
metadata_df.get_value('862', 'title')

'Toy Story'

In [34]:
metadata_df.to_csv('movie_list.csv', index = False)

## [200,000+ Jeopardy! Questions](https://data.world/sya/200000-jeopardy-questions)

In [2]:
nlp = spacy.load('en_core_web_md')

In [34]:
with codecs.open("JEOPARDY_CSV.csv", "r",encoding='utf-8', errors='ignore') as fdata:
    jeopardy = pd.read_csv(fdata)

Given a big qty of questions, idea is to filter out related questions based on
- words starred / acted / featured in question / category
- movie / cinema / screen / entities in category
Possibly, manually / using script fill blanks /  replace pronouns with answer, so  
```it starred Gregory Hines as a slumlord named Scrooge",A Christmas Carol```  
becomes  
```A Christmas Carol starred Gregory Hines as a slumlord named Scrooge```  
**Afterthought**: no need to search questions, good enough to find in categories only

In [43]:
nlp("A JIM CARREY FILM FESTIVAL".lower()).ents

(jim,)

In [45]:
jeopardy.head(5)

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,Category_entities
0,4680,12/31/04,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,()
1,4680,12/31/04,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,"((10),)"
2,4680,12/31/04,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,()
3,4680,12/31/04,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,()
4,4680,12/31/04,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,()


In [44]:
jeopardy['Category_entities'] = jeopardy['Category'].apply(lambda x: nlp(x.lower()).ents)

In [46]:
jeopardy['Question_entities'] = jeopardy['Question'].apply(lambda x: nlp(x.lower()).ents)

In [53]:
jeopardy.to_csv('jeopardy_with_ents.csv', index = False)

In [54]:
jeopardy.head(5)

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,Category_entities,Question_entities
0,4680,12/31/04,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,(),"((the, last, 8, years),)"
1,4680,12/31/04,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,"((10),)","((2), (1912), (6), (&))"
2,4680,12/31/04,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,(),"((yuma), (4,055, hours))"
3,4680,12/31/04,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,(),"((1963), (billionth))"
4,4680,12/31/04,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,(),"((second),)"


In [3]:
# jeopardy = pd.read_csv('jeopardy_with_ents.csv')

In [4]:
jeopardy.head(5)

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,Category_entities,Question_entities
0,4680,12/31/04,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,(),"(the last 8 years,)"
1,4680,12/31/04,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,"(10,)","(2, 1912, 6, &)"
2,4680,12/31/04,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,(),"(yuma, 4,055 hours)"
3,4680,12/31/04,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,(),"(1963, billionth)"
4,4680,12/31/04,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,(),"(second,)"


In [6]:
jeopardy[['Category_entities', 'Question_entities', 'Answer', 'Category']]

Unnamed: 0,Category_entities,Question_entities,Answer,Category
0,(),"(the last 8 years,)",Copernicus,HISTORY
1,"(10,)","(2, 1912, 6, &)",Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES
2,(),"(yuma, 4,055 hours)",Arizona,EVERYBODY TALKS ABOUT IT...
3,(),"(1963, billionth)",McDonald's,THE COMPANY LINE
4,(),"(second,)",John Adams,EPITAPHS & TRIBUTES
5,(),(),the ant,3-LETTER WORDS
6,(),"(312, today)",the Appian Way,HISTORY
7,"(10,)","(8, 30, birmingham, 2,306)",Michael Jordan,ESPN's TOP 10 ALL-TIME ATHLETES
8,(),"(the winter of 1971-72, a record 1,122 inches)",Washington,EVERYBODY TALKS ABOUT IT...
9,(),(),Crate & Barrel,THE COMPANY LINE


In [39]:
category_filter = jeopardy.Category.str.contains("\s(character|act|play|movie|cinema|film|tv|screen|role|sitcom)", case=False)
jeopardy[category_filter]

  """Entry point for launching an IPython kernel.


Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,Category_entities,Question_entities
118,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$100,"Once Tommy Mullaney on ""L.A. Law"", John Spence...",The West Wing,(),"(l.a,)"
124,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$200,Barbra Streisand knows he played Lt. Col. Bill...,James Brolin,(),()
130,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$300,"(Hi, I'm Wallace Langham) I played Don Kirshn...",The Monkees,(),"(vh1,)"
136,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$400,"Teri Hatcher looked ""shipshape"" as one of the ...",The Love Boat,(),"(1985,)"
142,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$500,"On ""Saturday Night Live"", he's famous for play...",Will Ferrell,(),"(saturday night, craig)"
179,3673,7/19/00,Jeopardy!,1994 FILMS,$100,Quentin Tarantino directed this film & also ha...,Pulp Fiction,"(1994,)","(quentin tarantino, &, toluca)"
185,3673,7/19/00,Jeopardy!,1994 FILMS,$200,"As mad bomber Howard Payne in this film, Denni...",Speed,"(1994,)","(l.a,)"
191,3673,7/19/00,Jeopardy!,1994 FILMS,$300,"Jean Vander Pyl, who played Wilma in the origi...",The Flintstones,"(1994,)",()
197,3673,7/19/00,Jeopardy!,1994 FILMS,$400,"Containing the hit ""Can You Feel The Love Toni...",The Lion King,"(1994,)","(tonight, first)"
203,3673,7/19/00,Jeopardy!,1994 FILMS,$800,In this film Martin Scorsese says the TV audie...,Quiz Show,"(1994,)","(martin scorsese, )"


In [40]:
extra_category_filter = jeopardy.Category.isin(["3", "10", "21", "1921", "1946", "1955", "1957", "1959", "1994", "2001",
"____ AND ____", "...AND MAN CREATED WOMAN",
"...OR BUST", "'40s POP CULTURE", "'49ers", "'60s FLICKS", "'60s MUSIC SCENE",
"'60s POP MUSIC", "A.M.", "AA"])
jeopardy[extra_category_filter]

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,Category_entities,Question_entities
4948,3003,9/24/97,Double Jeopardy!,1957,$200,On October 4 Russia launched this first satell...,Sputnik,"(1957,)","(october 4, russia, first)"
4954,3003,9/24/97,Double Jeopardy!,1957,$400,"When Wham-O introduced this toy in 1957, it wa...",Frisbee,"(1957,)","(1957,)"
4960,3003,9/24/97,Double Jeopardy!,1957,$600,"As the Teamsters' vice president, he was indic...",Jimmy Hoffa,"(1957,)",()
4966,3003,9/24/97,Double Jeopardy!,1957,$800,He ended his brief retirement to become chairm...,Armand Hammer,"(1957,)",()
4972,3003,9/24/97,Double Jeopardy!,1957,"$1,000","The first explorer to fly over both poles, he ...",Admiral Richard Byrd,"(1957,)","(first,)"
15928,3660,6/30/00,Jeopardy!,...AND MAN CREATED WOMAN,$100,"In this animated 1995 film, Annie Potts was a ...",Toy Story,(),"(1995,)"
15934,3660,6/30/00,Jeopardy!,...AND MAN CREATED WOMAN,$200,"Carmen Ibanez fights the bugs as a pilot in ""R...",Starship Troopers,(),()
15940,3660,6/30/00,Jeopardy!,...AND MAN CREATED WOMAN,$300,This 3-D tomb-raiding woman has done commercia...,Lara Croft,(),()
15946,3660,6/30/00,Jeopardy!,...AND MAN CREATED WOMAN,$400,Vanessa Angel played Lisa in the TV series of ...,Kelly LeBrock,(),()
15952,3660,6/30/00,Jeopardy!,...AND MAN CREATED WOMAN,$500,Webbie Tokay is a cyberspace supermodel create...,Elite,(),()


In [41]:
interesting_questions = jeopardy[category_filter | extra_category_filter]
interesting_questions

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,Category_entities,Question_entities
118,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$100,"Once Tommy Mullaney on ""L.A. Law"", John Spence...",The West Wing,(),"(l.a,)"
124,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$200,Barbra Streisand knows he played Lt. Col. Bill...,James Brolin,(),()
130,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$300,"(Hi, I'm Wallace Langham) I played Don Kirshn...",The Monkees,(),"(vh1,)"
136,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$400,"Teri Hatcher looked ""shipshape"" as one of the ...",The Love Boat,(),"(1985,)"
142,3751,12/18/00,Jeopardy!,TV ACTORS & ROLES,$500,"On ""Saturday Night Live"", he's famous for play...",Will Ferrell,(),"(saturday night, craig)"
179,3673,7/19/00,Jeopardy!,1994 FILMS,$100,Quentin Tarantino directed this film & also ha...,Pulp Fiction,"(1994,)","(quentin tarantino, &, toluca)"
185,3673,7/19/00,Jeopardy!,1994 FILMS,$200,"As mad bomber Howard Payne in this film, Denni...",Speed,"(1994,)","(l.a,)"
191,3673,7/19/00,Jeopardy!,1994 FILMS,$300,"Jean Vander Pyl, who played Wilma in the origi...",The Flintstones,"(1994,)",()
197,3673,7/19/00,Jeopardy!,1994 FILMS,$400,"Containing the hit ""Can You Feel The Love Toni...",The Lion King,"(1994,)","(tonight, first)"
203,3673,7/19/00,Jeopardy!,1994 FILMS,$800,In this film Martin Scorsese says the TV audie...,Quiz Show,"(1994,)","(martin scorsese, )"


In [42]:
interesting_questions.to_csv('temp.csv')

In [84]:
cleaned_qs = interesting_questions.apply(lambda x: re.sub("(?<=[\s|^])(he|it|she|this|these|who|_+)(?=('s|\s|$|\"))", x.Answer, x.Question, count = 1), axis=1)

# [NELL](http://rtw.ml.cmu.edu/rtw/)

This one seems to be really noisy and have some related information, but not too suitable + issue with annotating other info :(

## Combining all questions

In [67]:
all_qs = list()

In [120]:
input_unique_list = input_unique[input_unique['category'].isin(["Entertainment: Television",
                                            "Entertainment: Film", 
                                            "Entertainment: Cartoon & Animations",
                                            "Entertainment: Video Games"])]['question'].apply(html.unescape).tolist()

In [113]:
trivia_list = film_trivia_30.split("\n")[1:-1]

In [114]:
japardy_list = cleaned_qs.tolist()

In [115]:
japardy_list

['Once Tommy Mullaney on "L.A. Law", John Spencer now plays White House chief of staff Leo McGarry on The West Wing series',
 'Barbra Streisand knows James Brolin played Lt. Col. Bill "Raider" Kelly on "Pensacola: Wings of Gold"',
 '(Hi, I\'m Wallace Langham)  I played Don Kirshner in VH1\'s TV movie about The Monkees quartet who sang "Daydream Believer"',
 'Teri Hatcher looked "shipshape" as one of the singing "mermaids" The Love Boat jumped on board this cruisin\' series in 1985',
 'On "Saturday Night Live", Will Ferrell\'s famous for playing Craig the Cheerleader, Janet Reno & moi',
 'Quentin Tarantino directed Pulp Fiction film & also had a bit role as Jimmy of Toluca Lake',
 'As mad bomber Howard Payne in Speed film, Dennis Hopper planted a bomb on an L.A. area transit bus',
 'Jean Vander Pyl, The Flintstones played Wilma in the original cartoon series, played Mrs. Feldspar in this movie adaptation',
 'Containing the hit "Can You Feel The Love Tonight", The Lion King was Disney\'s

In [125]:
all_lines = itertools.chain(input_unique_list, trivia_list, japardy_list)

In [126]:
good = 0
all = 0
for line in all_lines:
    clear_output()
    print("Select if this is about actor(s) in movie/show/series")
    print(line)
    selection = str(input('y/n/q')).lower().strip()
    all += 1
    if selection[0] == 'y':
        good += 1
        hex_string = hashlib.md5(line.encode('utf-8')).hexdigest()
        with open("data/output_{}.txt".format(hex_string), 'w+') as out:
            out.write(line)
        continue
    if selection[0] == 'n':
        continue
    if selection[0] == 'q':
        print('Done at {} / {}'.format(good, all))
        break

Select if this is about actor(s) in movie/show/series
He was born in St. John, New Brunswick, but his son Kiefer was born in London, England
y/n/qq
Done at 149 / 634


Going through first 634 samples I ended up annotating 149 sentences as aligning with what I need

## Further, I'm going to annotate sentences with following labels
- star_movie
- star_actor
- unrelated_movie
- unrelated_actor

Where unrelated_movie/unrelated_actor would be catch-all for movies/person that are not part of actor-in-movie pair that I'm looking ofr