INTRO

The Original Project Goal was to investigate a dataset of Jeopardy! questions and answers:

* build and use Python lambdas functions
* understand the basics of NumPy
* create and manipulate pandas DataFrames
* work with aggregates and multiple DataFrames using pandas


But while investigating the dataset I'll set up an ambitious goal for myself to try to make my program actually answer questions, because figuring out how many times a particular word appeared in question didn't seem like that big of a deal.

I would like to notice, that I'm a complete novice in ML and I do not aim to build the ultimate Question Answering machine or even come close to what the whole IBM team achieved back in 2011, when computer Watson has beaten Jeopardy champions.

QUESTION ANSWERING

When I figured out what I want to do — answer questions, I found out that there is such discipline within NLP.

I learned a lot from the book "Speech and Language Processing, An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition" by Daniel Jurafsky from Stanford University and James H. Martin from University of Colorado at Boulder. Below you'll find several passages from this book, that I considered would be helpful for understanding of this project.
________________________

As far as the 1960s, there were two major paradigms of question answering — information-retrieval-based and knowledge-based.

IR-based factoid question answering has two stages: retrieval, which returns relevant documents from the collection, and reading, in which a neural reading comprehension system extracts answer spans.

In the second paradigm, knowledge-based question answering, a system instead builds a semantic representation of the query. These meaning representations are then used to query databases of facts.

Watson by IBM combines information from IR-based and knowledge-based sources.

OUR TASK

We would try to use IR-based question answering and rely on Google with information retrieval, while our task will be to get the right answer from these retrieved passages.

Without further ado, let's get to work!
________________________

What are we going to do:

1. Python Fundamentals (Functions, Lists, Loops, Strings, List comprehension)
2. Data Manipulation with Pandas
3. Data Acquisition (Web scraping with BeautifulSoup)
4. Natural Language Processing (Text preprocessing, Regex)

Let's begin by importing pandas library and opening our dataset as dataframe.

In [None]:
import pandas as pd

df = pd.read_csv('/kaggle/input/200000-jeopardy-questions/JEOPARDY_CSV.csv')


Let's look at several first entries of our dataset and get to know the length of our database.

In [None]:
df.head(10)

In [None]:
print(len(df))

Columns we will make use of are Answer, Question and Category (if question is too short to make sense of, we would add Category to search query).

We'll need to know exact names of columns to work with them.

In [None]:
print(df.columns)

We could see not much needed space before word in some columns, it wouldn't affect our task, but for the sake of it we'll change column names and check them out again.

In [None]:
df.columns = df.columns.str.strip()

In [None]:
print(df.columns)

Do we have any empty, null, NaN cells?

In [None]:
df.isnull().sum()

2 out of 216930 doesn't seem that bad, we could just get rid of them.

In [None]:
df = df.dropna()
df = df.reset_index(drop=True)

print(df.isnull().sum())
print(len(df))

We would be asking google the question, and then how could we get the answer?

We could plug in some of this question into google and see that for some questions google gets second part of IR-based QA machine's work done - basically in first result we get the answer, for many of the questions we get linked to some Jeopardy site with answers and some queries get no relevant results at all.

Let's first try to use brute force and assume that we could get the right answer from counting words in results' snippets, because the right answer would appear frequently in results. We would obviously have to exclude words from the question, but also the most frequent words are articles and prepositions such as 'the, a, to, in...', so we would have to get rid of them as well.

In order to match the most frequent word with our correct answer we would have to also strip down all the same words from the answers. So we need almost the same preprocessing for them, but not the same, because upon data analysis we found out some unique data traits that need to be dealt with separately (like we have dual answers, that both could be true, the second answer in parentheses orshort questions, that we could make no sense of without adding Category to it).

We would make use of nltk libraby and Wordnet Lemmatizer.

In [None]:
import nltk
nltk.download('stopwords')

First we will crwte

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from nltk.corpus import wordnet
from collections import Counter
import re 

lemmatizer = WordNetLemmatizer()

def get_part_of_speech(word):
    #we get synonyms
    probable_part_of_speech = wordnet.synsets(word)
  
    pos_counts = Counter()
    
    #then use synonyms to determine the most likely part of speech

    pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
    pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
    pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
    pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  
    most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
    return most_likely_part_of_speech



In [None]:
def preprocess_question(data, n):
    #we need variable to write on our preprocessed data
    # to get rid of all the articles and prepositions we'll use stopwords from nltk
    stop_words = set(stopwords.words('english'))
    question = data['Question'][n]
    # if length of a question is too short we would use category as well
    if len(question) < 20:
        question = data['Category'][n] + ' ' + question  
    words = re.sub("[^ a-zA-Z-]|[&'-]{2,}", "", question) #" \\1"
    words = word_tokenize(words)
    # we would make all the words lowercase and strip all the symbol that are not letters or numbers
    words = [word.lower() for word in words]
    words = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in words]
    
    final_tokens = [] 
    for each in words:
        # if length of question is less than 5 words, 
        # we might not want to get rid of any words, that's why we append all of them
        if len(words) < 5:
            final_tokens.append(each)
        else:
            # else we would get rid of all the stop_words
            if each not in stop_words:
                final_tokens.append(each)
    return final_tokens

Let's try and see what our function does to a question.

In [None]:
process_q_3699 = preprocess_question(df, 3699)
print(df['Question'][3699])
print(process_q_3699)

In [None]:
process_q_5550 = preprocess_question(df, 5550)
print(df['Question'][5550])
print(process_q_5550)

Below we would create a bit simlified function for preprocessing of the answers.

In [None]:
def preprocess_answers(data):
    answers = []
    for answer in data:
        #for some answers there is an alternative answer in parentheses, for now we would get rid of the alternative using regex
        ans = re.sub("[\(\[].*?[\)\]]", "", answer) 
        ans = word_tokenize(ans)
        ans = [word.lower() for word in ans if word.isalnum()]
        ans = [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in ans]
        stop_words = set(stopwords.words('english'))
        final_result_tokens = [] 
        for each in ans:
            if each not in stop_words:
                  final_result_tokens.append(each)
        answers.append(final_result_tokens)
    return answers


And we will prepocess them all answers, as it takes not so much time and compare our "before and after".

In [None]:
df['Processed answer'] = preprocess_answers(df['Answer'])

In [None]:
df['Answer'][12]

In [None]:
df['Processed answer'][12]

In [None]:
x = 0
y = 0
z = 0
u = 0

for n in range(len(df['Processed answer'])):
    if len(df['Processed answer'][n]) == 0:
        x += 1
    if len(df['Processed answer'][n]) == 1:
        y += 1
    if len(df['Processed answer'][n]) == 2:
        z += 1
    if len(df['Processed answer'][n]) >= 3:
        u += 1
    
print('There are ' + str(x) + ' answers with 0 words. \nThe reason for that is not optimal preprocessing, during which we \'lost\' ' + 
       str(round((x/len(df['Processed answer'])*100), 1)) + 
       '% of all answers. We would get rid of this empty lists for now.\n')
print('There are ' + str(y) + ' answers with 1 word, which is ' + str(round((y/len(df['Processed answer'])*100), 1)) + 
       '% of all answers. Them we would use.\n')
print('There are also ' + str(z) + ' answers with 2 words, ' + str(u) + ' answers with 3 or more words. For simplification of a task we would not use them.\n')
print('The total number of answers for now is ' + str(len(df['Processed answer'])))

In [None]:
one_word_answer_data = pd.DataFrame()  
one_word_answer_data = df[df['Processed answer'].map(lambda d: len(d)) == 1]

In [None]:
one_word_answer_data = one_word_answer_data.reset_index()

In [None]:
h = 0
for answer in one_word_answer_data['Processed answer']:
    if len(answer) == 0:
        h += 1
print(h)

In [None]:
len(one_word_answer_data)

Let's get to know actual efficiency of our method. We found out (using allmighty Internet) that if we need 95% confidence level and 5% margin of error  with our population size of 118832 we would need sample size of 383 for our test. We will get those 383 questions randomly.

In [None]:
import random

indices_list = []

while len(indices_list) < 383:
    index = random.randint(0,118832)
    if index not in indices_list:
        indices_list.append(index)

In [None]:
len(indices_list)

In order not to use computing power wastefully we would not process all the questions, just the ones we randomly choose. After preprocessing them, we would form queries to add to our web-scraper. 

In [None]:
one_word_answer_data.loc[71073]

In [None]:
queries = []
sample_questions_processed = []
for index in indices_list:
    x = preprocess_question(one_word_answer_data, index)
    query = ''
    for item in x:
        query += item 
        query += '+'
    queries.append(query)
    sample_questions_processed.append(x)
    


In [None]:
len(queries)

In [None]:
queries[33]

Next, using BeautifulSoup we would get the results, preprocess them and right away compute the most frequent words. As they are given to us in the form of tuples, we would need to extract just the words for the ease of working with them later.

NOTE: It takes time to go through with all the almost 400 queries. (~ 4 sec for a query)

Also we would use NER from spacy to get the most frequent proper nouns from the results. So then we could compare this two almost brute force methods to get our answer. For that purpose we would need spacy library.

In [None]:
pip install bs4

In [None]:
%%time



import requests
from bs4 import BeautifulSoup
import spacy
import string

sample_results_frequency = []

sample_results_ner_fre = []

for index in range(383):
    url = 'https://google.com/search?q=' + queries[index]
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser").select(".s3v9rd.AP7Wnd")
    results_tokenized = []
    results = []
    for item in soup:
        results.append(item.getText(strip=True))
        results_df = pd.DataFrame(results, columns=['A'])
        
        for result in results:
            words_from_results = word_tokenize(result)
            words_from_results = [word.lower() for word in words_from_results if word.isalnum()]
            words_from_results =  [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in words_from_results]
            stop_words = set(stopwords.words('english'))
            final_result_tokens = [] 
            for each in words_from_results:
                if each not in stop_words:
                    if each not in sample_questions_processed[index]:
                        final_result_tokens.append(each)
            results_tokenized.append(final_result_tokens)

    bag = []
    from collections import Counter
    
    no = ['jeopardy', 'answer']

    for i in range(len(results_tokenized)):
        current = results_tokenized[i]
        for current_word in current:
            for word in sample_questions_processed[index]:
                if nltk.edit_distance(word, current_word) <= 1 and current_word in current and word not in no:
                    current.remove(current_word)
        bag += results_tokenized[i]


    c = Counter(bag)
    maybe = c.most_common(3)

    wow = []
    for tulip in maybe:
        wow.append(tulip[0])
    sample_results_frequency.append(wow)
    
    
    

    nouns_processed = []
    nouns = []
    
    nlp = spacy.load("en")
    for result in results:
        sample_results_ner_nouns = []
        doc = nlp(result)
        
        ner_fre = []

        for chunk in doc:
            if chunk.pos_ == 'PROPN':
                new_nouns = re.sub('\.(?!(\S[^. ])|\d)', '', chunk.text)
                nouns.append(new_nouns)


        for noun in nouns:
            ner = word_tokenize(noun)
            ner = [word.lower() for word in ner if word.isalpha()]
            ner =  [lemmatizer.lemmatize(token, get_part_of_speech(token)) for token in ner]
            for word in ner:
                if word not in sample_questions_processed[index] and word not in stop_words and word not in no:
                    nouns_processed.append(word)
        sample_results_ner_nouns.append(nouns_processed)
        counter = Counter(nouns_processed)
        maybe_ner = counter.most_common(3)

        
        for tulip in maybe_ner:
            ner_fre.append(tulip[0])
    sample_results_ner_fre.append(ner_fre)

    

We would get the answers we need for our sample.

In [None]:
sample_answers = []

for index in indices_list:
    sample_answers.append(one_word_answer_data['Processed answer'][index])

Let's check out what have gotten here.

In [None]:
for index in range(10):
    #print(sample_questions_processed[index])
    #print(queries[index])
    print(sample_answers[index])
    print(sample_results_frequency[index])
    print(sample_results_ner_fre[index])

Now it's time to get a score for our program. We would get a score for each question - if the word from the answer is among the 3 most frequent in our results (excepy stop words) we would add 1 point to our score.

In [None]:
score_fre = 0
score_ner = 0
score_ner_not_fre = 0


for x in range(10):
    for word in sample_answers[x]:
        if word in sample_results_frequency[x]:
            score_fre += 1
        else:
            if word in sample_results_ner_fre[x]:
                score_ner_not_fre += 1
        if word in sample_results_ner_fre[x]:
            score_ner += 1
        
       
    
print('score fre: ' + str(round((score_fre/10*100), 2)), '%')  
print('score ner not fre: ' + str(round((score_ner_not_fre/10*100), 2)), '%')  
print('score ner: ' + str(round((score_ner/10*100), 2)), '%') 


WORK IN PROGRESS...