# Investigating patterns in Jeopardy questions

We're all familiar with the game of Jeopardy, but if you're not, it is a popular TV show in the US where participants answer questions to win money. It was hosted by the absolute legend Alex Trebek for many years until his unfortunate passing this year. 

I'm writing this project on New Year's Eve and building on the hypothetical assumption that I would like to compete in Jeopardy. In order to gain any sort of competitive advantage, I'm going to take a look at a dataset of Jeopardy questions to see if I can figure out some patterns that might help me win (again - should I decide to compete). 

We're going to be using a dataset named jeopardy.csv containing 20000 rows. Here are the columns in the dataset and their explanation (and if you'd like to download it yourself, you can follow think [link](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)):

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.

Let's go ahead and read in our dataset so we can explore its contents a bit. 

In [1]:
import pandas as pd

# read in dataset
jeopardy = pd.read_csv('jeopardy.csv')

# print first 5 rows
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# print columns of jeopardy
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Some of the columns have spaces in front. Let's get rid of those quickly. 

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [4]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Normalizing text and value columns

In order to avoid mistakes in our analysis, we're going to normalize all of the text columns (the Question and Answer columns). Essentially, this is to ensure that words with different cases or punctuations aren't considered to be different words when compared. 

Let's write a function that:

- Takes in a string.
- Converts the string to lowercase.
- Removes all punctuation in the string.
- Returns the string.

In [6]:
import re

def normalize_text(text): # takes in string
    text = text.lower() # make lower case
    text = re.sub('[^A-Za-z0-9\s]', '', text) # removes punctuation
    text = re.sub('\s+', ' ', text)
    return text # returns string

# apply function to question and answer columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)


Text columns are not the only ones needing normalizing here. The Value column should be numeric to allow us to manipulate it more easily, while the Air Date column should be in datetime format, not in string format. 

For the Value column, let's write a function that can:
- Take in a string.
- Remove any punctuation in the string.
- Convert the string to an integer.
- If the conversion has an error, assign 0 instead.
- Return the integer.

In [7]:
def normalize_values(text):
    text = re.sub('[^A-Za-z0-9\s]', '', text) # remove punctuation
    try:
        text = int(text)
    except Exception: # if conversion 
        text = 0
    return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

Let's now quickly change the Air Date column to a datetime column. 

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

### Is the answer deducible from the question?

Now that our dataset is cleaned up, we're going to try and answer our first question:

- How often is the answer deducible from the question?

To do so, we need to see how many times words in the answer also occur in the question. 

In [9]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [10]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

It seems that according to our analysis, the mean amount of times the answer is already in the question is 0.059 times, so only about 6% of the time. Consequently, I don't think that we'll be able to rely on certain words in the question being in the answer for our studying strategy for Jeopardy 2021. 

### How often are questions repeats of older ones?

We can't completely answer this question, since we only have about 10% of the full Jeopardy question dataset, but we can at least work with what we have. 

To figure this out, we'll have to:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    - If it does, increment a counter.
    - Add each word to terms_used.

We're removing words shorter than 6 characters so that we can filter out redundant words like 'the' and 'than', which don't tell us a whole lot about the question unless the question is "the the the the the than?"

Let's give it a go. 

In [11]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

In [12]:
print(jeopardy['question_overlap'].mean())

0.6876260592169802


According to our results, there seems to be about a 70% overlap between new questions and terms in old questions. Of course this is not the total amount of questions from Jeopardy, which means that this result is not necessarily significant, but it would give one reason to look into this a bit more. 

### High value vs low value questions

Finally, let's say we wanted to study more high value questions instead of low value questions. That would be the perfect way to make more money, right? 

Well, we can actually figure which terms correspond to high-value questions using a chi-squared test. First, we'll need to narrow down questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

Then we can loop through each of the terms from the last screen and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

Then we can find the words with the biggest differences in usage between high and low value questions by selecting words with the highest associated chi-squared values. We'll just do it with a small sample for now, since doing it for all the words would take a pretty long while. 

In [13]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [16]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [17]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(high_low_count(word))

observed_expected

[(0, 1),
 (0, 2),
 (1, 2),
 (0, 1),
 (0, 1),
 (1, 0),
 (1, 0),
 (1, 0),
 (1, 1),
 (2, 4)]

Okay, we've found the observed counts for a few terms, so now we'll compute the expected counts and the chi-squared value. 

In [20]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    
    expected_high_count = total_prop * high_value_count
    expected_low_count = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high_count, expected_low_count])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781)]

It looks like none of the terms had a significant difference between high value and low value rows. Also, the frequences were all below 5 and so our chi-squared test is not as valid. Further analysis would need to be done with terms with higher frequencies. 

# Conclusion

It's looking like Jeopardy questions are a tough nut to crack. We were able to delve into three specific questions:

- How often is the answer used in the question?
- How often are terms recycled in questions?
- Do certain terms in questions relate to more high-value questions?

We didn't have much luck in looking through each of these questions, but we did get some indication of terms being recycled, which would warrant further analysis. 