# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. This project aims to find any edge we can get to win. We will work with a [dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) of 20000 Jeopardy questions to figure out some patterns in the questions that could help us win.

## Data overview and cleaning

In [1]:
# Import modules
import pandas as pd
import re

In [2]:
# Import the dataset
jeopardy = pd.read_csv("jeopardy.csv", parse_dates=[" Air Date"])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Show Number  19999 non-null  int64         
 1    Air Date    19999 non-null  datetime64[ns]
 2    Round       19999 non-null  object        
 3    Category    19999 non-null  object        
 4    Value       19999 non-null  object        
 5    Question    19999 non-null  object        
 6    Answer      19999 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


In [4]:
# Remove leading and trailing whitespaces from the columns names
jeopardy.columns = jeopardy.columns.str.strip()

Before analyzing the Jeopardy questions and answers, we need to normalize them. We will convert all text to lowercase and remove punctuation.

In [5]:
def normalize_text(text):

    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    
    return text

In [6]:
jeopardy["Question_Old"] = jeopardy["Question"]
jeopardy["Question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["Answer_Old"] = jeopardy["Answer"]
jeopardy["Answer"] = jeopardy["Answer"].apply(normalize_text)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Question_Old,Answer_Old
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,for the last 8 years of his life galileo was u...,copernicus,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,no 2 1912 olympian football star at carlisle i...,jim thorpe,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,the city of yuma in this state has a record av...,arizona,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,in 1963 live on the art linkletter show this c...,mcdonalds,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,signer of the dec of indep framer of the const...,john adams,"Signer of the Dec. of Indep., framer of the Co...",John Adams


The column `Value` should be numeric. We need to remove the dollar sign and thousands separator from the beginning of each value and convert the column from text to numeric.

In [7]:
# Remove the dollar sign and thousands separator
jeopardy["Value"] = jeopardy["Value"].str.replace(r"[$,]","")

# Check if all characters in the strings are numeric characters
jeopardy.loc[~jeopardy["Value"].str.isnumeric(), "Value"].value_counts()

None    336
Name: Value, dtype: int64

All strings can be converted to numeric, except those with "None" value. The None value will be considered zero.

In [8]:
# Convert "None" values to zero
jeopardy.loc[jeopardy["Value"] == "None", "Value"] = "0"

# Convert column Value to numeric
jeopardy["Value"] = jeopardy["Value"].astype(int)

## Answers in questions

To figure out whether to study past questions, general knowledge, or not study at all, it would be helpful to figure out two things:
* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur, and the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now.

In [9]:
def question_answer_match(row):

    # Remove the word "the" and split the strings in Question and Answer around whitespaces
    question = row["Question"].replace("the","").split()
    answer = row["Answer"].replace("the","").split()
    if len(answer) == 0: return 0
    
    # Count how many times terms in the answer occur in the question
    match_count = 0

    for word in answer:
        
        if word in question: match_count += 1
    
    return match_count / len(answer)

jeopardy["answer_in_question"] = jeopardy.apply(question_answer_match, axis=1)

jeopardy["answer_in_question"].mean()

0.058400780789400364

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

## Recycled questions

Let's now investigate how often new questions are repeats of older ones. We will count how many times each unique word is repeated. We will only look at words with six or more characters to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [10]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    
    # Split the question around whitespaces
    question = row["Question"].split()
    
    # Remove any words that are less than 6 characters long
    question = [word for word in question if len(word) >= 6]

    # Count how many times each word is repeated
    match_count = 0

    for word in question:
        if word in terms_used:
            match_count += 1
    for word in question:
        terms_used.add(word)
            
    # Calculate the percentage
    if len(question) > 0:
            match_count /= len(question)
    
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.6908737315671878

## Low value vs high value questions

There is 69% of overlap between terms in new and old questions. This does not look at phrases; it looks at single terms. Although the process is not very accurate, it suggests that we should look into the recycling of questions with more detail.

In [11]:
jeopardy["high_value"] = jeopardy["Value"].apply(lambda x: 1 if x > 800 else 0)

In [12]:
def count_usage(word):
    
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        
        if word in row["Question"].split():
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1

    return high_count, low_count

In [13]:
from random import choice

comparison_terms = [choice(list(terms_used)) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (0, 1),
 (36, 79),
 (1, 0),
 (11, 9),
 (5, 3),
 (1, 1),
 (0, 2),
 (0, 1),
 (1, 0)]

In [14]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.3898148396824147, pvalue=0.5323967459066039),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=6.779091835847712, pvalue=0.009223180537640667),
 Power_divergenceResult(statistic=4.476558568129228, pvalue=0.03436284804287323),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

## Conclusions

None of the terms had a significant difference in usage between high value and low value rows.  Additionally, the frequencies were all low, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.