# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It has been running for a few decades now.

Let's say I want to compete in Jeopardy and am looking for an edge to win. In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset we will work with contains 20,000 rows from the beginning of a full dataset of Jeopardy questions.

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Data Cleaning

The columns in this dataset have some leading spaces. Let's start by removing those

In [2]:
jeopardy.columns
jeopardy.columns = jeopardy.columns.str.strip()

In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


Before we start analyzing the dataset, we also need to normalize all of the text columns. We will lowercase all words and remove punctuation next.

In [4]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]',"",text)
    text = re.sub("\s+"," ",text)
    return text

def normalize_values(text):
    text = re.sub('[^A-Za-z0-9\s]',"",text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## Deducing the Answer from the Question

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question.

Let's work on answering the first question first.

In [5]:
def count_matches(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches,axis=1)

jeopardy['answer_in_question'].mean()


0.05898946462474648

The answer appears in the question only about 6% of the time. This is not a big number, and this means we likely can't rely on hoping that hearing the question will help us find the answer. We likely will have to study trivial knowlege!

## Repeat Questions

This dataset is only 10% of the full Jeopardy question dataset. So if we wanted to investigate how often questions are repeats of older ones, we can't completely answer this with the dataset but it is worth investigating. 

To do this, we will sort the dataset by ascending air date, check for words only greater than 6 words, and check if the word has occured before. This will enable us to check if terms in questions have been used previous or not. Only looking at words of 6 characters or more allow us to filter out words like 'the' and 'than', which are commonly used but not helpful.

In [6]:
question_overlap = []
terms_used = set()

jeopardy.sort_values('Air Date',inplace = True,ascending=True)

for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.6876260592169776

There is about 69% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, just single terms. This makes it insignificant, but it means it's worth looking more into the recycling of questions.

## Low Value vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.


You can figure out which terms correspond to high-value questions using a chi-squared test. We first will need to narrow down questions into 2 categories:

- Low value - row where value is less than 800
- High value - row where value is greater than 800

In [7]:
def determine_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value,axis = 1)

def count_usage(term):
    low_count = 0
    high_count = 0
    
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if term in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count

from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for word in comparison_terms:
    observed_expected.append(count_usage(word))
    
observed_expected

[(0, 1),
 (0, 1),
 (3, 8),
 (0, 1),
 (0, 2),
 (0, 6),
 (1, 0),
 (0, 3),
 (2, 2),
 (1, 2)]

Now that we have found the observed counts for a few terms, we can compute the expcted counts and chi-squared value

In [8]:
from scipy.stats import chisquare 
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for val in observed_expected:
    total = sum(val)
    total_prop = total/jeopardy.shape[0]
    expected_high_val = total_prop * high_value_count
    expected_low_val = total_prop * low_value_count
    
    observed = np.array([val[0],val[1]])
    expected = np.array([expected_high_val,expected_low_val])
    chi_squared.append(chisquare(observed,expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.01052283698924083, pvalue=0.9182956181393399),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.411777076761304, pvalue=0.120425590069509),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766901714),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

## Chi-Squared Results

If you examine the p-values, at a significance level of p = 0.05 you will find that there are no statistically significant results. None of the terms had a significant difference in the usage between high and low value questions. 

The frequences were all also lower than 5, so the chi-square test is not valid. It would be better to run this test with only terms that have higher frequencies