<a href="https://colab.research.google.com/github/tlcuzick/data-science-projects/blob/main/winning-jeopardy/winning_jeopardy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import string
from scipy.stats import chisquare

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
print(jeopardy.head(5))
print(jeopardy.columns)

   Show Number  ...      Answer
0         4680  ...  Copernicus
1         4680  ...  Jim Thorpe
2         4680  ...     Arizona
3         4680  ...  McDonald's
4         4680  ...  John Adams

[5 rows x 7 columns]
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [5]:
jeopardy.rename(columns=lambda x: x.replace(' ',''),inplace=True)

In [6]:
def normalize_text(text):
    punc = string.punctuation
    new_text = ''
    for l in text:
        if not(l in punc):
            new_text = new_text + l
    return new_text.lower()
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [7]:
def normalize_dollars(text):
    punc = string.punctuation
    new_text = ''
    for l in text:
        if not(l in punc):
            new_text = new_text + l
    try:
        num = int(new_text)
    except ValueError:
        num = 0
    return num
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollars)
jeopardy['AirDate'] = jeopardy['AirDate'].apply(pd.to_datetime)

In [8]:
def avg_matches(row):
    split_answer = row['clean_answer'].split(' ')
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    split_question = row['clean_question'].split(' ')
    match_count = 0
    for a in split_answer:
        if a in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [9]:
jeopardy['answer_in_question'] = jeopardy.apply(avg_matches,axis=1)

In [10]:
print(jeopardy['answer_in_question'].mean())

0.060352773854699004


Since 6% of the words in jeopardy answers also occurred in the corresponding question, consciously giving more consideration to potential answers that include words from the question seems like one potentially effective strategy.

In [11]:
question_overlap = []
terms_used = set()
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [n for n in split_question if len(n) >= 6]
    match_count = 0
    
    for q in split_question:
        if q in terms_used:
            match_count += 1
        terms_used.add(q)
    question_overlap.append(match_count)
    
question_overlap = pd.Series(question_overlap)
jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

3.073953697684884


Since an average of three words (with 6 or more characters) per question were repeated from previous questions, it seems reasonable to conclude that questions are occasionally, at least to some extent, being recycled.

In [12]:
def high_low(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(high_low, axis=1)

def high_low_count(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []
terms_used = list(terms_used)
comparison_terms = terms_used[0:5]

for t in comparison_terms:
    observed_expected.append(high_low_count(t))

In [13]:
high_value_count = sum(jeopardy['high_value'])
low_value_count = jeopardy.shape[0] - high_value_count
chi_squared = []

for l in observed_expected:
    total = l[0] + l[1]
    total_prop = total / jeopardy.shape[0]
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    chisq, pval = chisquare(l,[high_value_expected,low_value_expected])
    chi_squared.append([chisq, pval])
print(chi_squared)

[[2.487792117195675, 0.11473257634454047], [0.401962846126884, 0.5260772985705469], [0.401962846126884, 0.5260772985705469], [0.401962846126884, 0.5260772985705469], [0.401962846126884, 0.5260772985705469]]


Based on the limited sample of words I reviewed, there appears to be correlation in some cases (p-value around 0.1), but not enough to be considered statistically significant with a threshold of 0.05.

It may prove helpful to analyze words which appear in questions with more frequency.