# Winning Jeopardy

Imagine that I want to compete on Jeopardy, and I am looking for any way to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help me win.

In [374]:
import pandas as pd
import string

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [375]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [376]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

## Normalizing Text

Before I can start doing analysis on the Jeopardy questions, I need to normalize all of the text columns (the `Question` and `Answer` columns). 

In [377]:
import re

def clean_str(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [378]:
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_str)

In [379]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_str)

## Normalizing Columns

Now that I've normalized the text columns, I need to normalize the `Value` column and the `Air Date` column.

In [380]:
def dollar_to_int(value):
    value = value.translate(str.maketrans('', '', string.punctuation))
    if value.isdigit() == True:
        value = int(value)
    else: value = 0
    
    return value

In [381]:
jeopardy['clean_value'] = jeopardy['Value'].apply(dollar_to_int)

In [382]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

I can answer the second question by seeing how often complex words (> 6 characters) reoccur. I can answer the first question by seeing how many times words in the answer also occur in the question.

In [383]:
def answer_in_question(row):
    
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count/len(split_answer)

In [384]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)

In [385]:
jeopardy['answer_in_question'].value_counts()

0.000000    17475
0.500000     1448
0.333333      494
0.250000      155
1.000000      124
0.666667      104
0.200000       68
0.166667       27
0.400000       26
0.142857       21
0.750000       17
0.600000        9
0.125000        9
0.285714        7
0.800000        2
0.428571        2
0.181818        2
0.571429        2
0.300000        2
0.111111        2
0.350000        1
0.444444        1
0.875000        1
Name: answer_in_question, dtype: int64

In [386]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

## Recycled Questions

Let's say I want to investigate how often new questions are repeats of older ones. I can't completely answer this, because I only have about 10% of the full Jeopardy question dataset, but I can investigate it at least.

To do this, I can:

- Sort `jeopardy` in order of ascending air date.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of `jeopardy`.
- Split `clean_question` into words, remove any word shorter than `6` characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to `terms_used`.

This allows me to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables me to filter out words like `the` and `than`, which are commonly used, but don't tell me a lot about a question.

In [387]:
question_overlap = []
terms_used = set()

jeopardy.sort_values('Air Date',inplace=True, ascending=True)

In [388]:
for index, row in jeopardy.iterrows():
    
    split_question = row['clean_question'].split()
    
    for word in split_question[:]:
        if len(word) < 6:
            split_question.remove(word)
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    
    if len(split_question)>0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

In [389]:
jeopardy['question_overlap'] = question_overlap

In [390]:
jeopardy['question_overlap'].mean()

0.6894031359073245

## Low Value vs High Value Questions

There is about `70%` overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Let's say I only want to study questions that pertain to high value questions instead of low value questions. This will help me earn more money when you're on Jeopardy.

I can actually figure out which terms correspond to high-value questions using a chi-squared test. 

I'll compute the chi squared value based on the expected counts and the observed counts for high and low value questions. I can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so I'll just do it for a small sample now.

In [336]:
def high_or_low(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1

    return value

In [337]:
jeopardy['high_value'] = jeopardy.apply(high_or_low,axis=1)

In [338]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.000000,0.000000,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.000000,0.000000,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.000000,0.000000,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200,0.000000,0.500000,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.000000,0.000000,0
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200,0.333333,0.000000,0
19306,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$200,"Last season, this series mourned the loss of S...",Hill Street Blues,last season this series mourned the loss of sg...,hill street blues,200,0.000000,0.000000,0
19307,10,1984-09-21,Double Jeopardy!,1789,$400,Why April 28th was a bad day for Capt. Bligh,the day of the mutiny on the Bounty,why april 28th was a bad day for capt bligh,the day of the mutiny on the bounty,400,0.142857,0.000000,0
19308,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$400,Seaside resort that has a monopoly on East Coa...,"Atlantic City, New Jersey",seaside resort that has a monopoly on east coa...,atlantic city new jersey,400,0.000000,0.000000,0
19309,10,1984-09-21,Double Jeopardy!,LITERATURE,$400,"He wrote ""The 3 Musketeers""; his son wrote ""Ca...",(Alexandre) Dumas,he wrote the 3 musketeers his son wrote camille,alexandre dumas,400,0.000000,0.000000,0


In [339]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
    
        split_question = row['clean_question'].split()

        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else: low_count+= 1
    
    return high_count,low_count

In [340]:
import random

comparison_terms = []
comparison_terms.append(random.sample(terms_used,10))

observed_expected = []

for word in comparison_terms[0]:
    observed_expected.append(high_low_count(word))

In [341]:
observed_expected

[(0, 1),
 (0, 2),
 (0, 1),
 (0, 1),
 (1, 1),
 (0, 5),
 (0, 1),
 (1, 0),
 (1, 0),
 (1, 0)]

## Applying the Chi-Squared Test

Now that I've found the observed counts for a few terms, I can compute the expected counts and the chi-squared value.

In [351]:
from scipy.stats import chisquare

high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

chi_squared = []

for item in observed_expected:
    total = item[0] + item[1]
    total_prop = total/len(jeopardy)
    chi_squared.append(chisquare(item,f_exp=[(total_prop*high_value_count),(total_prop*low_value_count)]))

In [352]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.00981423063442, pvalue=0.1562844540498966),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

# Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies