# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. This project will attempt to find patterns in the questions that are asked. 

We are working with a dataset of 20,000 Jeopardy questions, of which the full dataset is available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). The dataset contains the following columns:

- **Show Number** -- the Jeopardy episode number of the show this question was in.
- **Air Date** -- the date the episode aired.
- **Round** -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- **Category** -- the category of the question.
- **Value** -- the number of dollars answering the question correctly is worth.
- **Question** -- the text of the question.
- **Answer** -- the text of the answer.

In [14]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [15]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [16]:
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
print(jeopardy.columns)

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [19]:
import re

def normalise(text):
    return re.sub(r'[^\w\s]', '', text.lower())

def normalise_values(text):
    text = re.sub(r'[^\d]', '', text)
    try:
        return int(text)
    except:
        return 0

In [22]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalise)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalise)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalise_values)
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [23]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
jeopardy.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1. How often the answer is deducible from the question.
2. How often new questions are repeats of older questions.

In [24]:
def matches(row):
    split_question = row['clean_question'].split()
    split_answer = row['clean_answer'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the')
        
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for ans in split_answer:
        if ans in split_question:
            match_count += 1
    
    return match_count / len(split_answer)


answer_in_question = jeopardy.apply(matches, axis=1)
print(answer_in_question.mean())    

0.05900196524977763


With the answer terms only occuring in the questions about 6% of the time, it would be difficult to deduce answers consistently from the question.

In [26]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('AirDate')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [term for term in split_question if len(term) > 5]
    
    match_count = 0
    for term in split_question:
        if term in terms_used:
            match_count += 1
            
    for term in split_question:    
        terms_used.add(term)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = pd.Series(question_overlap)
print(jeopardy['question_overlap'].mean())

0.6877209804150519


When comparing question words to previously used terms, we see a 69% match rate. This is significant enough to explore further, however is not yet enough to indicate repeated questions. This is because we were largely comparing single words, instead of phrases, which makes a large difference. 

## High/low value questions

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

1. Low value -- Any row where Value is less than 800.
2. High value -- Any row where Value is greater than 800.

In [28]:
def value(row):
    if row['clean_value'] > 800:
        return 1
    else: return 0
    
jeopardy['high_value'] = jeopardy.apply(value, axis=1)
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.857143,0
19274,10,1984-09-21,Jeopardy!,GEOGRAPHY,$100,Formerly Formosa,Taiwan,formerly formosa,taiwan,100,0.75,0
19275,10,1984-09-21,Jeopardy!,DOUBLE TALK,$100,"Not a Hawaiian cow, but a dress worn by Hawaii...",a muumuu,not a hawaiian cow but a dress worn by hawaiia...,a muumuu,100,1.0,0
19276,10,1984-09-21,Jeopardy!,"""JACKS"" OF ALL TRADES",$100,He celebrated his 39th birthday 41 times,Jack Benny,he celebrated his 39th birthday 41 times,jack benny,100,0.857143,0
19277,10,1984-09-21,Jeopardy!,SHIPS,$100,"""Unsinkable"" for most of its maiden voyage in ...",the Titanic,unsinkable for most of its maiden voyage in 1912,the titanic,100,0.75,0


In [31]:
def match_value(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else: 
                low_count += 1
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[:5]
for term in comparison_terms:
    observed_expected.append(match_value(term))

print(observed_expected[:5])
        

[(2, 4), (1, 0), (0, 2), (0, 1), (0, 1)]


## Chi-squared test

With the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [36]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy['high_value'].sum()
low_value_count = jeopardy.shape[0] - high_value_count

chi_squared = []
for oe in observed_expected:
    total = sum(oe)
    total_prop = total / jeopardy.shape[0]
    e_high = total_prop * high_value_count
    e_low = total_prop * low_value_count
    
    observed = np.array(oe)
    expected = np.array([e_high, e_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

As noted in the documentation for the **chisquare** function:

"This test is invalid when the observed or expected frequencies in each category are too small. A typical rule is that all of the observed and expected frequencies should be at least 5."

Therefore these results are not particularly meaningful. It may be worth running it again with only higher frequency observed values. 

## Next steps...

Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.