# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help one win.

Here are explanations of each column:

* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.

## Reorganize the data

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.head())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [2]:
# fix the column names that have spaces in front
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


There's a need to normalize the questions and answers. We can lowercase words and remove punctuation.

In [3]:
import re
def normalized_text(string):
    string = str.lower(string)
    string = re.sub('[^A-Za-z0-9]+', ' ', string)
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(normalized_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalized_text)

Write a function to normalize dollar values.

In [4]:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

Convert the Air Date column to a datetime column.

In [5]:
import datetime as dt
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'],format='%Y-%m-%d')

In [6]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question?
* How often new questions are repeats of older questions?

## Answers deducible from questions

To figure out how often the answer is deducible from the question, we can see how many times words in the answer also occur in the question.

In [7]:
def repeat_words(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

In [8]:
answer_in_question = jeopardy.apply(repeat_words, axis=1)
print(answer_in_question.mean())

0.06296062319488491


Only around 6% of the answers are implied in the questions.

##  Recycled Questions

In [9]:
# sort jeopardy in order of ascending air date
jeopardy.sort_values('Air Date', ascending=True, inplace=True)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200


In [10]:
#Investigate how often new questions are repeats of older ones.
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]

    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1        
        terms_used.add(word)
        
    if len(split_question) > 0:
        p_match = match_count / len(split_question)
    
    question_overlap.append(p_match)

jeopardy['question_overlap'] = question_overlap

print(jeopardy['question_overlap'].mean())
print(len(terms_used))

0.7426625052441135
20249


## High-value questions

Studying questions that pertain to high value questions instead of low value questions will help one earn more money. We can figure out which terms correspond to high-value questions using a chi-squared test.

First, we need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800
* High value -- Any row where Value is greater than 800

In [11]:
def is_high_value(row):
    value = 0
    
    if row['clean_value'] > 800:
        value = 1
    
    return value

jeopardy['high_value'] = jeopardy.apply(is_high_value, axis=1)

In [12]:
def count_usage(word):
    low_count = 0
    high_count = 0
    
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count 

In [13]:
import random
comparison_terms = random.sample(terms_used, 10)

observed_expected = []
for term in comparison_terms:
    counts = count_usage(term)
    observed_expected.append(counts)

print(observed_expected)

[(1, 0), (0, 1), (3, 3), (1, 0), (0, 2), (1, 0), (0, 1), (0, 1), (0, 1), (2, 2)]


## Expected counts and chi-squared value

In [14]:
from scipy.stats import chisquare
import numpy as np

high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / len(jeopardy)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
print(chi_squared)

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468)]


There isn't a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. 