# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In [1]:
import pandas as pd
import numpy as np

jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# Get columns
columns = jeopardy.columns
columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

## Normalizing text

Before you can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that you have lowercase words and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [4]:
# normalize
import re

def normalize(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

In [5]:
# normalize questions and answers

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)

In [6]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing columns

Now that we've normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable us to work with it more easily.

In [7]:
# function to normalize value column

def normalize_value(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    
    try:
        text = int(text)
    except:
        text = 0
    return text

In [8]:
# apply function to value column
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_value)

In [9]:
jeopardy["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [10]:
# Convert Airdate to datetime
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

<ul>
<li>How often the answer is deducible from the question.
<li>How often new questions are repeats of older questions.
</ul>

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [11]:
# function to find answers in questions

def answers_in_questions(row):
    split_question = row["clean_question"].split(" ")
    split_answer = row["clean_answer"].split(" ")
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

In [12]:
jeopardy["answer_in_question"] = jeopardy.apply(answers_in_questions, axis=1)

jeopardy["answer_in_question"].mean()

0.06049325706933587

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

## Recycled questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

To do this, you can:
<ul>
<li>Sort jeopardy in order of ascending air date.
<li>Maintain a set called terms_used that will be empty initially.
<li>Iterate through each row of jeopardy.
<li>Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
<li>If it does, increment a counter.
<li>Add each word to terms_used.
</ul>

This will enable us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like the and than, which are commonly used, but don't tell us a lot about a question.

In [13]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
    for word in split_question:
        terms_used.add(word)
        
    if len(split_question) > 0:
            match_count /= len(split_question)
            
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

print(jeopardy["question_overlap"].mean())

0.6876260592169802


There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:
<ul>
<li>Low value -- Any row where Value is less than 800.
<li>High value -- Any row where Value is greater than 800.
</ul>

We'll then be able to loop through each of the terms from the last screen, terms_used, and:
<ul>
<li>Find the number of low value questions the word occurs in.
<li>Find the number of high value questions the word occurs in.
<li>Find the percentage of questions the word occurs in.
<li>Based on the percentage of questions the word occurs in, find expected counts.
<li>Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
</ul>

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [14]:
def determine_value(row):
    value = 1 if row["clean_value"] > 800 else 0
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)
jeopardy[['high_value', 'clean_value']].head()

Unnamed: 0,high_value,clean_value
19325,0,0
19301,0,200
19302,0,200
19303,0,200
19304,0,200


In [15]:
def count_usage(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [16]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 3),
 (1, 2)]

# Applying Chi Squared test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [17]:
import numpy as np
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for values in observed_expected:
    total = sum(values)
    total_prop = total / jeopardy.shape[0]
    
    high_value_exp = high_value_count * total_prop
    low_value_exp = low_value_count * total_prop
    
    observed = np.array([values[0], values[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.