# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

Imagine that we want to compete on Jeopardy, and we're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named `jeopardy.csv`, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

## Exploring the data

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import chisquare
import matplotlib.pyplot as plt
%matplotlib inline

jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.head())

print(jeopardy.columns)

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype

In [2]:
# Removing leading and trailing spaces
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

print(jeopardy.info())

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB
None


## Normalizing text

Before we can analyse the Jeopardy columns, we need to normalize all of the text columns. Therefore we will remove all punctuation and put all words in lowercase.

In [3]:
# Define a function to normalize strings

import re

def normalize_text(string):
    """Takes a string as an input, transforms it to lowercase and removes punctuation"""
    string = string.lower()
    string = re.sub(r'\W', ' ', string)
    return string

In [4]:
# Apply function to the Question column

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_question'].head()

0    for the last 8 years of his life  galileo was ...
1    no  2  1912 olympian  football star at carlisl...
2    the city of yuma in this state has a record av...
3    in 1963  live on  the art linkletter show   th...
4    signer of the dec  of indep   framer of the co...
Name: clean_question, dtype: object

In [5]:
# Apply function to the Answer column

jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3    mcdonald s
4    john adams
Name: clean_answer, dtype: object

## Normalizing columns

The `Value` column should be numeric, to allow us to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable us to work it easier.

In [6]:
# Function to normalize dollar values

def normalize_dollars(string):
    """Takes in a string as an input, removes the dollar sign and returns an integer"""
    if not string or string == 'None':
        return 0
        print("Some erroneous values have been replaced with 0.")
    try:
        string = re.sub(r'\$', '', string)
        string = re.sub(r'\W', '', string)
    except ValueError:
        string = 0
        print("Some erroneous values have been replaced with 0.")
    return int(string)

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollars)
jeopardy['clean_value'].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [7]:
# Convert Air Date to datetime

import datetime as dt

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy['Air Date'].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

## Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [8]:
# Function to match terms occuring in the answers to questions

def match_count(clean_answer, clean_question):
    """Takes in the cleaned answer and the question, splits them into words, and counts recurring matches."""
    
    # Split both into lists of words
    split_answer = clean_answer.split()  
    split_question = clean_question.split()  

    # Remove instances of 'the'
    split_answer = [word for word in split_answer if word != 'the']
    
    if len(split_answer) == 0:
        return 0  # Return 0 if answer is empty 
    
    # Count matches
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    
    return match_count / len(split_answer)  # Normalize by the length of the answer    

In [9]:
# Count how many terms in clean_answer occur in clean_question

jeopardy['answer_in_question'] = jeopardy.apply(lambda row: match_count(row['clean_answer'], row['clean_question']), axis=1)
print(jeopardy['answer_in_question'].mean())

0.06229526885934705


This mean tells us that on average, the answer makes up for about 6% of the question. So if our strategy would be to recycle content from what we hear in the question, we would not be very successful.

## Recycled questions

But we might be able to exploit the fact that questions are being recycled, i.e. they are repeats of old ones.  
We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we 
can investigate it at least.

In [10]:
# Check for question overlap over time

question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ') # split into words
    split_question = [question for question in split_question if len(question) > 5] # remove questions with fewer than 6 words
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

0.7197989717809739


There is about a 72% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases — it looks at single terms. This makes it relatively insignificant, but it does mean that it might be worth looking more into the recycling of questions.

## Low-value vs. high-value questions



Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [15]:
# Function to determine high and low value questions

def high_low(row):
    """Function to determine formulations in high- and low-value questions."""
    if row > 800:
        return 1
    else:
        return 0

# Apply to clean_value column
jeopardy['high_value'] = jeopardy['clean_value'].apply(high_low)
print(jeopardy['high_value'].head())
print(jeopardy['high_value'].mean())   

19325    0
19301    0
19302    0
19303    0
19304    0
Name: high_value, dtype: int64
0.28671433571678584


In [16]:
# Connecting words and value

def high_low_count(word):
    """Function to determine which words show up in high or low value questions."""
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split() # split into words
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count       

In [17]:
# Select sample of comparison terms

from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

# Calculate the observed and expected counts
observed_expected = []

for term in comparison_terms:
    observed_expected.append(high_low_count(term))

print(observed_expected)

[(0, 1), (0, 1), (1, 0), (2, 1), (1, 0), (1, 1), (0, 1), (1, 0), (0, 1), (1, 1)]


## Applying the chi-squred test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [18]:
## Applying the chi-squared test

from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

print(chi_squared)

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996)]


None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.