Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

We might want to compete, and hence might be looking for any way to win. Hence lets explore the sample dataset of jeopardy questions(i.e. the first 20,000 rows of the full data set of Jeopardy questions)

# Jeopardy Questions

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky


Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* `Show Number` - the Jeopardy episode number
* `Air Date` - the date the episode aired
* `Round` - the round of Jeopardy
* `Category` - the category of the question
* `Value` - the number of dollars the correct answer is worth
* `Question` - the text of the question
* `Answer` - the text of the answer

Each episode is a Jeopardy game consisting of rounds in which questions are asked.

In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

## Cleaning column names

In [5]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Columns information and formattting

In [6]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


* The text columns need to be standardized
* The `Value` column should be numeric, to allow us to manipulate it easier.
* The `Air Date` column should also be a datetime, not a string, to enable us to work with it easier.

# Normalizing Text

We need to normalize all of the text columns (the Question and Answer columns) to ensure that we put words in lowercase and remove punctuation so "Don't" and "don't" aren't considered to be different words when we compare them.

In [7]:
jeopardy[['Question','Answer']]

Unnamed: 0,Question,Answer
0,"For the last 8 years of his life, Galileo was ...",Copernicus
1,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,The city of Yuma in this state has a record av...,Arizona
3,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,"Signer of the Dec. of Indep., framer of the Co...",John Adams
...,...,...
19994,"Of 8, 12 or 18, the number of U.S. states that...",18
19995,...& the New Power Generation,Prince
19996,In 1589 he was appointed professor of mathemat...,Galileo
19997,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky


In [8]:
import re

In [9]:
def normalize_text(qa_text):
    normalized_text = qa_text.lower()
    #remove punctuation marks
    normalized_text = re.sub('[^a-zA-Z0-9\s]','',normalized_text)
    #remove extra spaces
    normalized_text = re.sub('\s+',' ',normalized_text)
    return normalized_text

In [10]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_question']

0        for the last 8 years of his life galileo was u...
1        no 2 1912 olympian football star at carlisle i...
2        the city of yuma in this state has a record av...
3        in 1963 live on the art linkletter show this c...
4        signer of the dec of indep framer of the const...
                               ...                        
19994    of 8 12 or 18 the number of us states that tou...
19995                             the new power generation
19996    in 1589 he was appointed professor of mathemat...
19997    before the grand jury she said im really sorry...
19998    llamas are the heftiest south american members...
Name: clean_question, Length: 19999, dtype: object

In [11]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_answer']

0             copernicus
1             jim thorpe
2                arizona
3              mcdonalds
4             john adams
              ...       
19994                 18
19995             prince
19996            galileo
19997    monica lewinsky
19998             camels
Name: clean_answer, Length: 19999, dtype: object

# Normalizing other columns

## Normalizing the `Value` column

In [12]:
jeopardy['Value'].unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389'], dtype=object)

In [13]:
def normalize_value(dollar_string):
    if dollar_string == 'None':
        normalized_value = 0
    else:
        normalized_value = int(re.sub('[$,]','',dollar_string))
    return normalized_value

In [14]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
jeopardy['clean_value']

0        200
1        200
2        200
3        200
4        200
        ... 
19994    200
19995    200
19996    200
19997    200
19998    200
Name: clean_value, Length: 19999, dtype: int64

## Normalzing the `Air Date` column

In [15]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy['Air Date']

0       2004-12-31
1       2004-12-31
2       2004-12-31
3       2004-12-31
4       2004-12-31
           ...    
19994   2000-03-14
19995   2000-03-14
19996   2000-03-14
19997   2000-03-14
19998   2000-03-14
Name: Air Date, Length: 19999, dtype: datetime64[ns]

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
* How often the answer can be used for a question.
   - we can answer this by seeing how many times words in the answer also occur in the question. 
* How often questions are repeated.
   - we can answer this by seeing how often complex words (> 6 characters) reoccur.

# Answers in Questions

In [16]:
def calculate_answer_in_question(question_info_row):
    split_answer = question_info_row['clean_answer'].split()
    split_question = question_info_row['clean_question'].split()
    
    proportion_of_answer_word_in_q_to_num_answer_words = 0
    match_count = 0
    
    #'the' word is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer
    #so, remove occurences of 'the' in split_answer cell
    split_answer = [word for word in split_answer if word != 'the']
    
    #prevents a division by zero error later
    if len(split_answer) == 0:
        return proportion_of_answer_word_in_q_to_num_answer_words
    
    for answer_word in split_answer:
        if answer_word in split_question:
            match_count += 1
    
    proportion_of_answer_word_in_q_to_num_answer_words = match_count/len(split_answer)
    return proportion_of_answer_word_in_q_to_num_answer_words

In [17]:
jeopardy['answer_in_question'] = jeopardy.apply(calculate_answer_in_question,axis=1)
jeopardy['answer_in_question']

0        0.0
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
19994    1.0
19995    0.0
19996    0.0
19997    0.0
19998    0.0
Name: answer_in_question, Length: 19999, dtype: float64

In [18]:
jeopardy['answer_in_question'].mean()

0.058347444789267004

On an average, only about 6% of words in an answer occur in a question, hence we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

# Recycled Questions

In [19]:
question_overlap = list()
terms_used = set()
jeopardy = jeopardy.sort_values('Air Date')
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) >= 6]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap    

In [20]:
jeopardy['question_overlap'].mean()

0.6894031359073217

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

# Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test.

In [21]:
def determine_high_value(question_info):
    if question_info['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [22]:
jeopardy['high_value'] = jeopardy.apply(determine_high_value,axis=1)

In [23]:
def count_high_low_occurances(term_used):
    high_count = 0
    low_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if term_used in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return (high_count,low_count)

In [24]:
from random import choice
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for i in range(10)]

observed_expected = list()
for term in comparison_terms:
    observed_expected.append(count_high_low_occurances(term))
    
observed_expected

[(3, 2),
 (0, 1),
 (0, 1),
 (0, 2),
 (7, 12),
 (2, 9),
 (2, 0),
 (0, 1),
 (0, 1),
 (1, 5)]

# Applying the Chi-squared Test

In [25]:
from scipy.stats import chisquare

high_value_count = jeopardy.loc[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy.loc[jeopardy['high_value'] == 0].shape[0]

chi_squared = list()

for observed in observed_expected:
    total_observed_count = sum(observed)
    
    #calculating expected_high_value_count
    expected_high_value_proportion = float(high_value_count)/jeopardy.shape[0]
    expected_high_value_count = expected_high_value_proportion * total_observed_count
    
    #calculating expected_low_value_count
    expected_low_value_proportion = float(low_value_count)/jeopardy.shape[0]
    expected_low_value_count = expected_low_value_proportion * total_observed_count

    observed_frequencies = np.array([observed[0],observed[1]])
    expected_frequencies = np.array([expected_high_value_count,expected_low_value_count])
    
    chi_squared.append(chisquare(observed_frequencies,expected_frequencies))

chi_squared

[Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.6202349255189139, pvalue=0.4309599995797433),
 Power_divergenceResult(statistic=0.5918326368236639, pvalue=0.44171132316120576),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.42281054506129573, pvalue=0.515537958129453)]

# Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.