# Are There Advantages to Winning Jeopardy?

Jeopardy is a popular trivia game show in the United States. The answers are given first, and the contestants are asked to supply the questions.  In this project I will be working with a dataset of Jeopardy questions to figure out some patterns in the questions that could help someone win.  The datset is named jeopardy.csv, and contains 20000 rows.  The dataset can be downloaded here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/


The data will be read with a pandas dataframe.  Here is an explaination of the columns:
- Show Number: The Jeopardy episode number of the show this quesiton was in.
- Air Date: The date the episode aired
- Round:The round of jeopardy that the question was asked in.  Jeopardy has several rounds as each episode progresses.
- Category: The category of the question.
- Value: The number of dollars answering the question correctly is worth.
- Question: The text of the question.
- Answer: The text of the answer


In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# Removing White Space
jeopardy.columns = jeopardy.columns.str.lstrip()

In [3]:
import string
def normalize(s):
    s = s.lower()
    s = s.translate(str.maketrans('', '', string.punctuation))
    return s
        
    

In [4]:
#Using the normalize function to clean the question and answer columns in the dataframe
jeopardy['clean_question'] = jeopardy.Question.apply(normalize)
jeopardy['clean_answer'] = jeopardy.Answer.apply(normalize)

In [5]:
def normal_convert(s):
    s = s.translate(str.maketrans('', '', string.punctuation))
    try:
        s = int(s)
    except:
        s = 0
    return s

In [6]:
#Normalizing the dollar values so that they are integers
jeopardy['clean_value'] = jeopardy.Value.apply(normal_convert)

In [7]:
#Normalizing the Air Data so that it is type datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

Our interest is marshalling a strategy for winning in jeopardy.  There are three possible solutions:
- study past questions
- study general knowledge
- not study at all

It would be helpful to figure out two things:
- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

I can answer the second question by seeing how often complex words (> 6 characters) reoccur. I can answer the first question by seeing how many times words in the answer also occur in the question. I'll work on the first question now, and come back to the second.

In [8]:
def question_deduce(s):
    split_answer = s.clean_answer.split()
    split_question = s.clean_question.split()
    match_count = 0
    try:
        split_answer.remove('the')
    except:
        pass
    if(len(split_answer) == 0):
        return 0
    for word in split_answer:
        if (word in split_question):
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(question_deduce,axis=1)
print('On average the number of words that appear in both the answer and question is:', 
      jeopardy.answer_in_question.mean()*100,   '%')
    
    

On average the number of words that appear in both the answer and question is: 5.886148203514072 %


Above I created a function that finds the percentage of words that are included in both the answer and question.  As said in the previous markdown cell, the problem we are trying to answer is: How often is the answer deducable from the question?  According to or calculation, the answer on average shares about 6% of the words with the question.

Because of this finding, I must reevaluate my studying strategy for Jeopardy.

In [9]:
jeopardy.sort_values(by=['Air Date'],inplace=True)
terms_used = set()
question_overlap = []
for index, row in jeopardy.iterrows():
    split_question = row.clean_question.split()
    split_question = [i for i in split_question if len(i) >= 6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if (len(split_question) > 0):
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()
        

0.6889055316620328

# Question Overlap

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [10]:
def low_or_high(row):
    if row.clean_value > 800:
        value = 1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(low_or_high,axis=1)


In [11]:
def word_counts(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        question = row.clean_question.split()
        if word in question:
            if row.high_value == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count , low_count

observed_expected = []
terms_used = list(terms_used)
comparison_terms = terms_used[:10]
for term in comparison_terms:
    observed_expected.append(word_counts(term))
    


        

In [12]:
observed_expected

[(1, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (1, 2),
 (1, 1),
 (1, 0),
 (1, 0),
 (0, 1),
 (0, 1)]

In [13]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    ex_high = total_prop * high_value_count
    ex_low = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([ex_high, ex_low])
    chi_squared.append(chisquare(observed, expected))
    

In [14]:
chi_squared

[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]


# Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies