Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. 

Let's say someone wants to compete on Jeopardy, and they're looking for any edge they can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help them win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). 

In [51]:
import pandas as pd
import string
import re
import numpy as np
from scipy.stats import chisquare

In [2]:
jeopardy= pd.read_csv("jeopardy.csv")

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
jeopardy.rename(columns={' Air Date': 'Air Date', ' Round': 'Round',' Value': 'Value', ' Question':'Question', ' Answer':'Answer' }, inplace=True)

# Normalizing Columns

In [25]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_value(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [24]:
jeopardy["clean_question"]= jeopardy["Question"].apply(normalize_text)   
jeopardy["clean_answer"]= jeopardy["Answer"].apply(normalize_text)




In [26]:
jeopardy["clean_value"]= jeopardy["Value"].apply(normalize_value)
jeopardy["Air Date"]= pd.to_datetime(jeopardy["Air Date"])


## Answer Terms in the Question

In [27]:
def count_matches(row):
    split_answer= row["clean_answer"].split(" ")
    split_question= row["clean_question"].split(" ")
    
    match_count=0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer)==0:
        return 0
    for each in split_answer:
        if each in split_question:
            match_count= match_count+1
            
    return (match_count/len(split_answer))

In [29]:
answer_in_question= jeopardy.apply(count_matches, axis=1)
print(np.mean(answer_in_question))

0.0604932570693


**Comments on the mean result**  
From the above result, we can interpret that only 6% of the time was when the answer was in the question, which is not a lot of times. So just hearing the question won't give us the answer. 

## Question Overlap

In [31]:
question_overlap=[]
terms_used= set()

for i,each in jeopardy.iterrows():
    split_question= each["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count=0
    for word in split_question:
        if word in terms_used:
            match_count=match_count+1
    for word in split_question:
            terms_used.add(word)
    if len(split_question)>0:
        match_count= match_count/len(split_question)
    question_overlap.append(match_count)
    
jeopardy["question_overlap"] = question_overlap
print("Mean Question Overlap: ", np.mean(question_overlap))

Mean Question Overlap:  0.690873731567


**Comments**  
From the above results, we can see that 70% times there is an overlap between newer and older questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [32]:
def determine_value(row):
    if row["clean_value"]>800:
        value=1
    else:
        value=0
    return value


In [35]:
jeopardy["high_value"]= jeopardy.apply(determine_value, axis=1)

In [45]:
def counts(word):
    low_count=0
    high_count=0
    for i,each in jeopardy.iterrows():
        split_question= each["clean_question"].split(" ")
        if word in split_question:
            if each["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [46]:
observed_expected=[]
comparison_terms= list(terms_used)[0:5]

In [47]:
for each in comparison_terms:
    observed_expected.append(counts(each))
observed_expected

[(0, 1), (1, 0), (1, 0), (0, 5), (2, 1)]

# Chi Squared Test

In [57]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.00981423063442, pvalue=0.1562844540498966),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344)]

**Comments**  
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.