# Winning Jeopardy
Jeopardy! is a popular daytime television trivia show in the United States. It has existed for a long time, and a contestant that knows the answers to a myriad of diverse trivia questions can win a lot of money. Today, I will perform statistical analysis on a dataset of Jeopardy! questions to find out the best way to win the game show. There are several ways to study for Jeopardy! 

### Goal:
My goal is to answer the question:  
   
If I want to win Jeopardy! in the future, should I study previous Jeopardy! questions, general knowledge, or not study at all?

I will use a dataset of 216,930 Jeopardy! questions, roughly 83% of all Jeopardy! questions ever used. These questions were collected by crawling www.j-archive.com, and this dataset can be found <a href='https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file'>here</a>. 

## Reading in the Data

In [1]:
import pandas as pd

jeopardy = pd.read_csv('/Users/admin/Downloads/DATASCIENCE PROJECTS - GUIDED/Project 16 | Winning Jeopardy/JEOPARDY_CSV.csv')

jeopardy.shape

(216930, 7)

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.describe(include='all')

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
count,216930.0,216930,216930,216930,216930,216930,216928
unique,,3640,4,27995,150,216124,88268
top,,2007-11-13,Jeopardy!,BEFORE & AFTER,$400,[audio clue],China
freq,,62,107384,547,42244,17,216
mean,4264.238519,,,,,,
std,1386.296335,,,,,,
min,1.0,,,,,,
25%,3349.0,,,,,,
50%,4490.0,,,,,,
75%,5393.0,,,,,,


## Cleaning the Dataset
I need to clean the column names and the "Question", "Answer", and "Value" columns.

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()  #Stripping white space from the column names

In [5]:
#Normalizing the Question and Answer columns

import re

def normalize(string):
    string = str(string)
    string = string.lower()
    string = re.sub('\W', ' ', string)
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

In [6]:
#Normalizing the Value column

def normalize_value(value):
    value = str(value)
    value = re.sub('\D', '', value)
    if value == '':     #Correcting for non-digit entry
        value = 0
    value = int(value)
    return value

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

## Exploring the Data
There are two questions that can help me answer my original question:   
<ul><li>How often can a Jeopardy! answer be used as a question?</li>
    <li>How often are questions repeated?</li></ul>
    
If I can answer these questions, then I have a good idea of whether I should study previous Jeopardy! questions if I want to win in the future.

### How often can a Jeopardy! answer be used as a question?
This question will help me discover whether I even need to study if I want to win Jeopardy! by finding out how often the answer can be discerned from the question itself.

To answer this question, I will use a word-count method to estimate how many times, on average, previous answers appear in their corresponding question.

To do this, I will create and apply a function that calculates how many times each word in a Jeopardy! answer appears in its corresponding question, and then average this value over all questions.

In [7]:
def answer_in_question(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    while 'the' in split_answer:    #Removing "the" because it is too common
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)   #Words-matched-with-question per words-total-in-answer

jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)

jeopardy['answer_in_question'].mean()

0.06082627177801448

On average, the answer can only be found in the question 6% of the time. This means that, in order to win, I will have to do more than "not study at all."

### How often are questions repeated?
This question will help me discover if studying past Jeopardy! questions will actually help me prepare for future Jeopardy! questions.

To answer this question, I will once again use a word-count method. This time, this method will help reveal if previous question topics reappear in future Jeopardy! questions.

To do this, I will create and apply a function that calculates, for each Jeopardy! question, how many words in that question have already been used in previous questions.

In [8]:
question_overlap = []
terms_used = set()

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy = jeopardy.sort_values('Air Date')     #Sorting by date aired

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [word for word in split_question if len(word) >= 6]    #Removing common words by removing small words
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap     #Creating a column of how many words in each question were used in previous questions

jeopardy['question_overlap'].mean()

0.8987549397173473

On average, 89.9% of words from previous Jeopardy! questions are recycled for new questions. This is a sign that question topics may in fact be recycled! However, this conclusion is still inconclusive. To provide more certainty, a good next step is to find question overlap of phrases rather than just solitary words. Still, this is a sign that studying past questions can be worthwhile.

## Studying High Value Jeopardy Questions
There is still yet another dimension of the game show to consider when determining a potent path to Jeopardy! victory. Jeopardy! is a game show with varying cash prizes, awarded for answering questions of different value. To win more money on the game show, it is better to know the answer to high-value questions rather than low-value questions.

Another good question to answer is:   
Will studying more previous high-value questions than low-value questions help me win a future Jeopardy! game show better than just studying previous Jeopardy! questions in general?

To help answer this question, I will test whether there is a significant difference between high-value questions and low-value questions, using a word-count method once again, and this time I will perform a chi-squared test to evaluate whether the results of this test are statistically significant.

The null hypothesis:   
The words used in high-value questions are not statistically different from the words used in low-value questions.

Hypothesis:   
The words used in high-value questions are significantly different from the words used in low-value questions.

### Testing the Hypothesis
I will test the hypothesis by dividing the Jeopardy! questions into high-value and low-value questions. High-value questions are questions that are worth more than \$800.

Next, I will create a function that counts how many times a particular word is used in high-value questions, and how many times that same word is used in low-value questions. The function will return both counts.

I will then apply this function to 10 random words from the list of used Jeopardy! terms. This result will serve as 10 'observation' instances for a chi-squared test.

### The Chi Squared Test
The chi squared test calculates a test statistic that quantifies the difference between sets of observed and expected categorical values. The chi squared test is used to determine statistical significance.

$$\chi^2 = \frac{(observed - expected)^2}{expected}$$

For each of the 10 Jeopardy! terms that I will test, there are 2 observed values and 2 expected values: one observed value for the high-value count, one observed value for the low-value count, one expected value for the high-value count, one expected value for the low-value count.

The observed values are calculated from the term-counting function. 

The expected values are the product of the term's proportional representation in the set of Jeopardy! questions with the number of questions that are high-/low-value.

Each term will generate it's own chi-squared value.

From each chi-squared value, I will generate a p-value from a sampling distribution that will determine statistical significance.

If any term appears with a p-value of 5\% or less, then that term is probably primarily used in only high-value or only low-value questions.

P-value threshold: 5%.

In [9]:
#Categorizing questions as high-value or low-value

def assign_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(assign_value, axis=1)

In [12]:
#High-/low-value count function

def high_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [14]:
#Generating sample observation values

from random import choice

terms_used_list = list(terms_used)

comparison_terms = [choice(terms_used_list) for _ in range(10)]     #Simple random sampling (representative sampling)

observed_expected = []

for term in comparison_terms:
    observed_expected.append(high_count(term))

[(1, 0),
 (0, 1),
 (1, 3),
 (0, 1),
 (8, 15),
 (2, 6),
 (2, 6),
 (2, 4),
 (0, 2),
 (0, 1)]

In [16]:
#Performing the chi-squared test

from scipy.stats import chisquare
import numpy as np

low_value_count = len(jeopardy[jeopardy['high_value'] == 0])
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total/len(jeopardy)
    
    exp_high = total_prop * high_value_count    #Calculating the expected values
    exp_low = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_high, exp_low])
    
    chi_sq = chisquare(observed, expected)
    chi_squared.append(chi_sq)
    
chi_squared

[Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.021646150708492677, pvalue=0.8830323245068887),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.4741163319768614, pvalue=0.4910995336105496),
 Power_divergenceResult(statistic=0.043292301416985354, pvalue=0.8351758561462266),
 Power_divergenceResult(statistic=0.043292301416985354, pvalue=0.8351758561462266),
 Power_divergenceResult(statistic=0.07446818777814278, pvalue=0.7849388502668134),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695)]

### Chi-Squared Test Results
None of the terms chosen from the simple random sample demonstrate statistical significance in the chi-squared test. None of these terms is significantly likely to appear more in high-value questions than low-value questions or vice versa.

The lowest p-value of all 10 terms in the sample is 11.2%. All other p-values range from 37.4% to 88.3%, with 7 of the 10 p-values higher than 50%.

All of the chi-squared values are low. Only 1 chi-squared value is greater than 1. This means that 9 out of 10 of the terms in the sample display strong similarity between their observed and expected values.

At this time, the null hypothesis cannot be rejected. However, this sample is too small to garner conclusive results. Many more chi-squared tests can be performed using more terms in the vocabulary. With more terms tested, a few may appear to demonstrate high chi-squared values and low p-values. This is indicated by noticing that a small minority of the sample terms tested deviates from the general pattern of the rest. 1 in 10 of the samples has a much higher chi-squared value than the others, and simultaneously a much lower p-value than the others. Other terms may continue this trend, with some showing p-values less than 5\%.

## Conclusion
It looks like studying previous Jeopardy! questions might be a winning strategy if I want to win a future game of Jeopardy!, although the evidence is not final. There is still more analysis required in order to come to a precise conclusion. 

### Next Steps:
<ul><li>Answering the question "How often are questions repeated?" by searching for common phrases among newer and older questions.</li>
    <li>Discovering which "Categories" ("Categories" column) of Jeopardy! questions are most popular for previous questions.</li>
    <li>Performing more chi-squared tests on more Jeopardy! terms to find out if any topics do appear more often in high-value questions.</ul>