## Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions. Here's the beginning of the file:
![](https://dq-content.s3.amazonaws.com/Nlfu13A.png)

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
   + Show Number -- the Jeopardy episode number of the show this question was in.
   + Air Date -- the date the episode aired.
   + Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
   + Category -- the category of the question.
   + Value -- the number of dollars answering the question correctly is worth.
   + Question -- the text of the question.
   + Answer -- the text of the answer.
   
First, we'll explore the dataset.

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Some of the column names start with spaces. Let's remove the beginning spaces in each item.

In [3]:
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


## Normalizing text

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the "Question" and "Answer" columns). After normalization, all the words are in lowercase. And all the punctuation are removed so Don't and don't aren't considered to be different words when we compare them.

In [4]:
#Write a function to normalize questions and answers.
import re
def convert_word(m):
    m = re.sub('\W', ' ', m)
    m = m.lower()
    return m
jeopardy['clean_question'] = jeopardy['Question'].apply(convert_word)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(convert_word)
jeopardy.head()        

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams


## Normalizing columns

Now that we've normalized the text columns, there are also some other columns to normalize.

The 'Value' column should be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The 'Air Date' column should also be a datetime, not a string, to enable us to work with it more easily.

In [5]:
#Write a function to normalize Value column.
def change_value(s):
    try:
        s = re.sub("\W", "", s)
        s = int(s)
    except:
        s = 0
    return s
jeopardy['clean_value'] = jeopardy['Value'].apply(change_value)

#Convert the 'Air Date' column to a datetime column.
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200


## Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it at all, it would be helpful to figure out two things:
   + How often the answer is deducible from the question.
   + How often new questions are repeated of older questions.
   
We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the second question by checking how often complex words (>6 characters) reoccur.

We'll work on the first question now, and then come back to the second.

In [6]:
#Write a function to calculate the matches of the words between answer and question.
def match_count(row):
    #This function takes in a row in "jeopardy" dataset.
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    
    #Remove 'the' in split_answer, since it's meaningless.
    if "the" in split_answer: 
        split_answer.remove("the")
    
    match_count = 0
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)

#Count how many times terms in 'clean_answer' occur in 'clean_question'.
jeopardy["answer_in_question"] = jeopardy.apply(match_count, axis = 1)

#Calculate the mean of 'answer_in_question'.
mean = jeopardy["answer_in_question"].mean()
print(mean)        

0.09565366087691443


So on average, only about 10% of the words in an answer occur in the question. This won't give you enough information to get the answers to the questions. Next, we'll find how often new questions are repeats of older questions.

## Recycled questions

Now let's answer the second question from the previous section:
   + How often new questions are repeats of older questions.

Since we only have about 10% of the full Jeopardy question dataset, we can't completely answer this. But we can investigate it at lease.

To do this, we can:
   + Sort "jeopardy" in order of ascending air date.
   + Maintain a set "terms_used" to contain all the terms used in the old questions.
   + Iterate through each row of "jeopardy".
   + Split "clean_question" into words, remove any word shorter than 6 characters, and check if each word occurs in "terms_used".
   
This will enable us to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables us to filter out words like "the" and "than", which are commonly used, but don't tell us a lot about a question.

In [7]:
jeopardy = jeopardy.sort_values(by = ['Air Date'])
question_overlap = []
terms_used = set()
for index, row in jeopardy.iterrows():
    match_count = 0
    split_question = row["clean_question"].split(" ")
    split_question = [i for i in split_question if len(i)>5]
       
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
    if len(split_question) > 0:
        overlap = match_count/len(split_question)
    else:
        overlap = 0
    question_overlap.append(overlap)
    
print(len(terms_used))
jeopardy["question_overlap"] = question_overlap
print(jeopardy["question_overlap"].mean())

21223
0.721603243720504


There is indeed 72% of overlap between new questions and old ones. However, we got this number from the only 10% of the full dataset with one word. If we check the overlap of phrases, the result must be different. But it's still worth investagating recycled questions further more.

## Low value vs high value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:
   + Low value -- Any row where "Value" is less than 800.
   + High value -- Any row where "Value" is greater than 800.
   
We'll then be able to loop through each of the terms from the last section, terms_used, and:

   + Find the number of low value questions the word occurs in.
   + Find the number of high value questions the word occurs in.
   + Find the percentage of questions the word occurs in.
   + Based on the percentage of questions the word occurs in, find expected counts.
   + Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. 

There are more than 20000 words in terms_used. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [8]:
#High-vlaue and low-value questions
def value_classification(row):
    if row["clean_value"]<800:
        value = 0
    else:
        value = 1
    return value
jeopardy["high_value"] = jeopardy.apply(value_classification, axis = 1)

#Counts in high-value questions and low-value questions for the words.
def counts(word):
    high_count = 0
    low_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 0:
                low_count += 1
            else:
                high_count += 1
    return high_count, low_count

#Select 5 elements from terms_used to compare.
terms_used_list = list(terms_used)
comparison_terms = terms_used_list[:5]
observed_expected = [counts(w) for w in comparison_terms]
print(observed_expected)

[(0, 1), (0, 1), (0, 1), (0, 1), (1, 5)]


## Applying the chi-squared test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [9]:
high_value_count = jeopardy[jeopardy["high_value"]==1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"]==0].shape[0]

import numpy as np
from scipy.stats import chisquare 
chi_squared = []
for lis in observed_expected:
    total = lis[0] + lis[1]
    total_prop = total/jeopardy.shape[0]
    high_value_expected = high_value_count * total_prop
    low_value_expected = low_value_count * total_prop
    observed = np.array([lis[0], lis[1]])
    expected = np.array([high_value_expected, low_value_expected])
    chi_squared.append(chisquare(observed, expected))
    
print(chi_squared)

[Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682), Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682), Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682), Power_divergenceResult(statistic=0.7721754541426672, pvalue=0.3795448984353682), Power_divergenceResult(statistic=1.7665714689958703, pvalue=0.18380695652645074)]


None of five terms have a significant different in usage between high-value and low-value questions. In addition, the frequencies for all the terms are less than 10, so the chi-squared test isn't valid. It would be better to run this test with only those terms which have higher frequencies.

## Conclusions and future directions

In this project, we analyzed the subset of Jeopardy dataset. We investigated our strategies to win the game with chi-squared test.

Here are some potential next steps:
   + Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long.
   + Perform the chi-squared test across more terms to see what terms have larger differences. We need to select terms that have high frequencies across the dataset.
   + Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.