# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook here.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download here (https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Here's the beginning of the file:

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer

## Jeopardy Questions

### Data Exploration

In [1]:
import pandas as pd

jeopardy = pd.read_csv("jeopardy.csv")

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

As we see above we see that there are whitespaces before and after the word which we must remove

In [15]:
#in this case we did it this way as it small enough to manually fix it
#however if it was too big we would use jeopardy.columns = jeopardy.columns.str.strip() 
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [16]:
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


## Normalizing Text

Based on the dataframe above we can see that words are a mix of upper and lower case. We will need to fix this so that same words aren't treated differently. Fro example Don't and don't won't be considered differnt but the same. 

To achieve this we will make a function that converts to lowercase and apply to the columns

In [21]:
import re
def lowercase(string):
    '''
    takes a string and returns all lowercase with it's punctuation removed
    '''
    
    string=string.lower()
    res = re.sub(r'[^\w\s]', '', string)
    
    return res

In [40]:
jeopardy['clean_question']=jeopardy['Question'].apply(lowercase)
jeopardy['clean_answer']=jeopardy['Answer'].apply(lowercase)

In [41]:
jeopardy['clean_question'].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [42]:
jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

## Normalizing Columns

Even though we normalized the text columns, there are also some other columns to normalize.

The Value column should be numeric, to allow you to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable you to work it easier.

In [38]:

def normalize_values(text):
    '''
    takes in a string
    remove any punctuation in the string
    convert the string to an integer
    assign 0 instead if the conversion has an error
    return the integer.
    '''
        
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [44]:
jeopardy['clean_value']=jeopardy['Value'].apply(normalize_values)
jeopardy['clean_value'].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [45]:
#converting air date to datetime

jeopardy['Air Date']=pd.to_datetime(jeopardy['Air Date'])

In [46]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [53]:
def count_matches(row):
    '''
    returns the proportion for clean_question
    & clean_answer with matching terms
    '''
    #splitting on whitespaces
    split_answer=row['clean_answer'].split()
    split_question=row['clean_question'].split()
    
    #making a varible to increment on
    match_count=0
    
    #removing word (the) as it is the most common in the answer column
    if 'the' in split_answer:
        split_answer.remove('the')
    
    #if it is 0 we just say it is 0 as it will prevent division by 0 later on
    if len(split_answer)==0:
        return 0
    
    #checking if the word occurs in both split_answer and split_question and if it does increase the count by 1
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count/len(split_answer)    

In [54]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [55]:
#finding the mean of answer_in_question

jeopardy["answer_in_question"].mean()

0.05900196524977763

On average, only 6% of questions have their answers in the questions asked. This is not a whole lot of questions and means we can't hope to win by trying to figure out the answers of questions using the question. So the best strategy will be to actually study for jeopardy.

## Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, you can:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
 - If it does, increment a counter.
 - Add each word to terms_used.
 
This allows us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [56]:
question_overlap=[]
terms_used=set()

jeopardy = jeopardy.sort_values('Air Date')

for row in jeopardy.iterrows():
    row = row[1]
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0: # to avoid dividing by 0
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

In [57]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,adventurous 26th president he was 1st to ride ...,Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,notorious labor leader missing since 75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,washington proclaimed nov 26 1789 this first n...,Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,both ferde grofe the colorado river dug this ...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.0,0.5
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,depending on the book he could be a jones a sa...,Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0


In [58]:
jeopardy['question_overlap'].mean()

0.6876235590919739

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when you're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We will first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

We'll then be able to loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [59]:
def value_category(row):
    '''categorises rows into high or low value
    1 = high value, 0 = low vaue'''
    
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(value_category, axis=1)

In [60]:
def count_value(word):
    '''counts the value of individual words 
    in the clean question column'''
    
    low_count = 0
    high_count = 0
    for row in jeopardy.iterrows():
        row = row[1]
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [62]:

from random import choice
comparison_terms = []
comparison_terms = [choice(list(terms_used)) for i in range(10)] # picks a random smaple of 10 terms with replacement

observed_expected = []
for i in comparison_terms:
    result = count_value(i)
    observed_expected.append(result)
    
print(observed_expected)

[(0, 1), (0, 1), (4, 16), (0, 1), (1, 1), (1, 1), (1, 0), (0, 1), (1, 2), (0, 1)]


## Applying the Chi-Squared Test

Now that we've found the observed counts for a few terms,we can compute the expected counts and the chi-squared value.

In [63]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.7353581241806549, pvalue=0.39115190605378425),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

In our observed_expected list, terms seem to be more frequent in lower value questions, this could be due to the fact that there are more low value questions than high value ones. In cases where there were significant differences(of at least 3) in the term frequencies for low and high value, the pvalues are all less than 0.05 which would mean a strong relationship between those terms and low value words which makes sense as low value questions are more common. Although it was a small sample, there are no strong relationship between terms and high value questions.

## Popular Categories Per Round

Jeopardy has rounds and here we want to find out the most frequent category in each of the rounds.

In [64]:
jeopardy['Round'].value_counts(normalize=True)


Jeopardy!           0.495075
Double Jeopardy!    0.488124
Final Jeopardy!     0.016751
Tiebreaker          0.000050
Name: Round, dtype: float64

In [65]:
jeopardy_grp =  jeopardy.groupby(['Round'])

In [83]:
for i in jeopardy['Round'].unique():
    j_round = jeopardy_grp.get_group(i)
    top_cat_proportion = j_round['Category'].value_counts(normalize=True)[0] # returns the value for the category with the highest proportion
    top_cat_percentage = round(top_cat_proportion * 100,2)
    top_cat_name = j_round['Category'].value_counts().index[0] # returns the name of the category with the highest frequency in each round
    
    print('{} category make up {}% of the questions in {} round'.format(top_cat_name,top_cat_percentage,i)
)

WORD ORIGINS category make up 2.39% of the questions in Final Jeopardy! round
LITERATURE category make up 0.36% of the questions in Double Jeopardy! round
TELEVISION category make up 0.35% of the questions in Jeopardy! round
CHILD'S PLAY category make up 100.0% of the questions in Tiebreaker round


Most of the questions in our dataset are from the Jeopardy! and Double Jeopardy! rounds, with these round making up nearly 99% of the data, even though we know the top categories for these rounds, these categories make up only a small percentage of the total question. Focusing on just one particular category of question for a specific round isn't a very good strategy.

## Conclusion

- While there is no guaranteed strategy to winning Jeopardy as we have found out, it might be worth while to look at past questions while preparing.

- There also isn't any significant relationship between any term and high questions, so there is no keyword to look out for to prepare for high value questions.

- There isn't a significant question category to focus on for any jeopardy round, it's best to be prepared for as much ccategories as possible.