# Winning Jeopardy

# Introduction

In this project, let's say we are interested to compete on Jeopardy, which is a popular TV game show in the U.S. where participants answer questions to win money. We are attempting to look for any edge that could help us win. We will be working with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win. The dataset can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file), where each row represents a single question on a single episode of Jeopardy. The accompanying data dictionary is as follows:

* `Show Number` -- the Jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering the question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

# Data Exploration and Cleaning

First, we would need to load in the dataset and quickly explore the dataset.

In [1]:
# importing the libraries to be used
import numpy as np
import pandas as pd

In [2]:
# Reading in the csv file into a pandas DataFrame
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
# First five rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
# Column names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

It appears that some of the column names have spacings at the start which ought to be removed. This is done in the next code block.

In [5]:
# Looping through the column names to strip the whitespaces at the start, if any
cleaned_columns = []
for column in jeopardy.columns:
    cleaned_columns.append(column.strip())
    
# Update the column names with the new ones
jeopardy.columns = cleaned_columns

In [6]:
# Checking after cleaning the column names
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
# Overview of the columns in the dataframe
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


Taking note of the data types of the columns, some of the data types appear to be incorrect. The `Air Date` column ought to be of the `datetime` dtype and the `Value` column ought to be of the `int` dtype. These would be handled subsequently.

## Normalizing Text

Before analysis, there is a need to normalize all of the text columns (primarily, the `Question` and `Answer` columns). All words have to be converted to lowercase and punctuation would be removed. This can be achieved with a user-defined function.

In [8]:
# Creating a function to normalize the text
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub('\W+', ' ', text)
    text = re.sub('\s+', ' ', text)
    return text

In [9]:
# Testing out the custom function first before applying to the columns
test = normalize_text('Hello, how.are you?')
print(test)

hello how are you 


In [10]:
# Normalize the `Question` column and assign the cleaned questions to a new column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)

# Normalize the `Answer` column and assign the cleaned answers to a new column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [11]:
# Checking the dataframe after cleaning
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Converting the `Value` and `Air Date` Columns' Data Types

After normalizing the text columns, there are also a few columns whose dtypes need to be converted, as mentioned earlier. The `Value` column should also be numeric, to allow you to manipulate it more easily. The dollar sign has to be removed first before the columns be converted from the object dtype to a numeric dtype like int.

The `Air Date` column should also be a datetime, not an object, dtype.

In [12]:
# Creating a custom function to normalize dollar values:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [13]:
# Testing out the custom function first before applying over to the `Value` column
test_1 = normalize_values('$2,510')
print(test_1)
print(type(test_1))

test_2 = normalize_values('None')
print(test_2)
print(type(test_2))

2510
<class 'int'>
0
<class 'int'>


In [14]:
# Normalizing the `Value` column by applying the custom function
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

In [15]:
# Converting the dtype of the `Air Date` to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [16]:
# Checking out the dataframe after all the normalization
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


Seems like all went well, with the correct dtypes.

# Analysis of Past Questions

In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:
* How often the answer is deducible from the question?
* How often new questions are repeats of older questions?

The second question can be answered by checking how often complex words with more than characters reoccur. The first question can be answered by checking how many times words in the answer also occur in the question. The first question would be tackled on first, in the subsequent section.

## Deducing Answers from the Questions

In [17]:
# Creation of custom function to check how many times words in the answer also occur in the question, expressed as a ratio of
# the number of words in the answer that appear in the question to the length of the answer
def deduce(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [18]:
# Applying the custom function to every row in the dataframe
jeopardy['answer_in_question'] = jeopardy.apply(deduce, axis=1)

In [19]:
# Calculating the mean of the new `answer_in_question` column
jeopardy['answer_in_question'].mean()

0.06294645581984949

Looking at the mean of the ratios,= which is very low, only at about 0.063, it appears that it is highly unlikely that the answer can be deducible from the question alone since the number of words in the answer that appear in the question as well is very unlikely to be a lot. This would mean it would be unwise to not prep by not studying at all prior to the Jeopardy game. A good studying strategy could perhaps be spending more time on general knowledge. Now, moving on to the next question of whether studying past question helps at all: how often the answer is deducible from the question?

## Recycled Questions

While we are interested to investigate how often new questions are repeats of older ones, it is important to point out that we are unable to answer the question fully since we are working with only about 10% of the full Jeopardy question dataset but it is worth investigating at least still. We can investigate if questions are recycled by determining if the terms in questions have been used previously or not. We are only looking at words with six or more characters to filter out words like `the` and `than` which are commonly used but are not helpful in discerning the question.

In [20]:
question_overlap = []
terms_used = set()

# Sorting the dataframe by Air Date, in ascending order
jeopardy = jeopardy.sort_values('Air Date')

# Looping through each row of the dataframe
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)

In [21]:
jeopardy['question_overlap'] = question_overlap

In [22]:
jeopardy['question_overlap'].mean()

0.7197989717809739

Looking at the mean calculated above, it may seem worthwhile to study past questions as past questions do occasionally get recycled. The mean of about 0.72 indicates that about 72% of the terms in the question are repeated terms from older questions. This could potentially highlight that questions may be similar since the same terms are used. As such, answers may in fact be similar. However, it is noteworthy that this only looks at a small set of questions, and it does not look at phrases, it looks at single terms.

## Low Value vs High Value

Now, suppose we are only interested to study questions that pertain to high value questions instead of low value questions. This would maximise earnings on Jeopardy. Questions can be narrowed into two categories:
* Low value -- Any row where `Value` is less than 800.
* High value -- Any row where `Value` is greater than or equal to 800.

We'll then be able to loop through each of the terms (which are from the questions) from the last screen, `terms_used`, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [23]:
# Adding a column to identify whether a question is low or high value based on above criteria
jeopardy['high_value'] = jeopardy['clean_value'].apply(lambda x: 1 if x > 800 else 0)

In [24]:
# Checking first 5 rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0,0


In [25]:
# Creation of function to count how many times a single term appears in low and value questions
def count_lowhigh_times(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split():
            if row['high_value'] == 1:
                high_count +=1
            else:
                low_count += 1
    return high_count, low_count

In [26]:
# Randomly pick ten elements from the `terms_used` set from before and append to a list
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

# Determing the number of observed counts for each of the 10 terms randomly picked in high and low value questions
observed = []
for term in comparison_terms:
    result = count_lowhigh_times(term)
    observed.append(result)

In [27]:
observed

[(2, 1),
 (1, 1),
 (1, 2),
 (0, 1),
 (4, 13),
 (1, 0),
 (0, 1),
 (2, 0),
 (2, 0),
 (0, 1)]

With observed counts obtained, the expected counts need to be computed before arriving at a chi-squared value.

In [28]:
# Determining the number of high value and low value questions
high_value_count = jeopardy['high_value'].value_counts()[1]
low_value_count = jeopardy['high_value'].value_counts()[0]

In [29]:
from scipy.stats import chisquare

# List to store all 10 chi-squared values for the ten randomly picked terms
chi_squared = []
for obs in observed:
    total = sum(obs)
    total_prop = total/jeopardy.shape[0]
    high_expected = total_prop * high_value_count
    low_expected = total_prop * low_value_count
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_expected, low_expected])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared    

[Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.21978793356318777, pvalue=0.6392015498682628),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

# Conclusion

Two of the terms had a significant difference in usage between high value and low value rows, testing at 5% significance level. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.