# Winning Jeopardy
This project investigates similarity between historical jeopardy answers and questions to see if there are any relationships that could be exploited and used as a study guide to improve a contestants chances of winning. 

The data set is the first 20,000 rows from a data set posted on reddit which can be found here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

Two main questions we will be trying to answer are:
1) How often do questions contain hints about the answer?
2) How often are questions repeated over time?

In [1]:
# Read in libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

# Read in data
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Clean and prepare the data

In [2]:
# investigate column names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# Remove spaces from the column names
jeopardy.columns = [col.strip() for col in jeopardy.columns]

In [4]:
# Check data types and check for null values
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [5]:
# Convert date column to datetime format
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [6]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null datetime64[ns]
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


On order to analyze the words in the question and answer columns we need to remove all punctuation to normalize them.

In [7]:
def normalize(str_):
    """
    Function takes in a string and removes all punctuation.
    """
    str_ = re.sub('[^\w\s]','',str_.lower())
    
    return str_

# Using the function above, clean the question and answer columns
# in the data set.
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

# Check data
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


The value column is also problematic because it is in string format and has punctuation in it. We will write a function to clean that as well.

In [8]:
# Create function to clean the Value column

def norm_value(value):
    """
    Function takes in a string, removes '$' and ',' and casts string to an integer
    """
    
    if value == 'None':
        value = 0
    else:
        value = value.strip('\$').replace(',','')
        
    return int(value)

# Apply function and create new column
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_value)

## Analysis
### Repeat words
Let's start with investigating whether it is possible to figure out the answer to a question from the words in the question. We will investigate how often words from the question are repeated in the answer.

In [9]:
def repeat_words(row):
    """
    Function takes in a dataframe row and converts the clean_answer and 
    clean_question columns to lists. It then removes the word 'The' from 
    the beginning of the answer and returns the proportion of answer words
    that are in the question.
    """
    
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    # Counter to count how many words are in both question and answer
    match_count = 0
    
    # Remove the word 'The'
    if 'the' in split_answer:
        split_answer.remove('the')
    
    # Return proportion of matched words
    if len(split_answer) == 0:
        return 0     
    else:
        for item in split_answer:
            if item in split_question:
                match_count += 1                
    return (match_count / len(split_answer))

# Apply to dataframe
jeopardy['answer_in_question'] = jeopardy.apply(repeat_words, axis=1)

In [10]:
# Calculate the mean proportion of repeat words
jeopardy['answer_in_question'].mean()

0.05900196524977763

In [11]:
# Calculate how many questions have relatively high proportion of repeat words
jeopardy['answer_in_question'].value_counts().sort_index(ascending=False)

1.000000      124
0.875000        1
0.800000        2
0.750000       17
0.666667      104
0.600000        9
0.571429        2
0.500000     1448
0.444444        1
0.428571        2
0.400000       26
0.350000        1
0.333333      494
0.300000        2
0.285714        7
0.250000      155
0.200000       68
0.181818        2
0.166667       27
0.142857       21
0.125000        9
0.111111        2
0.000000    17475
Name: answer_in_question, dtype: int64

In [12]:
# Proportion of answers with 50% or more words in the question
jeopardy['answer_in_question'][jeopardy['answer_in_question'] >= 0.5].count() / jeopardy['answer_in_question'].count()

0.08535426771338567

It appears 8.5% of the questions have half or more of the words in the question repeated in the answer. The mean number of repeat words is 5.9%. Since it's only a small number of answers that have repeat words, there are some answers with a significantly high percentage of repeat words. This is a significant enough percentage to investigate as the type of question could be identifiable and could inform studying. 

### Repeat questions
Next let's see how often questions are repeat questions. We only have 10% of the total data but we will try and see what conclusions we can draw. We will look at how often words are reused across questions over time. To focus on meaningful words and ignore small common words like articles, we will just focus on words with six or more characters.

In [13]:
# Sort rows by air date
jeopardy_sorted = jeopardy.sort_values('Air Date')

In [14]:
# Create an empty set to house all the unique words used
terms_used = set()

def repeat_questions(row):
    """
    Function takes in a row and calculates the proportion of words
    in the clean_question column which are repeated in the terms_used
    set. If the words are not present in terms_used, they are added 
    to the set. 
    
    Function returns the proportion of words repeated as a float.
    """
    
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word) > 6]
        
    match_count = 0
    
    for word in split_question:  
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
                
    if len(split_question) > 0:
        return match_count / len(split_question)

# Apply function to sorted df
jeopardy_sorted['question_overlap'] = jeopardy_sorted.apply(repeat_questions, axis=1)
# calculate mean proportion of repeated words
jeopardy_sorted['question_overlap'].mean()

0.6649036076777254

In [15]:
# Calculate how often proportions above 0 appear in the data set
jeopardy_sorted['question_overlap'][jeopardy_sorted['question_overlap'] > 0].value_counts().sort_index(ascending=False)

1.000000    6853
0.950000       1
0.923077       1
0.916667       3
0.909091       4
0.900000       4
0.888889       4
0.875000      21
0.866667       1
0.857143      77
0.846154       1
0.833333     228
0.818182       7
0.800000     648
0.785714       1
0.777778      14
0.769231       2
0.764706       1
0.750000    1320
0.733333       2
0.727273      13
0.714286      79
0.705882       1
0.700000      10
0.692308       7
0.666667    2040
0.642857       2
0.636364       5
0.625000      30
0.615385       9
0.600000     546
0.583333       3
0.571429      58
0.555556      14
0.545455       6
0.533333       1
0.500000    2725
0.461538       3
0.454545       5
0.444444       7
0.428571      28
0.416667       2
0.400000     255
0.384615       2
0.375000      11
0.363636       2
0.333333    1014
0.285714      10
0.250000     391
0.222222       1
0.200000      95
0.166667      20
0.142857       5
Name: question_overlap, dtype: int64

There are over 6000 questions where all of the words used over six characters reappear in other questions. This is quite a significant number considering we only have 20000 rows. The average number of repeated words is 66% of the words in a question over six characters. It seems that studying old questions would be highly valuable to a contestant. 

### Repeat words in high value questions
We will now look at how often words that are in the terms_used set are in high value (over $800) questions. We will start with labeling rows as high value or not and then 

In [16]:
# Create function to label high value and low value questions in a new row
def q_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    
    return value

# Apply function and create labels
jeopardy_sorted['high_value'] = jeopardy_sorted.apply(q_value, axis=1)

In [17]:
# Create function to count the number of times a word occurs in high and low
# value questions in the data set.
def word_value(word):
    """
    Function takes in a word and returns the number of times the word occurs in
    both high value and low value questions (function returns to values).
    """
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy_sorted.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

### Calculate chi squared value for each word
For efficiency purposes, we will sample 10 words from our set of words stored as terms_used. We can then calculate the number of times the word occurs in both high and low value questions and calculate a chi squared score using the proportion of high and low value questions as our expected values. If we have a high enough chi squared score we can see which questions are more valuable to study.

In [18]:
# Randomly sample 10 words from terms_used set
import random
comparison_terms = random.sample(terms_used, 10)
observed_expected = []

# Iterate through sample list and feed into word_value function defined above to
# return the counts of the word in both high and low value questions
for word in comparison_terms:
    observed_expected.append(word_value(word))

In [19]:
from scipy.stats import chisquare
# Get the total counts of both high value and low value questions in the data
high_value_count = jeopardy_sorted[jeopardy_sorted['high_value'] == 1].shape[0]
low_value_count = jeopardy_sorted[jeopardy_sorted['high_value'] == 0].shape[0]

# Empty list to store the chi_squared values.
chi_squared = []

# Iterate through observed counts, sum the observed counts, calculate the 
# proportion of observations to the total number of questions; then calculate
# the expected proportions by multiplying the total proportion by the number of 
# high value questions and the number of low value questions.
for observation in observed_expected:
    total = sum(observation)
    total_prop = total / jeopardy_sorted.shape[0]
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    expected = (high_value_expected,low_value_expected)
    
    values = chisquare(observation, expected)
    chi_squared.append(values)
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

It appears only one of the ten results has a significant P-value (0.0257) and a chi value of almost 5 (4.975). But if the sample is representative of the whole population then perhaps up to 10% of words may have a significant chisquare and P-value. Let's see if we can improve these scores by looking for high frequency words in particular.

In [27]:
# Making a frequency table for the words would take to long because we have a lot
# of words!
terms_used = list(terms_used)
len(terms_used)

19401

In [39]:
# Some of the words appear to actually be image files.
terms_used[:10]

['strongest',
 'wharton',
 'medicine',
 'bergenbelsen',
 'hrefhttpwwwjarchivecommedia20100706_dj_14jpg',
 'huizong',
 'grandmother',
 'berserk',
 'designate',
 'kendall']

In [37]:
# The image files have very long names so we can look for words with over 25 characters.
count = 0
for word in terms_used:
    
    if len(word) > 25:
        count += 1
        
print(count)

1208


In [40]:
# Remove words over 25 characters to clear out image files
terms_used = [word for word in terms_used if len(word) <= 25]
len(terms_used)

18193

In [65]:
# Initialize empty dictionary
for word in terms_used:
    word_freq[word] = 0
    
# Create frequency table
def word_count(question):
    split_question = question.split()
    
    for word in terms_used:
        if word in split_question:
            word_freq[word] += 1

jeopardy_sorted['clean_question'].apply(word_count)

19325    None
19301    None
19302    None
19303    None
19304    None
19305    None
19306    None
19307    None
19308    None
19309    None
19310    None
19311    None
19312    None
19313    None
19314    None
19315    None
19316    None
19317    None
19318    None
19319    None
19320    None
19321    None
19322    None
19323    None
19300    None
19324    None
19299    None
19297    None
19274    None
19275    None
         ... 
1973     None
1974     None
1959     None
1958     None
1957     None
1956     None
1934     None
1935     None
1936     None
1937     None
1938     None
1939     None
1940     None
1941     None
1942     None
1943     None
1932     None
1944     None
1946     None
1947     None
1948     None
1949     None
1950     None
1951     None
1952     None
1953     None
1954     None
1955     None
1945     None
1922     None
Name: clean_question, Length: 19999, dtype: object

In [93]:
# Create dataframe from frequency table
word_freq_df = pd.DataFrame.from_dict(word_freq, orient='index')
word_freq_df.rename(columns={0:'count'}, inplace=True)
# Store 20 most common words as a list
common_words = list(word_freq_df['count'].sort_values(ascending=False)[:20].index)

We will now rerun our chi squared calculations using the 20 most common words in terms_used. 

In [94]:
# Initialize empty list to hold observed high/low value counts
observed_expected = []

# Iterate through sample list and feed into word_value function defined above to
# return the counts of the word in both high and low value questions
for word in common_words:
    observed_expected.append(word_value(word))

# Empty list to store the chi_squared values.
chi_squared = []

# Iterate through observed counts, sum the observed counts, calculate the 
# proportion of observations to the total number of questions; then calculate
# the expected proportions by multiplying the total proportion by the number of 
# high value questions and the number of low value questions.
for observation in observed_expected:
    total = sum(observation)
    total_prop = total / jeopardy_sorted.shape[0]
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    expected = (high_value_expected,low_value_expected)
    
    values = chisquare(observation, expected)
    chi_squared.append(values)
    
chi_squared

[Power_divergenceResult(statistic=0.29967829483482744, pvalue=0.5840841713114313),
 Power_divergenceResult(statistic=0.4938111242657224, pvalue=0.4822321568398581),
 Power_divergenceResult(statistic=0.22592591114717697, pvalue=0.6345612982626103),
 Power_divergenceResult(statistic=1.9084254764809114, pvalue=0.16713826420471323),
 Power_divergenceResult(statistic=15.028296538003147, pvalue=0.00010591119029347305),
 Power_divergenceResult(statistic=0.36956355622281933, pvalue=0.5432422635312689),
 Power_divergenceResult(statistic=1.9892622715198827, pvalue=0.15841803672554888),
 Power_divergenceResult(statistic=1.4521478773041714, pvalue=0.22818361990918334),
 Power_divergenceResult(statistic=4.4934633334396965, pvalue=0.03402468062121473),
 Power_divergenceResult(statistic=2.6950209285044178, pvalue=0.10066216730100558),
 Power_divergenceResult(statistic=0.08036666360833383, pvalue=0.7768011333475542),
 Power_divergenceResult(statistic=0.0027964365481966506, pvalue=0.9578264488892704),


The final result was 4 significant words in our list which are repeated more often than others. This technique could lead to finding more questions which should be studied, however, it may not yield many questions as this only yielded significant results when the word frequency was particularly high. 

We will stop the analysis here.