# Winning Jeopardy

Jeopardy is a a popular TV show in the US that has been running for decades. Participants in the show answer questions to win money.

This project aims to find some patterns in the questions to give participants an edge.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions

In [1]:
%config IPCompleter.greedy=True
import pandas as pd
import re

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Some of the column names have spaces in front of them. It's better to have them removed to avoid making mistakes later on in the study.

In [4]:
new_cols = []
for each in jeopardy.columns:
    x = re.sub('^ ','', each)
    new_cols.append(x)

In [5]:
jeopardy.columns = new_cols
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Before we can start analyzing the dataset, we have to normalize all text columns. The idea is we lowercase all words and remove punctuations so that *Don't* and *don't* are not considered different words.

In [6]:
def normalize_text(string):
    string  = string.lower()
    string = re.sub('[^A-Za-z0-9\s]', '', string)
    return string

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

The value column is currently a string with dollar signs infront. Removing the dollar sign and converting it to a integers will let us easily manipulate the values.

In [7]:
def normalize_vals(string):
    string = string = re.sub('[^A-Za-z0-9\s]', '', string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_vals)

Finally, the we have to convert the values in the column Air Date to datetime from strings.

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [9]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [10]:
jeopardy.head(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200


## Answers In Questions

To figure this out, we can count terms in the answer occur in the question in the entire dataset.

In [11]:
def count_matches(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    if 'the' in split_answer:
        split_answer = [item for item in split_answer if item != "the"]
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for each in split_answer:
        if each in split_question:
            match_count += 1
    return match_count/len(split_answer)

In [12]:
jeopardy['answer in question'] = jeopardy.apply(count_matches, axis = 1)
jeopardy['answer in question'].mean()*100

5.987760759999372

Words in the question only seem to appear about 6% of the time. This is not a very large number, therefore we can expect to deduce an answer just from the question alone. Hence, we have to study.

## Recycled Questions

Now, we can investigate how often new questions are repeats of older ones.

In [13]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [x for x in split_question if len(x) > 5]
    match_count = 0
    for each in split_question:
        if each in terms_used:
            match_count += 1
    for each in split_question:
        terms_used.add(each)
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()*100

69.08737315671962

Based on this initial analysis, which looks at individual words in a question, not phrases, there is about 70% overlap bbetween the words used in the question throughout the years. As this is looking at individuial words, the result is releatively insignificant but the result here does warrant a deeper look into recycled questions.  

## High Value  Questions

Identifying the kind of questions that are worth more can help us prepare better for those kinds of questions and therefore help us earn more when we are on the show.

In [14]:
def val_determine(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(val_determine, axis = 1)

In [19]:
def word_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[:5]

for each in comparison_terms:
    observed_expected.append(word_count(each))
    
observed_expected

[(0, 1), (0, 1), (2, 4), (0, 1), (0, 1)]

In [20]:
import numpy as np
from scipy.stats import chisquare


high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for each in observed_expected:
    total = sum(each)
    total_prop = total/jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    obs = np.array([each[0], each[1]])
    exp = np.array([high_value_exp, low_value_exp])
    chisq, p = chisquare(obs, exp)
    chi_squared.append((chisq, p))
    
chi_squared

[(0.401962846126884, 0.5260772985705469),
 (0.401962846126884, 0.5260772985705469),
 (0.06376233446880725, 0.8006453026878781),
 (0.401962846126884, 0.5260772985705469),
 (0.401962846126884, 0.5260772985705469)]

## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.