In [1]:
import pandas as pd
import string
import numpy as np
from scipy.stats import chisquare

## Winning Jeopardy

In [3]:
jeopardy = pd.read_csv("jeopardy.csv")

In [4]:
cols = list(jeopardy.columns)

In [5]:
cols

['Show Number',
 ' Air Date',
 ' Round',
 ' Category',
 ' Value',
 ' Question',
 ' Answer']

There's an extra spcae at the start of each column name, let's remove that and the space between words.

In [9]:
for i,col in enumerate(cols):
    cols[i] = col.replace(" ", "")

In [10]:
jeopardy.columns = cols

In [11]:
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now, let's take our questions and answer and normalize them by making each letter lowercase and removing any puncutation.

In [12]:
def normalize(st):
    st = st.lower()
    st = ''.join((x for x in st if x not in string.punctuation))
    return st

In [13]:
jeopardy["clean_answer"] = jeopardy.Answer.apply(normalize)
jeopardy["clean_question"] = jeopardy.Question.apply(normalize)

In [14]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_answer,clean_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,copernicus,for the last 8 years of his life galileo was u...
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,jim thorpe,no 2 1912 olympian football star at carlisle i...
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,arizona,the city of yuma in this state has a record av...
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,mcdonalds,in 1963 live on the art linkletter show this c...
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,john adams,signer of the dec of indep framer of the const...


The air date will be easier to work with if we convert it to a datatime object. 

In [15]:
jeopardy["AirDate"] = pd.to_datetime(jeopardy["AirDate"])

Working with the each question's value will be much easier if we convert it to an integer. First the dollar sign must be stripped out and then the value can be converted to an int, we will set the value to 0 if it cannot be converted.

In [16]:
def norm_int(val):
    val = ''.join((x for x in val if x not in string.punctuation))
    try:
        val = int(val)
    except ValueError:
        val = 0
    return val

In [17]:
jeopardy["clean_value"] = jeopardy["Value"].apply(norm_int)

When does the answer appear in the question, can we gain any knowledge about the question from the given answer?

In [18]:
def a_in_q(row):
    split_answer = row[7].split(" ")
    split_question  = row[8].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)    

In [19]:
jeopardy["answer_in_question"] = jeopardy.apply(a_in_q, axis = 1)

In [20]:
jeopardy["answer_in_question"].mean()

0.060352773854698942

A word from the answer only appears in the question 6% of the time. Relying on a startegy such as this would be foolish.

Let's see if we can find whether a question is being repeated.

In [22]:
sorted_by_date = jeopardy.sort_values(["AirDate"])

In [23]:
question_overlap = []
terms_used = set()

for i in range(0, jeopardy.shape[0]):
    split_question = jeopardy['clean_question'][i].split(" ")
    match_count = 0
    split_question = [q for q in split_question if len(q) > 5]
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)

In [24]:
jeopardy["question_overlap"] = question_overlap

In [25]:
jeopardy["question_overlap"].mean()

0.69195779922036438

There's a 70% overlap, but this is a small amount of data and we are only looking at single words. This is worth exploring more, but this result does not matter much except for showing their may be something further to explore.

In [26]:
values = list(jeopardy["clean_value"])
jeopardy["high_value"] = [int(x>800) for x in values]

What if we only want to study questions that result in high values, what words appear more often in high value question than low?

To do this we can;
+ Find the number of low value questions the word occurs in.
+ Find the number of high value questions the word occurs in.
+ Find the percentage of questions the word occurs in.
+ Based on the percentage of questions the word occurs in, find expected counts.
+ Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

In [27]:
def word_val(word):
    low_count = 0
    high_count = 0
    for i in range(0, jeopardy.shape[0]):
        split_question = jeopardy['clean_question'][i].split(" ")
        if word in split_question:
            if jeopardy['high_value'][i] == 1:
                high_count += 1
            else:
                low_count += 1
    return[low_count,high_count]

In [28]:
observed_expected = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append((term,(word_val(term))))

In [36]:
observed_expected[1][1]

[2, 0]

In [37]:
high_value_count = sum(jeopardy["high_value"])
low_value_count = len(jeopardy["high_value"]) - high_value_count

In [38]:
chi_squared = []

In [39]:
for e in observed_expected:
    word = e[0]
    total = sum(e[1])
    total_prop = total/jeopardy.shape[0]
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    obs = np.array(e[1])
    exp = np.array([exp_low, exp_high])
    chisquare_value, pvalue = chisquare(obs, exp)
    chi_squared.append((word, chisquare_value, pvalue))

In [40]:
chi_squared

[('harness', 0.44487748166127949, 0.50477764875459963),
 ('opener', 0.80392569225376798, 0.36992223780795708),
 ('osub3sub', 0.40196284612688399, 0.52607729857054686),
 ('targetblankwyatt', 0.40196284612688399, 0.52607729857054686),
 ('officials', 0.026364433084407689, 0.87101348468892104)]

None of the terms have a significant difference in usage.

Here are some potential next steps:

+ Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    + Manually create a list of words to remove, like the, than, etc.
    + Find a list of stopwords to remove.
    + Remove words that occur in more than a certain percentage (like 5%) of questions.
+ Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    + Use the apply method to make the code that calculates frequencies more efficient.
    + Only select terms that have high frequencies across the dataset, and ignore the others.
+ Look more into the `Category` column and see if any interesting analysis can be done with it. Some ideas:
    + See which categories appear the most often.
    + Find the probability of each category appearing in each round.
+ Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
+ Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.