# Winning Jeopardy

In [1]:
import pandas as pd
import numpy as np

In [2]:
jeopardy = pd.read_csv("jeopardy.csv")

In [3]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.dtypes

Show Number     int64
 Air Date      object
 Round         object
 Category      object
 Value         object
 Question      object
 Answer        object
dtype: object

Some of the column names have spaces in front. Remove these spaces

In [5]:
jeopardy.columns = jeopardy.columns.str.replace(' ','')

In [6]:
jeopardy.dtypes

ShowNumber     int64
AirDate       object
Round         object
Category      object
Value         object
Question      object
Answer        object
dtype: object

Normalize questions and answers. 
1. Convert the string to lowercase.
2. Remove all punctuation in the string.

In [7]:
import string

In [8]:
def clean_string(s):
    s=s.lower()
    for c in string.punctuation:
        s=s.replace(c,'')    
    return s 

In [9]:
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_string)

In [10]:
jeopardy['clean_answer']=jeopardy['Answer'].apply(clean_string)

In [11]:
jeopardy.head(5)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


Normalize dollar values:

1. Remove any punctuation in the string
2. Convert the string to an integer
3. If the conversion has an error, assign 0 instead.

In [12]:
def clean_value(v):
    for c in string.punctuation:
        v=v.replace(c,'')
    while True:
        try:
            v=int(v)
            break
        except ValueError:
            return 0
    return v

In [13]:
jeopardy['clean_value']=jeopardy['Value'].apply(clean_value)

Normalize AirDate values:

1. convert the column to a datetime column

In [14]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

In [15]:
jeopardy.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

For improve the chance to win, we need to figure out two things:

1. How often the answer is deducible from the question.
2. How often new questions are repeats of older questions.

# Question 1: 

For finger out the first question, we need to check how many times words in the answer also occur in the question.

In [16]:
def match(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count=0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count+=1
    return match_count/len(split_answer)

In [17]:
jeopardy['answer_in_question']=jeopardy.apply(match,axis=1)

In [18]:
jeopardy['answer_in_question'].mean()

0.06035277385469894

The mean of the chance that words in the answer also occur in the question is 6%. It's pretty rare.

# Question 2:

Check the words that longer than 6 characters, to avoid the words like the and than. 

1. Sort jeipardy by ascending air date
2. Check if the word of question appeared in previous question.
3. Find the mean of the possiblity that new questions are repeats of older questions

In [19]:
question_overlap=[]
terms_used=set()
jeopardy=jeopardy.sort_values('AirDate')
for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    for word in split_question:
        if len(word)<6:
            split_question.remove(word)
    match_count=0
    for words in split_question:
        if words in terms_used:
            match_count+=1
        terms_used.add(words)
    if len(split_question) > 0:
        match_count/=len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.8016009520196677

The mean is 80%, this means old questions are pretty important. We need to deep explore these questions.

For earning more money, we can focuses on the high value questions. Now, we can use chi-squared test to fingure out which terms correspond to high-value questions.

Classify the questions into two categories:

1. Low value -- Any row where Value is less than 800
2. High value -- Any row where Value is greater than 800.

In [20]:
def value(row):
    value=0
    if row['clean_value']>800:
        value=1
    else:
        value=0
    return value

In [21]:
jeopardy['high_value']=jeopardy.apply(value,axis=1)

In [25]:
# count the value of each term
def value_count(word):
    low_count=0
    high_count=0
    for i,row in jeopardy.iterrows():
        split_question=row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value']==1:
                high_count+=1
            else:
                low_count+=1
    return high_count,low_count

Count the frequency of the term is high value and low value

In [26]:
observed_expected = []
comparison_terms= list(terms_used)[0:5]
for term in comparison_terms:
    observed_expected.append(value_count(term))

In [29]:
high_value_count=len(jeopardy[jeopardy['high_value']==1])

In [31]:
low_value_count=len(jeopardy[jeopardy['high_value']==0])

In [33]:
from scipy.stats import chisquare
chi_squared=[]
for lists in observed_expected:
    total = sum(lists)
    total_prop = total/len(jeopardy)
    expected_term_count_high=total_prop*high_value_count
    expected_term_count_low=total_prop*low_value_count
    observed = np.array([lists[0], lists[1]])
    expected = np.array([expected_term_count_high, expected_term_count_low])
    chi_squared.append(chisquare(observed, expected))
chi_squared

[Power_divergenceResult(statistic=4.4007463431988825, pvalue=0.035923206140745186),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=1.9628115606834662, pvalue=0.1612129460510291),
 Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886)]

# Chi-squared results

There isn't significant difference. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.