## How to win in Jeopardy

The goal of the project is to find some pattern in asked questions in Jeopardy show. Those patter may be helpful in winning show.

In [149]:
import pandas as pd
import numpy as np
import re

from scipy.stats.mstats import chisquare

In [3]:
jeopardy = pd.read_csv("jeopardy.csv")

In [172]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.0,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0,0.0,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0,0.2,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0,0.142857,0.0


In [26]:
jeopardy.columns = jeopardy.columns.str.lower().str.strip().str.replace(" ","_")

In [128]:
def normalize_text(string):
    lowered = string.lower()
    s = re.sub("[^A-Za-z0-9\s]",'',lowered)
    s = re.sub("\s+",' ',s)
    s = s.strip()
    return s


In [129]:
jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)
jeopardy["clean_answer"]  = jeopardy["answer"].apply(normalize_text)

In [130]:
def convert_value(string):
    s = re.sub("[^A-Za-z0-9\s]",'',string)
    try:
        int_val = int(s)
    except ValueError:
        int_val = 0
    return int_val


In [131]:
jeopardy["clean_value"] = jeopardy["value"].apply(convert_value)
jeopardy["air_date"] = pd.to_datetime(jeopardy["air_date"])

In [132]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.083333,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0,0.1,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0,0.111111,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0,0.333333,0.0


In [135]:
def remove_words_arr(words_to_remove,arr):
    for word in words_to_remove:
        if word in arr:
            arr.remove(word)

def answer_in_question(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    words_to_remove = ["the","a","an"]
    remove_words_arr(words_to_remove,split_answer)
    remove_words_arr(words_to_remove,split_question)
    if len(split_answer)==0:
        return 0
    for word_answer in split_answer:
        if word_answer in split_question:
            match_count += 1
            
    return match_count/len(split_answer)


In [136]:
jeopardy["answer_in_question"] = jeopardy.apply(answer_in_question,axis=1)

In [137]:
jeopardy["answer_in_question"].mean()

0.04283139120881008

To check whether answer is deducible from the question I splitted all the answer and question by space and count the number of the answer words which appears in the question. The column answer_in_question contains ratio of the number of answer words which exist in question to number of total words. It will allow to answer if answer can be answered based on question. 

In [175]:
question_overlap = []
terms_used = set()

def remove_short_words(arr):
    for word in split_question:
        if len(word)<6:
            split_question.remove(word)
jeopardy.sort_values(by="air_date")

for row in jeopardy.sort_values(by="air_date").iterrows():
    split_question = row[1]["clean_question"].split(" ")
    remove_short_words(split_question)
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if match_count>0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.7989239490508186

By counting the matching words from question, which exists in previous questions I wanted to know how is the possible to get similiar question to the previous ones. The mean words which exists in earlier questions seems to be very high. It tells that 80% of the words in the question has been in previous ones. The problem with this score could be that the words in specific question could have mix of many other questions, which give huge value of repeated words. However, it does not has to be truth that it is related to any specific previous question. Anyway it is nice to make further investigation to see this outcome closer.

In [176]:
jeopardy.loc[jeopardy["clean_value"]>800,"high_value"] = 1
jeopardy.loc[jeopardy["clean_value"]<=800,"high_value"] = 0

In [177]:
terms_used

{'cranio',
 'lynda',
 'sunlight',
 'mountaina',
 'publishing',
 'automobile',
 'woody',
 'cooper',
 'arturo',
 'presiding',
 'astrologer',
 'claustrophobia',
 'brenda',
 'g',
 'hrefhttpwwwjarchivecommedia20091014j09jpg',
 'mortals',
 'almighty',
 'chair',
 'downey',
 'dispensary',
 'estrogen',
 'morial',
 'hiphop',
 'poplicola',
 'stoller',
 'partagas',
 'zee',
 'pastorship',
 'dhaka',
 'outfit',
 'sympathies',
 'cardiac',
 'piggies',
 'onethird',
 'cruella',
 'corporate',
 'mural',
 'medals',
 'quatrains',
 'abbey',
 'literature',
 'immediate',
 'crossed',
 'lanai',
 'doodah',
 'sundew',
 'petticoats',
 'camp',
 '2masted',
 'person',
 'beta',
 'differently',
 '4th',
 'peddles',
 'invigorate',
 'stairs',
 'makola',
 'sunset',
 'snowed',
 'viceia',
 'blossomed',
 'undertake',
 'hobble',
 'gemini',
 'embarrassment',
 'text',
 'borglum',
 'heroic',
 'herring',
 '1877',
 'hippophagy',
 'meltdown',
 'hrefhttpwwwjarchivecommedia20040521j22jpg',
 'carmel',
 'matlock',
 'antipsychotic',
 'hess

In [192]:
def count_high_low(word):
    low_count = 0
    high_count = 0
    for row in jeopardy.iterrows():
        row = row[1]
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1 
            else:
                low_count += 1
    return high_count,low_count
observed_expected = [] 
comparision_terms = list(terms_used)[:5]
for term in comparision_terms:
    observed_expected.append(count_high_low(term))
    

In [193]:
observed_expected

[(0, 1), (0, 1), (0, 3), (1, 1), (3, 7)]

In [194]:
comparision_terms

['cranio', 'lynda', 'sunlight', 'mountaina', 'publishing']

In [195]:
high_value_count = (jeopardy["high_value"]==1).sum()
low_value_count = (jeopardy["high_value"]==0).sum()

In [196]:
chi_squared = [] 
for high,low in observed_expected:
    total = high + low
    total_prop = total/len(jeopardy)
    exp_high = total_prop*high_value_count
    exp_low = total_prop*low_value_count
    chi_squared.append(chisquare([high,low],[exp_high,exp_low]))

In [197]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.008630851497838939, pvalue=0.9259811180040979)]

Any of all tested values has significant differences between high and low values questions. Frequencies is also very low what makes the test insignificant. 