# Patterns to Win Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money.

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

    - Show Number - the Jeopardy episode number
    - Air Date - the date the episode aired
    - Round - the round of Jeopardy
    - Category - the category of the question
    - Value - the number of dollars the correct answer is worth
    - Question - the text of the question
    - Answer - the text of the answer

# Reading Dataset

In [1]:
import pandas as pd

data = pd.read_csv('jeopardy.csv')
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
data.columns = data.columns.str.replace(" ", "")
data.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalize Data

In [3]:
import re

#normalize question and answer columns
def normalize(st):
    st = st.lower()
    st = re.sub('\W', " ", st)
    st = re.sub('\s+', ' ',st)  #remove extra whitespaces
    return st

In [4]:
#assign it to new columns
data['CleanQuestion'] = data['Question'].apply(normalize)
data['CleanAnswer'] = data['Answer'].apply(normalize)

In [5]:
#normalize value column
def norm_values(val):
    val = re.sub('\W', " ", val)
    try:
        val = int(val)
    except Exception:
        val = 0
    return val

In [6]:
data['CleanValue'] = data['Value'].apply(norm_values)

In [7]:
data.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,CleanQuestion,CleanAnswer,CleanValue
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [8]:
#convert AirDate Column to datetime object
data['AirDate'] = pd.to_datetime(data['AirDate'])

In [9]:
data.dtypes

ShowNumber                int64
AirDate          datetime64[ns]
Round                    object
Category                 object
Value                    object
Question                 object
Answer                   object
CleanQuestion            object
CleanAnswer              object
CleanValue                int64
dtype: object

# Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

    - How often the answer can be used for a question.
    - How often questions are repeated

In [10]:
def first(series):
    split_answer = series['CleanAnswer'].split()
    split_question = series['CleanQuestion'].split()
    match_count= 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [11]:
data['AnswerInQuestion'] = data.apply(first, axis=1)
print(data['AnswerInQuestion'].mean())

0.06294645581984942


On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer.

In [12]:
question_overlap = []
terms_used = set()

data_sorted = data.sort_values("AirDate")

for i, row in data_sorted.iterrows():
        split_question = row["CleanQuestion"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
data_sorted["QuestionOverlap"] = question_overlap

data_sorted["QuestionOverlap"].mean()

0.7197989717809659

# Low value vs high value questions

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

    - Low value -- Any row where Value is less than 800.
    - High value -- Any row where Value is greater than 800

In [14]:
def value(row):
    if row['CleanValue'] > 800:
        return 1
    return 0

data['HighValue'] = data.apply(value, axis=1)

In [37]:
def count(word):
    low_count = 0
    high_count = 0
    for i,row in data.iterrows():
        clean_question = row['CleanQuestion'].split(" ")
        if word in clean_question:
            if row['HighValue'] == 1:
                high_count += 1
            else:
                low_count += 1
    return low_count, high_count

In [43]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

In [44]:
observed_expected = []

for i in comparison_terms:
    observed_expected.append(count(i))

In [45]:
observed_expected

[(1, 0),
 (1, 0),
 (3, 1),
 (2, 0),
 (1, 0),
 (8, 1),
 (1, 0),
 (2, 3),
 (1, 0),
 (1, 0)]

# Expected Counts and Chi-Squared Value

In [57]:
high_value_count = data[data['HighValue'] == 1].shape[0]
low_value_count = data[data['HighValue'] == 0].shape[0]

In [59]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / data.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=5.382949069925676, pvalue=0.02033447695772831),
 Power_divergenceResult(statistic=6.044650040225262, pvalue=0.013948497547915516),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=19.751074709544152, pvalue=8.821207646809378e-06),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=0.6134279937303524, pvalue=0.43350002170202684),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953),
 Power_divergenceResult(statistic=3.022325020112631, pvalue=0.08212564786568953)]

# Some potential next steps

    - Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
        - Manually create a list of words to remove, like the, than, etc.
        - Find a list of stopwords to remove.
        - Remove words that occur in more than a certain percentage (like 5%) of questions.
        
    - Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
        - Use the apply method to make the code that calculates frequencies more efficient.
        - Only select terms that have high frequencies across the dataset, and ignore the others.
        
    - Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
        - See which categories appear the most often.
        - Find the probability of each category appearing in each round.
        
    - Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.