# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine tat we want to compete on Jeopardy, and we're looking for any way to win. In this prokect, we will work with a dataset of Jeopardy questions to figure out some pattern in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). 

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv', parse_dates=[' Air Date'])

jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [2]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns 

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.tail()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
216925,4999,2006-05-11,Double Jeopardy!,RIDDLE ME THIS,$2000,This Puccini opera turns on the solution to 3 ...,Turandot
216926,4999,2006-05-11,Double Jeopardy!,"""T"" BIRDS",$2000,In North America this term is properly applied...,a titmouse
216927,4999,2006-05-11,Double Jeopardy!,AUTHORS IN THEIR YOUTH,$2000,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo
216929,4999,2006-05-11,Final Jeopardy!,HISTORIC NAMES,,A silent movie title includes the last name of...,Grigori Alexandrovich Potemkin


# Data cleaning

Before we start doing analysis on the Jeoparyd questions, we need to normalise all of the text columns (the *Question* and *Answer* columns).

In [5]:
jeopardy['clean_question'] = jeopardy['Question'].str.replace('[^\w\s]', '').str.lower().str.replace('\s+', ' ')
jeopardy['clean_answer'] = jeopardy['Answer'].str.replace('[^\w\s]', '').str.lower().str.replace('\s+', ' ')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


In [6]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 9 columns):
Show Number       216930 non-null int64
Air Date          216930 non-null datetime64[ns]
Round             216930 non-null object
Category          216930 non-null object
Value             216930 non-null object
Question          216930 non-null object
Answer            216928 non-null object
clean_question    216930 non-null object
clean_answer      216928 non-null object
dtypes: datetime64[ns](1), int64(1), object(7)
memory usage: 14.9+ MB


In [7]:
jeopardy['clean_answer'].isnull().sum()

2

In [8]:
jeopardy.dropna(inplace=True)
jeopardy['clean_answer'].isnull().sum()

0

In [9]:
jeopardy['Value'].value_counts()

$400       42243
$800       31860
$200       30454
$600       20377
$1000      19539
$1200      11331
$2000      11243
$1600      10801
$100        9029
$500        9016
$300        8663
None        3634
$1,000      2101
$2,000      1586
$3,000       769
$1,500       546
$1,200       441
$4,000       349
$1,600       239
$2,500       232
$5,000       231
$1,400       228
$700         203
$1,800       182
$2,200       147
$2,400       127
$900         114
$6,000        85
$2,600        83
$1,300        75
           ...  
$2,127         1
$1,534         1
$11,000        1
$1,810         1
$3,599         1
$1,347         1
$1,020         1
$11,600        1
$13,200        1
$1,512         1
$50            1
$3,499         1
$8,917         1
$5,700         1
$1,246         1
$16,400        1
$1,801         1
$1,777         1
$2,021         1
$8,700         1
$6,435         1
$14,200        1
$6,300         1
$585           1
$13,800        1
$1,492         1
$22            1
$1,809        

In [10]:
jeopardy['clean_value'] = jeopardy['Value'].str.replace('[^\w\s]', '')
jeopardy['clean_value'] = jeopardy['clean_value'].apply(lambda x: 0 if x == 'None' else int(x))
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [11]:
# import re

# def normalize_text(text):
#     text = text.lower()
#     text = re.sub("[^A-Za-z0-9\s]", "", text)
#     text = re.sub("\s+", " ", text)
#     return text

# def normalize_values(text):
#     text = re.sub("[^A-Za-z0-9\s]", "", text)
#     try:
#         text = int(text)
#     except Exception:
#         text = 0
#     return text
# jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])

In [12]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the second question by seeing how often complex words (> 6 characters) reoccur. 

## How often the answer can be used for a question

In [13]:
def count_matches(row):
    
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the') #The is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)


In [14]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
jeopardy['answer_in_question'].mean()

0.057921237245162335

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

## How often questions are repeated

Note that you can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [15]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date').reset_index(drop=True)

In [16]:
for index, row in jeopardy.iterrows():
    
    split_question = row['clean_question'].split(' ')
    split_question = [word for word in split_question if len(word)> 5]
    #Only looking at words with six or more characters enables you to filter out words like the and than, 
    #which are commonly used, but don't tell you a lot about a question.
    match_count = 0
    for word in split_question:
#         if len(word) < 6 :
#             split_question.remove(word) 
        if word in terms_used:
            match_count += 1
        
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()
            

0.8721734034756163

There is about 87% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions

# Low value vs. high value questions

Let's say we only want to study questions that pertian to high value questions instead of low value questions. This will help us earn more money when we are on Jeopardy.

we can actually figure out when terms correspond to high-value questions using a chi-squared test.

we will first need to narrow down the questions into two categories:
- Low value -- Any row where Value is less than 800
- High value -- Any row where Value is greater than 800.


In [19]:
jeopardy['value_classify'] = jeopardy['clean_value'].apply(lambda x: 1 if x>800 else 0)
jeopardy['value_classify'].head()

0    0
1    1
2    1
3    1
4    1
Name: value_classify, dtype: int64

In [21]:
def count_usage(term):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        if term in row['clean_question'].split():
            if row['value_classify'] == 1:
                high_count += 1
            else:
                low_count += 1
    return low_count, high_count
           

In [23]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

comparison_terms 

['sorrowful',
 'headdresses',
 'webbed',
 'hrefhttpwwwjarchivecommedia20100924_dj_22jpg',
 'hrefhttpwwwjarchivecommedia20091023_j_01ajpg',
 'chechen',
 'target_blankbob',
 'halfsies',
 'stuart',
 'kremlin']

# Applying the chi-squared test

We will loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [24]:
observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
observed_expected

[(4, 2),
 (1, 0),
 (10, 2),
 (0, 1),
 (1, 0),
 (1, 1),
 (1, 0),
 (2, 0),
 (38, 16),
 (13, 2)]

In [25]:
high_value_count = jeopardy[jeopardy['value_classify'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['value_classify'] == 0].shape[0]

In [29]:
import numpy as np
from scipy.stats import chisquare

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total/jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([low_value_exp, high_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.07446002639524918, pvalue=0.7849503458405339),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=0.8021008220410135, pvalue=0.37046599961469506),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=0.4633727036157106, pvalue=0.4960519396377898),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538),
 Power_divergenceResult(statistic=0.04601664196998195, pvalue=0.8301455495155613),
 Power_divergenceResult(statistic=1.658595760587037, pvalue=0.1977930281337782)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

# Next steps

Potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
  - Manually create a list of words to remove, like the, than, etc.
  - Find a list of stopwords to remove.
  - Remove words that occur in more than a certain percentage (like 5%) of questions.

- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
  - Use the apply method to make the code that calculates frequencies more efficient.
  - Only select terms that have high frequencies across the dataset, and ignore the others.

- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
  - See which categories appear the most often.
  - Find the probability of each category appearing in each round.
  
-  Use the whole Jeopardy dataset (available here) instead of the subset we used in this lesson.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.