# Winning_Money_In_Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It has been running for a few decades, and is a major force in popular culture. 

In this project, we worked with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win. The dataset is named `jeopardy.csv`, and contains **20000** rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). Here's the beginning of the file:

<img src='jeopardy_beginning_rows.png'>

We may see above that each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

In [1]:
# reading the dataset into pandas

import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# printing out the columns
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# removing the space in the columns of the jeopardy dataframe

jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category',
                    'Value', 'Question', 'Answer']

In [4]:
# def function to normalize all of text columns
# ('Question' and 'Answer')

# importing regex module
import re

def normalize_text(text):
    lower_text = text.lower()
    lower_text = re.sub('[^A-Za-z0-9\s]','', lower_text)
    return lower_text

# defining function to normalize 'Value' column
def normalize_values(text):
    removed_dollar = re.sub('[^A-Za-z0-9\s]','', text)
    try:
        text = int(removed_dollar)
    except Exception:
        text = 0
    return text

In [5]:
# applying the functions to the desired columns and assigning these to
# new columns
jeopardy['clean_question'] = (jeopardy['Question']
                              .apply(normalize_text)
                             )
jeopardy['clean_answer'] = (jeopardy['Answer']
                            .apply(normalize_text)
                           )
jeopardy['clean_value'] = (jeopardy['Value']
                           .apply(normalize_values)
                          )

In [6]:
# examining few random rows of jeopardy
jeopardy.sample(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
17692,5770,2009-10-16,Jeopardy!,ON THE WEBSITE'S FRONT PAGE,$800,"""Share your photos. Watch the world"" at this ...",Flickr,share your photos watch the world at this web...,flickr,800
1238,1302,1990-04-10,Double Jeopardy!,AUTHORS,$800,An eye ailment contracted at Eton School ended...,Aldous Huxley,an eye ailment contracted at eton school ended...,aldous huxley,800


In [7]:
# converting 'Air Date' column to a datetime dtype
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [8]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [9]:
# defining a function to see how many times words in the answer also
# occur in the question

def count_matches(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
    return match_count / len(split_answer)

# applying the function to each row in jeopardy dataframe
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [10]:
# finding the mean of the 'answer_in_question' column
jeopardy['answer_in_question'].mean()

0.06049325706933587

The answer only appears in the question about 6% of the time. This is not a huge number, and means that we probably cannot just hope that hearing a question will enable us to figure out the answer. We will probably have to study.

In [11]:
# investigating how often new questions are repeats of older ones

question_overlap = []
terms_used = set()

# sorting jeopardy dataframe by ascending air date
jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
        
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.6876260592169802

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it does not look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it is worth looking more into the recycling of questions.

In [12]:
# creating a function to study questions that pertain to high value
# questions instead of low value questions

def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [13]:
# examining high_value column
jeopardy.sample(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
7461,5094,2006-11-02,Double Jeopardy!,I BIT OFF MORE THAN I COULD CHEW,$2000,"Jim Reeves was top-""seeded"" after he chomped 1...",watermelon,jim reeves was topseeded after he chomped 13 p...,watermelon,2000,0.0,0.6,1
10129,5825,2010-01-01,Jeopardy!,IDEAS FOR TOURISM CAMPAIGNS,$800,"From Koluszki to Kolno, & Wozniki to Strzelce,...",Poland,from koluszki to kolno wozniki to strzelce vi...,poland,800,0.0,0.25,0


In [14]:
# creating a function to count usage of words/terms

def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# converting 'terms_used' into a list and assigning first 5 elements
# to comparison_terms
comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 4), (0, 1), (2, 3), (2, 3), (1, 0)]

In [15]:
# computing expected counts and chi-squared value

from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    jeopardy_rows = jeopardy.shape[0]
    total_prop = total / jeopardy_rows
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_expected, low_value_expected])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.