# Winning Jeopardy

## Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

We will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/?st=j063dgeb&sh=90ed4830).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.

## Reading in and exploring the data

In [386]:
# importing pandas
import pandas as pd

In [387]:
# reading in jeopardy.csv
jeopardy = pd.read_csv("jeopardy.csv")

In [388]:
# displaying first 5 rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [389]:
# displaying columns names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [390]:
# displaying the no. of rows and columns
jeopardy.shape

(19999, 7)

In [391]:
# displaying data types of each column
jeopardy.dtypes

Show Number     int64
 Air Date      object
 Round         object
 Category      object
 Value         object
 Question      object
 Answer        object
dtype: object

## Removing spaces in front of some column names

In [392]:
# renaming column names to remove spaces in front
jeopardy = jeopardy.rename(columns = {' Air Date': 'Air Date', ' Round': 'Round', ' Category': 'Category'\
                                      , ' Value': 'Value', ' Question': 'Question', ' Answer': 'Answer'})

In [393]:
# displaying column names to confirm removal of spaces in front
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing columns

In [394]:
# defining a function to normalize strings - convert to lowercase and remove punctuations
def norm_qanda(s):
    norm_s = s.lower()
    exclude = set(string.punctuation)
    norm_s = ''.join(ch for ch in norm_s if ch not in exclude)
    return norm_s

In [395]:
# normalizing the question and answer columns
jeopardy['clean_question'] = jeopardy['Question'].apply(norm_qanda)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(norm_qanda)

In [396]:
# Confirming normalization
jeopardy.loc[:,['Question','clean_question','Answer','clean_answer']].head()

Unnamed: 0,Question,clean_question,Answer,clean_answer
0,"For the last 8 years of his life, Galileo was ...",for the last 8 years of his life galileo was u...,Copernicus,copernicus
1,No. 2: 1912 Olympian; football star at Carlisl...,no 2 1912 olympian football star at carlisle i...,Jim Thorpe,jim thorpe
2,The city of Yuma in this state has a record av...,the city of yuma in this state has a record av...,Arizona,arizona
3,"In 1963, live on ""The Art Linkletter Show"", th...",in 1963 live on the art linkletter show this c...,McDonald's,mcdonalds
4,"Signer of the Dec. of Indep., framer of the Co...",signer of the dec of indep framer of the const...,John Adams,john adams


In [397]:
# defining function to normalize the strings in the value column - removing punctuations and converting to integers
def norm_val(v):
    exclude = set(string.punctuation)
    norm_v = ''.join(ch for ch in v if ch not in exclude)
    try:
        norm_v = int(norm_v)
    except Exception:
        norm_v = 0
    return norm_v

In [398]:
# normalizing value column
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_val)

In [399]:
# confirming normalization
jeopardy.loc[:,['Value','clean_value']].head()

Unnamed: 0,Value,clean_value
0,$200,200
1,$200,200
2,$200,200
3,$200,200
4,$200,200


In [400]:
# checking data types
jeopardy.dtypes

Show Number        int64
Air Date          object
Round             object
Category          object
Value             object
Question          object
Answer            object
clean_question    object
clean_answer      object
clean_value        int64
dtype: object

In [401]:
# converting air date from string to datetime type
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'], format='%Y-%m-%d')

In [402]:
# confirmting air date datetime conversion
jeopardy['Air Date'].value_counts().head()

2007-11-13    62
2007-11-27    61
2008-12-08    61
2009-05-08    61
2001-05-11    61
Name: Air Date, dtype: int64

In [403]:
# checking data types
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Finding answers in questions

In [404]:
# defining a function to return the proportion of matching words in the answer and the respective question and 
# the no. of words in the answer

def words_ans_ques(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    for word in split_answer:
        if 'the' in split_answer:
            split_answer.remove('the')
        else:
            break
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/ len(split_answer)

In [405]:
# applying the function above to each row in jeopardy
jeopardy['answer_in_question'] = jeopardy.apply(words_ans_ques, axis=1)

In [406]:
# Verifying that the function worked as expected
jeopardy.loc[:,['clean_question','clean_answer','answer_in_question']].head()

Unnamed: 0,clean_question,clean_answer,answer_in_question
0,for the last 8 years of his life galileo was u...,copernicus,0.0
1,no 2 1912 olympian football star at carlisle i...,jim thorpe,0.0
2,the city of yuma in this state has a record av...,arizona,0.0
3,in 1963 live on the art linkletter show this c...,mcdonalds,0.0
4,signer of the dec of indep framer of the const...,john adams,0.0


In [407]:
# descriptive stats of answer in question column
jeopardy['answer_in_question'].describe()

count    19999.000000
mean         0.059737
std          0.166078
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: answer_in_question, dtype: float64

### Observations

On an average only 6% of the words in an answer are also in the question. This insight won't really help us better prepare to answer jeopardy questions.

## Repeating questions

In [408]:
# Calculating the proportion of words in questions that were repeated from past questions 
# to the total no. of words, this is to figure out how many question repetitions occur 
# and thus will help better prepare us for jeopardy

jeopardy = jeopardy.sort_values('Air Date')
question_overlap = []
terms_used = set()

for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) == 0:
        question_overlap.append(0)
    else:
        question_overlap.append(match_count/len(split_question))

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6889055316620302

### Observations

On an average 69% of the words in later jeopardy questions are repetitions from previously asked questions. But this overlap considers only words and not phrases. So, it's not that significant, but, it might be worth more investigation.

## Low and high value questions

In [409]:
# defining a function to classify high and low value questions
def high_val(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [410]:
# applying the function above
jeopardy['high_value'] = jeopardy.apply(high_val, axis=1)

In [411]:
# defining a function to return the counts of a word appearing in high and low value questions
def high_low_count(word):
    high_count = 0
    low_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [412]:
# Using the function above to determine the high and low value question counts of first five words in terms_used
observed_expected =[]
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append(high_low_count(term))
    
observed_expected

[(1, 0), (1, 0), (1, 0), (1, 2), (0, 2)]

In [413]:
# calculating the chi squared and p-value to understand the statistical significance 
# of these five words appearing in high and low value questions
from scipy.stats import chisquare

chi_squared = []
high_value_count = sum(jeopardy['high_value'] == 1)
low_value_count = jeopardy.shape[0] - high_value_count

for item in observed_expected:
    total = sum(item)
    total_prop = total/ jeopardy.shape[0]
    high_val_exp = total_prop * high_value_count
    low_val_exp = total_prop * low_value_count
    chi_squared.append(chisquare(item, [high_val_exp, low_val_exp]))
    
chi_squared

[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932),
 Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708)]

### Obervations

All of the p-values are not statistically significant. Moreover, the chi square test is not really valid as the frequencies are low. It would make more sense to run chisquared test for higher frequency terms.

## Potential next steps

Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:

* Manually create a list of words to remove, like the, than, etc.
* Find a list of stopwords to remove.
* Remove words that occur in more than a certain percentage (like 5%) of questions.

Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:

* Use the apply method to make the code that calculates frequencies more efficient.
* Only select terms that have high frequencies across the dataset, and ignore the others.

Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:

* See which categories appear the most often.
* Find the probability of each category appearing in each round.

Use the whole Jeopardy dataset [available here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/?st=j06k4zh9&sh=6bf18734) instead of the subset we used in this mission.

Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.