# Analysing patterns in Jeopardy questions

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions. It can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [172]:
import pandas as pd
import numpy as np
import re
import random
from scipy.stats import chisquare

## Understanding and cleaning the data

In [173]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [174]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [175]:
# Removing spaces before column names
jeopardy.columns = ['Show Number','Air Date','Round','Category','Value','Question','Answer']


As 2 of the columns contain text data, we'll normalise them by changing everything to lowercase and removing punctuation

In [176]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text


In [177]:
# normalising question and answer columns
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)


We'll also normalise the Value column and convert the Air Date column values to datetime values.

In [178]:
def normalize_values(value):
    value = re.sub("[^A-Za-z0-9\s]", "", value)
    try:
        value = int(value)
    except:
        value = 0
    return value
    

In [179]:
jeopardy["clean_value"] = jeopardy['Value'].apply(normalize_values)

jeopardy['Air Date'] = jeopardy['Air Date'].apply(lambda x: pd.to_datetime(x,format='%Y-%m-%d'))


In [180]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Deciding what to study

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question
- How often questions are repeated

We will answer the first question by seeing how many times words in the answer also occur in the question. 

In [181]:
def matching(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)
    

In [182]:
# Counting how many times words in answers appear in questions
jeopardy['answer_in_question'] = jeopardy.apply(matching, axis=1)
mean_matches = jeopardy['answer_in_question'].mean()
print(mean_matches)


0.05900196524977763


Based on our calculations, the chances of an answer appearing in the question are about 5.9%, meaning that there's about a 6 in 100 chance that the answer will be in the question. As this is rather low, we will probably need to study to be able to answer questions.

### Repeated questions

Now we want to understand how often new questions are repeats of older ones to see if we can use past questions as a guide for studying. Even though we only have about 10% of the full Jeopardy questions dataset, we can use this sample to see what the chances are of complex words (> 6 characters) reoccuring.

To do this, we will:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
- If it does, increment a counter.
- Add each word to terms_used.

This allows us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like the and than, which are commonly used, but don't tell us a lot about a question.

In [183]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by='Air Date',inplace=True)
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [x for x in split_question if len(x) >= 6]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap   

In [184]:
jeopardy['question_overlap'].mean()

0.6876260592169802

About 68.8% of terms in questions are repeated, which means that there's a 2/3 chance that a term will be repeated. Although this only looks at single words, it suggests that it's worth doing further analysis on questions being repeated.

### Looking at high value questions

We'll figure out which terms correspond to high-value questions using a chi-squared test. First, we'll narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

Then we'll loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [185]:
# Determining whether questions are high or low value

def high_low(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(high_low, axis=1)

In [186]:
def count_high_low(word):
    low_count = 0
    high_count = 0
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count +=1 
    return [high_count, low_count]


In [187]:
comparison_terms = random.sample(list(terms_used), 10)
observed_expected = []
for x in comparison_terms:
    result = count_high_low(x)
    observed_expected.append(result)


In [188]:
print(observed_expected)

[[1, 3], [1, 0], [2, 6], [0, 1], [0, 1], [0, 1], [1, 0], [1, 0], [2, 0], [0, 1]]


In [189]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

In [190]:
for pair in observed_expected:
    total = sum(pair)
    total_prop = total/jeopardy.shape[0]
    exp_high = total_prop*high_value_count
    exp_low = total_prop*low_value_count
    observed = np.array(pair)
    expected = np.array([exp_high,exp_low])
    chi_squared.append(chisquare(observed, expected))

In [191]:
chi_squared

[Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.05272886616881538, pvalue=0.818381104912348),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

In [192]:
comparison_terms

['statehood',
 'tufting',
 'pictures',
 'abandoning',
 'federations',
 'robson',
 'gunner',
 'excommunist',
 'pundit',
 'waller']

### Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all 6 or lower, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

## Conclusion

Based on our analysis, it appears that studying is recommended to do well on Jeopardy, and one strategy to explore further is whether past questions tend to be repeated.

Possible next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.

- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.

- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)) instead of the subset we used in this lesson.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.