# Guided Project: Winning Jeopardy

In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help me win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded from [this reddit post](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

In [1]:
# Importing the dataset
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
# Exploring the first five rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
# Exploring the columns
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
# Fixing the column names
column_names = jeopardy.columns
fixed_columns = column_names.str.strip()
fixed_columns = list(fixed_columns)
fixed_columns[1] = "Air_Date"
fixed_columns
jeopardy.columns = fixed_columns

In [5]:
# Checking the column names
jeopardy.head()

Unnamed: 0,Show Number,Air_Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Normalising Text

Before I can start doing analysis on the Jeopardy questions, I need to normalise all of the text columns (the Question and Answer columns).
I will do the following:
- Write a function that:
    - Takes in a string.
    - Converts the string to lowercase.
    - Removes all punctuation in the string.
    - Returns the string.
- Use the function to normalise the Question column.
- Use the function to normalise the Answer column.

In [6]:
# Writing a normalising function
import re
def normalising(string):
    lowercase = string.lower()
    punctuation = re.sub(r'[^\w\s]','',lowercase)
    return punctuation

In [7]:
# Testing normalising function
test_string1 = "Liverpool F.C. is the BEST football team ever!"
normalising(test_string1)

'liverpool fc is the best football team ever'

In [8]:
# Using the normalising function on the Question column
jeopardy["clean_question"] = jeopardy["Question"].apply(lambda x: normalising(x))

In [9]:
# Using the normalising function on the Question column
jeopardy["clean_answer"] = jeopardy["Answer"].apply(lambda x: normalising(x))

In [10]:
# Inspecting dataset for normalising changes
jeopardy.head(10)

Unnamed: 0,Show Number,Air_Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ita...,the appian way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel


## Normalising Columns

Now that I've normalised the text columns, there are also some other columns to normalise.

The Value column should be numeric, to allow us to manipulate it easier. I'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable me to work with it easier.

Similar to before, I'll need to do the following steps for the Value column:
- Write a function to normalise dollar values ():
    - Takes in a string.
    - Removes any punctuation in the string.
    - Converts the string to an integer.
    - Assigns 0 instead if the conversion has an error.
    - Returns the integer.

In [11]:
# Writing a normalising function for values
def norm_val(string):
    punctuation = re.sub(r'[^\w\s]','',string)
    try:
        integer = int(punctuation)
    except Exception:
        integer = 0
    return integer

In [12]:
# Testing above function
test_value1 = "$1"
norm_val(test_value1)

1

In [13]:
# Nomalising value column with above function
jeopardy["clean_value"] = jeopardy["Value"].apply(lambda x: norm_val(x))

In [14]:
# Inspecting values for normalising changes
jeopardy["clean_value"].head(10)

0    200
1    200
2    200
3    200
4    200
5    200
6    400
7    400
8    400
9    400
Name: clean_value, dtype: int64

In [15]:
# Converting Air Date column to datetime
jeopardy["Air_Date"] = pd.to_datetime(jeopardy["Air_Date"])

In [16]:
# Inspecting Air Dates for changes
jeopardy["Air_Date"].head(10)

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
5   2004-12-31
6   2004-12-31
7   2004-12-31
8   2004-12-31
9   2004-12-31
Name: Air_Date, dtype: datetime64[ns]

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer can be used for a question.
- How often questions are repeated.

I can answer the first question by seeing how many times words in the answer  occur in the question.

In [17]:
# Building the function
def counter(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [18]:
# Counting how many times terms in clean_answer occur in clean_question
jeopardy["answer_in_question"] = jeopardy.apply(lambda x: counter(x), axis=1)

In [19]:
# Finding the mean of the answer_in_question column
jeopardy["answer_in_question"].mean()

0.05900196524977763

The mean percentage of words in the answer that also occur in the question is 5.90%. Furthermore, this mean is realistically lower as it includes repeted answer words like "a". Due to these results, I would not recommend anyone to try to answer the question with words already contained in it.

## Recycled Questions

Let's say I want to investigate how often new questions are repeats of older ones. I can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but I can investigate it at least.

In [20]:
# Investigating repeated questions
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values("Air_Date",ascending=True)

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)

In [21]:
# Mean of overlapping questions
sum(question_overlap)/len(question_overlap)

0.6876235590919714

The mean percentage of overlapping questions is 68.8%. Due to this result, I would recommend people studying older questions as there is a good chance that they will repeat.

## Low Value vs High Value Questions

Let's say I only want to study questions that pertain to high value questions instead of low value questions. This will help me earn more money when I'm on Jeopardy.

I can actually figure out which terms correspond to high-value questions using a chi-squared test.

In [22]:
# Creating value-calculator function
def question_value(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

In [23]:
# Applying above function to main array
jeopardy["high_value"] = jeopardy.apply(lambda x: question_value(x), axis=1)

In [24]:
# Creating another function for above investigation
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [25]:
# Comparing terms
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1),
 (0, 2),
 (2, 2),
 (0, 1),
 (0, 10),
 (2, 3),
 (0, 1),
 (1, 1),
 (0, 3),
 (0, 2)]

## Applying the Chi-squared Test

Now that I've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [26]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=4.01962846126884, pvalue=0.04497362002407036),
 Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.