# Introducing the Dataset

In [1]:
import pandas as pd
import os

os.chdir(r"C:\Users\Gerrit\Desktop\Data Science Portfolio\data")

jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
#Making all columns lowercase and removing spaces
columns = []
for col in jeopardy.columns:
    col = col.replace(" ","").lower()
    columns.append(col)
    
jeopardy.columns = columns

# Normalizing Columns

In [4]:
import re

def normalize_text(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]","",string)
    return string

jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)

jeopardy["clean_answer"] = jeopardy["answer"].apply(normalize_text)

In [5]:
def normalize_values(string):
    string = re.sub("[^A-Za-z0-9]+","",string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string

jeopardy["clean_values"] = jeopardy["value"].apply(normalize_values)

In [6]:
jeopardy["airdate"] = pd.to_datetime(jeopardy["airdate"])

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer is deducible from the question.

How often new questions are repeats of older questions.

In [7]:
#Count how many times terms in clean_answer occur in clean_question
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
            return 0
    match_count = 0
    for char in split_answer:
        if char in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy["answers_in_question"] = jeopardy.apply(count_matches, axis=1)

In [8]:
mean_answers_in_questions = jeopardy["answers_in_question"].mean()
print(mean_answers_in_questions)

0.0604932570693


The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

# Recycled Questions

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least.

In [9]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [a for a in split_question if len(a) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.69087373156719623

We see from the analysis above that terms appear in both the question and the answer about 70% of the time.  This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

# Low Value Vs High Value Questions

You can actually figure out which terms correspond to high-value questions using a chi-squared test. 

In [10]:
def determine_value(row):
    value = 0
    if row["clean_values"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [11]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1), (0, 1), (2, 1), (0, 1), (0, 1)]

# Applying the Chi-Squared Test

In [12]:
from scipy.stats import chisquare
import numpy as np

high_value_count = len(jeopardy[jeopardy["high_value"] == 1])
low_value_count = len(jeopardy[jeopardy["high_value"] == 0])
chi_squared = []
pvalues = []

for each in observed_expected:
    total = sum(each)
    total_prop = total/jeopardy.shape[0]
    exp_high = total_prop*high_value_count
    exp_low = total_prop*low_value_count
    
    obs = np.asarray([each[0],each[1]])
    exp = np.asarray([exp_high,exp_low])
    chisq, pvalue = chisquare(obs,exp)
    chi_squared.append(chisq)
    pvalues.append(pvalue)

significant_pvalues = [x for x in pvalues if x <= .05]
print("Number of significant results:", len(significant_pvalues))

chi_squared

Number of significant results: 0


[0.40196284612688399,
 0.40196284612688399,
 2.1177104383031944,
 0.40196284612688399,
 0.40196284612688399]

As shown above, there are no statistically significant differences in high_count and low_count. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

# Further Analysis

Here are some potential next steps:

Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
Manually create a list of words to remove, like the, than, etc.
Find a list of stopwords to remove.
Remove words that occur in more than a certain percentage (like 5%) of questions.

Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
Use the apply method to make the code that calculates frequencies more efficient.
Only select terms that have high frequencies across the dataset, and ignore the others.

Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
See which categories appear the most often.
Find the probability of each category appearing in each round.

Use the whole Jeopardy dataset (available here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file?st=iscjul8i&sh=140c8205) instead of the subset we used in this mission.

Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.