# Chi squares - Winning Jeopardy

Let's investigate a dataset called `jeopardy.csv` to see if we can discern some patterns from these questions. We'll use chi squared tests to analyze our data.


Hopefully this will give us an edge for when we compete!

### Load the dataset

In [1]:
import pandas as pd
import csv 

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some column names have spaces in front of them. Let's remove these.

## Data Cleaning

In [3]:
# Strip whitespace
jeopardy.columns = jeopardy.columns.str.strip()

In [4]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now let's normalize the `Question` and `Answer` columns to remove punctuation and put all words in lowercase. We don't want words of different uppercase/lowercase combinations to register as different words.

Let's write a function to complete this task.

In [5]:
import re #For text replacement

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]","",text)
    text = re.sub("\s+"," ", text)
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

Great. Let's also normalize the `Values` column to remove the dollar sign and convert it from a string to an integer.

In [6]:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]","",text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text
        

In [7]:
jeopardy['clean_values'] = jeopardy['Value'].apply(normalize_values)

Finally, let's clean the `Air Date` column and convert it to a datetime column.

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [9]:
# Look at our cleaned data
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_values
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


## Analyzing questions

In order to decide what questions to study, let's figure out two things:

* How often the answer can be used for a question, and 
* How often questions are repeated

This will tell us the most important things to study! We'll write a function to answer these questions.

### Answer in question

In [10]:
# Do words in the answer also occur in the question?
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    # Ignore "the" since it doesn't provide useful info
    if "the" in split_answer:
        split_answer.remove("the")
        
    # If length of split_answer is 0, return 0
    # Prevents division by 0
    if len(split_answer) == 0:
        return 0
        
    match_count = 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count/len(split_answer)            

jeopardy["answer_in_question"] = jeopardy.apply(count_matches,axis=1)

In [11]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

On average, the answer only appears in the question about 6% of the time, meaning we probably won't hear the answer read to us in the question. 

### Recycled Questions

How often are old questions used?


In [12]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_q = row['clean_question'].split(" ")
    split_q = [q for q in split_q if len(q) > 5]
    match_count = 0
    for word in split_q:
        if word in terms_used:
            match_count += 1
    for word in split_q:
        terms_used.add(word)
    if len(split_q) > 0:
        match_count /= len(split_q)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

mean_overlap = jeopardy['question_overlap'].mean()
print(mean_overlap)

0.6876260592169802


Roughly 70% of words overlap between terms in new questions and terms in old questions. Given this only looks at individual words and not phrases, it may not be extremely meaningful. Still, it's worth looking into. 

## Low Value vs High Value Questions

Let's analyze low value and high value questions separately

- Low value -- Any value where `Value` is less than `800`
- High value -- Any value where `Value` is reater than `800`

We'll loop through `terms_used` and investigate:

- The number of low value questions that the word occurs in
- The number of high value questions that the word occurs in
- The percentage of questions the word occurs in
- The percentage of questions the word occurs in and expected counts
- The chi-squared value based on the expected counts and the observed counts for high and low value questions


In [13]:
def determine_value(row):
    value = 0
    if row["clean_values"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [14]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_q = row['clean_question'].split(" ")
        if word in split_q:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# Test the function with 'president'
count_usage('president')

(68, 181)

In [15]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []
for i in comparison_terms:
    observed_expected.append(count_usage(i))

In [16]:
observed_expected

[(0, 1),
 (1, 1),
 (1, 0),
 (1, 0),
 (1, 1),
 (1, 3),
 (1, 2),
 (2, 6),
 (0, 2),
 (0, 1)]

## Applying the chi-squared test

In [17]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    observed = np.array([i[0],i[1]])
    expected = np.array([expected_high,expected_low])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.05272886616881538, pvalue=0.818381104912348),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

All p values were greater than .05, so none of these results were significant. We may need more data for a better chi-square test of these data.