## Chi Squared test for significance for understanding question pattern in Jeopardy 

In [1]:
import pandas as pd
import numpy as np
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head(5)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

#### Removing punctuations from Question and Answer column 

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']

In [4]:
import re
def normalize_text(string):
    string = re.sub(r'[^\w\s]','',string)
    return string
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
print(jeopardy["clean_question"].head())
print(jeopardy["Answer"].head())

0    For the last 8 years of his life Galileo was u...
1    No 2 1912 Olympian football star at Carlisle I...
2    The city of Yuma in this state has a record av...
3    In 1963 live on The Art Linkletter Show this c...
4    Signer of the Dec of Indep framer of the Const...
Name: clean_question, dtype: object
0    Copernicus
1    Jim Thorpe
2       Arizona
3    McDonald's
4    John Adams
Name: Answer, dtype: object


#### Normalizing Value column

In [5]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,For the last 8 years of his life Galileo was u...,Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,No 2 1912 Olympian football star at Carlisle I...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,In 1963 live on The Art Linkletter Show this c...,McDonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,Signer of the Dec of Indep framer of the Const...,John Adams


In [6]:
jeopardy['Value'] = jeopardy['Value'].str.replace(',', '')
jeopardy['Value'] = jeopardy['Value'].str.replace('$', '')
valueint = []
for i in jeopardy['Value']:
    try:
        i = int(i)
    except Exception:
        i = 0
    valueint.append(i)
se = pd.Series(valueint)
jeopardy['Value'] = se.values
jeopardy['Value'].head()

0    200
1    200
2    200
3    200
4    200
Name: Value, dtype: int64

#### Normalizing  'Air Date' column

In [7]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [8]:
jeopardy["Air Date"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

#### Calculating matching words in answers and questions

In [9]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [10]:
jeopardy["answer_in_question"].mean()


0.044881387009743423

### The answer only appears in the question about 4% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

#### Let's investigate how often new questions are repeats of older ones

In [11]:
#jeopardy = jeopardy.sort_values(by="Air Date")
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
            
jeopardy["question_overlap"]= question_overlap           
jeopardy["question_overlap"].mean()        
            
        
     
        
    

0.65706547782404867

#### Question overlap
There is about 65
% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

#### Let's analyze questions by value

In [12]:

def determine_value(row):
    value = 0
    if row["Value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [13]:
jeopardy["high_value"].head()

0    0
1    0
2    0
3    0
4    0
Name: high_value, dtype: int64

In [14]:
def val_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count


In [15]:
observed_expected = []
comparison_terms = list(terms_used)[:5]
#comparison_terms

In [16]:
for item in comparison_terms:
    x = val_count(item)
    observed_expected.append(x)
    
observed_expected    

[(1, 2), (0, 1), (1, 1), (1, 0), (0, 1)]

Find the number of rows in jeopardy where high_value is 1, and assign to high_value_count.
Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count.
Create an empty list called chi_squared.
Loop through each list in observed_expected.
Add up both items in the list (high and low counts) to get the total count, and assign to total.
Divide total by the number of rows in jeopardy to get the proportion across the dataset. Assign to total_prop.
Multiply total_prop by high_value_count to get the expected term count for high value rows.
Multiply total_prop by low_value_count to get the expected term count for low value rows.
Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
Append the results to chi_squared.
Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.

In [24]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
print(low_value_count)
print(high_value_count)

14265
5734


In [35]:
from scipy.stats import chisquare
import numpy as np
chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared
    

[Power_divergenceResult(statistic=0.031881167234403623, pvalue=0.85828871632352932),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.44487748166127949, pvalue=0.50477764875459963),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.