Chi-Square te

In [24]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv') # reading the 20000 rows from jeopardy dataset

print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [25]:
jeopardy.columns = [i.strip() for i in jeopardy.columns] # removing whie spaces from the column names

jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [26]:
# Normalizing the text columns by lower casing and removinf punctuations
import re
def normalizing_text(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]', "", text) # Returns a match for any character EXCEPT  a-z, A-Z, 0-9, spaces and replace it with one white space
    text = re.sub('\s+', " ", text) # 
    return text
jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_text)

In [27]:
# Normalizing value column by converting to numeric:
def normalizing_value(value):
    value = re.sub('[^A-Za-z0-9\s]', "", value)
    try:
        value = int(value)
    except Exception:
        value = 0
    return value
jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_value)

In [14]:
# Converting Air Date to pd.datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [28]:
jeopardy.dtypes # checking the data types

Show Number        int64
Air Date          object
Round             object
Category          object
Value             object
Question          object
Answer            object
clean_question    object
clean_answer      object
clean_value        int64
dtype: object

In [29]:
# In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
# - How often the answer can be used for a question. - answer_in_question
# - How often questions are repeated. - overlap

def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the') # it appears in the answers, but doesn't contribute to our analysis
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
jeopardy.head()



Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


In [30]:
avg_answer_in_question = jeopardy['answer_in_question'].mean()
avg_answer_in_question

0.05900196524977763

On an average, only 6% of the time the answer is in the question. Thats not very helpful.


In [42]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

avg_overlap = jeopardy['question_overlap'].mean()
avg_overlap


0.6893853524586494

On average, 69% of questions were related to previous questions in some how related. May be, if we refer to the topics of previously asked questions, we might be able to study better.

If we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when on Jeopardy.

In [57]:
# Qualifying greater than $800 as high value, and less than $800 as low value:
def greater_than_800(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value
        
jeopardy['high_value'] = jeopardy.apply(greater_than_800, axis=1)


In [58]:
jeopardy['high_value'].value_counts() # there are 14265 low value questions, and 5734 high value questions


0    14265
1     5734
Name: high_value, dtype: int64

In [64]:
# Defining a function to count how many times a word was used in high value question and low value questions:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count


In [74]:
import random
# Since working with all the words in all the questions will take a long time, lets randomly select 10 words from terms_used set:
comparison_terms = random.sample(terms_used, 10)
comparison_terms

['everyone',
 'nelson',
 'knocking',
 'stairclimbing',
 'compete',
 'practice',
 'alleles',
 'postcentral',
 'hrefhttpwwwjarchivecommedia20091218j19jpg',
 'ebenezer']

In [75]:
observed_expected = []

for word in comparison_terms:
    observed_expected.append(high_low_count(word))
    
observed_expected # list of tuples of # of times the word was used in high and low value question

[(3, 11),
 (3, 7),
 (2, 1),
 (0, 1),
 (0, 4),
 (5, 13),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 1)]

Now that we have found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [76]:
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

In [79]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []

for i in observed_expected:
    total = sum(i)
    total_prop = total/jeopardy.shape[0]
    term_exp_high_value = total_prop * high_value_count
    term_exp_low_value = total_prop * low_value_count
    
    observed = np.array([i[0], i[1]])
    expected = np.array([term_exp_high_value, term_exp_low_value])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared
    


[Power_divergenceResult(statistic=0.3591166740081454, pvalue=0.5489972218273612),
 Power_divergenceResult(statistic=0.008630851497838939, pvalue=0.9259811180040979),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948),
 Power_divergenceResult(statistic=0.007029106963070332, pvalue=0.93318382776185),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996)]

Results:
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all low, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.