## Pattern finding in Jeoperdy game show

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download (here)[https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/].

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.shape

(19999, 7)

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the columns have spaces so we will replace them by _ and lowercase them

In [4]:
colnames = jeopardy.columns.str.strip().str.replace(' ', '_').str.lower()
colnames

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

In [5]:
jeopardy.columns = colnames
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


### Data Cleaning

In [6]:
import re
def normalize_string(value):
    value = value.lower()
    value = re.sub("\W", " ", value)
    return value

In [7]:
jeopardy['clean_question'] = jeopardy['question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize_string)

Convert value to numeric format

In [8]:
jeopardy['value'] = jeopardy['value'].str.replace('$', '')\
                .str.replace(",", "")
jeopardy.loc[jeopardy['value'] == "None", 'value'] = 0 #replace None value by 0
jeopardy['clean_value'] = jeopardy['value'].astype('float64')

In [9]:
jeopardy['clean_value'].dtypes

dtype('float64')

In [10]:
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])

In [11]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
show_number       19999 non-null int64
air_date          19999 non-null datetime64[ns]
round             19999 non-null object
category          19999 non-null object
value             19999 non-null object
question          19999 non-null object
answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1), object(7)
memory usage: 1.5+ MB


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

In [12]:
def ans_in_que(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if 'the' in split_question:
        split_question.remove('the')
    if len(split_answer) == 0:
        return 0
    else:
        for answer in split_answer:
            if answer in split_question:
                match_count += 1
        return match_count/len(split_answer)

In [13]:
jeopardy['answer_in_question'] = jeopardy.apply(ans_in_que, axis = 1)

In [14]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200.0,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200.0,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200.0,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200.0,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200.0,0.0


In [15]:
jeopardy['answer_in_question'].mean()

0.062493576033003005

There is 6% chance that the answer for the given question is likely in the question itself.

In [16]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by = 'air_date')
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
#     split_question = row['clean_question'].split(" ")
#     split_question
#     match_count = 0
#     for word in split_question:
#         if len(word) < 6:
#             split_question.remove(word)
#         else:
#             if word in terms_used:
#                 match_count += 1
#             else:
#                 terms_used.add(word)
#     if len(split_question) > 0:
#         match_count = match_count/len(split_question)
#     question_overlap.append(match_count)
    
jeopardy['question_overlap']  = question_overlap

jeopardy['question_overlap'].mean()
                
            

0.7197989717809739

So 71% of times, we can see the overlap of the keywords between the old and new questions.

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

In [17]:
def value_que(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [18]:
jeopardy['high_value'] = jeopardy.apply(value_que, axis = 1)

In [19]:
def freq(word):
    low_count = 0
    high_count =0
    for _, row in jeopardy.iterrows():
        if word in row['clean_question'].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count            

In [20]:
from random import choice

comparison_terms =  [choice(list(terms_used)) for _ in range(10)]
comparison_terms

['moreno',
 'strokes',
 'zygomatic',
 '07_j_03',
 'additions',
 'demise',
 'decays',
 'cheney',
 'praline',
 'dumfries']

In [21]:
observed_expected = []
for term in comparison_terms:
    high_value, low_value = freq(term)
    observed_expected.append((high_value, low_value))

In [23]:
high_value_count = sum(jeopardy['high_value'] == 1)
low_value_count = sum(jeopardy['high_value'] == 0)
chi_squared = []
from scipy.stats import chisquare
import numpy as np
for tup in observed_expected:
    total = tup[0] + tup[1]
    total_prop = total / jeopardy.shape[0]
    high_value = total_prop * high_value_count
    low_value = total_prop * low_value_count
    
    observed = np.array([tup[0], tup[1]])
    expected = np.array([high_value, low_value])
    
    chi_squared.append(chisquare(observed, expected))

In [24]:
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.