# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. The dataset contains a set of Jeopardy questions to figure out some patterns in the questions that could help someone win. Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:
- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

In [43]:
import pandas as pd
import numpy as np
import re
from scipy.stats import chisquare

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

### Exploring the Data

In [3]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']

In [6]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalization

In [7]:
def normalize(txt):
    txt = re.sub("[^A-Za-z0-9\s]", "", txt)
    return txt.lower()

In [8]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

In [9]:
jeopardy['clean_question'].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [10]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

In [11]:
jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [12]:
def normalize_val(txt):
    txt = re.sub("[^A-Za-z0-9\s]", "", txt)
    try:
        txt = int(txt)
    except Exception:
        txt = 0
    return txt

In [13]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_val)

In [14]:
jeopardy["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [15]:
jeopardy["Air Date"].head()

0    2004-12-31
1    2004-12-31
2    2004-12-31
3    2004-12-31
4    2004-12-31
Name: Air Date, dtype: object

In [16]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [17]:
jeopardy["Air Date"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

### Analyzing the Data

In [18]:
def q_matches_a(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    
    if "the" in split_answer:
        split_answer.remove("the")
        
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [19]:
jeopardy["answer_in_question"] = jeopardy.apply(q_matches_a, axis=1)

In [22]:
jeopardy["answer_in_question"].head(20)

0     0.000000
1     0.000000
2     0.000000
3     0.000000
4     0.000000
5     0.000000
6     0.000000
7     0.000000
8     0.000000
9     0.333333
10    0.000000
11    0.000000
12    0.000000
13    0.000000
14    0.500000
15    0.000000
16    0.000000
17    0.000000
18    0.000000
19    0.000000
Name: answer_in_question, dtype: float64

In [23]:
jeopardy['answer_in_question'].mean()

0.060493257069335914

**Only 6% of the time the answer is deducible from the question. This suggests that just hearing the question wouln't be very helpful in findig the answer!**

In [26]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [j for j in split_question if len(j) > 5]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
        
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
        
    question_overlap.append(match_count)

In [28]:
jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].head(20)

0     0.000000
1     0.000000
2     0.000000
3     0.000000
4     0.000000
5     0.000000
6     0.000000
7     0.000000
8     0.125000
9     0.000000
10    0.000000
11    0.000000
12    0.200000
13    0.142857
14    0.000000
15    0.000000
16    0.000000
17    0.000000
18    0.250000
19    0.000000
Name: question_overlap, dtype: float64

In [29]:
jeopardy["question_overlap"].mean()

0.6908737315671878

**The data suggests 69% overlap between the terms in new questions and old questions. Therefore, it may be helpful to spend some time looking into the repeating terms.**

In [30]:
def high_low_value(row):
    if row['clean_value'] > 800:
        return 1
    else: 
        return 0

In [31]:
jeopardy['high_value'] = jeopardy.apply(high_low_value, axis=1)

In [35]:
jeopardy['high_value']

0        0
1        0
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        0
10       0
11       0
12       0
13       0
14       0
15       0
16       0
17       0
18       0
19       0
20       0
21       0
22       1
23       0
24       1
25       1
26       1
27       1
28       1
29       0
        ..
19969    1
19970    1
19971    1
19972    1
19973    1
19974    1
19975    1
19976    1
19977    1
19978    1
19979    1
19980    1
19981    1
19982    1
19983    1
19984    1
19985    1
19986    1
19987    0
19988    0
19989    0
19990    0
19991    0
19992    0
19993    0
19994    0
19995    0
19996    0
19997    0
19998    0
Name: high_value, Length: 19999, dtype: int64

In [36]:
def value_counts(word):
    low_count = 0 
    high_count = 0 
    for i, row in jeopardy.iterrows():
        words = row['clean_question'].split(" ")
        if word in words:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [38]:
comparison_terms = list(terms_used)[:5]

observed_expected = []
for each in comparison_terms:
    observed_expected.append(value_counts(each))

observed_expected

[(1, 0), (3, 3), (0, 1), (0, 1), (0, 1)]

In [40]:
high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
high_value_count

5734

In [41]:
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])
low_value_count

14265

In [44]:
chi_squared = []

for i in observed_expected:
    total = i[0] + i[1]
    total_prop = total / len(jeopardy)
    high_count_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([high_count_expected, low_value_expected])
    chi_squared.append(chisquare(observed,expected))
    
chi_squared

[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881564),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

**Based on the chi-squared results and p-values there is no statistically significant result to suggest any difference in usage between high value and low value rows.**