# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and you're looking for any edge we can get to win. In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

## Define our hypothesis

Before we start the analysis, lets define our null and alternate hypothesis

   - null hypothesis - we cannot predict questions based on past questions. any relationship between questions asked in the newer episodes to question asked in the past episodes are just random occurences
   - alternate hypothesis - There is a strong relationship between newer questions to those asked in the older episode. By analysing newer questions against older questions, we can recognize a pattern, thereby enabling us to predict questions better

### Read data and do basic exploration

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

Lets start off by reading the dataset and creating pandas dataFrame

In [1]:
import pandas as pd
import csv

jeopardy = pd.read_csv("jeopardy.csv")

print("dataset size = {}".format(jeopardy.shape))
jeopardy.head()

dataset size = (19999, 7)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

  - **`Show Number`** -- the Jeopardy episode number of the show this question was in.
  - **`Air Date`** -- the date the episode aired.
  - **`Round`** -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
  - **`Category`** -- the category of the question.
  - **`Value`** -- the number of dollars answering the question correctly is worth.
  - **`Question`** -- the text of the question.
  - **`Answer`** -- the text of the answer.

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

some of the columns seems to have leading spaces. Lets fix it

In [3]:
jeopardy.columns = map(str.strip, ['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'])

In [4]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

all column's extra leading spaces have been stripped

## Data normalization

Before we go any further, we want to ensure the the colunn text and values are normalized, ie, we want to make the comparable. We can do this by

   - for both **`Question`** and **`Answer`**
       - removing puctuations
       - converting text to lower case
   - The **`Value`** column should also be numeric, to allow you to manipulate it more easily. You'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.
   - The **`Air Date`** column should also be a datetime, not a string, to enable you to work with it more easi
   
To accomplish the above, we will write couple of helper functions

In [5]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [6]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [7]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [8]:
jeopardy["clean_value"].value_counts(dropna=False)[0]

336

In [9]:
jeopardy["Value"].value_counts(dropna=False)["None"]

336

We only have 336 rows with value as zero and they correspond to rows which has value "None" in the original columns. So we are good to go

Now lets concentrate on the **`Air Date`** which should be datetime column and not string


In [10]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [11]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [12]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


So looks like all our value transformation has been successful. We can now move on to analyzing the data

# Analysis of past questions

### do the questions contain the answers themselves?

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

   - How often the answer is deducible from the question.
   - How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

We will build a function which will 
   - split both question and answer
   - remove 'the' from answer as it is not useful to match
   - count the number of times the words in answer apperar in question
   
We can run this function on all rows and save the answer in a separate column. That will provide us insight about how many words in answer is available in question itself



In [13]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [14]:
print("Answer in question = {0:.2f}% ".format(jeopardy["answer_in_question"].mean() * 100))


Answer in question = 6.05% 


Only about 6% of the time answers are contained in the question itself. This isn't  much and difficult to depend on and we cannot answer question just by hearing the question carefully

### do questions repeat often? if yes, how often?

Let's say we want to investigate how often new questions are repeats of older ones. Unfortuantely we cannot completey  answer this questions as we only have small subset of all Jeopardy questions ever asked in the show. Our dataset is around 10% of the full question set, but we can use it as a sample representative and try to get some insight out of what we have

to answer this question, we can follow this algorithm

   - Sort **`jeopardy`** in order of ascending air date. - This way we can pick up a question and look for all previous questions based on air date column
   - Maintain a set called **`terms_used`** that will be empty initially.
   - Iterate through each row of jeopardy.
   - Split **`clean_question`** into words, remove any word shorter than 6 characters, and check if each word occurs in **`terms_used`**.
       - If it does, increment a counter.
       - Add each word to **`terms_used`**.
       
This will enable you to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.



In [15]:
# set to store all the terms/words used in questions
terms_used = []
terms_used_unique = []


def find_mean_question_overlap(jeopardy):
    question_overlap = []
    jeopardy = jeopardy.sort_values("Air Date")

    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.append(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
    jeopardy["question_overlap"] = question_overlap
    terms_used_unique = list(set(terms_used))
    return jeopardy["question_overlap"].mean()

overlap_mean = find_mean_question_overlap(jeopardy)
print("Question overlap = {0:.2f}% ".format(overlap_mean * 100))


Question overlap = 68.76% 


In [16]:
terms_used_unique = list(set(terms_used))
terms_used_unique

['hrefhttpwwwjarchivecommedia20110718j10jpg',
 'consistent',
 'leinart',
 'evertuned',
 'shopkeepers',
 'impoverished',
 'csonkaa',
 'easing',
 'mercurys',
 'typical',
 'socially',
 'misnamed',
 'courtney',
 'amount',
 'jazair',
 'boutiques',
 'pianoplaying',
 'benson',
 'refused',
 'associations',
 'footwear',
 'sentinels',
 'hrefhttpwwwjarchivecommedia20100319dj26jpg',
 'following',
 'fingershaped',
 'langera',
 'soulsia',
 'coeducational',
 '54yearold',
 'saxons',
 'kellya',
 'interracial',
 'favors',
 'antiperspirant',
 'ibadan',
 'elements',
 'frenchman',
 'jellyfish',
 'fledgling',
 'exalted',
 'porters',
 'bribed',
 'approving',
 'drowned',
 'mazarin',
 'targetblankcharles',
 'indiana',
 'fighter',
 'twinkle',
 'bottoms',
 'gigolo',
 'purify',
 'hrefhttpwwwjarchivecommedia20111103dj14jpg',
 'friday',
 'battles',
 'action',
 'drunkenness',
 'dwellings',
 'eliminated',
 'phoenix',
 'devitos',
 'altitudes',
 'hopalong',
 'promontory',
 'sleeket',
 'nowsmallest',
 'aucklandarea',
 '

### Value analysis

There is close to 70% overlap between the words in the new questions and the words in the old questions. Since we are not looking into any phrases or exact same questions, it is tough to decide if this is significant or not. But the percentage is big enough that it is worth doing futher analysis

We want to restirct our analysis to high value questions so that we can get  more money in the game show. We will designate any amount more $800 as high value amount. Once we have thid categorization, we can check which of our words/terms correspond to high value questions using chi-squared test. 

FIrst lets narrow down our questions into two catgories

   - Low value -- Any row where Value is less than 800.
   - High value -- Any row where Value is greater than 800.

In [17]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)


Now we will be able to loop through each of the terms in **`terms_used`**, and:

   - Find the number of low value questions the word occurs in.
   - Find the number of high value questions the word occurs in.
   - Find the percentage of questions the word occurs in.
   - Based on the percentage of questions the word occurs in, find expected counts.
   - Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.
   
We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [18]:
def count_usage(jeopardy, term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count


def find_observed(jeopardy, comparison_terms):
    print(comparison_terms)
    observed = []
    for term in comparison_terms:
        observed.append(count_usage(jeopardy, term))
    return observed



### Chi-squared and p values

Now that we have found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.



In [19]:
def calculate_chi_squared(jeopardy, observed):
    from scipy.stats import chisquare
    import numpy as np
    high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
    low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

    chi_squared = []
    for obs in observed:
        total = sum(obs)
        total_prop = total / jeopardy.shape[0]
        high_value_exp = total_prop * high_value_count
        low_value_exp = total_prop * low_value_count

        obs = np.array([obs[0], obs[1]])
        exp = np.array([high_value_exp, low_value_exp])
        chi_squared.append(chisquare(obs, exp))

    return chi_squared

In [20]:
def find_and_print_chi_squared(jeopardy, terms):
    observed = find_observed(jeopardy, terms)

    # Now that we have found the observed counts for a few terms,
    # we can compute the expected counts and the chi-squared value.
    chi_squared = calculate_chi_squared(jeopardy, observed)
    for chi_sq in chi_squared:
        print("statistic = {0:.2f} p value = {1:.2f}%".format(
            chi_sq[0], chi_sq[1]*100))




In [None]:
words_to_check = 10
find_and_print_chi_squared(jeopardy, terms_used_unique[:words_to_check] )

['hrefhttpwwwjarchivecommedia20110718j10jpg', 'consistent', 'leinart', 'evertuned', 'shopkeepers', 'impoverished', 'csonkaa', 'easing', 'mercurys', 'typical']


### Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

### Chi-squared test on high frequency words

Lets create the frequency table and run the chi squared test only for the terms that have occured atleast 100 times or more

In [None]:
terms_used_freq = pd.Series(sorted(terms_used)).value_counts()
terms_used_100_times_or_more = terms_used_freq[terms_used_freq > 100]
terms_used_100_times_or_more

In [None]:
find_and_print_chi_squared(jeopardy, terms_used_100_times_or_more.index.tolist())

### Chi squared for high frequency word results

Again the p values are very high and most of them are well above 5% which is our theoritial limit for determining if the relationship exits are the results are just randon. So we will have to abaondon our analysis and accept null hypothesis

## Conculsion

We reject the alternate hypothesis and accept the null hypothesis. Any observed relationship between questions asked in the newer episode against those asked in the older episodes are just random in nature

