## The Goal of This Project

Jeopardy is a popular TV show in the US, which is been running for a few decades. The participants answer question to win money.

We want to compete on Jeopardy and you are looking for how to get to win. In this project, we try to figure out some patterns in the questions that could help you by analyzing a dataset of Jeopardy.

## Jeopardy Datasets

The dataset with 200,000 Jeopardy questions (!) has been posted on Reddit, wchich you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). 

The explanations of each column in the dataset are as follows.

* Show Number: the episode number the question appeared.
* Air Date: the date the episode aired.
* Round: the round that the  question appeared. Jeopardy has several rounds as each episode progresses.
* Category: the category of the question.
* Value: the number of dollars answering the question correctly is worth.
* Question: the text of the question.
* Answer: the text of the answer.

## Opening the dataset

In [1]:
import pandas as pd

# read the dataset
jeopardy = pd.read_csv('../data/jeopardy.csv')

# show the first five rows 
display(jeopardy.head())
# show the name of the columns
print(jeopardy.columns)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [2]:
# remove the spaces in each item in the columns
columns = [column.strip() for column in jeopardy.columns]
jeopardy.columns = columns
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


## Normalizing text

In [21]:
import re

def normalize_text(text):
    """
    normalize texts
    
    Args:
        text (str) : text you wanna normalize
    
    Returns:
        text(str) : text after nomalization
    """
    
    # conver the string to lowercase
    text = text.lower()
    # remove all punctuation
    text = re.sub('[^A-Za-z0-9\s\-]', '', text)

    return text
    
# remove missing values
jeopardy = jeopardy.dropna().reset_index(drop=True)
# normalize the `Question` column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
# normalizt the `Answer` column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

display(jeopardy.head())

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,question_overlap
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


In [22]:
def normalize_value(value):
    """
    normalize dollar values
    
    Args:
        value (str) : dollar value
    
    Returns:
        values (int) : normalized dolalr value
    """
    
    ## remove all punctuation
    value = re.sub('[^A-Za-z0-9\s\-]', '', value)
    # convert the string to an integer
    try:
        value = int(value)
    except Exception:
        value = 0
        
    return value

# normalize the `Value` column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
# conver the `Air Date` column to a datetime column
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
display(jeopardy.head())
# check the types of each column
display(jeopardy.dtypes)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,question_overlap
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


Show Number                  int64
Air Date            datetime64[ns]
Round                       object
Category                    object
Value                       object
Question                    object
Answer                      object
clean_question              object
clean_answer                object
clean_value                  int64
question_overlap           float64
dtype: object

## How often the answer is deducible from the question

In [23]:
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

# get the list of stopwords
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\raye4\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
def count_matches(row):
    """
    count how many times terms in clean_answer occur in clean_question
    
    Args:
        rows (pd.Series) : a row in jeopardy
    
    Returns:
        match_count / len(aplit_answer) (float) : rate of the terms reoccur
    """
    
    # split the column around spaces
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    match_count = 0
    
    # if stopwords are in split_answer, remove it
    for stop_word in stop_words:
        if stop_word in split_answer:
            split_answer.remove(stop_word)

    # return 0 if the length of split_answer is 0 in order to prevent a divisio by zero error
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1

    return match_count / len(split_answer)

# count how many terms in clean_answer occur in clean_question
answer_in_question = jeopardy.apply(count_matches, axis=1)
print(answer_in_question.mean())

0.04090742091587874


## How often new questions are repeats of older questions

In [31]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.sort_values('Air Date').iterrows():
    split_question = row['clean_question'].split(' ')
    
    # if stopwords are in split_question, remove it
    for stop_word in stop_words:
        if stop_word in split_question:
            split_question.remove(stop_word)
    
    # remove any words that are less than 7 characters long
    split_question = [term for term in split_question if len(term) > 7]
    # remove all words that contains a number
    split_question = [term for term in split_question if not bool(re.search(r'[0-9]', term))]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)

# assign `question_overlap` to the new column of dataset
jeopardy['question_overlap'] = question_overlap
# print the mean of the `question_overlap` column
print(jeopardy['question_overlap'].mean())

0.7063366846813552


In [36]:
terms_used

{'schwitters',
 'fribourg',
 'supported',
 'targetblanktendera',
 'disbanded',
 'override',
 'allergens',
 'denouncing',
 'mph--hence',
 'doorkeeper',
 'targetblankprograma',
 'toymakers',
 'all-japan',
 'hypotenuses',
 'tumbling',
 'con-mans',
 'dramatization',
 'abductee',
 'ofatmospheric',
 'muttered',
 'culminates',
 'serenaded',
 'defensible',
 'masthead',
 'grandstand',
 'revisited',
 'napsters',
 'refilled',
 'targetblanklook',
 'thoroughly',
 'sleepingi',
 'microphone',
 'paraphrase',
 'portlands',
 'jacobson',
 'pendulums',
 'indefinite-sized',
 'popsicles',
 'single-horned',
 'author-aviator',
 'puppies--playing',
 'mcgoverns',
 'fairylandversing',
 'speculation',
 'winglike',
 'hungry-man',
 'christinas',
 'unruffled',
 'delusions',
 'warehouses',
 'increased',
 'troublemakers',
 'microchips',
 'mellowed',
 'pythoner',
 'draftsmans',
 'chegwidden',
 'engine-powered',
 'demoiselles',
 'daffodils',
 'malagrida',
 'mailable',
 'collusion',
 'targetblankyouthful',
 'lady--this',

## Figure out which terms correspond to high-value questions using a chi-squared test

In [37]:
def determine_value(row):
    """
    check the value of the question
    
    Args:
        row (pd.DataFrame) : a row from a DataFrame
    
    Returns:
        value (bool) : 1 if the clean_value column is greater than 800, else 0
    
    """
    
    value = 1 if row['clean_value'] > 800 else 0
        
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [39]:
def count_usage(term):
    """
    get the high value and low value counts on the term
    
    Args:
        term (str) :
    
    Returns:
        high_count (int) : the number of rows where `Value` is greater than 800
        low_count (int) : the number of rows where `Value{ is less than 800}
    
    """
    
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if term in split_question:
            if row['high_value']:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[:20]

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
print(comparison_terms)
print(observed_expected)

['schwitters', 'fribourg', 'supported', 'targetblanktendera', 'disbanded', 'override', 'allergens', 'denouncing', 'mph--hence', 'doorkeeper', 'targetblankprograma', 'toymakers', 'all-japan', 'hypotenuses', 'tumbling', 'con-mans', 'dramatization', 'abductee', 'ofatmospheric', 'muttered']
[(0, 1), (1, 2), (24, 47), (1, 0), (7, 5), (0, 2), (0, 1), (1, 3), (0, 1), (1, 1), (1, 0), (0, 1), (0, 1), (0, 1), (3, 12), (1, 0), (0, 3), (1, 0), (0, 1), (0, 2)]


In [62]:
from scipy.stats import chisquare

high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = len(jeopardy[jeopardy['high_value']==0])

chi_squared = []
for i, obs in enumerate(observed_expected):
    total = sum(obs)
    total_prop = total / len(jeopardy)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count

    observed = list(obs)
    expected = [high_value_exp, low_value_exp]

    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=0.03723001319762459, pvalue=0.8469974958245368),
 Power_divergenceResult(statistic=1.053665015690835, pvalue=0.3046645082630172),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=5.3275806579540035, pvalue=0.02099050312842717),
 Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=0.02164944004882361, pvalue=0.8830235016084509),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=0.4633727036157106, pvalue=0.4960519396377898),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_d