# Guided Project: Winning Jeopardy

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.




In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


- Replace spaces in Column Names

In [2]:
print(jeopardy.columns)
col_names = jeopardy.columns.str.replace(" ","")
jeopardy.columns = col_names

print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


# Normalizing Text

Write a function to normalize questions and answers. The function should:
    - Take in a string.
    - Convert the string to lowercase.
    - Remove all punctuation in the string.
    - Return the string.

In [3]:
import string
import re

def convert_string(inpstr):
    inpstr = inpstr.translate(str.maketrans('', '', string.punctuation)).lower()
    return inpstr

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text


jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

print(jeopardy['clean_question'].head(1))
print(jeopardy['clean_answer'].head(1))

0    for the last 8 years of his life galileo was u...
Name: clean_question, dtype: object
0    copernicus
Name: clean_answer, dtype: object


# Normalizing Columns

Write a function to normalize dollar values. The function should:
- Take in a string.
- Remove any punctuation in the string.
- Convert the string to an integer.
- Assign 0 instead if the conversion has an error.
- Return the integer.

In [4]:
import re

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
print(jeopardy[['Value', 'clean_value']].head())

  Value  clean_value
0  $200          200
1  $200          200
2  $200          200
3  $200          200
4  $200          200


## Convert AirDate to date time



In [5]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
print(jeopardy['AirDate'].head(5))

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: AirDate, dtype: datetime64[ns]


In [6]:
print(jeopardy.dtypes)
print(jeopardy.head())

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object
   ShowNumber    AirDate      Round                         Category Value  \
0        4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1        4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2        4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3        4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4        4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl..

 # Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.


In [7]:
def process_row(data):
    split_answer = data['clean_answer'].split()
    split_question = data['clean_question'].split()
    #print(len(split_answer), len(split_question))
    # remove 'the'
    if 'the' in split_answer:
        split_answer.remove('the')
    
     
    if len(split_answer) == 0:
        return 0
    else:
        match_count = 0   
        for items in split_answer:
            if items in split_question:
                match_count += 1
               
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(process_row,axis=1)
    
print(jeopardy['answer_in_question'].mean())


0.05900196524977763


# Recycled Questions

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    -  If it does, increment a counter.
    -  Add each word to terms_used.


In [8]:
# Sort by Air Date
import numpy as np

question_overlap = []
terms_used = set()

jeopardy.sort_values('AirDate')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    # Remove any words in split_question that are less than 6 characters long.
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for words in split_question:
        if words in terms_used:
            match_count +=1
    for words in split_question:
        terms_used.add(words)
        
    if (len(split_question) > 0):
        match_count = match_count / len(split_question)
        question_overlap.append(match_count)
    #row["question_overlap"] = question_overlap

#jeopardy['question_overlap'] = question_overlap
#print(jeopardy['question_overlap'].head(5))
print(np.mean(question_overlap), 'Done')
        


0.7056940475822236 Done


# Low Value vs High Value Questions

There is about a 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases — it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.



In [9]:


def calc_value(data):
    if data['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(calc_value, axis=1)
print(jeopardy['high_value'].head(5))
    

0    0
1    0
2    0
3    0
4    0
Name: high_value, dtype: int64


In [10]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count


In [11]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected


[(0, 1),
 (3, 4),
 (0, 1),
 (0, 1),
 (1, 5),
 (1, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (1, 0)]

# Applying the Chi-squared Test



In [12]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0 ].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared


[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.6887906561130311, pvalue=0.4065760282166111),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.42281054506129573, pvalue=0.515537958129453),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

# Chi-Squared Results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.