# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import re

## Check the data

In [8]:
data = pd.read_csv('jeopardy.csv')
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [9]:
data.columns = [x.strip() for x in data.columns]

In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19663 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Normalizing Text
Take in a string.
Convert the string to lowercase.
Remove all punctuation in the string.
Return the string.

In [5]:
def clean_col(row):
    row = row.str.lower()
    row = row.str.replace('\W', ' ')
    return row

In [17]:
re.sub("[^A-Za-z0-9\s]", "",'$2,00')

'200'

In [18]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", '', text)
    text = re.sub('\s+', ' ', text)
    return text

data['clean_question'] = data['Question'].apply(normalize_text)
data['clean_answer'] = data['Answer'].apply(normalize_text)


In [29]:
def normalize_value(text):
    if not isinstance(text, str):
        text = str(text)  # Convert non-string input to a string
    text = re.sub("[^A-Za-z0-9\s]", '', text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

data['clean_value'] = data['Value'].apply(normalize_value)

In [30]:
data['Air Date'] = pd.to_datetime(data['Air Date'])

## Answers in Questions

In [31]:
# Dealing with more than one columns

data_a = data.copy()
def count_matches(row):  # row elements
    match_count = 0
    split_question = row['clean_question'].split()
    split_answer = row['clean_answer'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for items in split_answer:
        if items in split_question:
            match_count += 1
    return match_count/len(split_answer)

data_a['answer_in_question'] = data_a.apply(count_matches, axis=1)

In [32]:
data_a['answer_in_question'].mean()

0.05900196524977763

## Recycled Questions

In [33]:
data_a = data_a.sort_values('Air Date')

In [35]:
question_overlap = []
terms_used = set()
for i, value in data_a.iterrows():
    split_question = value['clean_question'].split(' ')
    split_question = [t for t in split_question if len(t) > 5] # one_line code 
#     for word in split_question:
#         if len(word) < 6:
#             split_question.remove(word)
    match_count = 0
    for item in split_question:
        if item in terms_used:
            match_count += 1
        terms_used.add(item)  # create the items in row after comparison
    if len(split_question) > 0:
        match_count /= len(split_question) # match
    question_overlap.append(match_count)
    

In [None]:
data_a['question_overlap'] = question_overlap
data_a['question_overlap']  .mean()    

## Low Value vs High Value Questions
Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

In [None]:
data_a['clean_value']

In [None]:
def modify(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value =0
    return value    

In [None]:
data_a['high_value'] = data_a.apply(modify, axis=1)

In [None]:
def create_counts(word):
    low_count = 0
    high_count = 1
    for i, row in data_a.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [36]:
from numpy.random import choice
# comparison_terms = choice(list(terms_used), 10)
comparison_terms = [choice(list(terms_used)) for _ in range(10)]
comparison_terms

['consonant',
 'sherpa',
 'lobsters',
 'listening',
 'josiah',
 'countdown',
 'twentieth',
 'reuben',
 'hammock',
 'creepycrawly']

In [None]:
# Create observed and rexpected
observed_expected = []
for item in comparison_terms:
    observed_expected.append(create_counts(item))

In [None]:
observed_expected

## Applying the Chi-squared Test

In [None]:
high_value_count = data_a[data_a['high_value']==1].shape[0]
low_value_count = data_a[data_a['high_value']==0].shape[0]
chi_squared = []

In [None]:
from scipy.stats import chisquare

chi_squared = []

for item in observed_expected:
    total = sum(item) # total times the item is counted
    high_value_p = high_value_count /len(data_a)  
    low_value_p = low_value_count /len(data_a) 
    high_value_expec = total * high_value_p
    low_value_expec = total * low_value_p
    observed = np.array(item)
    expected = np.array([high_value_expec, low_value_expec])
    chi_squared.append(chisquare(observed, expected))
chi_squared   