# Winning Jeopardy


### Jeopardy Questions

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that we want to compete on Jeopardy, and we're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us to win.

The dataset is named `jeopardy.csv`, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

Here's the beginning of the file:

![](https://dq-content.s3.amazonaws.com/Nlfu13A.png)

In [1]:
# read dataset
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

print(jeopardy.shape)
jeopardy.head(5)

(19999, 7)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# see the columns

jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# remove the spaces from each item
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']


Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Normalizing Text

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns).


In [5]:

import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

jeopardy.head(3)
    

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona


### Normalizing Columns

Now that we've normalized the text columns, there are also some other columns to normalize:
 
 - The `Value` column should be numeric
 - The `Air Date` column should also be a datetime

In [6]:
def normalize_value(text):
    text = re.sub("[^A-Za-z0-9\s]","", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text   

In [7]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [8]:
# convert the Air Date column to a datetime column.
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [9]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [10]:
print(jeopardy.shape)
jeopardy.head(3)

(19999, 10)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


### Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

 - How often the answer can be used for a question.
 - How often questions are repeated.

In [11]:
# This function takes in a row in the dataset and split the columns
# clean_answer and clean_question 
# then, count how many times terms in clean_answer occur in clean_question
def matches_row_count(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)   

jeopardy['answer_in_question'] = jeopardy.apply(matches_row_count, axis=1)
    

In [12]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

we can see that the mean in the answers in question is 6% ...
This isn't enought, so we'll to continue to investigate. 

### Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones.
    

In [13]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')
# We'll Use the iterrows Dataframe method to loop through each row of jeopardy
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()
            


0.6876260592169802

we can se that in average, there are 70% between term in new  questions and terms in old question term that are overlap. - but it take in considration only a small set of questions and it doesn't look at phrases.

### Low Value vs High Value questions

let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

In [14]:
# take in row from a dataset

def determined_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determined_value, axis=1) 

In [15]:
# Determine which questions are high and low value
def count_used(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows(): #Loops through each row in dataset using the iterrows method
        if term in  row['clean_question'].split(' '):
            if  row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
        

In [16]:
# create a list that randomly take terms to use in another list as to
# values observed an expecte values
from random import choice

terms_used_list= list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_used(term))
    
observed_expected    

[(1, 0),
 (0, 2),
 (0, 1),
 (1, 0),
 (1, 0),
 (1, 1),
 (19, 35),
 (1, 0),
 (3, 10),
 (1, 2)]

### Applying the Chi-squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [17]:

from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_pro = total / jeopardy.shape[0]
    high_value_exp = total_pro * high_value_count
    low_value_exp = total_pro * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared    


[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.1203229780198425, pvalue=0.2898489190576429),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.19895489749611328, pvalue=0.6555657483351991),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

### Chi-squared Result

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

### Next Steps

Look more into the Category column and see if any interesting analysis  can be done with it. Some ideas:
  - See which categories appear the most often.
  - Find the probability of each category appearing in each round.