# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

* our aim is to  to figure out some patterns in the questions that could help you win.

* The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [60]:
import pandas as pd
import numpy as np

In [61]:
jeopardy = pd.read_csv('jeopardy.csv') #reading the dataset
jeopardy.head() #printing the first five rows of the dataset

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [62]:
jeopardy.columns #showing the columns fo the dataset

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [63]:
jeopardy.columns = jeopardy.columns.str.strip() #removing extra spaces from column names 
jeopardy.columns 

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [64]:
jeopardy.dtypes #finding the type of each column

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

In [80]:
#creating a function to normalize string columns
import re
def string_normalizer(element):
    element = element.lower()
    element = re.sub('[^A-Za-z0-9\s]', '',element)
    element = re.sub('\s+', ' ', element)
    return element 


In [81]:
#normalizing question column
jeopardy['clean_question'] = jeopardy['Question'].apply(string_normalizer)
jeopardy['clean_question'].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [82]:
#normalizing the Answer column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(string_normalizer)
jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [83]:
#normalizing Value column
jeopardy['Value'].unique()[:20]

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800'],
      dtype=object)

there are some None, 0, punctuation and $ signs and values that should be handled

In [84]:
#defining a nomalizer function for the Value column
def value_normalizer(element):
    try:
        
        element = re.sub('[$,]','',element)
        element = int(element)
        return element
    except ValueError:
        return int(0)
    

In [85]:
#checking if the function works properly
print(jeopardy['Value'].dtypes)
jeopardy['Value'].apply(value_normalizer).unique()

object


array([  200,   400,   600,   800,  2000,  1000,  1200,  1600,  3200,
           0,  5000,   100,   300,   500,  1500,  4800,  1800,  1100,
        2200,  3400,  3000,  4000,  6800,  1900,  3100,   700,  1400,
        2800,  8000,  6000,  2400, 12000,  3800,  2500,  6200, 10000,
        7000,  1492,  7400,  1300,  7200,  2600,  3300,  5400,  4500,
        2100,   900,  3600,  2127,   367,  4400,  3500,  2900,  3900,
        4100,  4600, 10800,  2300,  5600,  1111,  8200,  5800,   750,
        7500,  1700,  9000,  6100,  1020,  4700,  2021,  5200,  3389])

In [155]:
#creating a new column clean_value
jeopardy['clean_value'] = jeopardy['Value'].apply(value_normalizer)
print(jeopardy['clean_value'].dtype)
jeopardy.head()

int64


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200,0.0,0.5
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0


In [87]:
#converting Air Date column data type to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

answering a question: How often the answer can be used for a question.


In [132]:
#defining a function to count the number of words that 
#are shared between each pair of answer - question:
def match_count(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()    
    match_count = 0
    
    if "the" in split_answer:
        split_answer.remove("the")

    if len(split_answer) == 0 :
        return 0
    else:
        for i in split_answer:
            if i in split_question:
                match_count += 1
        return match_count / len(split_answer)
    

In [134]:
#creating a new column using the match_count function
jeopardy['answer_in_question'] = jeopardy.apply(match_count,axis = 1)
jeopardy.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant,200,0.0
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ital...,the appian way,400,0.0
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan,400,0.0
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington,400,0.0
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel,400,0.0


In [135]:
#finding the mean of jeopardy['answer_in_question']
jeopardy['answer_in_question'].mean()

0.05900196524977763

the low value 0.05900 shows that answers can't be find easily through the words in questions. as a result, it's hard to try estimate answers by investigating words in questions. 

** new question: how often new questions are repeats of older ones?**

In [136]:
#sorting the dataset based on date
jeopardy = jeopardy.sort_values('Air Date', ascending = True)

In [145]:
# writing a for loop to find the mean of question overlap
question_overlap = list()
terms_used = set()

for i,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    #removing low char. words like the, than
    split_question = [v for v in split_question if len(v) > 5]
    match_count = 0
    for i in split_question:
        if i in terms_used:
            match_count += 1
    for i in split_question:
        terms_used.add(i)
        
    if len(split_question) > 0:
        match_count /=  len(split_question)
        
    question_overlap.append(match_count)

question_overlap[:20]

[0.0,
 0.0,
 0.0,
 0.5,
 0.0,
 0.0,
 0.0,
 0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0,
 0.0]

In [146]:
#assinging the question_overlap list to our dataset
jeopardy['question_overlap'] = question_overlap
jeopardy.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this n...,the grand canyon,200,0.0,0.5
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200,0.333333,0.0
19306,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$200,"Last season, this series mourned the loss of S...",Hill Street Blues,last season this series mourned the loss of sg...,hill street blues,200,0.0,0.0
19307,10,1984-09-21,Double Jeopardy!,1789,$400,Why April 28th was a bad day for Capt. Bligh,the day of the mutiny on the Bounty,why april 28th was a bad day for capt bligh,the day of the mutiny on the bounty,400,0.142857,0.0
19308,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$400,Seaside resort that has a monopoly on East Coa...,"Atlantic City, New Jersey",seaside resort that has a monopoly on east coa...,atlantic city new jersey,400,0.0,0.0
19309,10,1984-09-21,Double Jeopardy!,LITERATURE,$400,"He wrote ""The 3 Musketeers""; his son wrote ""Ca...",(Alexandre) Dumas,he wrote the 3 musketeers his son wrote camille,alexandre dumas,400,0.0,0.0


In [147]:
#the mean of question_overlap column
jeopardy['question_overlap'].mean()

0.6876260592169802

On average(mean), around 69% of the words of questions are previously repeated in prior questions. As a result, it's recommmanded to study previous questions to be familiar with new questions

**Finding words that are repeated significantly more in high value questions in comparison to low value questions**
* to achieve this goal, we use chi-square value 

In [156]:
# grouping questions into low and high value question, used index = 800uds
def value_segment(row):
    if int(row['clean_value']) > 800:
        return 1
    else:
        return 0

In [161]:
#applying the value_segment function to value column
jeopardy['high_value'] = jeopardy.apply(value_segment,axis=1)
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [169]:
# creating a function to find the frequency of 
#a specific word in low and high value questions
def frq(word):
    low_count = 0
    high_count = 0
    
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count,low_count   

In [172]:
# creatin a sample list from the terms_used set, containing all questions' words
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

comparison_terms

['mansions',
 'gundycoached',
 'hochdeutsch',
 'crimea',
 'neutering',
 'hrefhttpwwwjarchivecommedia20060303dj27jpg',
 'complex',
 'mather',
 'houses',
 'toledo']

In [175]:
#applying the frq function to our random list
observed_expected = list()
for term in comparison_terms:
    observed_expected.append(frq(term))
    
observed_expected

[(1, 0),
 (0, 1),
 (0, 1),
 (1, 1),
 (0, 1),
 (0, 1),
 (0, 13),
 (0, 1),
 (5, 19),
 (0, 3)]

In [177]:
#number of high value questions
high_value_count = jeopardy['high_value'].value_counts()[1]
high_value_count

5734

In [178]:
#number of low value questions
low_value_count = jeopardy['high_value'].value_counts()[0]
low_value_count

14265

To use chi-square test, first, we should determine the expected frequency of each selected word in low_value and high_value questions. for example, if word "complex" is used 0 times in high value questions in 13 times in low value questions, it is used 13(13+0) times in all our questions. 

a word that has a frequency of 13 across 20,000 questions, have a total proportion of 13/20000. it is expected that, this proportion be distributed based on total high value questions and low value questions, so we should multiply the proportion by total high value and low value questions to find the expected values.

then, we compare the expected values with our observation and find the chi-square of this word. if its p_value is below 5%(our threshold) this word is significantly more frequent in high value questions or low value questions. 

In [185]:
# calculating the chi-square 
from scipy.stats import chisquare
chi_squared = list()
for observation in observed_expected:
    total = sum(observation)
    total_prop = total / jeopardy.shape[0]
    expected_high_value = total_prop * high_value_count
    expected_low_value = total_prop * low_value_count
    
    observed = np.array([observation[0],observation[1]])
    expected = np.array([expected_high_value,expected_low_value])
    
    chi_squared_value , p_value = chisquare(observed,expected)
    chi_squared.append([chi_squared_value,p_value])
    
chi_squared

[[2.487792117195675, 0.11473257634454047],
 [0.401962846126884, 0.5260772985705469],
 [0.401962846126884, 0.5260772985705469],
 [0.4448774816612795, 0.5047776487545996],
 [0.401962846126884, 0.5260772985705469],
 [0.401962846126884, 0.5260772985705469],
 [5.225516999649492, 0.022257828882083316],
 [0.401962846126884, 0.5260772985705469],
 [0.7209745992373746, 0.3958244089185019],
 [1.205888538380652, 0.27214791766902047]]

None of our randomly selected words don't have a p_value less than 0.05. so there is no significant difference between their occurence in low value questions in comparison to high value questions. 

## Conclusion
* While there is no guaranteed strategy to winning Jeopardy as we have found out, it might be worth while to look at past questions while preparing.



* There also isn't any significant relationship between any term and high value questions, so there is no keyword to look out for to prepare for high value questions.

