# Winning Jeopardy

#### Objectives:

    The aim of this project is to interpret and make analysis on mostly categorical / nominal data. The main purpose is to apply a chi-square test at the end and analyze the statistical significance of our result. The database was provided by dataquest.io,the project being the first in a series of SQL tutorials using Python.


##### Resources used:

**Anaconda distribution** - *Jupyter Notebook v 5.7.8*, *Python 3.7.3*



In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()



Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [4]:
#normalizing the answer and questions strings - lowering case and removing all punctuation

import re
def norm(str):
    str=str.lower()
    str = re.sub("[^A-Za-z0-9\s]", "", str)
    return str
jeopardy["clean_questions"] = jeopardy["Question"].apply(norm)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(norm)

In [5]:
jeopardy["clean_questions"].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_questions, dtype: object

In [6]:
jeopardy["clean_answer"] .head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [7]:
#normalizing the number columns

def norm_val(str):

    str = re.sub("[^A-Za-z0-9\s]", "", str)
    try:
        str=int(str)
    except Exception:
        str=0
    return str
jeopardy["clean_value"] = jeopardy["Value"].apply(norm_val)
jeopardy["Air Date"]= pd.to_datetime(jeopardy["Air Date"])

In [15]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_questions,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0,0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.0,0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0,0.0,0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0,0.0,0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0,0.0,0


One theory we can analyze is that the answer already appears in the question. To do this, we will apply more string operations:

In [16]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_questions"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
print(jeopardy["answer_in_question"].mean())

0.060493257069335914


Unfortunately, our theory was wrong, the answer appears in the question only about 6% of the time.

In [10]:
jeopardy.sort_values(by=["Air Date"])
question_overlap = []
terms_used = []
for i, row in jeopardy.iterrows():
        split_question = row["clean_questions"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.append(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

print(jeopardy["question_overlap"].mean())


0.6908737315671878


In [11]:
def clv(row):
    if row["clean_value"] > 800:
        value= 1
    else:
        value=0
    return value
jeopardy["high_value"] = jeopardy.apply(clv, axis=1)

In [12]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_questions"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(1, 6), (3, 2), (0, 1), (11, 14), (0, 2)]

In [13]:
len(observed_expected)

5

In [14]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.7083506539662141, pvalue=0.3999918991363616),
 Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.8723025608618364, pvalue=0.09011585768849395),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571)]