# An Important Note
I worked on this project during my studies for Dataquest online Data Science Bootcamp. This was for "Hypothesis Testing: Fundamentals" part of the bootcamp.

# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

For this project, I assume that I want to compete on Jeopardy, and I am looking for any edge I can get to win. So in this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help me win.

# Reading in and Exploring the Data

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

There are 4 problems in the data set and I think these problems should be solved before starting the analyzing of the data.
1. The column names have some "white spaces" and they are not in "Snake Case".
2. "Value" column has "$" sign infront of the numerical values and that's why this column is stored as "object".
3. "Question" and "Answer" columns have special characters and they can give me some headache when I am analyzing the data.
4. "Air Date" column is stored as "object" intead of "datetime object".

## Correctiong the Column Names

In [5]:
columns = ['show_number', 'air_date', 'round', 'category', 'value',
       'question', 'answer']
jeopardy.columns = columns

In [6]:
jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

## Correcting the "value" Column
The function below will help me to remove "$" from "Value" column.

In [7]:
import re
def normalizing_value(string):
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string
    

In [8]:
jeopardy['clean_value'] = jeopardy['value'].apply(normalizing_value)

## Correcting the "question" and "answer" Columns
The function below will help me to remove special characters from "Question" and "Answer" columns.

In [9]:
def normalizing_string(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    return string
    

In [10]:
jeopardy['clean_question'] = jeopardy['question'].apply(normalizing_string)

In [11]:
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalizing_string)

## Changing the data type of "air_date" Column

In [12]:
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])

In [13]:
jeopardy.dtypes

show_number                int64
air_date          datetime64[ns]
round                     object
category                  object
value                     object
question                  object
answer                    object
clean_value                int64
clean_question            object
clean_answer              object
dtype: object

As it can be seen, I have done the corrections for the problems I mentioned before.

# Finding the Percentage of Getting the Answers from the Questions

As it is known, for the competitions like "Jeopardy", there are some questions which contain the answers in themselves. These type of questions might help me to win the competition if they occur frequently.

So now, I will calculate the percentage of the answers which can be found in the questions themselves by using the function below.

In [14]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

jeopardy["answer_in_question"].mean()*100

6.049325706933587

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that I probably can't just hope that hearing a question will enable me to figure out the answer. I'll probably have to study.

# Finding the Percentage of Question Overlapping
Again as it is known, for the competitions like "Jeopardy", there might be some questions which are repeated. So now, I'll check the percentage of overlapping for the words.

In [15]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values('air_date')
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()*100

68.76260592169801

There is about 69 % overlap between terms in new questions and terms in old questions. But, this doesn't look at phrases, it looks at single terms. That's why, this only looks at a small set of questions,and because of this reason there is no significant relationship between the old and new questions. But still, it does mean that it's worth looking more into the recycling of questions.

# Looking for the Questions with High Values
I assume that, I only want to study questions that pertain to high value questions instead of low value questions. This will help me earn more money when I am on Jeopardy. For this reason, I will look for the questions which have a value above 800 USD.

In [16]:
def checking_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

In [17]:
jeopardy["high_value"] = jeopardy.apply(checking_value, axis=1)

In [18]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(0, 1), (1, 1), (1, 1), (1, 0), (0, 1)]

In [19]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

# Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.