# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. I am going to work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help to win.

The dataset is named jeopardy.csv and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* **Show Number** -- the Jeopardy episode number of the show this question was in.
* **Air Date** -- the date the episode aired.
* **Round** -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* **Category** -- the category of the question.
* **Value** -- the number of dollars answering the question correctly is worth.
* **Question** -- the text of the question.
* **Answer** -- the text of the answer.
First I am going to read the dataset and explore.

Two main questions we will be trying to answer are: 1) How often do questions contain hints about the answer? 2) How often are questions repeated over time?

In [1]:
import pandas as pd
jeopardy = pd.read_csv("Jeopardy.csv")
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


# Cleaning and Preparing Data

* Printing out the columns of jeopardy using jeopardy.columns.
* Some of the column names have spaces in front.
* Removing the spaces in each item in jeopardy.columns.
* Assign the result back to jeopardy.columns to fix the column names in jeopardy.

In [2]:
#investigating columns names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Many columns contains the blank spaces infront, these must be removed.

In [3]:
# Remove spaces from the column names
jeopardy.columns = jeopardy.columns.str.strip()


#We can also try this method
#jeopardy.columns = [col.strip() for col in jeopardy.columns]

In [4]:
#Assign the result back to jeopardy.columns to fix the column names in jeopardy.
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns),the idea is to ensure that there are all lowercase words and no punctuation.

In [5]:
# Check data types and check for null values
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
Air Date       216930 non-null object
Round          216930 non-null object
Category       216930 non-null object
Value          216930 non-null object
Question       216930 non-null object
Answer         216928 non-null object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [6]:
#convert the date columns into the date-time format

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

The data time column is successfully changed into the datetime format. The Air Date column should also be a datetime, not a string, to enable us to work with it more easily too.

In [7]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
Air Date       216930 non-null datetime64[ns]
Round          216930 non-null object
Category       216930 non-null object
Value          216930 non-null object
Question       216930 non-null object
Answer         216928 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 11.6+ MB


In [8]:
jeopardy['Answer'] = jeopardy['Answer'].astype(str)

We took this step because answer columns were float type and was showing error after normalization


# Normalizing Text

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that you lowercase words and remove punctuation so Don't and don't aren't considered to be different words when you compare them.

In [9]:
import string
def normalize_text(text):
    text = text.lower()
    for character in string.punctuation:
        text = text.replace(character,'')
    return text 

It's Working effectively, Let's apply the normalize function to **Question**and **Answer columns** and save the result in **clean_question** and **clean_answer columns**.

In [10]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [11]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


Now that we have normalized the text columns, there are also some other columns to normalize.

The Value column should also be numeric, to allow us to manipulate it more easily. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.



In [12]:
#Writing a function to normalize Value column
def normalize_values(text):
    clean = ''
    for num in text:
        if num in '0123456789':
            clean += num
        if not clean:
            clean = '0'
            return int(clean)
        else:
            return int(clean)
        
#or we can try this 

#def normalize_values(text):
    #text = re.sub("[^A-Za-z0-9\s]", "", text)#"\s" is space
    #try:
        #text = int(text)
    #except Exception:
        #text = 0
    #return text

In [13]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

In [14]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,0


# Answers in Question
In order to figure out whether to study past questions, study general knowledge, or not study at all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.



We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second later.

In [15]:
def answer_deducible(data):
    split_answer = data['clean_answer'].split(' ')
    split_question = data['clean_question'].split(' ')
    match_count = 0
    #The is commonly found in answers and questions, 
    #but doesn't have any meaningful use in finding the answer.
    if 'the' in split_answer:
        split_answer.remove('the')
    #To prevent a division by zero error later.
    if len(split_answer) == 0:
        return 0
    
    for element in split_answer:
        if element in split_question:
            match_count += 1
    return match_count / len(split_answer)

counting how many times terms in clean_answer occur in clean_question

In [16]:
jeopardy['answer_in_question'] =  jeopardy.apply(answer_deducible, axis=1)


In [17]:
jeopardy['answer_in_question'].value_counts()


0.000000    189104
0.500000     15236
0.333333      5847
0.250000      1865
1.000000      1395
0.666667      1123
0.200000       847
0.400000       351
0.166667       307
0.142857       149
0.750000       123
0.285714       103
0.600000        99
0.125000        74
0.428571        41
0.111111        35
0.800000        30
0.375000        27
0.222222        26
0.571429        21
0.300000        15
0.100000        12
0.272727        10
0.181818         8
0.153846         7
0.714286         6
0.444444         6
0.857143         5
0.090909         5
0.833333         5
0.625000         5
0.555556         4
0.214286         4
0.230769         4
0.083333         4
0.545455         3
0.357143         2
0.875000         2
0.700000         2
0.105263         1
0.363636         1
0.266667         1
0.350000         1
0.071429         1
0.307692         1
0.454545         1
0.777778         1
0.636364         1
0.416667         1
0.818182         1
0.538462         1
0.133333         1
0.294118    

In [18]:
jeopardy['answer_in_question'].mean()

0.05932504431848426

From the above output we find that the mean is really too low, thus our first point to deduce an answer from the question seems negligible for our winning strategy.

Now let us investigate our second point mentioned above - how often new questions are repeats of older ones.

We can't completely answer this, because we only have about 10% of the full Jeopardy questions dataset, but we can investigate it at least.

# Recycled Questions
Finding the frequency of words greater than 6 characters to find the the repetition of old questions in Jeopardy



In [20]:
  #Removing any words in split_question that are less than 6 characters long.
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for x, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
            else:
                terms_used.add(word)
        if len(split_question) > 0:
            match_count = match_count / len(split_question)
        question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()


0.8722047694957848

The above result is from only 10% of the full Jeopardy questions dataset. But no doubt the mean seems to be good and it does derives the fact that we need to prepare on old questions of Jeopardy should be one of the points in our winning strategy.

We can include in our winning strategy something like to focus on studying questions that pertain to high value questions instead of low value questions. This can help us earn more money when we are on Jeopardy.

So we can do this by actually figuring out which terms correspond to high-value questions using a chi-squared test. We need to first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.
And we can achieve this by using the set terms_used from our previous code and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.



We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

Right now for this we will be just working on a small sample from our dataset `jeopardydata`.

# Low value vs High value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.
You'll then be able to loop through each of the terms from the last screen, terms_used, and:

* Find the number of low value questions the word occurs in.
* Find the number of high value questions the word occurs in.
* Find the percentage of questions the word occurs in.
* Based on the percentage of questions the word occurs in, find expected counts.
* Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

Create a function that takes in a row from a Dataframe, and:

* If the clean_value column is greater than 800, assign 1 to value.
* Otherwise, assign 0 to value.
* Return value.
Determine which questions are high and low value.

Use the Pandas DataFrame.apply method to apply the function to each row in jeopardy.
Pass the axis=1 argument to apply the function across each row.
Assign the result to the high_value column.
Create a function that takes in a word, and:

Assigns 0 to low_count.
Assigns 0 to high_count.
* Loops through each row in jeopardy using the iterrows method.

* split
clean_question column on the space character ()
* If the word is in the split question:
* If the high_value column is 1, add 1 to high_count.
* Else, add 1 to low_count.
* Returns high_count and low_count. You can return multiple values by separating them with a comma.
* Randomly pick ten elements of terms_used and append them to a list called comparison_terms.

Create an empty list called observed_expected.
Loop through each term in comparison_terms, and:

Run the function on the term to get the high value and low value counts.


In [21]:
#Determining which questions are high and low value
def question_value(data):
    value = 0
    if data['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value
jeopardy['high_value'] = jeopardy.apply(question_value, axis=1)

In [22]:
jeopardy['high_value'].value_counts()

0    216930
Name: high_value, dtype: int64

In [29]:
#Calculating the observed count

In [28]:
def highlow_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardydata.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]

observed_expected = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append(highlow_count(term))

observed_expected
def highlow_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]

observed_expected = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append(highlow_count(term))

observed_expected

[[0, 1], [0, 1], [0, 5], [0, 1], [0, 1]]

Now that we have found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [32]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / len(jeopardy)
    expected_highvalue = total_prop * high_value_count
    expected_lowvalue = total_prop * low_value_count
    expected = [expected_highvalue, expected_lowvalue]
    chi_squared.append(chisquare(obs, expected))

chi_squared

[Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=nan, pvalue=nan),
 Power_divergenceResult(statistic=nan, pvalue=nan)]

# Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.