# Basic Set Up
Read the dataset into a Dataframe called jeopardy using Pandas.

Print out the first 5 rows of jeopardy.

Print out the columns of jeopardy using jeopardy.columns.

Some of the column names have spaces in front.

Remove the spaces in each item in jeopardy.columns.

Assign the result back to jeopardy.columns to fix the column names in jeopardy.

In [2]:
import pandas as pd
import string
from scipy.stats import chisquare

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy = jeopardy.rename(columns=lambda x: x.strip())

#print (jeopardy.head(5))
jeopardy.columns



Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalising Text
Write a function to normalize questions and answers. It should:

Take in a string.

    Convert the string to lowercase.
    
    Remove all punctuation in the string.
    
    Return the string.
    
Normalize the Question column.

    Use the Pandas apply method to apply the function to each item in the Question column.
    
    Assign the result to the clean_question column.
    
Normalize the Answer column.

    Use the Pandas apply method to apply the function to each item in the Answer column.
    
    Assign the result to the clean_answer column.

In [3]:
def normie(text):
    # convert string to lowercase
    text = text.lower()
    # remove all punctuation
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normie)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normie)


# Normalising Non-text

Write a function to normalize dollar values. It should:

    Take in a string.
    
    Remove any punctuation in the string.
    
    Convert the string to an integer.
    
    If the conversion has an error, assign 0 instead.
    
    Return the integer.
    
Normalize the Value column.

    Use the Pandas apply method to apply the function to each item in the Value column.
    
    Assign the result to the clean_value column.
    
Use the pandas.to_datetime function to convert the Air Date column to a datetime column.

In [4]:
def normint(val):
    # remove punctuation
    for punctuation in string.punctuation:
        val = val.replace(punctuation, '')
    # convert to integer and if error, assign zero:
    try: 
        val = int(val)
    except: 
        val = 0
    if val == 'none':
        val = 0
    return val

jeopardy['clean_value'] = jeopardy['Value'].apply(normint)

# convert date to date time format
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])



# Answers in Questions

Find out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

    How often the answer is deducible from the question.
    How often new questions are repeats of older questions.

Write a function that takes in a row in jeopardy, as a Series.

It should:

    Split the clean_answer column on the space character (), and assign to the variable split_answer.

    Split the clean_question column on the space character (), and assign to the variable split_question.
    
    Create a variable called match_count, and set it to 0.

    If the is in split_answer, remove it using the remove method on lists. The is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
    
    If the length of split_answer is 0, return 0. This prevents a division by zero error later.
    
    Loop through each item in split_answer, and see if it occurs in split_question. If it does, add 1 to match_count.
    
    Divide match_count by the length of split_answer, and return the result.

Count how many times terms in clean_answer occur in clean_question.

    Use the Pandas apply method on Dataframes to apply the function to each row in jeopardy.
    
    Pass the axis=1 argument to apply the function across each row.
    
    Assign the result to the answer_in_question column.
    
Find the mean of the answer_in_question column using the mean method on Series.

Write up a markdown cell with a short explanation of how finding this mean might influence your studying strategy for Jeopardy.

In [5]:
def deducible(row):
    split_answer = row['clean_answer'].split()
    split_question= row['clean_question'].split()
    match_count = 0
    for word in split_answer:
        if word == "the":
            split_answer.remove(word)
    if len(split_answer) == 0:
        return 0
    else:
        for item in split_answer:
            if item in split_question:
                match_count += 1
        return (match_count/len(split_answer))
answer_in_question = jeopardy.apply(deducible, axis = 1)        


meancounts = answer_in_question.mean()
meancounts

0.058206961574629956

The mean shows that there is the answer only coincides with the question only about once in 20 times. Hence, it's not entirely useful to use this "deducible" method.

# Recycle Questions

This investigates how often new questions are repeats of older ones.

Sort jeopardy in order of ascending air date.

Maintain a set called terms_used that will be empty initially.

Iterate through each row of jeopardy (using iterrows)

Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.

If it does, increment a counter.

Add each word to terms_used.

In [6]:
jeopardy = jeopardy.sort_values(by='Air Date',ascending=True)

terms_used = []
question_overlap =[]

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    for word in split_question:
        if len(word) < 6:
            split_question.remove(word)
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.append(word)
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)


In [7]:
jeopardy['question_overlap'] = question_overlap
meanques = jeopardy['question_overlap'].mean()
print (meanques)

0.798436884793


Figure out whcih terms correspond to high-value questions using chi-square test

# Low vs High value Questions

Find out which questions to study such that it pertain to high value questions instead of low value questions. In order to get this, find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

Create a function that takes in a row from a Dataframe, and:

    If the clean_value column is greater than 800, assign 1 to value.

    Otherwise, assign 0 to value.

    Return value.
    
Use the Pandas apply method on Dataframes to apply the function to each row in jeopardy.

Pass the axis=1 argument to apply the function across each row.

Assign the result to the high_value column.

In [8]:
jeopardy.dtypes


Show Number                  int64
Air Date            datetime64[ns]
Round                       object
Category                    object
Value                       object
Question                    object
Answer                      object
clean_question              object
clean_answer                object
clean_value                  int64
question_overlap           float64
dtype: object

In [9]:
def highlow(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(highlow, axis = 1)

Create a function that takes in a word, and:

    Assigns 0 to low_count.
    Assigns 0 to high_count.
    Loops through each row in jeopardy using the iterrows method.
    Split the clean_question column on the space character ().
    If the word is in the split question:
        If the high_value column is 1, add 1 to high_count.
        Else, add 1 to low_count.
    
    Returns high_count and low_count. You can return multiple values by separating them with a comma.

In [10]:
def highlowcounts(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        splitwords = row['clean_question'].split()
        if word in splitwords:
            if row['high_value'] == 1:
                high_count += 1
            else: 
                low_count += 1
    return high_count, low_count



Create an empty list called observed_expected.

Convert terms_used into a list using the list function, and assign the first 5 elements to comparison_terms.

Loop through each term in comparison_terms, and:

    Run the function on the term to get the high value and low value counts.
    Append the result of running the function (which will be a list) to observed_expected.

In [11]:
observed_expected = []

terms_usedlist = list(terms_used)
comparison_terms = terms_usedlist[0:5]

for term in comparison_terms:
    result = highlowcounts(term)
    observed_expected.append(result)

In [12]:
print (observed_expected)

[(0, 3), (68, 181), (781, 1754), (1209, 2962), (2324, 5491)]


# Applying the chi-squared test

With the observed counts for a few terms, compute the expected counts and the chi-squared value.

Find the number of rows in jeopardy where high_value is 1, and assign to high_value_count.

Find the number of rows in jeopardy where high_value is 0, and assign to low_value_count.

Create an empty list called chi_squared.


In [13]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []


Loop through each list in observed_expected.

    Add up both items in the list (high and low counts) to get the total count, and assign to total.
    Divide total by the number of rows in jeopardy to get the proportion across the dataset. Assign to total_prop.
    Multiply total_prop by high_value_count to get the expected term count for high value rows.
    Multiply total_prop by low_value_count to get the expected term count for low value rows.
    Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
    Append the results to chi_squared.

In [18]:
for elem in observed_expected:
    total = sum(elem)
    total_prop = total/ (jeopardy.shape[0])
    expectedhigh = total_prop * high_value_count
    expectedlow = total_prop * low_value_count
    expectedvals = [expectedhigh, expectedlow]
    chisqvalue, pvalue = chisquare(elem, expectedvals)
    chi_squared.append([chisqvalue, pvalue])

print (chi_squared)

[[1.2058885383806519, 0.27214791766902047], [0.22592591114717697, 0.63456129826261032], [5.6620493537229537, 0.017335855359660587], [0.20162796200547214, 0.65340999849142001], [4.3444466443049095, 0.037129831143466852], [1.2058885383806519, 0.27214791766902047], [0.22592591114717697, 0.63456129826261032], [5.6620493537229537, 0.017335855359660587], [0.20162796200547214, 0.65340999849142001], [4.3444466443049095, 0.037129831143466852]]
