# Data Description
Jeopardy是美国的一个智力问答赢得奖金的题目。假设你现在想参加这个节目，你想找到一个优势方面来赢得比赛。这个小项目就是work with [Jeopardy dataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)来找到一些问题的常见形式来帮助你获得胜利。

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


## 移除列名前面的空格

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

## 规范字符串列
将“question”列和"answer"列里的内容规范，将string转换成小写的，移除string里的标点。

In [4]:
import  re

def  normalize_text (text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text) #所有的标点都用“”替代
    return text

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

## 规范数值类型列
将"Value"列中的"$"符号移除，并转换成数值类型。将"Air Date"列中的字符串转换成datetime类型。

In [6]:
def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [7]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [8]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

有时候答案能从题目中直接推测出来，在这种情况下，答案中出现的单词可能在问题就出现过。下面的代码就是统计每一行中"Answer"中的单词在"Question"中出现的次数。

In [9]:
def word_matches (row):
    match_count = 0
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the") #去除掉答案中的没有意义的单词the
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(word_matches, axis=1)
print(jeopardy["answer_in_question"].mean())

0.06049325706933587


总体上，Answer中的单词有只有6%在Question中出现过。这个比例比较小，所以我们不能指望直接从问题推得答案，还需要更认真准备才行。

其次，我们还有可能想到，被问到的问题在以前的问题中是否有相似的？为了验证这一猜想的合理性，我们先将jeopardy按照Air Date升序排序，然后遍历每一行，将clean_question split成words, 只统计那些长度大于6的单词重复出现的次数（排除掉像the、than这些没有意义的单词），如果整体上重复出现的次数比较大，我们可以猜测这次问的问题有的来自以前的问题或者相似。

In [14]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
            match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
print(jeopardy["question_overlap"].mean())

0.6908737315671962


从结果上来看，69%的单词重复出现过，但这只是统计单一的单词，而没有将phrase等统计进去，不过这样的结果却提示我们值得更深入的研究那些已经出现过的问题。

假设我们时间有限，为了更多赢得奖金，我们只想研究那些奖金高(clean_value)的已被提问过的问题。我们可以使用chi-squared tests. 首先，我们将clean_value分为高奖金(>800)和低奖金两类，然后遍历上面的terms_used, 统计有多少已经是用过的terms在高奖金中，多少terms在低奖金中，然后通过observed values 和 expected values 计算chi-squared value.

In [15]:
def determine_value (row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0,0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0,0.0,0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0,0.0,0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0,0.0,0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0,0.0,0


In [17]:
#因为我们要统计的terms_used不在df中，所以不能用crosstab直接得到
def count_usage (term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if term in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return low_count, high_count

observed_expected = []
comparison_terms = list(terms_used)[:5] #只找前五个单词

for t in comparison_terms:
    observed_expected.append(count_usage(t))
print(observed_expected)

[(2, 0), (1, 0), (0, 2), (2, 0), (0, 1)]


In [18]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

从结果来看，这五组的显著性检验都不显著。因为这五组的数值都低于5，chi-squared test一般对于那些频率值很大的检验才有效。