1. Parse, clean, and organize the Jeopardy! question data file to train a Naive Bayesian classifier.

2. Pass the Text Analysis Basics quiz with a score of 85% or better.

Just as we have built a classifier above, your aim here is to make sense of the data presented, and create a binary classifier ("high value" and "low value," based on the points available for each) for questions. Despite the large number of questions, this is an extraordinarily difficult classification problem.

Consider it as a human coder: how often could you tell those questions that are "easy" versus "hard"? The degree to which you are successful in this is largely based on your own contextual knowledge--indeed, you might be tempted to classify questions you know the answer to as "easy" and those you do not as "hard." The computer doesn't know the answers to any of these.

For that reason, do not be discouraged if your classifier does not perform well. This constitutes an especially difficult problem for a simple classifier to solve.

In [54]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import numpy as np
from string import punctuation
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer

In [55]:
jeopdata = pd.read_json('jeopardy.json')
print(jeopdata)

                               category    air_date  \
0                               HISTORY  2004-12-31   
1       ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31   
2           EVERYBODY TALKS ABOUT IT...  2004-12-31   
3                      THE COMPANY LINE  2004-12-31   
4                   EPITAPHS & TRIBUTES  2004-12-31   
...                                 ...         ...   
216925                   RIDDLE ME THIS  2006-05-11   
216926                        "T" BIRDS  2006-05-11   
216927           AUTHORS IN THEIR YOUTH  2006-05-11   
216928                       QUOTATIONS  2006-05-11   
216929                   HISTORIC NAMES  2006-05-11   

                                                 question  value  \
0       'For the last 8 years of his life, Galileo was...   $200   
1       'No. 2: 1912 Olympian; football star at Carlis...   $200   
2       'The city of Yuma in this state has a record a...   $200   
3       'In 1963, live on "The Art Linkletter Show", t...   $200   

In [56]:
# Removing special characters from values and changing dtype to float
jeopdata['value'] = jeopdata['value'].str.replace('$', '').str.replace(',', '').astype(float)

  jeopdata['value'] = jeopdata['value'].str.replace('$', '').str.replace(',', '').astype(float)


In [57]:
# Initializing the logistic regression classifier
logistic_reg = LogisticRegression(max_iter=1000)

In [58]:
# Initializing the TFIDF Vectorizer to get rid of stop words
stop = TfidfVectorizer(stop_words='english')

In [59]:
# Creating a threshold
threshold= 1000

In [60]:
# Assigning label based on the threshold
HiLo = jeopdata['value'].apply(lambda x: 'high value' if x >= threshold else 'low value')

In [61]:
# Assigning X & y for classifer
X = jeopdata['question']
y = HiLo
print(X)
print(y)

0         'For the last 8 years of his life, Galileo was...
1         'No. 2: 1912 Olympian; football star at Carlis...
2         'The city of Yuma in this state has a record a...
3         'In 1963, live on "The Art Linkletter Show", t...
4         'Signer of the Dec. of Indep., framer of the C...
                                ...                        
216925    'This Puccini opera turns on the solution to 3...
216926    'In North America this term is properly applie...
216927    'In Penny Lane, where this "Hellraiser" grew u...
216928    'From Ft. Sill, Okla. he made the plea, Arizon...
216929    'A silent movie title includes the last name o...
Name: question, Length: 216930, dtype: object
0          low value
1          low value
2          low value
3          low value
4          low value
             ...    
216925    high value
216926    high value
216927    high value
216928    high value
216929     low value
Name: value, Length: 216930, dtype: object


In [62]:
# Spliting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [63]:
X_train_tfidf = stop.fit_transform(X_train)

In [64]:
X_test_tfidf = stop.transform(X_test)

In [65]:
# Training the classifier
logistic_reg.fit(X_train_tfidf, y_train)

In [66]:
# Prediction on the test set
y_pred = logistic_reg.predict(X_test_tfidf)
print(y_pred)

['low value' 'low value' 'low value' ... 'low value' 'low value'
 'low value']


In [67]:
# Evaluating the classifier
accuracy_score(y_test, y_pred)

0.719679159175771