<a href="https://colab.research.google.com/github/susan-sajadi/stc510/blob/master/Module5BasicsSajadi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Project: Parse, clean, and organize the Jeopardy! question data file to train a Naive Bayesian classifier.
Pass the Text Analysis Basics quiz with a score of 85% or better.
Just as we have built a classifier above, your aim here is to make sense of the data presented, and create a binary classifier ("high value" and "low value," based on the points available for each) for questions. Despite the large number of questions, this is an extraordinarily difficult classification problem. Consider it as a human coder: how often could you tell those questions that are "easy" versus "hard"? The degree to which you are successful in this is largely based on your own contextual knowledge--indeed, you might be tempted to classify questions you know the answer to as "easy" and those you do not as "hard." The computer doesn't know the answers to any of these.

For that reason, do not be discouraged if your classifier does not perform well. This constitutes an especially difficult problem for a simple classifier to solve.

Put the script and its output (which may merely report the accuracy of the trial) in your github repository, and share the link/filenames when you start your quiz.

In [166]:
import pandas as pd
import json
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import os
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [301]:
transcripts = pd.read_json("/jeopardy.json")
transcripts.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680




```
# This is formatted as code
```



In [302]:
transcripts.info()
print(transcripts.columns)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     216930 non-null  object
 1   air_date     216930 non-null  object
 2   question     216930 non-null  object
 3   value        213296 non-null  object
 4   answer       216930 non-null  object
 5   round        216930 non-null  object
 6   show_number  216930 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 11.6+ MB
Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       'show_number'],
      dtype='object')


Zooming into 1 column I see that there are tags for the monetary value. We are operating under the assumption that harder questions are worth more $$. We will assume then that questions 400 and 500 are hard, and 100 to 300 are easy. 


We need to remove dollars signs and commas from the values column so we can treat it like a string
---



In [303]:
print(transcripts)
transcripts = transcripts.mask(transcripts.eq('None')).dropna()
transcripts['value'] = transcripts['value'].replace({'\$':''}, regex = True)
transcripts['value'] = transcripts['value'].replace({'\,':''}, regex = True)
print(transcripts)

                               category  ... show_number
0                               HISTORY  ...        4680
1       ESPN's TOP 10 ALL-TIME ATHLETES  ...        4680
2           EVERYBODY TALKS ABOUT IT...  ...        4680
3                      THE COMPANY LINE  ...        4680
4                   EPITAPHS & TRIBUTES  ...        4680
...                                 ...  ...         ...
216925                   RIDDLE ME THIS  ...        4999
216926                        "T" BIRDS  ...        4999
216927           AUTHORS IN THEIR YOUTH  ...        4999
216928                       QUOTATIONS  ...        4999
216929                   HISTORIC NAMES  ...        4999

[216930 rows x 7 columns]
                               category  ... show_number
0                               HISTORY  ...        4680
1       ESPN's TOP 10 ALL-TIME ATHLETES  ...        4680
2           EVERYBODY TALKS ABOUT IT...  ...        4680
3                      THE COMPANY LINE  ...        4680
4   

In [304]:
#sorted = transcripts.sort_values(by=['value'])
easyqset = transcripts[transcripts['value'].astype(int) <= 350] 
easyqset['Test'] = 0
hardqset = transcripts[transcripts['value'].astype(int) >= 351] 
hardqset['Test'] = 1
# print(easyqset)
combined = pd.concat([hardqset, easyqset])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


Now I have separated the two types of questions, and cleaned up any empty or none data. The 0 or 1 for hard or easy has been added. Now the data is ready to see how naive bayes will do in training/testing. 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [310]:
X_train, X_test, y_train, y_test = train_test_split(transcripts.question, combined.Test, random_state=1)

In [315]:
tfidf_vectorizer =TfidfVectorizer(use_idf=True)
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_test_tf = tfidf_vectorizer.transform(X_test)


In [316]:
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_tf, y_train)
predictions = naive_bayes.predict(X_test_tf)

In [317]:
print('Accuracy: ', accuracy_score(y_test, predictions))

Accuracy:  0.7752231640537094


This is a better percentage than I was expecting honestly! How cool 😎 (unless I made a mistake that is giving me this false sense of accomplishment haha)