<a href="https://colab.research.google.com/github/susan-sajadi/stc510/blob/master/Module5BasicsSajadi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Project: Parse, clean, and organize the Jeopardy! question data file to train a Naive Bayesian classifier.
Pass the Text Analysis Basics quiz with a score of 85% or better.
Just as we have built a classifier above, your aim here is to make sense of the data presented, and create a binary classifier ("high value" and "low value," based on the points available for each) for questions. Despite the large number of questions, this is an extraordinarily difficult classification problem. Consider it as a human coder: how often could you tell those questions that are "easy" versus "hard"? The degree to which you are successful in this is largely based on your own contextual knowledge--indeed, you might be tempted to classify questions you know the answer to as "easy" and those you do not as "hard." The computer doesn't know the answers to any of these.

For that reason, do not be discouraged if your classifier does not perform well. This constitutes an especially difficult problem for a simple classifier to solve.

Put the script and its output (which may merely report the accuracy of the trial) in your github repository, and share the link/filenames when you start your quiz.

In [166]:
import pandas as pd
import json
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
import os
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [301]:
transcripts = pd.read_json("/jeopardy.json")
transcripts.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680




```
# This is formatted as code
```



In [340]:
transcripts.info()
print(transcripts.columns)
transcripts.value.max()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 213295 entries, 0 to 216928
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     213295 non-null  object
 1   air_date     213295 non-null  object
 2   question     213295 non-null  object
 3   value        213295 non-null  object
 4   answer       213295 non-null  object
 5   round        213295 non-null  object
 6   show_number  213295 non-null  int64 
dtypes: int64(1), object(6)
memory usage: 13.0+ MB
Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       'show_number'],
      dtype='object')


'9800'

Zooming into 1 column I see that there are tags for the monetary value. We are operating under the assumption that harder questions are worth more $$. We will assume then that hard questions are over 500 dollars. 


We need to remove dollars signs and commas from the values column so we can treat it like a string
---



In [303]:
print(transcripts)
transcripts = transcripts.mask(transcripts.eq('None')).dropna()
transcripts['value'] = transcripts['value'].replace({'\$':''}, regex = True)
transcripts['value'] = transcripts['value'].replace({'\,':''}, regex = True)
print(transcripts)

                               category  ... show_number
0                               HISTORY  ...        4680
1       ESPN's TOP 10 ALL-TIME ATHLETES  ...        4680
2           EVERYBODY TALKS ABOUT IT...  ...        4680
3                      THE COMPANY LINE  ...        4680
4                   EPITAPHS & TRIBUTES  ...        4680
...                                 ...  ...         ...
216925                   RIDDLE ME THIS  ...        4999
216926                        "T" BIRDS  ...        4999
216927           AUTHORS IN THEIR YOUTH  ...        4999
216928                       QUOTATIONS  ...        4999
216929                   HISTORIC NAMES  ...        4999

[216930 rows x 7 columns]
                               category  ... show_number
0                               HISTORY  ...        4680
1       ESPN's TOP 10 ALL-TIME ATHLETES  ...        4680
2           EVERYBODY TALKS ABOUT IT...  ...        4680
3                      THE COMPANY LINE  ...        4680
4   

I read the warning in chain indexing in the documentation. I don't see any overlap in the created lists with associated 0/1, so I believe this is not an issue. However, I am not sure if I am interpreting it correctly. 

In [370]:
#sorted = transcripts.sort_values(by=['value'])
easyqset = transcripts[transcripts['value'].astype(int) <= 350] 
easyqset['Test'] = 0
print(easyqset)

                               category    air_date  ... show_number Test
0                               HISTORY  2004-12-31  ...        4680    0
1       ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31  ...        4680    0
2           EVERYBODY TALKS ABOUT IT...  2004-12-31  ...        4680    0
3                      THE COMPANY LINE  2004-12-31  ...        4680    0
4                   EPITAPHS & TRIBUTES  2004-12-31  ...        4680    0
...                                 ...         ...  ...         ...  ...
216870             LOVE SONGS IN GERMAN  2006-05-11  ...        4999    0
216871              FIRST IN OUR HEARTS  2006-05-11  ...        4999    0
216872             IT'S NOT ALEX TREBEK  2006-05-11  ...        4999    0
216873                   SCIENCE BRIEFS  2006-05-11  ...        4999    0
216874                        RATED "R"  2006-05-11  ...        4999    0

[48165 rows x 8 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [371]:
hardqset = transcripts[transcripts['value'].astype(int) >= 351] 
hardqset['Test'] = 1
print(hardqset)


                               category    air_date  ... show_number Test
6                               HISTORY  2004-12-31  ...        4680    1
7       ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31  ...        4680    1
8           EVERYBODY TALKS ABOUT IT...  2004-12-31  ...        4680    1
9                      THE COMPANY LINE  2004-12-31  ...        4680    1
10                  EPITAPHS & TRIBUTES  2004-12-31  ...        4680    1
...                                 ...         ...  ...         ...  ...
216924                     OFF-BROADWAY  2006-05-11  ...        4999    1
216925                   RIDDLE ME THIS  2006-05-11  ...        4999    1
216926                        "T" BIRDS  2006-05-11  ...        4999    1
216927           AUTHORS IN THEIR YOUTH  2006-05-11  ...        4999    1
216928                       QUOTATIONS  2006-05-11  ...        4999    1

[165130 rows x 8 columns]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [368]:
combined = pd.concat([hardqset, easyqset])
print(combined)

                               category    air_date  ... show_number Test
6                               HISTORY  2004-12-31  ...        4680    1
7       ESPN's TOP 10 ALL-TIME ATHLETES  2004-12-31  ...        4680    1
8           EVERYBODY TALKS ABOUT IT...  2004-12-31  ...        4680    1
9                      THE COMPANY LINE  2004-12-31  ...        4680    1
10                  EPITAPHS & TRIBUTES  2004-12-31  ...        4680    1
...                                 ...         ...  ...         ...  ...
216870             LOVE SONGS IN GERMAN  2006-05-11  ...        4999    0
216871              FIRST IN OUR HEARTS  2006-05-11  ...        4999    0
216872             IT'S NOT ALEX TREBEK  2006-05-11  ...        4999    0
216873                   SCIENCE BRIEFS  2006-05-11  ...        4999    0
216874                        RATED "R"  2006-05-11  ...        4999    0

[213295 rows x 8 columns]


Now I have separated the two types of questions, and cleaned up any empty or none data. The 0 or 1 for hard or easy has been added. Now the data is ready to see how naive bayes will do in training/testing. 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [359]:
X_train, X_test, y_train, y_test = train_test_split(combined.question, combined.Test, random_state=1)

In [360]:
tfidf_vectorizer =TfidfVectorizer(use_idf=True)
X_train_tf = tfidf_vectorizer.fit_transform(X_train)
X_test_tf = tfidf_vectorizer.transform(X_test)


In [365]:
naive_bayes = MultinomialNB()
naive_bayes.fit(X_train_tf, y_train)
predictions = naive_bayes.predict(X_test_tf)

In [364]:
print('Accuracy: ', accuracy_score(y_test, predictions))

Accuracy:  0.7750168779536419


This is a better percentage than I was expecting honestly! How cool 😎 (unless I made a mistake that is giving me this false sense of accomplishment haha)