In [None]:
from fastai.text import *
import pandas as pd

# Language models

A language model is an algorithm that takes a sequence of words, and outputs the likely next word in the sequence. Most language models output a list of words, each with its probability of occurance. For example, if we had a sentence that started `I would like to eat a hot`, then ideally the algorithm would predict that  the word `dog` had a much higher chance of being the next word than the word `meeting`. 

Language models are a very powerful building block in natural language processing. They are used for classifying text (e.g. is this review positive or negative?), for answering questions based on text (e.g. "what is the capital of Finland?" based on the Wikipedia page on Finland), and language translation (e.g. English to Japanese).

## The intuition behind why language models are so broadly useful
How can this simple sounding algorithm be that broadly useful? Intuitively, this is because predicting the next word in a sentence requires a lot of information, not just about grammar and syntax, but also about semantics: what things mean in the real-world. For instance, we know that `I would like to eat a hot dog` is semantically reasonable, but `I would like to eat a hot cat` is nonsensical. 

We trained a simple language model, and asked it to predict the word following `I would like to eat a `. 

We get:
    

# Step 1: Load all the data 
In this example, we are going to use a dataset of tweets from [the Onion](https://www.theonion.com), as well as some non-sarcastic news sources. We found this data set on [Kaggle](https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection). 

Before we started creating this notebook, we downloaded to our computer the JSON file with the dataset of tweets to our computer. You can download the  JSON dataset from our github:
https://github.com/saiphcita/Human_AI_Interaction/tree/main/Assignment4 


In [None]:
#Code to upload the JSON dataset from your computer as a 
#The data is in a JSON file, so we are using the "read_json" method. 
#If your data is CSV, use the `read_csv` method instead. 
#We use the `lines=True` argument here because the author formatted each line 
#as a separate JSON object. At least half of your time as a data 
#scientist/AI researcher is spent dealing with other people's data formats!

from google.colab import files 
from __future__ import print_function
import io
import pandas as pd

uploaded = files.upload()
nameFile="Sarcasm_Headlines_Dataset_v2.json"
bytesFile=io.BytesIO(uploaded[nameFile])
headlines = pd.read_json(bytesFile, lines=True)

Saving Sarcasm_Headlines_Dataset_v2.json to Sarcasm_Headlines_Dataset_v2 (1).json


In [None]:
headlines

Unnamed: 0,is_sarcastic,headline,article_link
0,1,thirtysomething scientists unveil doomsday clo...,https://www.theonion.com/thirtysomething-scien...
1,0,dem rep. totally nails why congress is falling...,https://www.huffingtonpost.com/entry/donna-edw...
2,0,eat your veggies: 9 deliciously different recipes,https://www.huffingtonpost.com/entry/eat-your-...
3,1,inclement weather prevents liar from getting t...,https://local.theonion.com/inclement-weather-p...
4,1,mother comes pretty close to using word 'strea...,https://www.theonion.com/mother-comes-pretty-c...
...,...,...,...
28614,1,jews to celebrate rosh hashasha or something,https://www.theonion.com/jews-to-celebrate-ros...
28615,1,internal affairs investigator disappointed con...,https://local.theonion.com/internal-affairs-in...
28616,0,the most beautiful acceptance speech this week...,https://www.huffingtonpost.com/entry/andrew-ah...
28617,1,mars probe destroyed by orbiting spielberg-gat...,https://www.theonion.com/mars-probe-destroyed-...


As you can see, some of this dataset is drawn from the onion, the rest is drawn from places like the Huffington Post which publish real news, not satire. 

## Step 1a: Examine the data set (5 points)

Before we go off adventuring, let's first see what this dataset looks like. 

### Q: How large is this dataset? Is it balanced? (1 points)

In [None]:
# Insert code here to check size of dataset, and how many are positive (is_sarcastic = 1) and how many negative?



### Q: How long on average is each headline? (4 points)
Longer text = more information. We want to see what the length of the headline is in order to see how much information it may have. 

In [None]:
# Insert code here to find the average length of headline (in words)
## Hint: see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.count.html 
# the '\s' regex looks for spaces.



# Step 2: Build a language model that knows how to write news headlines

This is the first step of our project that will be using a machine learning model. 

We are going to use the [fast.ai](https://fast.ai/) library to create this model. If you need help with understanding this section, look at the fast.ai documentation -- it is fantastic! The steps below are modified from the [online tutorial](https://docs.fast.ai/text.html#Quick-Start:-Training-an-IMDb-sentiment-model-with-ULMFiT)

In [None]:
import fastai
from fastai.text import * 

In [None]:
data_lm = (TextList.from_df(headlines, cols='headline').split_none().label_for_lm().databunch())

  return np.array(a, dtype=dtype, **kwargs)


## So here is what happened above. 

First, we tell fastai that we want to work on a list of texts (headlines in our case), that are stored in a dataframe (that's the `TextList.from_df` part.)  Finally, we tell it where to look for the headline in the dataframe (which column to use, `cols=`). 

Then there are two other important parts. We'll take it from the end. A `databunch` is a fastai convenience. It keeps all your training, validation and test data together. But what kind of validation data do we need for a language model? Remember that a language model predicts the next word in an input sequence of words. So, we can't just take some of the headlines and set them aside as validation. Instead, we want to use all the sentences and validate whether we can guess the right next word some fraction of the time. So, we first say `split_none` so you use all your data. Then we say `label_for_lm` so it labels the "next word" as the label for each sequence of words. It's a clever method -- see the source if you're curious!


In [None]:
data_lm.save('data_lm_export.pkl')

Let's save this databunch. We'll use this saved copy later. 

## Step 2a: Learn the model

Now that we have the data, it's time to train the model.

Now, we *could* learn a language model from scratch. But we're instead going to cheat. We're going to use a pretrained language model, and finetune it for our purpose. Specifically, we're going to use a model trained on the `Wikitext-103` corpus. 

One way to understand it is to think of our pre-trained model is as a model that can predict the next word in a Wikipedia article. We want to train it to write headlines instead. Since headlines still have to sound like English, ie. follow grammar, syntax, be generally plausible etc, being able to predict the next word in Wikipedia is super useful. It allows us to start with a model that already knows some English, and then just train it for writing headlines.



In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.5)

Downloading https://s3.amazonaws.com/fast-ai-modelzoo/wt103-fwd.tgz


This `AWD_LSTM` is the pretrained Wikipedia model.

Let's train it.

In [None]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,6.049117,#na#,11:35,


Once trained, it's time to write some headlines! We give it a starting sequence `Students protest ` and see what it comes up with. 

In [None]:
learn.predict("Students protest ", n_words=5, no_unk=True)

'Students protest  night in town hall station'

Pretty good, huh? 

In [None]:
learn.predict('The Fed is expected to', n_words=3, no_unk=True)

'The Fed is expected to be a work'

OK, it's not perfect! Let's make it a little better. 

The `unfreeze` below is telling fastai to allow us to change the weights throughout the model. We do this when we want to make the model generate text that's more similar to our headlines (than to Wikipedia). 

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(cyc_len=1, max_lr=1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,5.169644,#na#,17:31,


In [None]:
learn.predict('New Study', n_words=5)

'New Study finds 65 % of our'

In [None]:
learn.predict('16 Problems', n_words=5)

'16 Problems of the past , 15'

OK, now let's save our hard work. We'll use this later. (Pssst: why is it called an encoder? Look at the Fastai docs to find out!)

In [None]:
learn.save_encoder('headlines-awd.pkl')

Note that we also want to save the whole model, so we can reuse it in our twitter bot. 


In [None]:
learn.export('headlines-lm.pkl')

# Step 2b: See how well the language model works (15 points)

Try generating a few more headlines. Then, answer the following questions. Wherever possible, show what code you ran, or what predictions you asked it for. *Suggestion: Try using punctuations, numbers, texts of different lengths etc.*

### Q: What is the effect of starting with longer strings? (5 points)

We could start our headline generation with just one word, e.g. `learn.predict('White', n_words=9)` or with many: `learn.predict('White House Says Whistleblower Did', n_words=5)`. 

State in your Notebook: What is the difference you see in the kinds of headlines generated?


In [None]:
## Your answer here. Insert more cells if you want to insert code etc.
learn.predict('White', n_words=9)
#learn.predict('White House Says Whistleblower Did', n_words=5)


'White mayor of transgender candidate just choosing to have a'

## Q: What aspects of the task of generating headlines does our language model do well? (5 points)
For example, does it get grammar right? Does it know genders of people or objects? etc. State the response in your Notebook. 

In [None]:
#Your answer here. Insert more cells if you want to insert code etc.




## Q: What aspects of the task of generating headlines does our model do poorly? (5 points)
What does it frequently get wrong? Why might it make these mistakes?



In [None]:
## Your answer here





# Step 3: Learn a classifier to see which headlines are satire

Remember, our dataset has some stories that are satire (from the Onion) and others that are real. Now, we're going to train a classifier to distinguish one from the other. 

In [None]:
data_clas = (TextList.from_df(df=headlines, vocab= data_lm.train_ds.vocab, cols='headline').split_by_rand_pct(valid_pct=0.2).label_from_df(cols='is_sarcastic').databunch())


  return np.array(a, dtype=dtype, **kwargs)


  return array(a, dtype, copy=False, order=order)


We're using a similar databunch method as we did for our language model above. Here, we are using `split_by_rand_pct` so we keep some fraction of our dataset as a validation set. There is one other trick: `vocab= data_lm.train_ds.vocab` ensures that our classifier only uses words that we have in our language model -- so it never deals with words it hasn't encountered before. (Consider: why is this important?)

See if you can work out what the other arguments are. 

In [None]:
data_clas.show_batch()

  return array(a, dtype, copy=False, order=order)


text,target
"xxbos ' 12 years a slave , ' ' captain xxunk , ' ' american xxunk , ' ' wolf of wall street , ' ' blue xxunk , ' ' dallas buyers club , ' ' her , ' ' xxunk , ' ' before midnight , ' and ' xxunk ' all written during same continuing education screenwriting class",1
"xxbos past xxunk and on to xxunk , one of israel 's premier xxunk sites : spring break 2016 , breaking bad on the looney front - part 1",0
xxbos trump : ' the only way to find out what happened at the saudi consulate is to send in more journalists one at a time ',1
"xxbos ' let 's all say what we 're grateful for , ' says mother who apparently believes she 's in a norman fucking xxunk painting",1
"xxbos trump : ' i know that was pretty bad , but let 's just say you 're going to want to save your energy '",1


Above: what our data looks like after we apply the vocabulary restriction. `xxunk` is an unknown word. 

Below: we're creating a classifier. 

In [None]:
classify = text_classifier_learner(data=data_clas, arch=AWD_LSTM, drop_mult=0.5)

Remember that language model we saved earlier? It's time load it back!

In [None]:
classify.load_encoder('headlines-awd.pkl')

RNNLearner(data=TextClasDataBunch;

Train: LabelList (22896 items)
x: TextList
xxbos xxunk scientists unveil doomsday clock of hair loss,xxbos dem rep . totally nails why congress is falling short on gender , racial equality,xxbos eat your xxunk : 9 xxunk different recipes,xxbos xxunk weather prevents liar from getting to work,xxbos mother comes pretty close to using word ' streaming ' correctly
y: CategoryList
1,0,0,1,1
Path: .;

Valid: LabelList (5723 items)
x: TextList
xxbos larry xxunk ' leaning ' toward senate run in connecticut,xxbos everyone doing it , xxunk sources xxunk,xxbos bloated obama delivers press conference from couch behind podium,xxbos clinton announces transition leadership should she win in november,xxbos cnn launches ' cnn for the shuttle bus from the airport to the hotel ' news channel
y: CategoryList
0,1,1,0,1
Path: .;

Test: None, model=SequentialRNN(
  (0): MultiBatchEncoder(
    (module): AWD_LSTM(
      (encoder): Embedding(11152, 400, padding_idx=1)
      (

What's happening here? 

Here's the trick: a language model predicts the next word in a sequence using all the information it has so far (all the previous words). When we train a classifier, we ask it to predict the label (satire or not) instead of the next word. 

The intuition here is that if you can tell what the next word in a sentence is, you can tell if it is satirical. (Similarly, if you can can tell what the next word in an email is, you can tell if it is spam, etc.)

In [None]:
classify.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.446148,0.368206,0.840468,05:37


  return array(a, dtype, copy=False, order=order)


In [None]:
classify.freeze_to(-2)

Above: this is similar to `unfreeze()` that we used before. Except, you only allow a few layers of your model to change. Then we can train again, similar to using `unfreeze()`

In [None]:
classify.fit_one_cycle(1, 1e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.385368,0.320701,0.86528,06:17


  return array(a, dtype, copy=False, order=order)


Wow! An accuracy of 86%! That sounds great, and for not that much work. 

Now, let's try it on some headlines, to see how well it does. 

# Step 4: try out the classifier (20 points)

In [None]:
classify.predict("Despair for Many and Silver Linings for Some in California Wildfires")

(Category tensor(0), tensor(0), tensor([0.9891, 0.0109]))

Here in the output, the first part of this tuple is the chosen category (`0`, i.e. not satire), and the last part is an array of probabilities. The classifier suggests that the headline (which I got from the [New York Times](https://www.nytimes.com/2019/10/29/us/california-fires-homes.html?action=click&module=Top%20Stories&pgtype=Homepage)) is not satire, with about a 98% confidence. 

## Step 4a: Try out this classifier (10 points)

Below, try the classifier with some headlines, real or made up (including made up by the language model above). 


In [None]:
## Two headlines that the classifier correctly classifies (1 point)

In [None]:
## Two headlines that the classifier classifies incorrectly (1 point)

Now, we want to find two headlines that the classifier is really confident about, but classifies incorrectly. We want the confidence of the prediction to be at least 85%.

One headline is anything you want to write. Another must be a real headline (not satire) that you could trick the classifier into misclassifying changing only one word. For instance, taking `"Despair for Many and Silver Linings for Some in California Wildfires"`, a real NYTimes headline, you can change it to `"Despair for Many and Silver Linings for Some in Oregon Wildfires"` (note that this particular change does not cause the classifier to misclassify).

In [None]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)


In [None]:
## Insert one headline that the classifier classifies incorrectly, with false high confidence. (4 points)

# Also, insert link to the original headline/article.


## Step 4b: What kinds of headlines are misclassified? (10 points)

Write your hypothesis below on what kinds of headlines are misclassified. If it helps you, use the [TextClassificationInterpretation](https://docs.fast.ai/text.learner.html#TextClassificationInterpretation) utility. Show your work, especially if you use this utility.

In [None]:
## Show work here

(Add your interpretation here)

# Step 5: Save your classifier
Now that we've trained the classifier, you're ready for Part 2. You'll use this saved file in your bot later.

In [None]:
classify.export(file='satire_awd.pkl')

Later, you'll use it like so.

In [None]:

#You will want to download the classifiers to your google drive to use them for 
#this part. Download pre-trained learners here or use your own:
#https://drive.google.com/drive/folders/1KrYsysOojlOvVK-YfSehREsiz1k0Xxme?usp=sharing
#use this tutorial to understand how to access and store some of this data
#on your google drive for easy access:  
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
data_path="/content/"
serve_classifier = load_learner(path=data_path, file='satire_awd.pkl')
serve_lm = load_learner(path=data_path, file='headlines-lm.pkl')

In [None]:
#this will use the learner you uploaded to predict whether a certain sentence 
#is about satire or not. 
serve_classifier.predict('How the New Syria Took Shape')

(Category tensor(0), tensor(0), tensor([0.9880, 0.0120]))

In [None]:
#this will create a headline based on a phrase you provide
#and based on what your learner learned
serve_lm.predict('Rising Seas', n_words=7)

'Rising Seas : a awful love cook , few'

# Step 6: add the bot code. 

See the assignment document for what the bot code should look like. You can add it just below here, but you are also welcome to create a new notebook where you put that code. 