<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Word Vectors

---

### Learning Objectives
- Describe word vectors and understand the shortcomings of bag-of-words methods.
- Describe word embeddings.
- Apply Word2Vec, GloVe, and BERT embedding techniques.

**We will start by importing what we need for Word2Vec, GloVe, and the transformer models.** (Downloading the pre-trained Word2Vec embeddings can take a while! We are using the [gensim.downloader](https://radimrehurek.com/gensim/auto_examples/howtos/run_downloader_api.html) module for this.)

In [None]:
# Install/upgrade Gensim & transformers
# !pip install gensim --upgrade
# !pip install transformers --upgrade

In [2]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-k7c2fqk5
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-k7c2fqk5
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.0-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 372 kB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 4.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |█████████████████████████████

In [4]:
import gensim.downloader as api
from gensim.models.word2vec import Word2Vec
from transformers import pipeline

In [5]:
api.info('text8')

{'checksum': '68799af40b6bda07dfa47a32612e5364',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'file_name': 'text8.gz',
 'file_size': 33182058,
 'license': 'not found',
 'num_records': 1701,
 'parts': 1,
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'record_format': 'list of str (tokens)'}

In [6]:
corpus = api.load('text8')



In [7]:
model = Word2Vec(corpus)

## Word Embeddings

### What is a vector?

There are lots of ways to think about a vector.

<img src="./images/vector.png" alt="drawing" width="450"/>

In **physics**, vectors are arrows.

<img src="./images/vector.jpg" alt="drawing" width="400"/>

In **computer science** and **statistics**, vectors are columns of values, like one numeric Series in a DataFrame.

#### It turns out that these are equivalent.

<img src="./images/vector_on_graph.png" alt="drawing" width="450"/>

[This video](https://www.youtube.com/watch?v=fNk_zzaMoSs) does an exceptional job explaining vectors.

### So... what is a word vector?

A word vector, simply, is a way for us to represent words with vectors.

<details><summary>How have we technically already done this?</summary>
    
- CountVectorizer and TFIDFVectorizer. By representing each word as a new column in our DataFrame, we have represented words with vectors.

![](../images/countvectorizer.jpeg)
</details>

To be more precise, we can think of each word as its own dimension or axis. In the example below, we have represented the horizontal axis with a vector for `cat` and the vertical axis with a vecvtor for `hat`.

<img src="./images/cat_hat.png" alt="drawing" width="400"/>

This is exactly what CountVectorization and TFIDFVectorization have done; we are now just representing it geometrically/visually! Each column in our DataFrame corresponds to a new axis.

This type of vectorization of words (turning each word into its own column) is known as "1-of-N encoding."

<img src="./images/one-hot-new.png" alt="drawing" width="400"/>

For example:
- the vector for the word `dog` would be [1, 0, 0, 0, 0].
- the vector for the word `cat` would be [0, 1, 0, 0, 0].
- the vector for the word `puppy` would be [0, 0, 1, 0, 0].
- the vector for the word `kitten` would be [0, 0, 0, 1, 0].
- the vector for the word `pug` would be [0, 0, 0, 0, 1].

All of the above vectors are independent of one another. Thinking purely about language and the way we use it, **should** dog and puppy be independent of one another? **Should** dog and pug be independent of one another?

<details><summary>What do you think?</summary>
    
- Probably not!
- Dog and puppy have similar meanings. (Really, only the age is different.)
- Dog and cat have similar meanings. (i.e. I know that "dog" and "cat" are more similar than "dog" and "book" or "cat" and "car.")
- Our current data science strategy for NLP (CountVectorization, TFIDFVectorization) is good in that it allows us to get computers to understand natural language in a way similar to how humans do... but our current strategy has its limitations!
</details>

Rather than creating a whole new dimension each time we encounter a new word and treating it as independent of all other words, can we instead come up with "new axes" that allow us to better understand meanings and relationships among words?
- YES.

**Word embedding** is a term used to describe representing words in mathematical space.
- One word embedding technique is CountVectorization.
- A more advanced word embedding technique is `Word2Vec`.

## Non-contextual Word Embeddings

### Word2Vec
- Word2Vec is an approach that takes in observations (sentences, tweets, books) and maps them into some other space using a neural network.

Going back to our previous example, try to "think" of a five-dimensional space. 
- The horizontal axis corresponds to `dog`.
- The vertical axis corresponds to `cat`.
- The axis extending out toward you corresponds to `puppy`.
- Given that we live in 3D space, we can't really visualize higher dimensions.

Instead of giving each word its own axis, the `Word2Vec` algorithm will take all of our words and map them to another set of axes that accounts for these relationships.

<img src="./images/word-vectors-new.png" alt="drawing" width="350"/>

### Why do we care?
The structure of language has a lot of valuable information in it! The way we organize our text/speech tells us a lot about what things mean.

By using machine learning to "learn" about the structure and content of language, our models can now organize concepts and learn the relationships among them.
- Above, we did not explicitly tell the computer what "dog" or "puppy" or "cat" or "kitten" actually mean. But by learning from the data, our model can quantify the relationship among these entities!

### How does Word2Vec work?

#### Basic Answer:
The idea is that we can use the position of words in sentences (i.e. see which words were commonly used together) to understand their relationships.
- If "dog" and "puppy" are used near one another a lot, then it suggests that there may be some sort of relationship between them.
- If "cat" and "dog" are used near similar words a lot (i.e. "pet"), then it suggests that there may be some sort of relationship between them.

#### More Advanced Answer:
There are two algorithms that use neural networks to learn these relationships: Continuous Bag-of-Words (CBOW) and Continuous Skip-grams.

![](./images/cbow.png)

**CBOW (BONUS)**

A continuous Bag-of-Words model is a two-layer neural network that:
- takes the surrounding "context words" as an input.
- generates the "focus word" as the output.

<img src="./images/word2vec-cbow.png" alt="drawing" width="400"/>

**Skip-Gram (BONUS)**

A Continuous Skip-gram model is a two-layer neural network that:
- takes the "focus word" as an input.
- generates the surrounding "context words" as the output.

<img src="./images/skipgram.png" alt="drawing" width="400"/>

([image source](https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/)).

 ### What is the vector for "cat"?

In [8]:
model.wv.get_vector('cat')

array([ 0.32087883, -2.0395539 ,  0.21913894,  1.9984392 , -1.0418444 ,
       -1.3562835 ,  0.3287806 , -0.65221465,  0.31915167, -1.3611715 ,
       -0.30536845, -1.0756254 , -0.45678398,  1.3633684 ,  0.47309658,
       -0.5094344 , -0.21412805,  0.6477837 , -0.2805002 ,  1.2492162 ,
       -0.23100717,  0.9812699 , -1.9186287 , -0.19715191,  0.81795245,
        0.7551643 ,  0.17827928,  0.34133685,  0.70873034,  0.16768254,
        0.853074  ,  0.34719294, -1.2049208 , -0.4515342 ,  0.9580673 ,
        1.1141106 , -0.71812993, -0.06592321, -1.8316294 ,  1.850454  ,
        0.68572396, -0.7247576 , -0.2332919 , -0.10931403,  0.13636535,
       -0.20184138,  0.24175973, -0.07787517,  0.11430994, -0.5003069 ,
       -0.59514254, -0.13427012, -2.1095424 ,  1.9945333 ,  2.545565  ,
       -0.267724  ,  2.1525702 ,  0.65854055,  0.16320081,  0.3433632 ,
        0.23347677, -0.94860137, -1.1830678 ,  1.0236462 ,  0.8227303 ,
        1.0762573 , -0.05766812,  0.6318919 , -0.25701755,  1.38

### Neat application 1: Which of these is not like the other?

In [9]:
model.wv.doesnt_match(['dog', 'fish', 'cat', 'hamster', 'elephant'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'elephant'

In [10]:
model.wv.doesnt_match(['taco', 'salad', 'burrito', 'quesadilla'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'salad'

In [11]:
model.wv.doesnt_match(['london', 'chicago', 'madrid', 'vienna'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'chicago'

In [12]:
model.wv.doesnt_match(['king', 'princess', 'doctor', 'duke'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'doctor'

In [13]:
model.wv.doesnt_match(['physics', 'math', 'english', 'statistics'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'english'

Try your own and share the most mind-blowing one in a thread.

In [14]:
# Shuya's code

model.wv.doesnt_match(['english', 'chinese', 'korean', 'french', 'spanish', 'portuguese'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'chinese'

In [15]:
# Mason's Code

model.wv.doesnt_match(['apple', 'banana', 'watermelon', 'stocks'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'apple'

In [17]:
# Chris' code

model.wv.doesnt_match(['dune', 'lord of the rings', 'harry potter', 'star wars'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'dune'

**Real-world application of this**: Suppose you're attempting to automatically detect spam emails or detect plagiarism based on words that don't belong.

### Neat application 2: What is most alike?

In [18]:
model.wv.most_similar('paris')

[('munich', 0.7468262314796448),
 ('milan', 0.7405922412872314),
 ('vienna', 0.7290281057357788),
 ('venice', 0.7287450432777405),
 ('leipzig', 0.7011932134628296),
 ('bologna', 0.6788668632507324),
 ('zurich', 0.6788535118103027),
 ('florence', 0.6696841716766357),
 ('commune', 0.663390040397644),
 ('louvre', 0.6616088151931763)]

In [19]:
model.wv.most_similar('physics')

[('mechanics', 0.7927184104919434),
 ('chemistry', 0.7566920518875122),
 ('mathematics', 0.7377824187278748),
 ('astronomy', 0.737430214881897),
 ('theoretical', 0.7299087047576904),
 ('electromagnetism', 0.7282665967941284),
 ('cosmology', 0.7205817699432373),
 ('quantum', 0.7110670208930969),
 ('electrodynamics', 0.7000282406806946),
 ('mathematical', 0.6976156234741211)]

**Real-world application of this**: Suppose you're building out a process to detect when people are tweeting about an emergency. They may not just use the word "emergency." Rather than manually creating a list of words people could use, you may want to learn from a much larger corpus of data than just your personal experience!

In [22]:
model.wv.most_similar('emergency')

[('safety', 0.697718620300293),
 ('monitoring', 0.692162275314331),
 ('detention', 0.6860338449478149),
 ('contraception', 0.6787272691726685),
 ('surveillance', 0.6639240384101868),
 ('conscription', 0.6323573589324951),
 ('involuntary', 0.6247165203094482),
 ('audit', 0.6246040463447571),
 ('prevention', 0.6237530708312988),
 ('civilian', 0.6195006370544434)]

---
## Create Word2Vec word vectors from your own corpus! (BONUS)

### NOTE: This will usually take a *long* time!

In [None]:
# # Import Word2Vec
# from gensim.models.word2vec import Word2Vec

# # If you want to use gensim's data, import their downloader
# # and load it.
# import gensim.downloader as api
# corpus = api.load('text8')

# # If you have your own iterable corpus of cleaned data, you can 
# # read it in as corpus and pass that in.

# # Train a model! 
# model = Word2Vec(corpus,      # Corpus of data.
#                  size=100,    # How many dimensions do you want in your word vector?
#                  window=5,    # How many "context words" do you want?
#                  min_count=1, # Ignores words below this threshold.
#                  sg=0,        # SG = 1 uses SkipGram, SG = 0 uses CBOW (default).
#                  workers=4)   # Number of "worker threads" to use (parallelizes process).

# # Do what you'd like to do with your data!
# model.most_similar("car")

Check out the documentation for Gensim's implementation of [Word2Vec here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec).

---
### GloVe

GloVe stands for Global Vectors for Word Representation. It is an unsupervised technique that maps words to vector representations where the distance between the vectors represents semantic similarities. This is done using a co-occurrence matrix which shows us how often pairs of words occur together.

In [23]:
api.info('glove-wiki-gigaword-50')

{'base_dataset': 'Wikipedia 2014 + Gigaword 5 (6B tokens, uncased)',
 'checksum': 'c289bc5d7f2f02c6dc9f2f9b67641813',
 'description': 'Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https://nlp.stanford.edu/projects/glove/).',
 'file_name': 'glove-wiki-gigaword-50.gz',
 'file_size': 69182535,
 'license': 'http://opendatacommons.org/licenses/pddl/',
 'num_records': 400000,
 'parameters': {'dimension': 50},
 'parts': 1,
 'preprocessing': 'Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i <fname> -o glove-wiki-gigaword-50.txt`.',
 'read_more': ['https://nlp.stanford.edu/projects/glove/',
  'https://nlp.stanford.edu/pubs/glove.pdf'],
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/glove-wiki-gigaword-50/__init__.py'}

In [24]:
model_glove = api.load('glove-wiki-gigaword-50')



### Neat application 1: Which of these is not like the other?

In [25]:
model_glove.doesnt_match(['dog', 'fish', 'cat', 'hamster', 'tiger'])

  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)


'hamster'

### Neat application 2: What is most alike?

In [26]:
model_glove.most_similar('glass')

[('plastic', 0.7942505478858948),
 ('metal', 0.770871639251709),
 ('walls', 0.7700636386871338),
 ('marble', 0.7638524174690247),
 ('wood', 0.7624281048774719),
 ('ceramic', 0.7602593302726746),
 ('pieces', 0.7589111924171448),
 ('stained', 0.7528817057609558),
 ('tile', 0.748193621635437),
 ('furniture', 0.746385931968689)]

---
## Contextualized/Dynamic Word Embeddings

What are some shortcomings of `Word2Vec`? It takes into consideration the meaning of words based on context in the corpus, but what about words with different meanings?

How many meanings can you think of for the word "set"? This word [holds the record](https://www.guinnessworldrecords.com/world-records/english-word-with-the-most-meanings/) for the most number of meanings in the English language. Even a word like "apple" can take on vastly different meanings in today's age. `Word2Vec` assigns one vector for each word.

**Dynamic Word Embeddings** overcome this shortcoming by assigning an embedding to each word after looking at the sentence of the words. This means that the same words (e.g. "apple" in a sentence about fruit and "Apple" in a sentence about computers) can be represented by different vectors based on their contexts. One of the first popular models that did this was called **ELMo**. Another popular one is named **BERT**.

<img src="./images/bert.png" alt="drawing" width="200"/>

[BERT](https://github.com/google-research/bert) (Bidirectional Encoder Representations from Transformers) was created by Google in late 2018 and continues to outperform other language representation models. It combined ELMo and several other transformers and is fully bidirectional allowing words to have different vectors based on the context of the word.

BERT is an example of a Transformer model. The following is from [Wikipedia](https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)):

> Like recurrent neural networks (RNNs), Transformers are designed to handle sequential data, such as natural language, for tasks such as translation and text summarization. However, unlike RNNs, Transformers do not require that the sequential data be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer allows for much more parallelization than RNNs and therefore reduced training times.  
Since their introduction, Transformers have become the model of choice for tackling many problems in NLP, replacing older recurrent neural network models such as the long short-term memory (LSTM).

We will use Hugging Face's [transformers](https://github.com/huggingface/transformers) for this section.

### Neat application 1: Fill in the blank
We will use the BERT model here!

In [27]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

In [28]:
unmasker('I want to go for a [MASK] in the park')

[{'score': 0.9612460732460022,
  'sequence': 'i want to go for a walk in the park',
  'token': 3328,
  'token_str': 'walk'},
 {'score': 0.008957397192716599,
  'sequence': 'i want to go for a run in the park',
  'token': 2448,
  'token_str': 'run'},
 {'score': 0.006247113924473524,
  'sequence': 'i want to go for a ride in the park',
  'token': 4536,
  'token_str': 'ride'},
 {'score': 0.00602363608777523,
  'sequence': 'i want to go for a stroll in the park',
  'token': 27244,
  'token_str': 'stroll'},
 {'score': 0.004511327017098665,
  'sequence': 'i want to go for a swim in the park',
  'token': 9880,
  'token_str': 'swim'}]

In [29]:
unmasker("No [MASK] allowed in class")

[{'score': 0.10337336361408234,
  'sequence': 'no smoking allowed in class',
  'token': 9422,
  'token_str': 'smoking'},
 {'score': 0.045239489525556564,
  'sequence': 'no weapons allowed in class',
  'token': 4255,
  'token_str': 'weapons'},
 {'score': 0.021998925134539604,
  'sequence': 'no drugs allowed in class',
  'token': 5850,
  'token_str': 'drugs'},
 {'score': 0.021368766203522682,
  'sequence': 'no alcohol allowed in class',
  'token': 6544,
  'token_str': 'alcohol'},
 {'score': 0.010899743065237999,
  'sequence': 'no music allowed in class',
  'token': 2189,
  'token_str': 'music'}]

In [30]:
unmasker('[MASK] is my favorite color')

[{'score': 0.37367215752601624,
  'sequence': 'this is my favorite color',
  'token': 2023,
  'token_str': 'this'},
 {'score': 0.11635718494653702,
  'sequence': 'it is my favorite color',
  'token': 2009,
  'token_str': 'it'},
 {'score': 0.042492132633924484,
  'sequence': 'that is my favorite color',
  'token': 2008,
  'token_str': 'that'},
 {'score': 0.04232509806752205,
  'sequence': 'he is my favorite color',
  'token': 2002,
  'token_str': 'he'},
 {'score': 0.03611450642347336,
  'sequence': 'chocolate is my favorite color',
  'token': 7967,
  'token_str': 'chocolate'}]

In [32]:
unmasker('My favorite color is [MASK].')

[{'score': 0.18677358329296112,
  'sequence': 'my favorite color is pink.',
  'token': 5061,
  'token_str': 'pink'},
 {'score': 0.16873008012771606,
  'sequence': 'my favorite color is red.',
  'token': 2417,
  'token_str': 'red'},
 {'score': 0.12622351944446564,
  'sequence': 'my favorite color is purple.',
  'token': 6379,
  'token_str': 'purple'},
 {'score': 0.09535549581050873,
  'sequence': 'my favorite color is orange.',
  'token': 4589,
  'token_str': 'orange'},
 {'score': 0.08776086568832397,
  'sequence': 'my favorite color is yellow.',
  'token': 3756,
  'token_str': 'yellow'}]

### Neat application 2: Sentiment Analysis
This was trained on [sst2](https://www.tensorflow.org/datasets/catalog/glue).

In [33]:
sent = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [34]:
sent('this was the worst movie ever')

[{'label': 'NEGATIVE', 'score': 0.9997519850730896}]

In [35]:
sent('this was the WORST movie EVER!!!!!!')

[{'label': 'NEGATIVE', 'score': 0.9997619986534119}]

In [36]:
sent('I love this so much')

[{'label': 'POSITIVE', 'score': 0.9998810291290283}]

In [37]:
sent('This movie was incredibly not bad')

[{'label': 'POSITIVE', 'score': 0.9990553259849548}]

### Neat application 3: Question Answering
This was trained on [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/).

In [38]:
question = pipeline('question-answering')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

In [39]:
# https://generalassemb.ly/faq
ga = 'General Assembly is a pioneer in education and career transformation, specializing in today\'s most in-demand coding, business, data, and design skills. With 30+ campuses around the world, we provide award-winning, dynamic training to a global community of professionals pursuing careers they love. Named one of Fast Company’s most innovative education companies, GA offers full- and part-time courses for career climbers both on campus and online. Through our corporate training programs, we also help companies compete for the future by sourcing, assessing, and growing their talent. All of these offerings are developed and led by industry experts.'
print(ga)

General Assembly is a pioneer in education and career transformation, specializing in today's most in-demand coding, business, data, and design skills. With 30+ campuses around the world, we provide award-winning, dynamic training to a global community of professionals pursuing careers they love. Named one of Fast Company’s most innovative education companies, GA offers full- and part-time courses for career climbers both on campus and online. Through our corporate training programs, we also help companies compete for the future by sourcing, assessing, and growing their talent. All of these offerings are developed and led by industry experts.


In [40]:
question(context = ga, question = 'Where is General Assembly located?')

{'answer': '30+ campuses around the world',
 'end': 186,
 'score': 0.5068291425704956,
 'start': 157}

In [41]:
question(context = ga,
         question = 'What can I learn at General Assembly?')

{'answer': 'in-demand coding, business, data, and design skills',
 'end': 150,
 'score': 0.19035615026950836,
 'start': 99}

In [42]:
question(context = ga,
         question = 'What is General Assembly?')

{'answer': 'a pioneer in education and career transformation',
 'end': 68,
 'score': 0.671988308429718,
 'start': 20}

### Neat application 4: Summarization
By default, this uses a [Bart](https://medium.com/analytics-vidhya/assesing-barts-syntactic-abilities-and-bert-s-part-1-cbf0983f6ea4) model that was trained on CNN/Daily Mail data.

In [43]:
summarizer = pipeline('summarization')

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [44]:
# https://www.upi.com/Odd_News/2020/10/22/Bear-opens-car-door-climbs-inside-in-Tennessee/9821603398162/
news = """
An Indiana family visiting Tennessee captured video of a black bear wandering up to their unoccupied car, opening a door and climbing inside.
The Franczak family said they traveled from Crown Point, Ind., to Sevierville, Tenn., to celebrate a grandmother's birthday. "One of our bucket list things was to see a bear," father Brian Franczak told WBBM-TV.
The family said they were shocked, however, when a bear came walking up the driveway of their vacation home and headed for their SUV.
"I just screamed, 'Oh my God! The bear is here! The bear is in the driveway,'" mom Carly Franczak said.
The family captured video as the bear opened a back door of the vehicle and climbed inside.
"I was at go-carts racing and my grandpa got a call about that there's a bear in their car," daughter Olivia Franczak said, "and we couldn't believe it at first. We thought my uncle got dressed up as a bear and went into the car."
The Tennessee Wildlife Resources Agency recommends residents and visitors keep vehicle doors locked at all times and make sure food and trash are secured where the animals can't reach.
"""

In [45]:
summarizer(news)

[{'summary_text': " The Franczak family was visiting Tennessee to celebrate a grandmother's birthday . The family said they were shocked when a bear came walking up to their car . The bear climbed into the vehicle and climbed inside . The Tennessee Wildlife Resources Agency recommends residents and visitors keep vehicle doors locked ."}]

### Neat application 5: Text Generation
Using [GPT-2](https://en.wikipedia.org/wiki/OpenAI#GPT-2)!

In [46]:
text_generator = pipeline("text-generation")

No model was supplied, defaulted to gpt2 (https://huggingface.co/gpt2)


Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

In [47]:
text_generator('I would like to')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I would like to invite the other members of the media to post comments at this place as well.\n\nFirstly, I would like to ask you, to be careful, if you are interested in attending, when you read that this is where'}]

In [48]:
text_generator('The last thing in the world I would ever want to do is')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The last thing in the world I would ever want to do is be arrested and held against my will.\n\n\nThe best news of 2016 is that you will be able to play a part in helping me get out of jail and do good for the'}]

## (BONUS) Applying this to your data

Want to use a pre-trained model on your own text data? Due to hardware and time limitations, we will not do this in class, but below are several tutorials that can walk you through this. Warning: these models take a lot of time/memory - you may need a GPU for this! ([Google Colab offers free use of a GPU!](https://www.tutorialspoint.com/google_colab/google_colab_using_free_gpu.htm))

- [Example of BERT in Keras](https://colab.research.google.com/drive/1934Mm2cwSSfT5bvi78-AExAl-hSfxCbq#scrollTo=gsscu_BluPLE)
- [BERT tutorial](https://towardsml.com/2019/09/17/bert-explained-a-complete-guide-with-theory-and-tutorial/)
- [Predicting movie review sentiment with BERT](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=dCpvgG0vwXAZ)
- [Classification with BERT in PyTorch](https://colab.research.google.com/drive/1ywsvwO6thOVOrfagjjfuxEf6xVRxbUNO)
- [Classification with GloVe embeddings](https://medium.com/analytics-vidhya/text-classification-using-word-embeddings-and-deep-learning-in-python-classifying-tweets-from-6fe644fcfc81)
- [Using pre-trained word embeddings in Keras](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)
- Series of three articles on using word embeddings (e.g. Word2Vec and FastText) to transform text data for classification models (e.g. MultinomialNB). This particular series uses some more advanced methodologies as well, but the overall process described still applies.
    - [Part 1](https://towardsdatascience.com/word-embeddings-and-document-vectors-part-1-similarity-1cd82737cf58)
    - [Part 2](https://towardsdatascience.com/word-embeddings-and-document-vectors-part-2-order-reduction-2d11c3b5139c)
    - [Part 3](https://towardsdatascience.com/word-embeddings-and-document-vectors-when-in-doubt-simplify-8c9aaeec244e)