## Calculating Containment

In this notebook, you'll implement a containment function that looks at a source and answer text and returns a *normalized* value that represents the similarity between those two texts based on their n-gram intersection.

In [1]:
import numpy as np
import sklearn

### N-gram counts

One of the first things you'll need to do is to count up the occurrences of n-grams in your text data. To convert a set of text data into a matrix of counts, you can use a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html).

Below, you can set a value for n and use a CountVectorizer is used to count up the n-gram occurrences. In the next cell, we'll see that the CountVectorizer constructs a vocabulary, and later, we'll look at the matrix of counts.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

a_text = "This is an answer text"
s_text = "This is a source text"

# set n
n = 1

# instantiate an ngram counter
counts = CountVectorizer(analyzer='word', ngram_range=(n,n))

# create a dictionary of n-grams by calling `.fit`
vocab2int = counts.fit([a_text, s_text]).vocabulary_

# print dictionary of words:index
print(vocab2int)

{'this': 5, 'is': 2, 'an': 0, 'answer': 1, 'text': 4, 'source': 3}


### EXERCISE: Create a vocabulary for 2-grams (aka "bigrams")

Create a `CountVectorizer`, `counts_2grams`, and fit it to our text data. Print out the resultant vocabulary.

In [3]:
# create a vocabulary for 2-grams
n = 2
counts_2grams = CountVectorizer(analyzer='word', ngram_range=(n,n))

vocab2int = counts_2grams.fit([a_text, s_text]).vocabulary_
print(vocab2int)

{'this is': 5, 'is an': 2, 'an answer': 0, 'answer text': 1, 'is source': 3, 'source text': 4}


### What makes up a word?

You'll note that the word "a" does not appear in the vocabulary. And also that the words have been converted to lowercase. When `CountVectorizer` is passed `analyzer='word'` it defines a word as *two or more* characters and so it ignores uni-character words. In a lot of text analysis, single characters are often irrelevant to the meaning of a passage, so leaving them out of a vocabulary is often desired behavior. 

For our purposes, this default behavior will work well; we don't need uni-character words to determine cases of plagiarism, but you may still want to experiment with uni-character counts.

> If you *do* want to include single characters as words, you can choose to do so by adding one more argument when creating the `CountVectorizer`; pass in the definition of a token, `token_pattern = r"(?u)\b\w+\b"`. 

This regular expression defines a word as one or more characters. If you want to learn more about this vectorizer, I suggest reading through the [source code](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L664), which is well documented.

**Next, let's fit our `CountVectorizer` to all of our text data to make an array of n-gram counts!**

The below code, assumes that `counts` is our `CountVectorizer` for the n-gram size we are interested in.

In [4]:
# create array of n-gram counts for the answer and source text
ngrams = counts.fit_transform([a_text, s_text])

# row = the 2 texts and column = indexed vocab terms (as mapped above)
# ex. column 0 = 'an', col 1 = 'answer'.. col 4 = 'text'
ngram_array = ngrams.toarray()
print(ngram_array)

[[1 1 1 0 1 1]
 [0 0 1 1 1 1]]


So, the top row indicates the n-gram counts for the answer text `a_text`, and the second row indicates those for the source text `s_text`. If they have n-grams in common, you can see this by looking at the column values. For example they both have one "is" (column 2) and "text" (column 4) and "this" (column 5).

```
[[1 1 1 0 1 1]    =   an  answer  [is]  ______  [text] [this]
 [0 0 1 1 1 1]]   =   __  ______  [is]  source  [text] [this]
```

### EXERCISE: Calculate containment values

Assume your function takes in an `ngram_array` just like that generated above, for an answer text (row 0) and a source text (row 1). Using just this information, calculate the containment between the two texts. As before, it's okay to ignore the uni-character words.

To calculate the containment:
1. Calculate the n-gram **intersection** between the answer and source text.
2. Add up the number of common terms.
3. Normalize by dividing the value in step 2 by the number of n-grams in the answer text.

The complete equation is:

$$ \frac{\sum{count(\text{ngram}_{A}) \cap count(\text{ngram}_{S})}}{\sum{count(\text{ngram}_{A})}} $$

In [19]:
def containment(ngram_array):
    ''' Containment is a measure of text similarity. It is the normalized, 
       intersection of ngram word counts in two texts.
       :param ngram_array: an array of ngram counts for an answer and source text.
       :return: a normalized containment value.'''
    
    
    # your code here
    
    num = np.sum(np.logical_and(*ngram_array))
    den = np.sum(ngram_array[0])
    
    return num/den

In [20]:
# test out your code
containment_val = containment(ngrams.toarray())

print('Containment: ', containment_val)

# note that for the given texts, and n = 1
# the containment value should be 3/5 or 0.6
assert containment_val==0.6, 'Unexpected containment value for n=1.'
print('Test passed!')

Containment:  0.6
Test passed!


In [22]:
# test for n = 2
counts_2grams = CountVectorizer(analyzer='word', ngram_range=(2,2))
bigram_counts = counts_2grams.fit_transform([a_text, s_text])

# calculate containment
containment_val = containment(bigram_counts.toarray())

print('Containment for n=2 : ', containment_val)

# the containment value should be 1/4 or 0.25
assert containment_val==0.25, 'Unexpected containment value for n=2.'
print('Test passed!')

Containment for n=2 :  0.25
Test passed!


I recommend trying out different phrases, and different values of n. What happens if you count for uni-character words? What if you make the sentences much larger?

I find that the best way to understand a new concept is to think about how it might be applied in a variety of different ways.

### My tests

(I'm going to follow the suggestions and make some tests)

In [27]:
source_text = 'Abundantia was a divine personification of abundance and prosperity in ancient Rome. One explanation of the origin of the cornucopia myth, as related by Ovid, is that while the river god Achelous, in the form of a bull, was fighting Hercules, one of his horns was ripped off. The horn was taken up by the Naiads, who filled it with fruit and flowers, transforming it into a "horn of plenty", and gave it into Abundantia\'s care. '
answer_text = 'Abundantia was a divine personification of abundance and prosperity in ancient Rome. One explanation of the origin of the cornucopia myth is that while the river god Achelous, in the form of a bull, was fighting Hercules, one of his horns was ripped off.'


* 1-grams

In [28]:
one_grams_vect = CountVectorizer(analyzer='word', ngram_range=(1,1))
one_gram_counts = one_grams_vect.fit_transform([answer_text, source_text])

In [30]:
one_grams_vect.vocabulary_

{'abundantia': 1,
 'was': 46,
 'divine': 10,
 'personification': 34,
 'of': 29,
 'abundance': 0,
 'and': 4,
 'prosperity': 36,
 'in': 23,
 'ancient': 3,
 'rome': 40,
 'one': 31,
 'explanation': 11,
 'the': 43,
 'origin': 32,
 'cornucopia': 9,
 'myth': 27,
 'is': 25,
 'that': 42,
 'while': 47,
 'river': 39,
 'god': 18,
 'achelous': 2,
 'form': 15,
 'bull': 6,
 'fighting': 12,
 'hercules': 19,
 'his': 20,
 'horns': 22,
 'ripped': 38,
 'off': 30,
 'as': 5,
 'related': 37,
 'by': 7,
 'ovid': 33,
 'horn': 21,
 'taken': 41,
 'up': 45,
 'naiads': 28,
 'who': 48,
 'filled': 13,
 'it': 26,
 'with': 49,
 'fruit': 16,
 'flowers': 14,
 'transforming': 44,
 'into': 24,
 'plenty': 35,
 'gave': 17,
 'care': 8}

In [29]:
print(one_gram_counts.toarray())

[[1 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 0 1 2 0 1 0 1 0 5 1 2 1 0 1 0 1
  0 1 1 1 0 1 4 0 0 3 1 0 0]
 [1 2 1 1 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 1 3 1 1 6 1 2 1 1 1 1 1
  1 1 1 1 1 1 6 1 1 4 1 1 1]]


In [31]:
# calculate containment
containment_val = containment(one_gram_counts.toarray())
print(containment_val)

0.738095238095


* 2-grams

In [35]:
two_grams_vect = CountVectorizer(analyzer='word', ngram_range=(2,2))
two_gram_counts = two_grams_vect.fit_transform([answer_text, source_text])

In [36]:
two_grams_vect.vocabulary_

{'abundantia was': 2,
 'was divine': 64,
 'divine personification': 13,
 'personification of': 47,
 'of abundance': 37,
 'abundance and': 0,
 'and prosperity': 7,
 'prosperity in': 49,
 'in ancient': 27,
 'ancient rome': 4,
 'rome one': 53,
 'one explanation': 43,
 'explanation of': 14,
 'of the': 41,
 'the origin': 60,
 'origin of': 45,
 'the cornucopia': 56,
 'cornucopia myth': 12,
 'myth is': 35,
 'is that': 31,
 'that while': 55,
 'while the': 68,
 'the river': 61,
 'river god': 52,
 'god achelous': 21,
 'achelous in': 3,
 'in the': 28,
 'the form': 57,
 'form of': 18,
 'of bull': 38,
 'bull was': 9,
 'was fighting': 65,
 'fighting hercules': 15,
 'hercules one': 22,
 'one of': 44,
 'of his': 39,
 'his horns': 23,
 'horns was': 26,
 'was ripped': 66,
 'ripped off': 51,
 'myth as': 34,
 'as related': 8,
 'related by': 50,
 'by ovid': 10,
 'ovid is': 46,
 'off the': 42,
 'the horn': 58,
 'horn was': 25,
 'was taken': 67,
 'taken up': 54,
 'up by': 63,
 'by the': 11,
 'the naiads': 59

In [37]:
print(two_gram_counts.toarray())

[[1 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0
  1 1 1 0 2 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 0 1
  1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


In [38]:
# calculate containment
containment_val = containment(two_gram_counts.toarray())
print(containment_val)

0.951219512195


* 1-ngrams + 2-ngrams

In [41]:
one_two_grams_vect = CountVectorizer(analyzer='word', ngram_range=(1,2))
one_two_gram_counts = one_two_grams_vect.fit_transform([answer_text, source_text])

In [42]:
one_two_grams_vect.vocabulary_

{'abundantia': 2,
 'was': 110,
 'divine': 23,
 'personification': 81,
 'of': 66,
 'abundance': 0,
 'and': 9,
 'prosperity': 85,
 'in': 50,
 'ancient': 7,
 'rome': 93,
 'one': 74,
 'explanation': 25,
 'the': 99,
 'origin': 77,
 'cornucopia': 21,
 'myth': 61,
 'is': 56,
 'that': 97,
 'while': 115,
 'river': 91,
 'god': 39,
 'achelous': 5,
 'form': 33,
 'bull': 15,
 'fighting': 27,
 'hercules': 41,
 'his': 43,
 'horns': 48,
 'ripped': 89,
 'off': 72,
 'abundantia was': 4,
 'was divine': 111,
 'divine personification': 24,
 'personification of': 82,
 'of abundance': 67,
 'abundance and': 1,
 'and prosperity': 12,
 'prosperity in': 86,
 'in ancient': 51,
 'ancient rome': 8,
 'rome one': 94,
 'one explanation': 75,
 'explanation of': 26,
 'of the': 71,
 'the origin': 104,
 'origin of': 78,
 'the cornucopia': 100,
 'cornucopia myth': 22,
 'myth is': 63,
 'is that': 57,
 'that while': 98,
 'while the': 116,
 'the river': 105,
 'river god': 92,
 'god achelous': 40,
 'achelous in': 6,
 'in the':

In [43]:
print(two_gram_counts.toarray())

[[1 0 1 1 1 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0
  1 1 1 0 2 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 0 1 0 0]
 [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 0 1
  1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]


In [44]:
# calculate containment
containment_val = containment(two_gram_counts.toarray())
print(containment_val)

0.951219512195


Mmm... the value is exactly the same as in 2-grams case