# **Zip Me Up: ngrams**

This lessons fills in the details about Python's built-in zip function. It 's a very powerful utility to help manipulate and maneuver data lists/arrays.

**N-Grams Revisited**

As a quick refresher, ngrams are a way to group words together (usually from processing text). They are contiguous sequence of n tokens.
For example, tri-grams (N = 3) for the first 2 sentences in **The Cat in The Hat** (cith.txt) would be:

```
The sun did
sun did not
did not shine
not shine It
shine It was
It was too
was too wet
too wet to
wet to play
```


You can print out the contents of the book using the following command:


In [1]:
def read_data_file(filename):
  with open(filename, 'r') as fd:
    return fd.read()
        
print(read_data_file("cith.txt")[0:100])

CHAPTER ONE
The sun did not shine.
It was too wet to play.
So we sat in the house
All that cold, col


One of the easiest ways to generate ngrams is to use Python's array slicing and comprehension syntax:



```
def get_ngrams(words, n):
  total = len(words) - (n-1)
  return [words[i:i+n] for i in range(total)]
```



In [4]:
# type&run the above example/exercise in this cell

f = read_data_file("cith.txt")

def get_ngrams(words, n):
  total = len(words) - (n-1)
  return [words[i:i+n] for i in range(total)]

get_ngrams(f, 4)

['CHAP',
 'HAPT',
 'APTE',
 'PTER',
 'TER ',
 'ER O',
 'R ON',
 ' ONE',
 'ONE\n',
 'NE\nT',
 'E\nTh',
 '\nThe',
 'The ',
 'he s',
 'e su',
 ' sun',
 'sun ',
 'un d',
 'n di',
 ' did',
 'did ',
 'id n',
 'd no',
 ' not',
 'not ',
 'ot s',
 't sh',
 ' shi',
 'shin',
 'hine',
 'ine.',
 'ne.\n',
 'e.\nI',
 '.\nIt',
 '\nIt ',
 'It w',
 't wa',
 ' was',
 'was ',
 'as t',
 's to',
 ' too',
 'too ',
 'oo w',
 'o we',
 ' wet',
 'wet ',
 'et t',
 't to',
 ' to ',
 'to p',
 'o pl',
 ' pla',
 'play',
 'lay.',
 'ay.\n',
 'y.\nS',
 '.\nSo',
 '\nSo ',
 'So w',
 'o we',
 ' we ',
 'we s',
 'e sa',
 ' sat',
 'sat ',
 'at i',
 't in',
 ' in ',
 'in t',
 'n th',
 ' the',
 'the ',
 'he h',
 'e ho',
 ' hou',
 'hous',
 'ouse',
 'use\n',
 'se\nA',
 'e\nAl',
 '\nAll',
 'All ',
 'll t',
 'l th',
 ' tha',
 'that',
 'hat ',
 'at c',
 't co',
 ' col',
 'cold',
 'old,',
 'ld, ',
 'd, c',
 ', co',
 ' col',
 'cold',
 'old,',
 'ld, ',
 'd, w',
 ', we',
 ' wet',
 'wet ',
 'et d',
 't da',
 ' day',
 'day.',
 'ay.\n',
 '

Type in the above code and be sure to experiment. You should be able to parse this out:
 * Experiment with some simple sentences
 * 'Prove' to yourself that if there is M words, the number of ngrams would  be M - (n - 1)
 * words[i:i+n] is just a slice n long of the array words 
 * [ slice for i in range(total) ]

# **Zip Me UP**
We have seen that working with parallel arrays can be cumbersome. The Python zip function can help manage the situation by taking different arrays and combining them into tuples: (Be sure to run and understand what is happening).

In [5]:
players = ["A. Gordon", "A. Holiday", "A. Nader"]
teams = ["ORL", "IND", "OKC"]
y_old = [23, 22, 25]
h_ins = [81, 73, 78]
w_lbs = [220, 185, 225]

values = zip(players, teams, y_old, h_ins, w_lbs)
for t in values:
  print(t)

('A. Gordon', 'ORL', 23, 81, 220)
('A. Holiday', 'IND', 22, 73, 185)
('A. Nader', 'OKC', 25, 78, 225)


The zip function returns an object (i.e. a custom type) that can be used as an iterator.
If you want all the items in a list or you want access to a specific element, you simply convert the output into a list:

In [6]:
values = zip(players, teams, y_old, h_ins, w_lbs)
dataset = list(values)
print(dataset)
print(dataset[1])

[('A. Gordon', 'ORL', 23, 81, 220), ('A. Holiday', 'IND', 22, 73, 185), ('A. Nader', 'OKC', 25, 78, 225)]
('A. Holiday', 'IND', 22, 73, 185)


Note that we have to recreate the value assigned to values. Once you iterate though the object, it is essentially empty.

With zip and list comprehensions, we can even create a dictionary of data from parallel arrays:



```
values = list(zip(players, teams, y_old, h_ins, w_lbs))
keys = ['p{}'.format(i) for i in range(len(values))]
dataset = {k:v for k,v in zip(keys, values)}
print(dataset)
```



In [8]:
#type in the above code

values = list(zip(players, teams, y_old, h_ins, w_lbs))
keys = ['p{}'.format(i) for i in range(len(values))]
dataset = {k:v for k,v in zip(keys, values)}
print(dataset)

{'p0': ('A. Gordon', 'ORL', 23, 81, 220), 'p1': ('A. Holiday', 'IND', 22, 73, 185), 'p2': ('A. Nader', 'OKC', 25, 78, 225)}


**Example: Building columns from rows using zip**

Here's a more complex example of using zip to wrangle your data from one format to another. Look at the following familiar dataset. Our goal is to easily get a full column of values in a single list (or tuple). For example, the first column would be ['a',1,4,7] as a list or ('a',1,4,7) as a tuple

```
a, b, c
1, 2, 3
4, 5, 6
7, 8, 9
```


So for this matrix (or table) of data, we want to get the 3 columns of data. Each column will have 4 items. This is an example of a column vector.

**Set Up**

We can easily read this data into a list of lists. So the first row is the header, the second row is the list [1,2,3], etc. You will want to be sure this code is run before all of the following examples.
Before you run this code, try to figure out what gets printed on the last line.



```
table = [
['a','b','c'],
[ 1,  2,  3],
[ 4,  5,  6],
[ 7,  8,  9] ]

header = table[0]
rows   = table[1:]  # this right here, is why we love slicing
print(header, rows)
print(rows[1][1])   # what gets printed here (figure it out before running)
```



In [9]:
#type in the above code

table = [
['a','b','c'],
[ 1,  2,  3],
[ 4,  5,  6],
[ 7,  8,  9] ]

header = table[0]
rows   = table[1:]  # this right here, is why we love slicing
print(header, rows)
print(rows[1][1])   # what gets printed here (figure it out before running)

['a', 'b', 'c'] [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
5


So for this matrix (or table) of data, we want to get the 3 columns of data.

**Attempt 1:**

Our first attempt will be to use the list concatenation operator '+':

In [10]:
column0 = header + rows[0]
print(column0)

['a', 'b', 'c', 1, 2, 3]


This is not what we wanted. Does the output make sense to you? 

However, even if you wanted do the following:

```
t = header[0] + str(rows[0][0]) + str(rows[1][0]) + str(rows[2][0])
print(t)
```
The data is hard coded. You want to be able to build this regardless of the numbers of rows in the dataset.

**Attempt 2:**

You could try to use enumeration:

In [12]:
out = []
for i in range(0, len(header)):
  l = header[i]
  v = rows[i]
  out.append( (l,v) )
print(out)

[('a', [1, 2, 3]), ('b', [4, 5, 6]), ('c', [7, 8, 9])]


This is closer. It at least builds an array of tuples .. wrong values though. Before continuing, think about what you would try next. You will get so much more out of this lesson if you try to solve it first.

**Attempt 3: We need one more loop:**

```
out = []
for i in range(0, len(header)):
  l = header[i]
  row = rows[i]
  for j in range(0, len(row)):
    v = row[j]
    out.append( (l,v) )
print(out)
```
Does that work?

That is a lot of code. But it's important that you understand what is happening.

We are looping through the rows in the table (i is the row index). Then for each row, we are looping for each of the values found at row i (j is the column index). So any cell is at table[i][j].

**Attempt 4:**

Let's try to use zip for solving this. As we have seen, zip works great if you have all your arrays ahead of time. Every parameter is suppose to be a list that will be "zipped up" with the other parameters. If we pass in a list for its parameters, zip will do the wrong thing:

```
print(rows)
print(list(zip(rows)))
```

In [13]:
# type&run the above example/exercise in this cell

out = []
for i in range(0, len(header)):
  l = header[i]
  row = rows[i]
  for j in range(0, len(row)):
    v = row[j]
    out.append( (l,v) )
print(out)

[('a', 1), ('a', 2), ('a', 3), ('b', 4), ('b', 5), ('b', 6), ('c', 7), ('c', 8), ('c', 9)]


The function zip is looking for multiple arguments to zip up. In the above example, the zip function is only being passed one parameter (the rows).

**"Fixing" zip:**

As we have seen Python has a special 'operator', the ✱, that basically takes a list, and flattens it into its single elements:

In [14]:
items = [1,2,3]
print(items)
print(*items)

[1, 2, 3]
1 2 3



We can use that operator on the list we pass into zip. This operator will essentially pass each row to zip as a separate argument:

```
print(list(zip(*rows)))
```

In [16]:
# type&run the above example/exercise in this cell

print(rows)
print(list(zip(*rows)))


[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
[(1, 4, 7), (2, 5, 8), (3, 6, 9)]


Oh WOW. So close. Make sure you can take apart that syntax and understand how it works. 

So zip(*rows) is similar to saying:

zip(rows[0], rows[1], rows[2])

But we never had to hard code the parameters (those numbers, 0, 1, 2 are 'hard coded'). If the number of rows in the table changes, we won't have to change our code.

**Zipping It Up (finally)**
```
table = [
   ['a','b','c'],
   [1,2,3],
   [4,5,6],
   [7,8,9]
]
print(list(zip(*table)))
```

In [17]:
# type&run the above example/exercise in this cell

table = [
   ['a','b','c'],
   [1,2,3],
   [4,5,6],
   [7,8,9]
]
print(list(zip(*table)))


[('a', 1, 4, 7), ('b', 2, 5, 8), ('c', 3, 6, 9)]


That syntax can be formidable, but once you know what zip does and how the operator works, reading complex syntax becomes a bit easier.

#**Ngrams Revisited (Again)**

We can use zip to build ngrams as well. Lets start with some simple data:

words = "The sun did not shine It was too wet to play".split()

![](https://drive.google.com/uc?export=view&id=1yUmDI0UrAlXYM2317_8hVtO_JdXiTbBD)


In [19]:

words = "The sun did not shine It was too wet to play".split()
words

['The', 'sun', 'did', 'not', 'shine', 'It', 'was', 'too', 'wet', 'to', 'play']

**Bi-grams**

For creating bi-grams, we pass in the words AND the words after removing the first word:

```
# bi-grams
words = "The sun did not shine It was too wet to play".split()
bigrams = list(zip(words, words[1:]))
print(bigrams)
```

In [20]:
# type&run the above example/exercise in this cell

# bi-grams
words = "The sun did not shine It was too wet to play".split()
bigrams = list(zip(words, words[1:]))
print(bigrams)

[('The', 'sun'), ('sun', 'did'), ('did', 'not'), ('not', 'shine'), ('shine', 'It'), ('It', 'was'), ('was', 'too'), ('too', 'wet'), ('wet', 'to'), ('to', 'play')]


**Tri-grams**

For tri-grams, it's now 3 lists we need to pass to zip:

```
# tri-grams
trigrams = list(zip(words, words[1:], words[2:]))
print(trigrams)
```

In [21]:
# type&run the above example/exercise in this cell

# tri-grams
trigrams = list(zip(words, words[1:], words[2:]))
print(trigrams)

[('The', 'sun', 'did'), ('sun', 'did', 'not'), ('did', 'not', 'shine'), ('not', 'shine', 'It'), ('shine', 'It', 'was'), ('It', 'was', 'too'), ('was', 'too', 'wet'), ('too', 'wet', 'to'), ('wet', 'to', 'play')]


## **N-grams**

Do you see a pattern ?

* What's the pattern for 4 words ?
* ``` zip(words, words[1:], words[2:], words[3:])```

We can generalize the parameter pattern using words and n:

* ```slices = [words[i:] for i in range(n)]```

and then pass the slices to zip:
* ```ngrams = zip( *slices )```


Once again, be certain you understand why we need to unpack the slices before sending them to zip. Finally, putting it all together:

```
def find_ngrams_v1(words, n):
  return zip(*[words[i:] for i in range(n)])
print(list(find_ngrams_v1(words, 3)))
```

In [22]:
# type&run the above example/exercise in this cell

def find_ngrams_v1(words, n):
  return zip(*[words[i:] for i in range(n)])
print(list(find_ngrams_v1(words, 3)))

[('The', 'sun', 'did'), ('sun', 'did', 'not'), ('did', 'not', 'shine'), ('not', 'shine', 'It'), ('shine', 'It', 'was'), ('It', 'was', 'too'), ('was', 'too', 'wet'), ('too', 'wet', 'to'), ('wet', 'to', 'play')]


### **Joining lists**

If you ever want to present ngrams as a unified string, just use string's join method with each of the ngram's list:

```
def find_ngrams_v2(words, n):
  ngrams = zip(*[words[i:] for i in range(n)])
  return [" ".join(ngram) for ngram in ngrams]

print(list(find_ngrams_v2(words, 3)))
```

That was easy !! Take a look at the find_ngrams_v2.

At first glance, it may seem impossible to understand but you now have the tools to unpack complex Pythonic code that you will see out in the wild.

In [23]:
# type&run the above example/exercise in this cell

def find_ngrams_v2(words, n):
  ngrams = zip(*[words[i:] for i in range(n)])
  return [" ".join(ngram) for ngram in ngrams]

print(list(find_ngrams_v2(words, 3)))

['The sun did', 'sun did not', 'did not shine', 'not shine It', 'shine It was', 'It was too', 'was too wet', 'too wet to', 'wet to play']


## **Review**

Before you go, you should know:

* What does zip do?

* What do you pass into zip?

* What is the return type of zip?






`## type in your answers to the above review questions ##`

1. zip can make a list of the corresponding tokens in a series of list
2. we pass in a list of lists
3. zip() return a interator object, but we can convert it to a list

## **Lesson Assignment:**

Be sure to type in all the examples first. For this lesson you will build on find_ngrams_v2.

**Create it**

Create a function named find_ngrams_bow:

* it has 4 parameters (words, n, bow=False, stopwords=[])
* words is a list of tokens/words
* each word should be converted to lowercase
* if bow is True, create ngrams such that order of the ngram words is no longer considered. Hence, each ngram is simply a bag-of-words (BOW). You can implement this by always using the alphabetical order for the words. For example the two ngrams, 'he said fine' and 'fine he said' would be the same ngram in the BOW model.
* if stopwords contains words, those words should not be considered part of the text


```
import Collections
def find_ngrams_bow():
   return []

def simple_test():
  text = read_data_file('hp1.txt')
  ngrams = find_ngrams_bow(text.split(), 3)
  top5 = collections.Counter(ngrams).most_common(5)
  print(top5)

expected output of simple_test():
[('of out the', 63), ('and harry ron', 51), ('end of the', 35), ('of rest the', 34), ('and hermione ron', 32)]
```


In [50]:
import collections
def find_ngrams_bow(tokens, gram, bow=False, stopwords=[]):
  
  if not stopwords == []:
    for word in stopwords:
      while word in tokens:
        tokens.remove(word)

  n_grams = find_ngrams_v1(tokens, gram)

  if bow:
    n_grams = [tuple(sorted(g)) for g in n_grams]

  return n_grams

def simple_test(data_file='hp1.txt', ngrams=3, bow=False, stopwords=[], most_com=5):
  text = read_data_file(data_file)
  ngrams = find_ngrams_bow(text.split(), ngrams, bow=bow, stopwords=stopwords)
  top5 = collections.Counter(ngrams).most_common(most_com)
  print(top5)

simple_test()

[(('out', 'of', 'the'), 63), (('harry', 'and', 'ron'), 36), (('ron', 'and', 'hermione'), 32), (('in', 'front', 'of'), 25), (('seemed', 'to', 'be'), 22)]


In [41]:
set([1, 2, 3]).difference(set([3, 4]))

{1, 2}

## **Use it**

With everything working, you will now use find_ngrams_bow to help support your research: 

**Question 1: write a function named q1 that takes no parameters.**

The function will use find_ngrams_bow to answer the following question:

As the n in ngrams increases, would you expect the BOW ngram counts to be higher or lower than non-BOW version?

* make sure you understand the question
* answer it BEFORE writing any code
* now write the code inside q1 that will help you confirm/deny your answer. You can use any method you want (print statements, analytical calculations, etc).
* q1 provides evidence to support the truth

In [52]:
#expectation: ngrams increase will let counts of grams decrease

def q1():
	for n in range(1, 5):
		print('ngrams={}'.format(n))
		print(simple_test(ngrams=n))


q1()

ngrams=1
[(('the',), 3626), (('and',), 1920), (('to',), 1857), (('he',), 1528), (('of',), 1258)]
None
ngrams=2
[(('of', 'the'), 286), (('in', 'the'), 270), (('on', 'the'), 217), (('it', 'was'), 207), (('he', 'was'), 195)]
None
ngrams=3
[(('out', 'of', 'the'), 63), (('harry', 'and', 'ron'), 36), (('ron', 'and', 'hermione'), 32), (('in', 'front', 'of'), 25), (('seemed', 'to', 'be'), 22)]
None
ngrams=4
[(('the', 'rest', 'of', 'the'), 15), (('in', 'front', 'of', 'the'), 12), (('out', 'of', 'the', 'way'), 12), (('he', 'was', 'going', 'to'), 11), (('the', 'three', 'of', 'them'), 11)]
None


**Question 2: write a function named q2 that takes no parameters.**

The function will use find_ngrams_bow to answer the following question:

If you add stopwords, should you see higher or lower counts in your ngrams?

* make sure you understand the question
* answer it BEFORE writing any code
* now write the code inside q2 that will help you confirm/deny your answer. You can use any method you want (print statements, analytical calculations, etc).
* q2 provides evidence to support the truth

In [53]:
# adding stoprwords will decrease the count of the grams 


def load_stopwords(extra=[]):
    return extra + ['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'same', "shan't", 'she', "she'd", "she'll", "she's", 'should', "shouldn't", 'so', 'some', 'such', 'than', 'that', "that's", 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'there', "there's", 'these', 'they', "they'd", "they'll", "they're", "they've", 'this', 'those', 'through', 'to', 'too', 'under', 'until', 'up', 'very', 'was', "wasn't", 'we', "we'd", "we'll", "we're", "we've", 'were', "weren't", 'what', "what's", 'when', "when's", 'where', "where's", 'which', 'while', 'who', "who's", 'whom', 'why', "why's", 'with', "won't", 'would', "wouldn't", 'you', "you'd", "you'll", "you're", "you've", 'your', 'yours', 'yourself', 'yourselves']


def q2():
	stopword_lists = [[], load_stopwords()]
	for stopwords in stopword_lists:
		print('stopwords={}'.format(stopwords))
		print(simple_test(stopwords=stopwords))

q2()

stopwords=[]
[(('out', 'of', 'the'), 63), (('harry', 'and', 'ron'), 36), (('ron', 'and', 'hermione'), 32), (('in', 'front', 'of'), 25), (('seemed', 'to', 'be'), 22)]
None
stopwords=['a', 'about', 'above', 'after', 'again', 'against', 'all', 'am', 'an', 'and', 'any', 'are', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', "can't", 'cannot', 'could', "couldn't", 'did', "didn't", 'do', 'does', "doesn't", 'doing', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', "hadn't", 'has', "hasn't", 'have', "haven't", 'having', 'he', "he'd", "he'll", "he's", 'her', 'here', "here's", 'hers', 'herself', 'him', 'himself', 'his', 'how', "how's", 'i', "i'd", "i'll", "i'm", "i've", 'if', 'in', 'into', 'is', "isn't", 'it', "it's", 'its', 'itself', "let's", 'me', 'more', 'most', "mustn't", 'my', 'myself', 'no', 'nor', 'not', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'ought', 'our', 'ours', 'ourselves', 'out', 'over', 

**Steps to submit your work:**


1.   Download the notebook from Moodle. It is recommended that you use Google Colab to work on it.
2.   Upload any supporting files using file upload option within Google Colab.
3.   Complete the exercises and/or assignments
4.   Download as .ipynb
5.   Name the file as "lastname_firstname_WeekNumber.ipynb"
6.   After following the above steps, submit the final file in Moodle





<h1><center>The End!</center></h1>