# Ngram Tutorial 
N-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram". Larger sizes are sometimes referred to by the value of n, e.g., "four-gram", "five-gram", and so on. (Source: [Wikipedia](https://en.wikipedia.org/wiki/N-gram))

### Pros:
* Simplicity and scalability
* Useful for many applications
* Well understood math

### Cons:
* Do not capture non-local dependencies
* Lack any explicit representation of long range dependency

### 1. Sklearn implementation 
Source: [stackoverflow](http://stackoverflow.com/a/26891673/7338277)

In [79]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

a = 'text about stuff'
b = 'stuff about text'
c = 'text about ngrams'
d = 'n-grams are handy'
document = [a, b, c, d]

vectorizer = CountVectorizer(ngram_range=(1, 2))
X = vectorizer.fit_transform(document)

# Creating a dictionary from term to frequency
terms = vectorizer.get_feature_names()
freqs = X.sum(axis=0).A1
result = dict(zip(terms, freqs))

result

{'about': 3,
 'about ngrams': 1,
 'about stuff': 1,
 'about text': 1,
 'are': 1,
 'are handy': 1,
 'grams': 1,
 'grams are': 1,
 'handy': 1,
 'ngrams': 1,
 'stuff': 2,
 'stuff about': 1,
 'text': 3,
 'text about': 2}

In [80]:
matrix_terms = np.array(vectorizer.get_feature_names())

# Using the axis keyword to sum over rows
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])

print(final_matrix)

[['about' 'about ngrams' 'about stuff' 'about text' 'are' 'are handy'
  'grams' 'grams are' 'handy' 'ngrams' 'stuff' 'stuff about' 'text'
  'text about']
 ['3' '1' '1' '1' '1' '1' '1' '1' '1' '1' '2' '1' '3' '2']]


### 2. NLTK implementation 
Source: [stackoverflow](http://stackoverflow.com/a/17547860/7338277)

In [94]:
from nltk import bigrams
sentence = 'this is a foo bar sentences and i want to ngramize it'

grams = bigrams(sentence.split())
for gram in grams:
    print(gram)

('this', 'is')
('is', 'a')
('a', 'foo')
('foo', 'bar')
('bar', 'sentences')
('sentences', 'and')
('and', 'i')
('i', 'want')
('want', 'to')
('to', 'ngramize')
('ngramize', 'it')
