# Social Media Mining: Word and Characters
### Vincent Malic - Spring 2018

# Module 9.1  N-grams in sklearn
* Work with small amount of fake data to illustrate, but nothing in principle prevents this from working with any set of text data.

In [1]:
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "The quick brown fox jumps behind the lazy dog.",
    "The lazy brown fox leaps beyond the sleeping dog.",
    "This lazy dog has a Twitter account that the fox subscribes to."
]

## Import the count vectorizer as usual.
* Make texts turn into a count vector by initializing a Count Vectorizer.
* Convert vector to array of words and counts for text
* Results in a vector of 4 rows and 19 columns representing the words

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
vectors = cv.fit_transform(texts)

vectors = vectors.toarray()

In [7]:
vectors.shape

(4, 19)

In [8]:
vectors

array([[0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 2, 0, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 2, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 2, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]], dtype=int64)

### Vectorizer object attribute ``vocabulary_``
* Contains information about *which element in a vector* corresponds to *which word* is in the 

In [9]:
print(cv.vocabulary_)

{'the': 15, 'quick': 11, 'brown': 3, 'fox': 5, 'jumps': 7, 'over': 10, 'lazy': 8, 'dog': 4, 'behind': 1, 'leaps': 9, 'beyond': 2, 'sleeping': 12, 'this': 16, 'has': 6, 'twitter': 18, 'account': 0, 'that': 14, 'subscribes': 13, 'to': 17}


### Word count by position in text
* The 0th element of first count vector has a value of 0, we know the word "account" appears in it 0 times. 
* The 1st element of the first count vector has a value of 1, so we know "brown" appears in it 1 time.
* Can modify how Count Vectorizer does its vectorization by *passing parameters to factory method during initialization*.

## **Presence vectors** versus **Count vectors**
* Some of readings have shown that presence vectors perform better than count vectors
* In **presence vector** value 1 indicates the word is in the text, 0 means its absence.
* factory method has named argument called ``binary`` that if True, produces presence vectors.

In [11]:
cv = CountVectorizer(binary=True)
vectors = cv.fit_transform(texts)
print(vectors.shape)
vectors.toarray()

(4, 19)


array([[0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]], dtype=int64)

### Presence vectors only have values of 0 and 1.
* In sample sentences, some words appeared twice in same text, represented by a 2 in Count Vectors.


# N-Grams
* N-gram is combination of n consecutive words. 
* In first sentence: ("the", "quick") is a bigram, and ("quick", "brown") is a bigram. 
* Considering n-grams *preserves some information in order of the words*. 

### For example, with a bigram 
* Classifier can know that "quick" and "brown" occurred together.
* Cost of *many combinations of 2 words*: length of vector becomes *much longer*. 
* Vector length increases for 3-grams or 4-grams, but may be *sparse* as combinations of words can be more rare.

## Parameter `ngram_range` 
* Considers which n-grams to count: default value is ``(1, 1)``, e.g., unigrams. 
* ngram_range of ``(2, 2)``, considers only bigrams, count every combination of two words

In [12]:
cv = CountVectorizer(ngram_range=(2, 2))
bigram_vectors = cv.fit_transform(texts)
print(bigram_vectors.shape)
bigram_vectors.toarray()

(4, 25)


array([[0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1,
        0, 0, 0],
       [0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
        0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,
        1, 0, 0],
       [1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0,
        0, 1, 1]], dtype=int64)

### Note: Dimensionality has increased. 
* There are 19 unique words, but there are 25 unique bigrams.

## Peek into the vocabulary: 
* Algorithm is counting combinations of two words, instead of single words.

In [13]:
cv.vocabulary_

{'account that': 0,
 'behind the': 1,
 'beyond the': 2,
 'brown fox': 3,
 'dog has': 4,
 'fox jumps': 5,
 'fox leaps': 6,
 'fox subscribes': 7,
 'has twitter': 8,
 'jumps behind': 9,
 'jumps over': 10,
 'lazy brown': 11,
 'lazy dog': 12,
 'leaps beyond': 13,
 'over the': 14,
 'quick brown': 15,
 'sleeping dog': 16,
 'subscribes to': 17,
 'that the': 18,
 'the fox': 19,
 'the lazy': 20,
 'the quick': 21,
 'the sleeping': 22,
 'this lazy': 23,
 'twitter account': 24}

## Vectorizer that considers *both* unigrams and bigrams:
* Using factory method, we designate `ngram_range` with tuple (1, 2)
* e.g. Minimum is a 1-gram, and maximum is a 2-gram
* Dimensionality has increased markedly

In [14]:
cv = CountVectorizer(ngram_range=(1, 2))
ngram_vectors = cv.fit_transform(texts)
print(ngram_vectors.shape)
ngram_vectors.toarray()

(4, 44)


array([[0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1,
        0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0,
        1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 2, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1]], dtype=int64)

### Considering both one-grams and bigrams has increased our dimensionality to 44.
* Element in position 1 represents the bigram, "account that"

In [15]:
cv.vocabulary_

{'account': 0,
 'account that': 1,
 'behind': 2,
 'behind the': 3,
 'beyond': 4,
 'beyond the': 5,
 'brown': 6,
 'brown fox': 7,
 'dog': 8,
 'dog has': 9,
 'fox': 10,
 'fox jumps': 11,
 'fox leaps': 12,
 'fox subscribes': 13,
 'has': 14,
 'has twitter': 15,
 'jumps': 16,
 'jumps behind': 17,
 'jumps over': 18,
 'lazy': 19,
 'lazy brown': 20,
 'lazy dog': 21,
 'leaps': 22,
 'leaps beyond': 23,
 'over': 24,
 'over the': 25,
 'quick': 26,
 'quick brown': 27,
 'sleeping': 28,
 'sleeping dog': 29,
 'subscribes': 30,
 'subscribes to': 31,
 'that': 32,
 'that the': 33,
 'the': 34,
 'the fox': 35,
 'the lazy': 36,
 'the quick': 37,
 'the sleeping': 38,
 'this': 39,
 'this lazy': 40,
 'to': 41,
 'twitter': 42,
 'twitter account': 43}

# Custom Tokenization
* ``CountVectorizer`` is counting instances of words, of n-grams inside of a string 
* First, it has to split the sentence into a list of strings, and then count the words
* Sometimes the default tokenizer may not always be sufficient 

### Some situations call for finer-grained control of tokenization process.

In [16]:
tweettext = [
    "RT @joeschmoe I'm really interested in your tutorial on #python!",
    "@annaschneider Thanks! Hopefully there'll be more #python lessons on the way!"]
cv = CountVectorizer()
cv.fit_transform(tweettext)
cv.vocabulary_

{'annaschneider': 0,
 'be': 1,
 'hopefully': 2,
 'in': 3,
 'interested': 4,
 'joeschmoe': 5,
 'lessons': 6,
 'll': 7,
 'more': 8,
 'on': 9,
 'python': 10,
 'really': 11,
 'rt': 12,
 'thanks': 13,
 'the': 14,
 'there': 15,
 'tutorial': 16,
 'way': 17,
 'your': 18}

Note that the ``sklearn`` Count Vectorizer ignored "RT" and stripped the @ sign from the mention and the # sign from the retweet. 

## NLTK Tweet Tokenizer:

In [17]:
from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
tt.tokenize(tweettext[0])

['RT',
 '@joeschmoe',
 "I'm",
 'really',
 'interested',
 'in',
 'your',
 'tutorial',
 'on',
 '#python',
 '!']

## If we want CountVectorizer object to use a *different tokenizer* 
* Specify parameter called ``tokenizer`` and 
* Pass the name of the function we want to use to tokenize*.
* Normally, if tokenizing a single text with the initialized Tweet tokenizer, we'd type:

```python
tt.tokenize(sometext)
```

When we want to pass it to a Count Vectorizer object, we use the parameter ``tokenizer=tt.tokenize``.

In [18]:
cv = CountVectorizer(tokenizer=tt.tokenize)

# CV will now use tt.tokenize() to tokenize a text it receives.
tweetvector = cv.fit_transform(tweettext)

In [19]:
tweetvector.toarray()

array([[1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1],
       [2, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0]], dtype=int64)

In [20]:
cv.vocabulary_

{'!': 0,
 '#python': 1,
 '@annaschneider': 2,
 '@joeschmoe': 3,
 'be': 4,
 'hopefully': 5,
 "i'm": 6,
 'in': 7,
 'interested': 8,
 'lessons': 9,
 'more': 10,
 'on': 11,
 'really': 12,
 'rt': 13,
 'thanks': 14,
 'the': 15,
 "there'll": 16,
 'tutorial': 17,
 'way': 18,
 'your': 19}

### Count Vectorizer kep the mentions and hashtags intact, 
* Using ``tt.tokenize`` function to split the text strings instead of the default tokenizer.


# Character N-Grams in ``sklearn``:
* A *character n-gram* **counts characters instead of words**. 
* In some situations, character n-grams provide better performance than word n-grams. 
* Doing counts of characters instead of words is just involves a different way of tokenizing the text. 

## Splitting by characters, instead of by words
* Use ``tokenizer`` parameter and pass it a function that takes a string and splits it into all its characters.
* Splitting a text string into its constituent characters by converting the string into a list directly:

In [21]:
astring = "Hello world!"
characters = list(astring)
print(characters)

['H', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!']


### Pass ``list`` function directly to Count Vectorizer's ``tokenizer`` parameter.

In [22]:
cv = CountVectorizer(tokenizer=list)
character_vectors = cv.fit_transform(texts)

In [23]:
character_vectors.toarray()

array([[ 8,  1,  1,  1,  1,  1,  3,  1,  1,  2,  1,  1,  1,  1,  1,  1,  4,
         1,  1,  2,  1,  2,  2,  1,  1,  1,  1,  1],
       [ 8,  1,  1,  2,  1,  2,  3,  1,  1,  3,  2,  1,  1,  1,  1,  2,  3,
         1,  1,  1,  1,  2,  2,  0,  1,  1,  1,  1],
       [ 8,  1,  2,  2,  0,  2,  6,  1,  2,  2,  1,  0,  0,  3,  0,  3,  4,
         2,  0,  1,  2,  2,  0,  0,  1,  1,  2,  1],
       [11,  1,  5,  2,  3,  1,  3,  1,  1,  4,  3,  0,  0,  1,  0,  1,  4,
         0,  0,  2,  5,  9,  2,  0,  1,  1,  1,  1]], dtype=int64)

In [24]:
cv.vocabulary_

{' ': 0,
 '.': 1,
 'a': 2,
 'b': 3,
 'c': 4,
 'd': 5,
 'e': 6,
 'f': 7,
 'g': 8,
 'h': 9,
 'i': 10,
 'j': 11,
 'k': 12,
 'l': 13,
 'm': 14,
 'n': 15,
 'o': 16,
 'p': 17,
 'q': 18,
 'r': 19,
 's': 20,
 't': 21,
 'u': 22,
 'v': 23,
 'w': 24,
 'x': 25,
 'y': 26,
 'z': 27}

Resulting vector contain counts of characters, including the spaces and periods.


# Character n-grams AND Word n-grams
* Burger and colleagues used character n-grams and word n-grams together. 
* Considered all character n-grams from 1 to 5, and all word 1-grams and 2-grams.
* They used **presence vectors** not count vectors, and we'll represent them the same way.

In [25]:
word_vectorizer = CountVectorizer(tokenizer=tt.tokenize, binary=True, ngram_range=(1, 2))
character_vectorizer = CountVectorizer(tokenizer=list, binary=True, ngram_range=(1, 5))

word_vectors = word_vectorizer.fit_transform(texts)
character_vectors = character_vectorizer.fit_transform(texts)

### Now have a matrix countaining word counts:
* 4 rows, for each text, and 49 columns: 49 unique 1-grams and 2-grams.

In [26]:
word_vectors.shape

(4, 49)

### Also have a matrix countaining character counts:
* 481 columns, unique character 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams. 

In [27]:
character_vectors.shape

(4, 481)

## Concatenate these two matrices
* Using ``hstack`` function ("horizontal stack") will glue the rows together
* Resulting in single new matrix with 4 rows and (481+49) = 530 columns.
* ``hstack`` takes a tuple as its argument, with matrices to be pasted together:

In [28]:
from scipy.sparse import hstack
vectors = hstack((word_vectors, character_vectors))
vectors.shape

(4, 530)

### Two separate vocabulary dictionaries - Now merged into one
* One for word n-grams, one for the character n-grams.
* n-gram dictionaries tell us which character corresponds to the 0th element of a vector, which character corresponds to the 1st element of a vector
* After concatenated the vectors, these **positions no longer hold**.

# Single merged matrix for character vectors and word vectors
* Word vector dictionary is correct (positions didn't change);
* Need to update the character n-grams dictionary.

In [31]:
character_vectorizer.vocabulary_

{' ': 0,
 '  a': 1,
 '  a  ': 2,
 '  a   t': 3,
 '  a   t w': 4,
 '  a c': 5,
 '  a c c': 6,
 '  a c c o': 7,
 '  b': 8,
 '  b e': 9,
 '  b e h': 10,
 '  b e h i': 11,
 '  b e y': 12,
 '  b e y o': 13,
 '  b r': 14,
 '  b r o': 15,
 '  b r o w': 16,
 '  d': 17,
 '  d o': 18,
 '  d o g': 19,
 '  d o g  ': 20,
 '  d o g .': 21,
 '  f': 22,
 '  f o': 23,
 '  f o x': 24,
 '  f o x  ': 25,
 '  h': 26,
 '  h a': 27,
 '  h a s': 28,
 '  h a s  ': 29,
 '  j': 30,
 '  j u': 31,
 '  j u m': 32,
 '  j u m p': 33,
 '  l': 34,
 '  l a': 35,
 '  l a z': 36,
 '  l a z y': 37,
 '  l e': 38,
 '  l e a': 39,
 '  l e a p': 40,
 '  o': 41,
 '  o v': 42,
 '  o v e': 43,
 '  o v e r': 44,
 '  q': 45,
 '  q u': 46,
 '  q u i': 47,
 '  q u i c': 48,
 '  s': 49,
 '  s l': 50,
 '  s l e': 51,
 '  s l e e': 52,
 '  s u': 53,
 '  s u b': 54,
 '  s u b s': 55,
 '  t': 56,
 '  t h': 57,
 '  t h a': 58,
 '  t h a t': 59,
 '  t h e': 60,
 '  t h e  ': 61,
 '  t o': 62,
 '  t o .': 63,
 '  t w': 64,
 '  t w i': 65,
 '

## Change character n-grams dictionary.
* Character n-grams come after the 49 word n-grams, shift the index accordingly
* Space symbol has index 0, but new index *after* all word n-grams is: `0 + 49 = 49`
* Bigram " a" with index 1, but its new index is: 1 + 49 = 50, etc. 

### Add 49 to all of indices of the Character Dictionary 

In [36]:
word_ngrams_count = len(word_vectorizer.vocabulary_)

new_character_vocabulary = {}

for character_ngram, index in character_vectorizer.vocabulary_.items():
    new_character_vocabulary[character_ngram] = index + word_ngrams_count

In [38]:
new_character_vocabulary


{' ': 49,
 '  a': 50,
 '  a  ': 51,
 '  a   t': 52,
 '  a   t w': 53,
 '  a c': 54,
 '  a c c': 55,
 '  a c c o': 56,
 '  b': 57,
 '  b e': 58,
 '  b e h': 59,
 '  b e h i': 60,
 '  b e y': 61,
 '  b e y o': 62,
 '  b r': 63,
 '  b r o': 64,
 '  b r o w': 65,
 '  d': 66,
 '  d o': 67,
 '  d o g': 68,
 '  d o g  ': 69,
 '  d o g .': 70,
 '  f': 71,
 '  f o': 72,
 '  f o x': 73,
 '  f o x  ': 74,
 '  h': 75,
 '  h a': 76,
 '  h a s': 77,
 '  h a s  ': 78,
 '  j': 79,
 '  j u': 80,
 '  j u m': 81,
 '  j u m p': 82,
 '  l': 83,
 '  l a': 84,
 '  l a z': 85,
 '  l a z y': 86,
 '  l e': 87,
 '  l e a': 88,
 '  l e a p': 89,
 '  o': 90,
 '  o v': 91,
 '  o v e': 92,
 '  o v e r': 93,
 '  q': 94,
 '  q u': 95,
 '  q u i': 96,
 '  q u i c': 97,
 '  s': 98,
 '  s l': 99,
 '  s l e': 100,
 '  s l e e': 101,
 '  s u': 102,
 '  s u b': 103,
 '  s u b s': 104,
 '  t': 105,
 '  t h': 106,
 '  t h a': 107,
 '  t h a t': 108,
 '  t h e': 109,
 '  t h e  ': 110,
 '  t o': 111,
 '  t o .': 112,
 '  t w':

## Working with final matrix
* Use ``word_vectorizer.vocabulary_`` for Word n-grams to find which words correspond to which index
* Use ``new_character_vocabulary`` for the Character n-grams. 