# Categorical Variables

Categorical Variables can take one of a fixed set of values. They often use one-hot encoding in which the eplainitory variable is represented using a binary feature representing one of its possible values.

* Example: Represent cities with one-hot encoding

In [1]:
from sklearn.feature_extraction import DictVectorizer

onehot_encoder = DictVectorizer()

X = [
    {'city': 'New York'},
    {'city': 'San Francisco'},
    {'city': 'Chapel Hill'}
]
print(onehot_encoder.fit_transform(X).toarray())

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


In [2]:
# Standardization is needed so that one feature's variance does not cause it to
# dominate over the learning algorithm.
from sklearn import preprocessing
import numpy as np

X = np.array([
    [0., 0., 5., 13., 9., 1.],
    [0., 0., 13., 15., 10., 15.],
    [0., 3., 15., 2., 0., 11.]
])
print(preprocessing.scale(X))

[[ 0.         -0.70710678 -1.38873015  0.52489066  0.59299945 -1.35873244]
 [ 0.         -0.70710678  0.46291005  0.87481777  0.81537425  1.01904933]
 [ 0.          1.41421356  0.9258201  -1.39970842 -1.4083737   0.33968311]]


- StandardScaler works by subtracting the mean of a feature from each instance's value and then dividing by the feature's standard deviation

- RobustScaler is an alternative which subtracts the median and divides by  the interquartile range. Quartiles are calulated by splitting sorted datasets into four parts of equal size. Second quartile is the median while the Interquartile range is the difference of the third and first quartiles. This mitigates the effect of outliers

# EXTRACTING FEATURES FROM TEXT

* Bag-of-Words-Model: The most common representation of text, this model uses a multiset or "bag" that encodes the words that appear in a text.  It does not encode the text's syntax, ignores the order of words, and disregards all grammar. Can be though of as an extension of one-hot encoding. Creates one feature for each work of interest in the text. The intuition is that documents containing similar words have similar meanings. Can be used or document classification and retrieval. A collection of documents is called a corpus.

  * Example: Using a corpus with 2 documents

In [1]:
corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game'
]

# Number of elements is called a vector's dimension. Here, 8 unique words make a vector
# with 8 elements. Maps the vocabulary to indicies in the feature vector (i.e. a dictionary)
# CountVectorizer converts the characters to lowercase and tokenizes the documents.
# Tokenization is done using regular expressions, splitting strings on whitespace and extracts
# sequences of characters that are two or more characters in length.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 1, 'in': 3, 'basketball': 0, 'lost': 4, 'the': 6, 'game': 2}


In [5]:
corpus.append('I ate a sandwich')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]
{'unc': 9, 'played': 6, 'duke': 2, 'in': 4, 'basketball': 1, 'lost': 5, 'the': 8, 'game': 3, 'ate': 0, 'sandwich': 7}


Notice how the first two feature vectors are closer to one another than the third vector. They will be more closely related using a metric such as euclidean distance.

*Euclidean distance* b/w two vectors is equivalent to the Euclidean norm/L2 norm of the difference b/w two vecors. A norm is a fuction that assigns a positive size to a vector. Euclidean norm = vector's magnitude.

In [6]:
from sklearn.metrics.pairwise import euclidean_distances
X = vectorizer.fit_transform(corpus).todense()
print('Distance between 1st and 2nd documents:', euclidean_distances(X[0], X[1]))
print('Distance between 1st and 3rd documents:', euclidean_distances(X[0], X[2]))
print('Distance between 2nd and 3rd documents:', euclidean_distances(X[1], X[2]))

Distance between 1st and 2nd documents: [[2.44948974]]
Distance between 1st and 3rd documents: [[2.64575131]]
Distance between 2nd and 3rd documents: [[2.64575131]]


A basic stategy for reducing dimension is converting text to lower case as it does not affect the meaning of the word **Stop Word Filtering** uses a second strategy to remove common words in the corpus. Examples include determiners (the, a, an), auxiliary verbs (do, be, will), and prepositions (on, around, beneath). These words are known as stop words. These are functional words that contribute to a text's grammar/meaning. However grammar is not considered by the bag-or-words-model.

#### Example: stop word filtering

In [7]:
vectorizer = CountVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}


**Stemming and Lemmatization:** used to condense inflected and derived words into a single feature.

Notice how the below documents have similar meanings, but their vectors have nothing in common as 'ate' and 'eaten' are treated seporately

In [2]:
corpus = [
    'He ate the sandwichs',
    'Every sandwich was eaten by him'
]
vectorizer = CountVectorizer(binary=True, stop_words='english')
print(vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 0 0 1]
 [0 1 1 0]]
{'ate': 0, 'sandwichs': 3, 'sandwich': 2, 'eaten': 1}


*__Lemmatization__ is the process of determining the lemma, the derivative root, of an inflected word based on context. **Stemming** is similar but more simplistic as it does not attempt to produce the morphological root of words. Lemmatization often requires a lexical resource while stemming uses rules.*

In [10]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

corpus =  [
    'I am gathering ingredients for the sandwich.',
    'There were many wizards at the gathering'
]
print(lemmatizer.lemmatize('gathering', 'v'))
print(lemmatizer.lemmatize('gathering', 'n'))

gather
gathering


In [12]:
# Notice how PorterStemmer cannot make the distinction between the various from of the word
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('gathering'))

gather


In [17]:
from nltk import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

wordnet_tags = ['n', 'v']
corpus = [
    'He ate the sandwichs',
    'Every sandwich was eaten by him'
]
stemmer = PorterStemmer()
print('Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus])

def lemmatize(token, tag):
    if tag[0].lower() in ['n', 'v']:
        return lemmatizer.lemmatize(token, tag[0].lower())
    return token

lemmatizer = WordNetLemmatizer()
tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
print('Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus])

Stemmed: [['He', 'ate', 'the', 'sandwich'], ['everi', 'sandwich', 'wa', 'eaten', 'by', 'him']]
Lemmatized: [['He', 'eat', 'the', 'sandwich'], ['Every', 'sandwich', 'be', 'eat', 'by', 'him']]


In [20]:
# Frequency of a word could denote its significance.
# Extending bag-of-words with tf-idf weights (showing number of occurances)
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
vectorizer = CountVectorizer(stop_words='english')
frequencies = np.array(vectorizer.fit_transform(corpus).todense())[0]
print(frequencies)
print('Token indices %s' % vectorizer.vocabulary_)
for token, index in vectorizer.vocabulary_.items():
    print('The token "%s" appears %s times' % (token, frequencies[index]))

[2 1 3 1 1]
Token indices {'dog': 1, 'ate': 0, 'sandwich': 2, 'wizard': 4, 'transfigured': 3}
The token "dog" appears 1 times
The token "ate" appears 2 times
The token "sandwich" appears 3 times
The token "wizard" appears 1 times
The token "transfigured" appears 1 times


In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]
vectorizer = TfidfVectorizer(stop_words='english')
print(vectorizer.fit_transform(corpus).todense())

[[0.75458397 0.37729199 0.53689271 0.         0.        ]
 [0.         0.         0.44943642 0.6316672  0.6316672 ]]


A hashing trick - creating a dictionary for the corpus has two drawbacks:
1. two passes are required over the corpus to create dictionary and feature vectors
2. Dictionary must be stored in memory which is expensive for large corpora
  * Solution: use hasing to map to memory locations instead

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer
corpus = ['the', 'ate', 'bacon', 'cat']
vectorizer = HashingVectorizer(n_features=6)
print(vectorizer.transform(corpus).todense())

# Errors from hash collisions cancel each other out with signage.
# Resulting model is difficult to inspect since not stored in memory

[[-1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0.  0.]]


### Word Embeddings: 
Representations of text that mitigate shortcoming of bag-of-words model. Use a vector rather than a scalar to represent each token. Vectors are dense and have between 50 and 500 dimensions. Semantically similar words are represented by vectors near each other.

See [this link](https:radimrehurk.com/gensim/install.html) for gensim installation instructions

Download and gunzip the word2vec embedings from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing

In [3]:
import gensim

# The model is large
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',
                                                        binary=True)

# Let's inspect the embedding for "cat"
embedding = model.wor_vec('cat')
print("Dimensions: %s" % embedding.shape)
print(embedding)

ModuleNotFoundError: No module named 'gensim'

In [None]:
# See which words are more similar
print(model.similarity('cat', 'dog'))
print(model.similarity('cat', 'sandwich'))

In [None]:
# Predict similar words
print(model.most_similar(positive=['puppy', 'cat'], negative=['kitten'], topn=3))