# Extracting features from categorical variables

Categorical variables are commonly encoded using one-of-K or one-hot encoding, in which the explanatory variable is encoded using one binary feature for each of the variable's possible values.

In [1]:
from sklearn.feature_extraction import DictVectorizer

In [2]:
one_hot = DictVectorizer()

In [3]:
instances = [
    {'city': 'Shanghai'},
    {'city': 'Shangxi'},
    {'city': 'Hangzhou'}
]

In [4]:
hot_feature = one_hot.fit_transform(instances)

In [5]:
# sparse matrix
print hot_feature

  (0, 1)	1.0
  (1, 2)	1.0
  (2, 0)	1.0


In [6]:
# matrix
print hot_feature.toarray()

[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]


# Extracting features from text

Text must be transformed to a different representation that encodes as much of its meaning as possible in a feature vector.

## The bag-of-words representation

This representation uses a multiset, or bag, that encodes the words that appear in a text; the bag-of-words does not encode any of the text's syntax, ignores the order of words, and disregards all grammar. 

The bag-of-words model is motivated by the intuition that documents containing similar words often have similar meanings. The bag-of-words model can be used effectively for document classification and retrieval despite the limited information that it encodes.

In [7]:
'''
    A collection of documents is called a corpus. Let's use a corpus with the following two documents to examine the 
    bag-of-words model; The corpus's unique words comprise its vocabulary.
'''
corpus = [
    'UNC played Duke in basketball',
    'Duke lost the basketball game',
    'I ate a sandwich'
]

In [8]:
from sklearn.feature_extraction.text import  CountVectorizer

In [9]:
vectorizer = CountVectorizer(corpus)

In [10]:
# The meanings of the  rst two documents are more similar to each other than they are to the third document
print vectorizer.fit_transform(corpus).todense()

[[0 1 1 0 1 0 1 0 0 1]
 [0 1 1 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0]]


In [11]:
print vectorizer.vocabulary_

{u'duke': 2, u'basketball': 1, u'lost': 5, u'played': 6, u'in': 4, u'game': 3, u'sandwich': 7, u'unc': 9, u'ate': 0, u'the': 8}


In [12]:
from sklearn.metrics.pairwise import euclidean_distances

In [13]:
feature_vectors = vectorizer.fit_transform(corpus).todense()

In [14]:
euclidean_distances(feature_vectors[0], feature_vectors[1])

array([[ 2.44948974]])

In [15]:
euclidean_distances(feature_vectors[0], feature_vectors[2])

array([[ 2.64575131]])

In [16]:
euclidean_distances(feature_vectors[1], feature_vectors[2])

array([[ 2.64575131]])

# Sparse Vectors

High-dimensional feature vectors that have many zero-valued elements are called sparse vectors.

1. The first problem is that high-dimensional vectors require more memory than smaller vectors.

2. The second problem is known as the curse of dimensionality, or the Hughes effect. As the feature space's dimensionality increases, more training data is required to ensure that there are enough training instances with each combination of the feature's values.

# How to decrease dimension

## 1. Stop-words filter

In [17]:
vectorizer = CountVectorizer(stop_words='english')

In [18]:
feature_vectors = vectorizer.fit_transform(corpus)

In [19]:
print feature_vectors.todense()

[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]


In [20]:
print vectorizer.vocabulary_

{u'duke': 2, u'basketball': 1, u'lost': 4, u'played': 5, u'game': 3, u'sandwich': 6, u'unc': 7, u'ate': 0}


# 2. Stemming and lemmatization

In [21]:
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]

In [22]:
vectorizer = CountVectorizer(binary=True, stop_words='english')
print vectorizer.fit_transform(corpus).todense()
print vectorizer.vocabulary_

[[1 0 0 1]
 [0 1 1 0]]
{u'sandwich': 2, u'ate': 0, u'sandwiches': 3, u'eaten': 1}


In [23]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

In [24]:
from nltk.stem.wordnet import WordNetLemmatizer

In [25]:
lemmatizer = WordNetLemmatizer()
print lemmatizer.lemmatize('gathering', 'v')
print lemmatizer.lemmatize('gathering', 'n')

gather
gathering


In [26]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print stemmer.stem('gathering')

gather


In [27]:
from nltk import word_tokenize

In [28]:
from nltk.stem import PorterStemmer

In [29]:
wordnet_tags = ['v', 'n']

In [30]:
corpus = [
    'He ate the sandwiches',
    'Every sandwich was eaten by him'
]

In [31]:
stemmer = PorterStemmer()

In [32]:
print 'Stemmed:', [[stemmer.stem(token) for token in word_tokenize(document)] for document in corpus]

Stemmed: [['He', 'ate', 'the', u'sandwich'], [u'everi', 'sandwich', u'wa', 'eaten', 'by', 'him']]


In [33]:
word_tokenize(corpus[0])

['He', 'ate', 'the', 'sandwiches']

In [34]:
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import pos_tag

In [35]:
lemmatizer = WordNetLemmatizer()

In [36]:
tagged_corpus = [pos_tag]

In [47]:
# [('He', 'PRP'), ('ate', 'VBD'), ('the', 'DT'), ('sandwiches', 'NNS')]
tagged_corpus = [pos_tag(word_tokenize(document)) for document in corpus]
print tagged_corpus

[[('He', 'PRP'), ('ate', 'VBD'), ('the', 'DT'), ('sandwiches', 'NNS')], [('Every', 'DT'), ('sandwich', 'NN'), ('was', 'VBD'), ('eaten', 'VBN'), ('by', 'IN'), ('him', 'PRP')]]


In [55]:
def lemmatize(token, tag):
    if tag[0].lower() in wordnet_tags:
        return lemmatizer.lemmatize(token, tag[0].lower())
    return token

In [56]:
print 'Lemmatized:', [[lemmatize(token, tag) for token, tag in document] for document in tagged_corpus]

Lemmatized: [['He', u'eat', 'the', u'sandwich'], ['Every', 'sandwich', u'be', u'eat', 'by', 'him']]


In [57]:
lemmatizer.lemmatize("ate", "v")

u'eat'

In [58]:
help(pos_tag)

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
    
        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
    
    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
    
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :param tagset: the tagset to be u

# Extending bag-of-words with TF-IDF weights

Instead of using a binary value for each element in the feature vector, we will now use an integer that represents the number of times that the words appeared in the document.

In [62]:
corpus = ['The dog ate a sandwich, the wizard transfigured a sandwich, and I ate a sandwich']
vectorizer = CountVectorizer(stop_words='english')
print vectorizer.fit_transform(corpus).todense()

[[2 1 3 1 1]]


In [63]:
vectorizer.vocabulary_

{u'ate': 0, u'dog': 1, u'sandwich': 2, u'transfigured': 3, u'wizard': 4}

## Normalization, 

$$tf(t,d)= \frac{f(t,d)+1}{||x||}$$  f(t,d) is the frequency of term `t` in document d and $||x||$ is the L2 norm of the count vector. 

## Logarithmically scaled term frequencies

$$tf(t,d) = \log ((f(t,d)+1)$$ which scale the counts to a more limited range,

## Augmented term frequencies 

$$tf(t,d) = 0.5 + \frac{0.5 * f(t, d)}{\max{f(w, d)}: w \in d}$$ which further mitigates the bias for longer documents

All three method can represent the frequencies of terms in a document while mitigating the effects of different document sizes. But There are words can be thought of as corpus-specific stop words and may not be useful to calculate the similarity of documents. The `inverse document frequency (IDF)` is a measure of how rare or common a word is in a corpus. 

## Inverse document frequency

$$idf(t,d) = \log{\frac{N}{1+|d \in D: t \in d|}}$$ Here, N is the total number of documents in the corpus and d ∈ D : t ∈ d is the number of documents in the corpus that contain the term t.

In [64]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [65]:
corpus = [
    'The dog ate a sandwich and I ate a sandwich',
    'The wizard transfigured a sandwich'
]

In [66]:
vectorizer = TfidfVectorizer(stop_words='english')

In [67]:
print vectorizer.fit_transform(corpus).todense()

[[ 0.75458397  0.37729199  0.53689271  0.          0.        ]
 [ 0.          0.          0.44943642  0.6316672   0.6316672 ]]


In [68]:
print vectorizer.vocabulary_

{u'sandwich': 2, u'wizard': 4, u'dog': 1, u'transfigured': 3, u'ate': 0}


## Space-efficient feature vectorizing with the hashing trick

In [71]:
from sklearn.feature_extraction.text import HashingVectorizer
corpus = ['the', 'ate', 'bacon', 'cat']
vectorizer = HashingVectorizer(n_features=6)
print vectorizer.transform(corpus).todense()

[[-1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.]
 [ 0.  0.  0.  0. -1.  0.]
 [ 0.  1.  0.  0.  0.  0.]]
