# Extracting features from text

Many machine learning problems use text as an explanatory variable. Text must
be transformed to a different representation that encodes as much of its meaning
as possible in a feature vector. In the following sections we will review variations
of the most common representation of text that is used in machine learning: the
bag-of-words model.

## The bag-of-words representation

 - It creates one feature for each word of interest in the text
 - used effectively for document classification and retrieval

A collection of documents is called a corpus. Let's use a corpus with the following
two documents to examine the bag-of-words model:

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
                'UNC played Duke in basketball',
                'Duke lost the basketball game'
         ]


vectorizer = CountVectorizer()


In [2]:
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)

[[1 1 0 1 0 1 0 1]
 [1 1 1 0 1 0 1 0]]
{'basketball': 0, 'played': 5, 'duke': 1, 'unc': 7, 'in': 3, 'lost': 4, 'the': 6, 'game': 2}


In [3]:
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]

In [4]:
vectorizer = CountVectorizer()

In [5]:
counts=(vectorizer.fit_transform(corpus).todense())
counts

matrix([[0, 1, 1, 0, 1, 0, 1, 0, 0, 1],
        [0, 1, 1, 1, 0, 1, 0, 0, 1, 0],
        [1, 0, 0, 0, 0, 0, 0, 1, 0, 0]], dtype=int64)

In [6]:
print(vectorizer.vocabulary_)

{'basketball': 1, 'played': 6, 'duke': 2, 'unc': 9, 'in': 4, 'sandwich': 7, 'lost': 5, 'the': 8, 'ate': 0, 'game': 3}


In [7]:
from sklearn.metrics.pairwise import euclidean_distances
print ('Distance between 1st and 2nd documents:', euclidean_distances(counts[2], counts[1]))

Distance between 1st and 2nd documents: [[ 2.64575131]]


## Stop-word filtering

In [8]:
corpus = [
'UNC played Duke in basketball',
'Duke lost the basketball game',
'I ate a sandwich'
]

In [9]:
vectorizer = CountVectorizer(stop_words='english')

In [10]:
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)


[[0 1 1 0 0 1 0 1]
 [0 1 1 1 1 0 0 0]
 [1 0 0 0 0 0 1 0]]
{'basketball': 1, 'played': 5, 'duke': 2, 'unc': 7, 'sandwich': 6, 'lost': 4, 'ate': 0, 'game': 3}


In [11]:
corpus = [
            'He ate the sandwiches',
            'Every sandwich was eaten by him'
        ]

In [12]:
vectorizer = CountVectorizer(stop_words='english')

In [13]:
counts=vectorizer.fit_transform(corpus).todense()
print (counts)
print(vectorizer.vocabulary_)


[[1 0 0 1]
 [0 1 1 0]]
{'sandwich': 2, 'ate': 0, 'eaten': 1, 'sandwiches': 3}


In [14]:
from sklearn.metrics.pairwise import euclidean_distances
print ('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))

Distance between 1st and 2nd documents: [[ 2.]]


## Stemming and lemmatization

Natural Language Tool Kit (NTLK) lib is used for this purpose. But not in the scope of this tutorial.

## Extending bag-of-words with TF-IDF weights

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
vectorizer = TfidfVectorizer(stop_words='english')
counts=vectorizer.fit_transform(corpus).todense()
print(counts)

[[ 0.70710678  0.          0.          0.70710678]
 [ 0.          0.70710678  0.70710678  0.        ]]


In [17]:
from sklearn.metrics.pairwise import euclidean_distances
print ('Distance between 1st and 2nd documents:', euclidean_distances(counts[0], counts[1]))

Distance between 1st and 2nd documents: [[ 1.41421356]]
