## Explaining the NLP terms 

The Bag of Words representation¶


Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

tokenizing strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.

counting the occurrences of tokens in each document.

normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

In this scheme, features and samples are defined as follows:

each individual token occurrence frequency (normalized or not) is treated as a feature.

the vector of all the token frequencies for a given document is considered a multivariate sample.

A corpus of documents can thus be represented by a matrix with one row per document and one column per token (e.g. word) occurring in the corpus.

We call vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

CountVectorizer implements both tokenization and occurrence counting in a single class:

In [2]:
#CounVectorizer for both tokenization and occurence counting

from sklearn.feature_extraction.text import CountVectorizer

In [6]:
vect = CountVectorizer()
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [22]:
# lets take a sample corpus to explain the working of countVectorizer()

corpus = ['Hi my name is Sravan','I like texting','this is example for text pre processing.']

In [23]:
#applying CountVectorizer()

X = vect.fit_transform(corpus)
X

<3x13 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

here in the above output  X(3X13) means 3 rows, 13 columns, as there are 3 documents and 13 unique  words

In [24]:
vect.get_feature_names() #used to get the names of the features. in this case they are unique words

['example',
 'for',
 'hi',
 'is',
 'like',
 'my',
 'name',
 'pre',
 'processing',
 'sravan',
 'text',
 'texting',
 'this']

## See this is the frequency matrix in the given documents

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

In [25]:
X.toarray()

array([[0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1]], dtype=int64)

In [20]:
X.dtype

dtype('int64')

Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [26]:
vect.transform(['hi,whats your name?.']).toarray()

array([[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]], dtype=int64)

## Normalization and Stemming

Since the words like text and texting has same meaning so, why not we treat them same?

In [28]:
import nltk
porter = nltk.PorterStemmer()


In [34]:
for i in vect.get_feature_names():
    print(porter.stem(i)) #strips affixes from the token and returns the stem .Eg: "example  to exampl", "texting to text"

exampl
for
hi
is
like
my
name
pre
process
sravan
text
text
thi


Now the texting has become text

In [38]:
# the same thing using list comprehension

[porter.stem(i) for i in vect.get_feature_names()]

['exampl',
 'for',
 'hi',
 'is',
 'like',
 'my',
 'name',
 'pre',
 'process',
 'sravan',
 'text',
 'text',
 'thi']

In [40]:
list(set([porter.stem(i) for i in vect.get_feature_names()]))

['sravan',
 'text',
 'thi',
 'process',
 'hi',
 'like',
 'pre',
 'my',
 'for',
 'exampl',
 'is',
 'name']

In [41]:
lemmat = nltk.WordNetLemmatizer()

In [45]:
[lemmat.lemmatize(i) for i in vect.get_feature_names()]

['example',
 'for',
 'hi',
 'is',
 'like',
 'my',
 'name',
 'pre',
 'processing',
 'sravan',
 'text',
 'texting',
 'this']

## Lemmatization

A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.

Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.

In [50]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("running"))
print(lemmatizer.lemmatize("ran",'v'))

cat
cactus
goose
rock
python
good
best
running
run
