# 2 - Feature extraction

Before training Machine learning algorithms, preprocessed text needs to be transformed into numerical data. This process is called feature extraction, or vectorization. 

In this exercice, you will familiarize yourself with three popular feature extraction algorithms:
- Bag-of-Words
- N-grams
- Tfidf
- Part of Speech tagging

The following operations will be completed on the following corpus (set of texts):

In [2]:
corpus = ['i love this game',
          'football is a game',
          'i love football']

## Bag of Words

A bag of words is a count of word occurences within a text.

The algorithm is made up of two parts. First, it creates a dictionary from the entire corpus. Then, it transforms each text in the corpus as a vector of word occurences. It is called a "bag" because it disregards the order of the words within the text and focuses on content. 

A Bag of Words can be obtained using Sklearn's CountVectorizer. Initiate a default vectorizer.

In [6]:
# Code Here

 Use the default vectorizer to transform the corpus to a Bag of Words.

In [7]:
# Code Here

Below are printed the original corpus, the features of our Bag of Words, and the corresponding vectors. Make sense of the code, you will need it later ;)

In [None]:
print(*corpus,sep='\n') #Prints original corpus with each text at a new line

print(' ') # Print a blank line to air things out ;)

print(default_vectorizer.get_feature_names()) # Prints features of your Bag of Words

print(' ') # Another blank line.....

print( bag_of_words.todense() ) # Prints Bag of words vectors

As you can see, vectors represent the occurences of each word in each sentence. It is those vectors that are then used for Machine Learning tasks.

## N-grams

A N-gram is a combination of N number of words treated as a single feature, as opposed to single word features in the Bag of Word. The idea is to extract contextual information and enrich data. For example the word "good" is always positive individually but can be negative when preceeded by "not". In certain cases, "not good" is an informative bigram.

N-grams can also be extracted through sklearn's CountVectorizer, but with a specific parameter. Find out how that parameter works and initiate a Bigram (2-gram) vectorizer.

In [18]:
# Code Here

Transform corpus to bigram vectors.

In [10]:
# Code Here

Print the original corpus, the bigram features, and the corresponding vectors. Separate by a blank line to air out ;)

In [2]:
# Code Here

## Term Frequency - Inverse Document Frequency (Tfidf)

Rather than counting occurences, the TfIdf vectorizer computes an importance value for each word in its text and according the entire corpus. That value is the product of the TF and the IDF.

TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Don't you worry, the abovementioned steps of the Tfidf algorithm are automatized in Sklearn's TfidfVectorizer package.

Import the package, initiate a default Tfidf vectorizer, and transform your corpus. Then, print the original corpus, tfidf features, and corresponding vectors.

In [16]:
# Code Here

## Part of Speech (POS) tagging 

Part-of-Speech tagging is a way to enrich text data by specifying the function of a word within its sentence (noun, verb, adjective, etc...).  

If executed prior to Bag-of-Word vectorization, it can, for example, differentiate two words that are written the same but have different grammatical properties.

In [21]:
example = "After I complete the bootcamp my life will be complete "

Before being POS tagged, a sentence needs to be tokenized into words. Go ahead.

In [3]:
# Code Here

You can now import and use NLTK's pos_tag package on your tokens.

In [4]:
# Code Here

You can see that the two different words "complete" have a different function within the sentence. To check the meaning of each tag, use the function below.

In [None]:
import nltk

nltk.help.upenn_tagset('VBP')

## Lexical Richness

Depending on the task you are working on, you may need to engineer your own feature. 

Assume that your task is to classify texts according to their author. In such case, Lexical Richness caries author specific information. One may have significantly more vocabulary than the other. However, there is no NLTK package to compute such a feature....

Engineer a feature extraction function that returns a ratio of (unique words / total words). Apply it to the example below. There are a few tricks here ;)

In [None]:
example = "I love Malmo but it is too cold and i do not like the cold"

In [1]:
# Code here

# Should return 0.8666666666666667