# Feature extraction

In this notebook we will learn how to extract different features from a text and how to combine them. It's pretty simple, but if you have this part well organized, it will be really useful in the near future. So, let's get started!

In [96]:
import nltk
from sklearn.pipeline import FeatureUnion
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


In [3]:
train_sentences = [
    'I liked this movie',
    'The plot was intriguing',
    'Oh, it was truly boring',
]
train_classes = [1,1,0]

test_sentences = [
    'I liked it',
    'The plot was boring !',
    'Oh, it was absolutely terrible !',
]
test_classes = [1,0,0]

### Exercise 1: Get the bag of words representation of the training set

You can make it simply by using the `CountVectorizer` class from `scikit-learn`. Once you instantiate the vectorizer and you `fit_transform` it, this should create a 3x11 sparse matrix.

If the matrix is saved in a variable called `X`, you can see the values of the matrix by using the `X.toarray()` function.

In [79]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
unigrams = vectorizer.fit_transform(train_sentences)

In [32]:
tokens_uni = vectorizer.get_feature_names()
tokens_uni, X.toarray()

(['boring',
  'i',
  'intriguing',
  'it',
  'liked',
  'movie',
  'oh',
  'plot',
  'the',
  'this',
  'truly',
  'was'],
 array([[0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1],
        [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1]], dtype=int64))

### Exercise 2: Now get a bag of words of bigrams and trigrams from the training set

Use the same class as before.

In [78]:

bigr = CountVectorizer(ngram_range = (2,2), token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
bigrams_bow = bigr.fit_transform(train_sentences)


trigr = CountVectorizer(ngram_range = (3,3))
trigrams_bow = trigr.fit_transform(train_sentences)
bigr.vocabulary_

tokens_bi = bigr.get_feature_names()

tokens_bi, bigrams_bow.toarray()

numpy.ndarray

#### Hint:

Oh, do you want to check the vocabulary of the training set? You can do it using the vectorizer. If the vectorizer is saved in a variable called `unigram_vectorizer`, you can check the attribute `unigram_vectorizer.vocabulary_` and you will see the vocabulary there.

### Exercise 3: Test these vectorizers with the test set

Try the vectorizers with the test set. What happens if a word doesn't appear in the training corpus?

In [42]:
bi_test = bigr.fit_transform(test_sentences)


bigr.vocabulary_  #??

{'i liked': 1,
 'liked it': 3,
 'the plot': 6,
 'plot was': 5,
 'was boring': 8,
 'oh it': 4,
 'it was': 2,
 'was absolutely': 7,
 'absolutely terrible': 0}

### Exercise 4: Calculate bag of postags for the training set and then apply the vectorizer on the test set.

Calculate something similar to the bag of words, but instead of using the words, use the POS-tags of the sentences. The goal here is not to get perfect results, so then, use the `nltk.pos_tag()` function to get the part-of-speeches.

In [101]:
help(nltk.pos_tag_sents)

Help on function pos_tag_sents in module nltk.tag:

pos_tag_sents(sentences, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to tag the
    given list of sentences, each consisting of a list of tokens.
    
    :param tokens: List of sentences to be tagged
    :type tokens: list(list(str))
    :param tagset: the tagset to be used, e.g. universal, wsj, brown
    :type tagset: str
    :param lang: the ISO 639 code of the language, e.g. 'eng' for English, 'rus' for Russian
    :type lang: str
    :return: The list of tagged sentences
    :rtype: list(list(tuple(str, str)))



In [73]:
#from nltk.tokenize import word_tokenize
#bu = []
#for ind,item in enumerate(train_sentences):
bu= [nltk.pos_tag(word_tokenize(train_sentences[ind])) for ind, item in enumerate(train_sentences)]

bu


[[('I', 'PRP'), ('liked', 'VBD'), ('this', 'DT'), ('movie', 'NN')],
 [('The', 'DT'), ('plot', 'NN'), ('was', 'VBD'), ('intriguing', 'VBG')],
 [('Oh', 'UH'),
  (',', ','),
  ('it', 'PRP'),
  ('was', 'VBD'),
  ('truly', 'RB'),
  ('boring', 'JJ')]]

### Exercise 5: Combine all features for each sentence.

Combine all the previous features, and generate a matrix encoding all previously mentioned features: unigrams, bigrams, trigrams and pos_tags. The resulting matrix should have the following dimensions: 3x31

You could use the `sklearn.pipeline.FeatureUnion` class.

In [114]:
#toks = sentence_tokenize(train_sentences)

union = FeatureUnion([("uni", vectorizer),("bi", bigr), ("tri", trigr), ("PoS", nltk.pos_tag(word_tokenize(train_sentences[0])))])



#union.fit_transform(train_sentences)    


TypeError: All estimators should implement fit and transform. '[('I', 'PRP'), ('liked', 'VBD'), ('this', 'DT'), ('movie', 'NN')]' (type <class 'list'>) doesn't

### Extra to play with: Check this website and think about it. Do you think you can use this for something? (in the exam)

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

## SHARE YOUR KNOWLEDGE!

### Do you know any other way of representing the features of the training/testing set?

Please share your knowledge using the forum from Absalon!!!