# Text Feature Extraction

As we know, for an input to a machine learning model, we need numerical data. For example, if we have a categorical input, we apply one-hot encoding and use that vector as an input to the model. The question is how can we convert something like text which is at a very high level and has a meaning to it, into a format that can be understood by machines in the form of numbers and vectors.

Some of the traditional methods to extract features from text are based on creating these features in a manual way, through an understanding of the domain. For example, we can use bag of words, TF-IDF representations and other such techniques to get a representation of the text. But more recently, an interesting and powerful technique that has come up, is the learning of features through dimensionality reducation using deep learning neural networks, which learn the features automatically.

In this notebook, we will try to go through some of these methods to understand how it works, so that we can decide upon which technique we want to use in our project and also to understand how it works.

## Traditional Methods

Let us start by going through some of the surface traditional methods that involve some manual domain knowledge effort to create features for the text.

### 1. Bag of Words

This method assumes, that we have a document and a corpus (a lot of documents). In our example and the upcoming examples, we are going to assume that a document is a simple English sentence and hence the corpus would be a list of sentences.

For every sentence we can generate a list of tokens, that will represent the sentence. The important part is the generation of these tokens and then maintaining a count, to represent the text. We will be using Python language throughout this notebook.

There are 3 important steps involved in this whole process:

1. Tokenization
2. Stemming
3. Lemmatization

Stemming means to trim a word in a way, that it retains its meaning. For example the words leaf and leaves will both be stemmed to leaf and leav respectively.

The next step is of lemmatization, that connects the stemmed words to their root words. For example, leaf and leav, both words will be lemmatized to leaf.

Hence, we can say that after stemming and lemmatization the words 'leaf' and 'leaves' appearing in a sentence would both be transformed into the word leaf. Let us see how we can do these steps in Python.

If you get errors using ntlk, run ntlk.download() in a separate command line to download all the packages.

In [46]:
import nltk
from nltk import word_tokenize

text = 'Hi! This is a test sentence that we will use to perform tokenization, stemming and lemmatization. Let us see, how it performs. We will use a lot of sentences the next time.'
word_tokens = word_tokenize(text)

word_tokens

['Hi',
 '!',
 'This',
 'is',
 'a',
 'test',
 'sentence',
 'that',
 'we',
 'will',
 'use',
 'to',
 'perform',
 'tokenization',
 ',',
 'stemming',
 'and',
 'lemmatization',
 '.',
 'Let',
 'us',
 'see',
 ',',
 'how',
 'it',
 'performs',
 '.',
 'We',
 'will',
 'use',
 'a',
 'lot',
 'of',
 'sentences',
 'the',
 'next',
 'time',
 '.']

If you see the output, we have a lot of words like punctuation marks, and frequent non-useful words like 'the', 'it' etc. All these words are called as stop words, which just help to fill in the sentence but do not provide any meaning or semantics to it.

We can get rid of these stop words.

In [47]:
from nltk.corpus import stopwords
import string

def get_filtered_tokens(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_word_tokens = []

    # filter out word tokens and punctuations
    for token in word_tokens:
        if token.lower() not in stop_words and token.lower() not in string.punctuation and token.isalpha():
            filtered_word_tokens.append(token.lower())

    return filtered_word_tokens

get_filtered_tokens(text)

['hi',
 'test',
 'sentence',
 'use',
 'perform',
 'tokenization',
 'stemming',
 'lemmatization',
 'let',
 'us',
 'see',
 'performs',
 'use',
 'lot',
 'sentences',
 'next',
 'time']

Now, you can see that the tokens are all meaningful words, with all the stop words and punctuations removed.

Now, we can proceed with the stemming and lemmatization process

In [48]:
from nltk import PorterStemmer
from nltk import WordNetLemmatizer

ps = PorterStemmer()
lmtzr = WordNetLemmatizer()

def stem_lmtz_word(word):
    return lmtzr.lemmatize(ps.stem(word))

Let us test the function by passing few words like 'programming' and 'programmer' and see the output.

In [49]:
words = ['programming', 'programmer', 'program', 'programs']

for word in words:
    print(stem_lmtz_word(word))

program
programm
program
program


For the next steps, we need to get a dataset for ourselves. We will get a semantic analysis dataset. We are using the Apple Twitter Sentiment dataset from data.world. We will only focus on the text column of the dataset currently, to generate tokens and perform other operations.

In [50]:
import pandas as pd

data = pd.read_csv('../Datasets/Apple-Twitter-Sentiment-DFE.csv', encoding='cp1252')
data['text'] = data['text'].apply(lambda x: x.strip())
data['text'].head()

0    #AAPL:The 10 best Steve Jobs emails ever...htt...
1    RT @JPDesloges: Why AAPL Stock Had a Mini-Flas...
2    My cat only chews @apple cords. Such an #Apple...
3    I agree with @jimcramer that the #IndividualIn...
4         Nobody expects the Spanish Inquisition #AAPL
Name: text, dtype: object

In [51]:
get_filtered_tokens(data['text'][0])

['aapl', 'best', 'steve', 'jobs', 'emails', 'ever', 'http']

The next step that we want to do, is to create a vocabulary. By vocabulary, we mean to create a set of all the words occurring in the corpus.

In [52]:
vocab = set()
tokens_data = []

for index, sentence in data['text'].items():
    tokens = []
    for token in get_filtered_tokens(sentence):
        final_token = stem_lmtz_word(token)
        vocab.add(final_token)
        tokens.append(final_token)
    tokens_data.append(tokens)

Now, we are all set. We have the vocabulary ready and all the sentences have been represented with a list of tokens that have stop words removed, stemmed and lemmatization done.

The next step is to create a bag of words for each sentence. We need to create a vector representation for each sentence, in which each value will be equal to the frequency of a token.

In [53]:
# create new vector
token_count_dict = {}
for token in vocab:
    token_count_dict[token] = 0

bag_of_words_data = []

for sent in tokens_data:
    sent_rep = []

    # set values
    for token in sent:
        token_count_dict[token] += 1

    for key in token_count_dict:
        sent_rep.append(token_count_dict[key])
        token_count_dict[key] = 0

    bag_of_words_data.append(sent_rep)

### 2. TF-IDF Model

Although the bag of words model, does give a way to represent a sentence in a numerical way, it still does not capture any semantics. We now look at tf-idf model that captures the weights of the words in a better way. It provides a way, to give weights to words according to their importance.

An important point to be noted here is that, where in bag of words, more weight is given to a word occuring more frequently, it does not really mean that the word is more important. Generally words that occur more frequently in a document can also be insignificant.

Hence the tf-idf model, takes into account not only the frequency of a word in a document, it also considers the number of documents the word is present in.

$\text{TF} = \dfrac{\left(\text{Word frequency in document}\right)}{\left(\text{Total number of terms in document}\right)}$

$\text{IDF} = \log{\left(\dfrac{\text{N}}{1 + \text{n}}\right)}\text{, where}$

$\text{N is total number of documents}$

$\text{n is number of documents that contain the word}$

In [54]:
import numpy as np

# tf data can be calculated from bag of words model

curr_index = 0
tf_data = []

for rep in bag_of_words_data:
    num_terms = len(tokens_data[curr_index])
    npa = np.asarray(rep, dtype=np.float64)
    npa = npa / num_terms
    curr_index += 1
    tf_data.append(npa.tolist())

To create the idf data, we need to have a dictionary that contains for each token, the number of documents it is present in.

We have the token_count_dict that contains all the tokens with a value of zero. We can iterate over all the tokens and set the values as the number of documents in this dictionary.

The idf array will have a size equal to the number of tokens and is independent of the current document. So, we can calculate it separately.

In [60]:
import math

token_count_dict = {}
for token in vocab:
    token_count_dict[token] = 0

idf_data = []

# set dict values as number of documents it is present in
for sent in tokens_data:
    doc_vocab = set()
    for token in sent:
        doc_vocab.add(token)
    for token in doc_vocab:
        token_count_dict[token] += 1

# total number of documents
N = data['text'].shape[0]

for key in token_count_dict:
    init_val = token_count_dict[key]
    final_val = math.log(N/(1+init_val))
    token_count_dict[key] = final_val

    idf_data.append(final_val)

Now, we have both the tf data for all the sentences and we have the idf data for all the words. We can calculate the tf-idf representations of all sentences, by simply multiplying their tf data with the idf array.

In [68]:
tf_idf_data = []

for tf_rep in tf_data:
    tf_npa = np.asarray(tf_rep, dtype=np.float64)
    tf_idf_npa = np.multiply(tf_rep, idf_data)
    tf_idf_data.append(tf_idf_npa.tolist())