<center> <h2>Feature Extraction from Text</h2></center>

## Outline
1. <a href='#1'>CountVectorizer</a>
2. <a href='#2'>TfidfVectorizer</a>
3. <a href='#3'>Stop Words</a>
4. <a href='#4'>min_df</a>
5. <a href='#5'>ngrams</a>



## 1. Extracting Features from Text 
* Textual data need to be transformed so they can be represented in quantitative terms
* A common text representation strategy in ML:
    * Bag of Words

In [2]:
quote = ["Happiness can be found in the darkest of times, if one only remembers to turn on the light"]

<a id="1"></a>

## 2. CountVectorizer
* Converts a collection of strings to a matrix of token counts
* Convert all characters to lowercase before tokenizing (can disable this using lowercase=False).
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(quote)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [4]:
print("Vocabulary: \n", vect.vocabulary_)

Vocabulary: 
 {'happiness': 4, 'can': 1, 'be': 0, 'found': 3, 'in': 6, 'the': 13, 'darkest': 2, 'of': 8, 'times': 14, 'if': 5, 'one': 10, 'only': 11, 'remembers': 12, 'to': 15, 'turn': 16, 'on': 9, 'light': 7}


In [5]:
print("Vocabulary size: ", len(vect.vocabulary_))

Vocabulary size:  17


In [6]:
bag_of_words = vect.transform(quote)

In [7]:
bag_of_words

<1x17 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

In [8]:
bag_of_words_arr = bag_of_words.toarray()
bag_of_words_arr

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1]], dtype=int64)

In [9]:
feature_names = vect.get_feature_names()

In [10]:
feature_names

['be',
 'can',
 'darkest',
 'found',
 'happiness',
 'if',
 'in',
 'light',
 'of',
 'on',
 'one',
 'only',
 'remembers',
 'the',
 'times',
 'to',
 'turn']

In [11]:
import pandas as pd
quote_df = pd.DataFrame(bag_of_words_arr, columns = feature_names)
quote_df["quote"] = quote
quote_df.set_index("quote", inplace = True)

In [12]:
quote_df

Unnamed: 0_level_0,be,can,darkest,found,happiness,if,in,light,of,on,one,only,remembers,the,times,to,turn
quote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
"Happiness can be found in the darkest of times, if one only remembers to turn on the light",1,1,1,1,1,1,1,1,1,1,1,1,1,2,1,1,1


### 2.1. Bag-of-Words for a List of Strings

In [14]:
import pandas as pd
quotes = pd.read_csv("quotes.csv", header=None)
quotes

Unnamed: 0,0
0,"It matters not what someone is born, but what ..."
1,"We are only as strong as we are united, as wea..."
2,"Happiness can be found, even in the darkest of..."
3,Curiosity is not a sin.... But we should exerc...
4,Age is foolish and forgetful when it underesti...
5,You will also find that help will always be gi...
6,"It is our choices, Harry, that show what we tr..."
7,The fact that you can feel pain like this is y...
8,It does not do to dwell on dreams and forget t...
9,Differences of habit and language are nothing ...


In [15]:
quotes[0].values

array(['It matters not what someone is born, but what they grow to be.',
       'We are only as strong as we are united, as weak as we are divided.',
       'Happiness can be found, even in the darkest of times, if one only remembers to turn on the light',
       'Curiosity is not a sin.... But we should exercise caution with our curiosity... yes, indeed',
       'Age is foolish and forgetful when it underestimates youth',
       'You will also find that help will always be given at Hogwarts to those who ask for it',
       'It is our choices, Harry, that show what we truly are far more than our abilities.',
       'The fact that you can feel pain like this is your greatest strength',
       'It does not do to dwell on dreams and forget to live',
       'Differences of habit and language are nothing at all if our aims are identical and our hearts are open'],
      dtype=object)

In [16]:
vect = CountVectorizer()
vect.fit(quotes[0].values)
bag_of_words = vect.transform(quotes[0].values)

In [17]:
print("Vocabulary: \n", vect.vocabulary_)

print("\nVocabulary size: ", len(vect.vocabulary_))

Vocabulary: 
 {'it': 50, 'matters': 55, 'not': 57, 'what': 87, 'someone': 70, 'is': 49, 'born': 12, 'but': 13, 'they': 76, 'grow': 38, 'to': 80, 'be': 11, 'we': 85, 'are': 7, 'only': 62, 'as': 8, 'strong': 72, 'united': 84, 'weak': 86, 'divided': 20, 'happiness': 40, 'can': 14, 'found': 35, 'even': 25, 'in': 47, 'the': 75, 'darkest': 18, 'of': 59, 'times': 79, 'if': 46, 'one': 61, 'remembers': 66, 'turn': 82, 'on': 60, 'light': 52, 'curiosity': 17, 'sin': 69, 'should': 67, 'exercise': 26, 'caution': 15, 'with': 91, 'our': 64, 'yes': 92, 'indeed': 48, 'age': 1, 'foolish': 31, 'and': 6, 'forgetful': 34, 'when': 88, 'underestimates': 83, 'youth': 95, 'you': 93, 'will': 90, 'also': 4, 'find': 30, 'that': 74, 'help': 43, 'always': 5, 'given': 36, 'at': 10, 'hogwarts': 44, 'those': 78, 'who': 89, 'ask': 9, 'for': 32, 'choices': 16, 'harry': 41, 'show': 68, 'truly': 81, 'far': 28, 'more': 56, 'than': 73, 'abilities': 0, 'fact': 27, 'feel': 29, 'pain': 65, 'like': 53, 'this': 77, 'your': 94, '

In [18]:
bag_of_words.toarray()
feature_names = vect.get_feature_names()
quote_df = pd.DataFrame(bag_of_words.toarray(), columns = feature_names)
quote_df["quote"] = quotes[0].values
quote_df.set_index("quote", inplace=True)

In [19]:
quote_df

Unnamed: 0_level_0,abilities,age,aims,all,also,always,and,are,as,ask,...,weak,what,when,who,will,with,yes,you,your,youth
quote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"It matters not what someone is born, but what they grow to be.",0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
"We are only as strong as we are united, as weak as we are divided.",0,0,0,0,0,0,0,3,4,0,...,1,0,0,0,0,0,0,0,0,0
"Happiness can be found, even in the darkest of times, if one only remembers to turn on the light",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Curiosity is not a sin.... But we should exercise caution with our curiosity... yes, indeed",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
Age is foolish and forgetful when it underestimates youth,0,1,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,1
You will also find that help will always be given at Hogwarts to those who ask for it,0,0,0,0,1,1,0,0,0,1,...,0,0,0,1,2,0,0,1,0,0
"It is our choices, Harry, that show what we truly are far more than our abilities.",1,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
The fact that you can feel pain like this is your greatest strength,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
It does not do to dwell on dreams and forget to live,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Differences of habit and language are nothing at all if our aims are identical and our hearts are open,0,0,1,1,0,0,2,3,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2.2. CountVectorizer Summary

In [20]:
import pandas as pd
quotes = pd.read_csv("quotes.csv", header=None)
quotes

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(quotes[0].values)
bag_of_words = vect.transform(quotes[0].values)

feature_names = vect.get_feature_names()
quote_df = pd.DataFrame(bag_of_words.toarray(), columns = feature_names)
quote_df["quote"] = quotes[0].values
quote_df.set_index("quote", inplace=True)

In [21]:
quote_df

Unnamed: 0_level_0,abilities,age,aims,all,also,always,and,are,as,ask,...,weak,what,when,who,will,with,yes,you,your,youth
quote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"It matters not what someone is born, but what they grow to be.",0,0,0,0,0,0,0,0,0,0,...,0,2,0,0,0,0,0,0,0,0
"We are only as strong as we are united, as weak as we are divided.",0,0,0,0,0,0,0,3,4,0,...,1,0,0,0,0,0,0,0,0,0
"Happiness can be found, even in the darkest of times, if one only remembers to turn on the light",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"Curiosity is not a sin.... But we should exercise caution with our curiosity... yes, indeed",0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
Age is foolish and forgetful when it underestimates youth,0,1,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,1
You will also find that help will always be given at Hogwarts to those who ask for it,0,0,0,0,1,1,0,0,0,1,...,0,0,0,1,2,0,0,1,0,0
"It is our choices, Harry, that show what we truly are far more than our abilities.",1,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
The fact that you can feel pain like this is your greatest strength,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
It does not do to dwell on dreams and forget to live,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
Differences of habit and language are nothing at all if our aims are identical and our hearts are open,0,0,1,1,0,0,2,3,0,0,...,0,0,0,0,0,0,0,0,0,0


## 2. TfidfVectorizer
* Converts a collection of raw documents to a matrix of TF-IDF features.
* Allows us to weight words based on how important they are to a string


* High weight is given to words, or terms, that appear often in a particular string, but don't appear often in the corpus (across all strings)
    * Features with low tf-idf are either commonly used across all strings or rarely used and only occur in long strings
    
* https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer().fit(quotes[0].values)
#vect = TfidfVectorizer(norm = None).fit(quotes[0].values)
print("Number of features: ", len(vect.get_feature_names()))

Number of features:  96


In [27]:
#encode the words in the quotes based on the vocabulary built using tfidf
bag_of_words = vect.transform(quotes[0].values)

In [28]:
feature_names = vect.get_feature_names()
quote_df = pd.DataFrame(bag_of_words.toarray(), columns = feature_names)
quote_df["quote"] = quotes[0].values
quote_df.set_index("quote", inplace=True)

In [29]:
quote_df

Unnamed: 0_level_0,abilities,age,aims,all,also,always,and,are,as,ask,...,weak,what,when,who,will,with,yes,you,your,youth
quote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"It matters not what someone is born, but what they grow to be.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.51587,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"We are only as strong as we are united, as weak as we are divided.",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.402824,0.722169,0.0,...,0.180542,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Happiness can be found, even in the darkest of times, if one only remembers to turn on the light",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Curiosity is not a sin.... But we should exercise caution with our curiosity... yes, indeed",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.26983,0.26983,0.0,0.0,0.0
Age is foolish and forgetful when it underestimates youth,0.0,0.371176,0.0,0.0,0.0,0.0,0.276055,0.0,0.0,0.0,...,0.0,0.0,0.371176,0.0,0.0,0.0,0.0,0.0,0.0,0.371176
You will also find that help will always be given at Hogwarts to those who ask for it,0.0,0.0,0.0,0.0,0.240136,0.240136,0.0,0.0,0.0,0.240136,...,0.0,0.0,0.0,0.240136,0.480272,0.0,0.0,0.204138,0.0,0.0
"It is our choices, Harry, that show what we truly are far more than our abilities.",0.274206,0.0,0.0,0.0,0.0,0.0,0.0,0.203935,0.0,0.0,...,0.0,0.2331,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The fact that you can feel pain like this is your greatest strength,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.255458,0.300506,0.0
It does not do to dwell on dreams and forget to live,0.0,0.0,0.0,0.0,0.0,0.0,0.23601,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Differences of habit and language are nothing at all if our aims are identical and our hearts are open,0.0,0.0,0.22048,0.22048,0.0,0.0,0.327955,0.491933,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id="3"></a>

## 3. Stop words
* It is common practice to remove stop words (such as "the", "this", and "in") from text before vectorizing strings
    * Need a list of stop words
* Both CountVectorizer and TfidfVectorizer can do this
    * Use the stop_words keyword argument

In [30]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

In [31]:
print("Number of stop words: ", len(ENGLISH_STOP_WORDS))

Number of stop words:  318


In [32]:
ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [33]:
stop_words_list = list(ENGLISH_STOP_WORDS)

In [34]:
stop_words_list[::10]

['me',
 'could',
 'hundred',
 'since',
 'describe',
 'here',
 'co',
 'whereas',
 'via',
 'de',
 'three',
 'of',
 'eleven',
 'where',
 'forty',
 'seeming',
 'her',
 'own',
 'himself',
 'us',
 'whatever',
 'third',
 'behind',
 'formerly',
 'sincere',
 'per',
 'further',
 'few',
 'yet',
 'about',
 'might',
 'how']

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(stop_words = "english").fit(quotes[0].values)

print("Number of features: ", len(vect.get_feature_names()))

Number of features:  50


In [36]:
vect.get_feature_names()[::2]

['abilities',
 'aims',
 'born',
 'choices',
 'darkest',
 'divided',
 'dreams',
 'exercise',
 'far',
 'foolish',
 'forgetful',
 'greatest',
 'habit',
 'harry',
 'help',
 'identical',
 'light',
 'live',
 'open',
 'remembers',
 'strength',
 'times',
 'turn',
 'united',
 'yes']

<a id="4"></a>

## 4. min_df
* Allows us to specify the minimum number of different documents in which the word should appear 
* cut-off value used when constructing the bag of words
* If a word appears in one document only, it has little to contribute to the model
    * so no need to include in the vocabulary (and thus in the bag of words)

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer

#set the minimum of documents in which the word should appear before it can be included in the vocabulary
vect = TfidfVectorizer(min_df=2).fit(quotes[0].values)

bag_of_words = vect.transform(quotes[0].values)

feature_names = vect.get_feature_names()

print("Number of features: ", len(vect.get_feature_names()))

Number of features:  20


In [38]:
quote_df = pd.DataFrame(bag_of_words.toarray(), columns = feature_names)
quote_df["quote"] = quotes[0].values
quote_df.set_index("quote", inplace=True)

quote_df.head()

Unnamed: 0_level_0,and,are,at,be,but,can,if,is,it,not,of,on,only,our,that,the,to,we,what,you
quote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
"It matters not what someone is born, but what they grow to be.",0.0,0.0,0.0,0.307179,0.351109,0.0,0.0,0.245263,0.245263,0.307179,0.0,0.0,0.0,0.0,0.0,0.0,0.273104,0.0,0.702218,0.0
"We are only as strong as we are united, as weak as we are divided.",0.0,0.682763,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.260135,0.0,0.0,0.0,0.0,0.682763,0.0,0.0
"Happiness can be found, even in the darkest of times, if one only remembers to turn on the light",0.0,0.0,0.0,0.271676,0.0,0.310528,0.310528,0.0,0.0,0.0,0.310528,0.310528,0.310528,0.0,0.0,0.621057,0.241539,0.0,0.0,0.0
"Curiosity is not a sin.... But we should exercise caution with our curiosity... yes, indeed",0.0,0.0,0.0,0.0,0.514058,0.0,0.0,0.359089,0.0,0.44974,0.0,0.0,0.0,0.44974,0.0,0.0,0.0,0.44974,0.0,0.0
Age is foolish and forgetful when it underestimates youth,0.662993,0.0,0.0,0.0,0.0,0.0,0.0,0.529358,0.529358,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<a id="5"></a>

## 5. ngrams
* Allows us to count pairs or triplets tokens that appear next to each other
* ngram_range = (1,2) 
    * creates unigrams and bigrams 

In [45]:
from sklearn.feature_extraction.text import TfidfVectorizer

#set the minimum of documents in which the word should appear before it can be included in the vocabulary
vect = TfidfVectorizer(ngram_range = (1, 2)).fit(quotes[0].values)

bag_of_words = vect.transform(quotes[0].values)

feature_names = vect.get_feature_names()
print("Number of features: ", len(feature_names))

Number of features:  231


In [46]:
feature_names[::12]

['abilities',
 'and forget',
 'as',
 'born',
 'curiosity',
 'dreams',
 'feel',
 'found',
 'harry',
 'if our',
 'it is',
 'more than',
 'on the',
 'our hearts',
 'someone is',
 'the fact',
 'to dwell',
 'we',
 'who',
 'your']