## Main Documentation

http://scikit-learn.org/stable/modules/feature_extraction.html

The <font color='green' size='3pt'>sklearn.feature_extraction</font> module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.



# Text feature extraction



<font color = 'red' size = '4pt'> The Bag of Words representation </font>

Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.


In order to address this, scikit-learn provides utilities for the most common ways to extract numerical features from text content, namely:

<font color = 'red'>tokenizing </font>strings and giving an integer id for each possible token, for instance by using white-spaces and punctuation as token separators.
<font color = 'red'>counting </font> the occurrences of tokens in each document.
<font color = 'red'>normalizing </font> normalizing and weighting with diminishing importance tokens that occur in the majority of samples / documents.

<font color= 'red'> CountVectorizer </font> implements both tokenization and occurrence counting in a single class:

<font color= 'green'> Parameters </font> : http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [6]:
"Let’s use it to tokenize and count the word occurrences of text documents:"

corpus = [
        'This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?',
        ]

X = vectorizer.fit_transform(corpus)
X

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [7]:
"The default configuration tokenizes the string by extracting words of at least 2 letters."
"The specific function that does this step can be requested explicitly:"

analyze = vectorizer.build_analyzer()
analyze("This is a text document to analyze.")

['this', 'is', 'text', 'document', 'to', 'analyze']

In [8]:
"Each term found by the analyzer during the fit is assigned "
"a unique integer index corresponding to a column in the resulting matrix."
"This interpretation of the columns can be retrieved as follows:"

vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

In [9]:
"From above output we know that there is exactly 9 unique word and they are arranged as per the alphabatical order"
"So the matrix will be of 4X9, because we have 4 sentence"
"For each sentence: col value will be 1 if the that word is present in that sentence"

X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [10]:
"The converse mapping from feature name to column index "
"is stored in the vocabulary_ attribute of the vectorizer:"

vectorizer.vocabulary_.get('document')

1

Udacity CountVectorizer
====================


In [5]:
string1 = "Hi Katie the self driving car will be late Best Sebastian"
string2 = "Hi Sebastian the Machine Learning class will be great great great Best Katie"
string3 = "hi Katie the machine learning class will be most excellent"

email_list = [string1, string2, string3]
vector = CountVectorizer()

bag_of_words = vector.fit_transform(email_list)

print(bag_of_words)

vector.get_feature_names()

  (0, 13)	1
  (0, 1)	1
  (0, 9)	1
  (0, 0)	1
  (0, 16)	1
  (0, 2)	1
  (0, 4)	1
  (0, 14)	1
  (0, 15)	1
  (0, 8)	1
  (0, 7)	1
  (1, 6)	3
  (1, 3)	1
  (1, 10)	1
  (1, 11)	1
  (1, 13)	1
  (1, 1)	1
  (1, 0)	1
  (1, 16)	1
  (1, 15)	1
  (1, 8)	1
  (1, 7)	1
  (2, 5)	1
  (2, 12)	1
  (2, 3)	1
  (2, 10)	1
  (2, 11)	1
  (2, 0)	1
  (2, 16)	1
  (2, 15)	1
  (2, 8)	1
  (2, 7)	1


['be',
 'best',
 'car',
 'class',
 'driving',
 'excellent',
 'great',
 'hi',
 'katie',
 'late',
 'learning',
 'machine',
 'most',
 'sebastian',
 'self',
 'the',
 'will']

From the above O/P we can see that 

    Lets take example of (1, 6)    3


In Document number 1, feature number 6 appears 3 time 

As we know python index starts at 0, so doc 1 is string2 and feature number 6 is "great"


In [6]:


vector.vocabulary_.get('great')

6

NLTK stopwords and stemming
=========================


 <font color='green' size = '4pt'> See the example in ud120-projects/text-learning/nltk_stopwords.py </font>
 
 The order of text processing should be <font color='green' size = '3pt'> 1> stemming 2> bag_of_words </font>

TFIDF repesentation
================

<font color= 'red' size='3pt'> TF => </font> Term Frequency (similar to bag of words)

<font color= 'red' size='3pt'> IDF => </font> Inverse Document Frequency (put higher weigtht on rare words than on common words across all document) 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words="english", lowercase=True)
vectorizer.fit_transform(email_list)

feature_names = vectorizer.get_feature_names()

print('Number of different words: {0}'.format(len(feature_names)))


Number of different words: 13


For clustering of text documents see below.

http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py