# Introduction to Text Analytics

-----

In this notebook, we introduce basic concepts in text analytics, which is one of the most exciting application areas for machine learning. Text analytics forms the basis for [natural language processing][nlp], and is explicitly used for [sentiment analysis][sa], for [language identification][lc], for [spelling and grammar correction][sc], as well as for mining text data contained in forms, such as [medical informatics][mi]. Before moving into text classification, in this notebook we focus on basic text analytics such as accessing data, tokenizing a corpus, and computing token frequencies. We demonstrate these tasks by using basic Python concepts and by using functionality from within the scikit-learn library. We also introduce the NLTK library and use methods within this library to perform text analytic tasks.

---

[nlp]: https://en.wikipedia.org/wiki/Natural_language_processing
[sa]: https://en.wikipedia.org/wiki/Sentiment_analysis
[lc]: https://en.wikipedia.org/wiki/Language_identification
[sc]: http://norvig.com/spell-correct.html
[mi]: https://en.wikipedia.org/wiki/Health_informatics

## Table of Contents

[Text Data Preparation](#Text-Data-Preparation)
- [Tokenization](#Tokenization)
- [Remove Unwanted Characters](#Remove-Unwanted-Characters)
- [Convert to Same Case](#Convert-to-Same-Case)
- [NLTK](#NLTK)
- [Stop Words](#Stop-Words)
- [Stemming](#Stemming)

[Bag of Words](#Bag-of-Words)
- [CountVectorizer](#CountVectorizer)

- [TF-IDF](#TF-IDF)

-----

Before proceeding with the rest of this notebook, we first include our notebook setup code.

-----

In [1]:
import pandas as pd

# We do this to ignore several specific warnings
import warnings
warnings.filterwarnings("ignore")


-----

[[Back to TOC]](#Table-of-Contents)

## Text Data Preparation
    
Unlike datasets we've used so far, text analysis datasets don't have many features. A typical text analysis dataset is just a collection of text, which is called **corpus**. Machine learning algorithms can't directly train on text, so we will have to first pre-process the text data. **Tokenization** is normally the first step in text analysis.

### Tokenization
Tokenization is a process to break down a paragraph of text to smaller chunks, normally a single word or a phrase, which are called tokens.

In the next Code cell, we tokenize a paragraph of text to single words with string function `split`, then print out all the words or tokens in the text.


In [2]:
text = """Machine Learning is a technique of parsing data, learn from that data and then 
apply what is learned to make an informed decision. Machine learning focuses on 
designing algorithms that can learn from and make predictions on the data. 
The learning can be supervised or unsupervised.
"""
print(text.split())

['Machine', 'Learning', 'is', 'a', 'technique', 'of', 'parsing', 'data,', 'learn', 'from', 'that', 'data', 'and', 'then', 'apply', 'what', 'is', 'learned', 'to', 'make', 'an', 'informed', 'decision.', 'Machine', 'learning', 'focuses', 'on', 'designing', 'algorithms', 'that', 'can', 'learn', 'from', 'and', 'make', 'predictions', 'on', 'the', 'data.', 'The', 'learning', 'can', 'be', 'supervised', 'or', 'unsupervised.']


---
With the collection of tokens, we can make a small step forward: get count of tokens. In the following code we will use Python `collections` module to get a count of words in the text. We can already get some useful information from the word count. For example, a few top-used words may give us a better idea what this piece of text is about. In the next Code cell, we will print the word count, as well as the top 10 most used words in the text.

---

In [3]:
import collections as cl

# Tokenize and create counter
words = text.split()
wc = cl.Counter(words)
print("Word Count:\n", wc)
print("\n Top 10 Words:")
wc.most_common(10)

Word Count:
 Counter({'Machine': 2, 'is': 2, 'learn': 2, 'from': 2, 'that': 2, 'and': 2, 'make': 2, 'learning': 2, 'on': 2, 'can': 2, 'Learning': 1, 'a': 1, 'technique': 1, 'of': 1, 'parsing': 1, 'data,': 1, 'data': 1, 'then': 1, 'apply': 1, 'what': 1, 'learned': 1, 'to': 1, 'an': 1, 'informed': 1, 'decision.': 1, 'focuses': 1, 'designing': 1, 'algorithms': 1, 'predictions': 1, 'the': 1, 'data.': 1, 'The': 1, 'be': 1, 'supervised': 1, 'or': 1, 'unsupervised.': 1})

 Top 10 Words:


[('Machine', 2),
 ('is', 2),
 ('learn', 2),
 ('from', 2),
 ('that', 2),
 ('and', 2),
 ('make', 2),
 ('learning', 2),
 ('on', 2),
 ('can', 2)]

---
Let's examine the output of previous Code cell. You may have already found some issues from the output. In the word count, `'data'`, `'data,'` and `'data.'` are considered as different words due to the connected punctuations; `'Learning'` and `'learning'` are considered as different words due to different cases. These two issues have big impact on the effectiveness of our analysis but they are easy to fix.

---

### Remove Unwanted Characters
In text analysis, we want pure *text*. We want to remove all the special characters which typically are any characters that are not letters or numbers. One simply way to remove special characters is to use _regular expression_. A regular expression, also known as Regex, is a sequence of characters that define a search pattern. For example, this Regex `r'[^\w]'` matches all characters that are not letters(both upper or lower case) or numbers.

In the following Code cell, we will use Python Regular Expression module [re](https://docs.python.org/3/library/re.html) to replace all special characters in `text` with whitespace.

---



In [4]:
import re
text_ns = re.sub(r'[^\w]',' ', text)
print(text_ns)

Machine Learning is a technique of parsing data  learn from that data and then  apply what is learned to make an informed decision  Machine learning focuses on  designing algorithms that can learn from and make predictions on the data   The learning can be supervised or unsupervised  


---
Regular expression is very powerful. In our case, we eliminate all characters that are not letter or number. Sometimes, you may want to keep some information that contains special characters, like '@' in email addresses. You can then define more complicated regular expression to keep certain pattern of characters. To learn more about Python Regular Expression, you may refer to Python [re module](https://docs.python.org/3/library/re.html).

### Convert to Same Case
Next, we will convert all the words to same case. In the following code we will use string function `lower()` to convert all characters to lower case.

In [5]:
text_nsl = text_ns.lower()
print(text_nsl)

machine learning is a technique of parsing data  learn from that data and then  apply what is learned to make an informed decision  machine learning focuses on  designing algorithms that can learn from and make predictions on the data   the learning can be supervised or unsupervised  


---
The text is much cleaner now. Let's perform word count on the cleaned text.

In [6]:
wc = cl.Counter(text_nsl.split())
print("Word Count:\n", wc)
print("\n Top 10 Words:")
wc.most_common(10)

Word Count:
 Counter({'learning': 3, 'data': 3, 'machine': 2, 'is': 2, 'learn': 2, 'from': 2, 'that': 2, 'and': 2, 'make': 2, 'on': 2, 'can': 2, 'the': 2, 'a': 1, 'technique': 1, 'of': 1, 'parsing': 1, 'then': 1, 'apply': 1, 'what': 1, 'learned': 1, 'to': 1, 'an': 1, 'informed': 1, 'decision': 1, 'focuses': 1, 'designing': 1, 'algorithms': 1, 'predictions': 1, 'be': 1, 'supervised': 1, 'or': 1, 'unsupervised': 1})

 Top 10 Words:


[('learning', 3),
 ('data', 3),
 ('machine', 2),
 ('is', 2),
 ('learn', 2),
 ('from', 2),
 ('that', 2),
 ('and', 2),
 ('make', 2),
 ('on', 2)]

---
The new list of most used words gives us a better idea on what the text is about. However, if we examine the list more closely, we can see that there's another issue. 

Some words in the list, like 'from', 'that', 'and', 'on', 'can', don't give any useful information. They are _useless_ for text analysis and can be ignored. This kind of frequently used but useless words are called **stop words**. We can use Python [Natural Language ToolKit][nltk], or NLTK module to deal with stop words. 

---

### NLTK

The scikit-learn library is a general purpose, Python machine learning library that does include some basic text analysis functionality. Text analysis, however, is an extremely large and growing topic. As a result, we want to explore an additional library for natural language processing. This library, known as [Natural Language ToolKit][nltk], or NLTK, enables a wide range of text analyses either on its own, or in conjunction with scikit-learn. The NLTK library is extensive and includes [documentation][nltkd] covering many of the topics we have demonstrated previously.

In the rest of this notebook, we will explore how to use NLTK to perform basic text analysis, in a similar manner as demonstrated earlier via standard Python and the scikit-learn library. 

-----
[nltk]: http://www.nltk.org
[nltkd]: http://www.nltk.org/book/

### Stop Words
In the following two Code cells, we will use NLTK module to remove stop words from the text. We will first print out all English stop words, then we will remove all the stop words from the word list.




In [7]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [8]:
#remove stop words
words = text_nsl.split()
words_no_stop = []
for word in words:
    if word not in stop_words:
        words_no_stop.append(word)
wc = cl.Counter(words_no_stop)
print("Word Count:\n", wc)
print("\n Top 10 Words:")
wc.most_common(10)        

Word Count:
 Counter({'learning': 3, 'data': 3, 'machine': 2, 'learn': 2, 'make': 2, 'technique': 1, 'parsing': 1, 'apply': 1, 'learned': 1, 'informed': 1, 'decision': 1, 'focuses': 1, 'designing': 1, 'algorithms': 1, 'predictions': 1, 'supervised': 1, 'unsupervised': 1})

 Top 10 Words:


[('learning', 3),
 ('data', 3),
 ('machine', 2),
 ('learn', 2),
 ('make', 2),
 ('technique', 1),
 ('parsing', 1),
 ('apply', 1),
 ('learned', 1),
 ('informed', 1)]

---
### Stemming
Now the most common words are all informative but there are three words that are different forms of *learn*. They are *'learn'*, *'learning'* and *'learned'*. It would be better that they are treated as the same word. This is called **stemming**. `PorterStammer` in NLTK can be used to accomplish this task. In the next Code cell, we will demonstrate this process. 

In the following Code cell, we will first create a `PorterStammer` object, then use the `stem()` function of the object to *stem* the words.

The new most common words list shows that the count of the word 'learn' is now 6, both 'learning' and 'learned' are converted to 'learn'. 

However, it seems PorterStemmer also makes some mistakes. It converts 'machine' to 'machin', 'parse' to 'pars', 'apply' to 'appli' and 'decision' to 'decis'. The reason is that the purpose of stemming is to bring variant forms of a word together, not to map a word onto its _paradigm_ form. For example, all 'apply', 'applied' and 'applying' will be converted to 'appli'. In fact, stemming doesn't always improve text analysis result. Whether to apply stemming or not is a rather advanced topic. An easy way to determine whether to stem or not is to do the analysis with and without stemming and compare the results.

Now the top 3 most common words really describe the text well. From just the top 3 most common words, 'learn', 'data' and 'machine', we can already figure out what this piece of text is about with very high confidence.

In [9]:
from nltk.stem import PorterStemmer
st = PorterStemmer()

words_clean = []
for word in words_no_stop:
    words_clean.append(st.stem(word))
    
wc = cl.Counter(words_clean)
print("Word Count:\n", wc)
print("\n Top 10 Words:")
wc.most_common(10)

Word Count:
 Counter({'learn': 6, 'data': 3, 'machin': 2, 'make': 2, 'techniqu': 1, 'pars': 1, 'appli': 1, 'inform': 1, 'decis': 1, 'focus': 1, 'design': 1, 'algorithm': 1, 'predict': 1, 'supervis': 1, 'unsupervis': 1})

 Top 10 Words:


[('learn', 6),
 ('data', 3),
 ('machin', 2),
 ('make', 2),
 ('techniqu', 1),
 ('pars', 1),
 ('appli', 1),
 ('inform', 1),
 ('decis', 1),
 ('focus', 1)]

-----

[[Back to TOC]](#Table-of-Contents)

## Bag of Words

A simple question about text data mining that you might have is _How can we classify documents made up of words when machine learning algorithms work on numerical data?_ The answer is simple. We need to build a numerical summary of a text data set that our algorithms can manipulate. A conceptually easy approach to implement this idea is to identify all possible words in the documents of interest and track the number of times each word occurs in specific documents. This produces a (very) sparse matrix for our sample of documents, where the columns are the possible words (or tokens) and the rows are different documents. 

This concept, where one tokenizes documents to build these sparse matrices, is more formally known as [_bag of words_][bwd], because we effectively create the bag of words out of the documents. In the bag of words model, each document can be mapped into a vector, of which a individual element corresponds to the number of times the word (associated with the particular column) appears in the document.

For example, if we build *bag of words* with `text` defined above, we will have a matrix with 15 columns. Because there are 15 different words (tokens) after we clean up `text`, each word is a column in the matrix as shown below:

|make|inform|unsupervis|machin|design|techniqu|focus|data|predict|decis|supervis|algorithm|learn|pars|appli|
| :-: | :-: | :-: | :-: | :- :| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |


With the matrix, we can convert another piece of text to a series numbers as a row in the matrix. 

For example, adding `'Machine learning has supervised and unsupervised learning'` to the matrix, we will get this:

|make|inform|unsupervis|machin|design|techniqu|focus|data|predict|decis|supervis|algorithm|learn|pars|appli|
| :-: | :-: | :-: | :-: | :- :| :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: | :-: |
|0|0|1|1|0|0|0|0|0|0|1|0|2|0|0|

This is a matrix with only one row. The row represent the new text. The number in each column represents the frequency of the word in the new text. Most columns have value 0, meaning those words are not in the new text. This kind of matrix, in which only a small subset of cells have non-zero value, is called a sparse matrix. When we add another piece of text into the matrix, the matrix will have one more row.

When we use a large collection of text to build the bag of words, we can get a matrix with thousands of columns. This matrix is also called Document Term Matrix (DTM).

### CountVectorizer

With scikit-learn, we can use the [`CountVectorizer`][skcv] to break our document into tokens (in this case words), which are used to construct our _bag of words_ for the given set of documents. Given this tokenizer, we first need to construct the list of tokens, which we do with the `fit` method. Second, we need to transform our documents into this sparse matrix(DTM), which we do with the `transform` method. If both steps use the same input data, there is a convenient method to perform both operations at the same time, called `fit_transform`.

`CountVectorizer` by default will ignore special characters and convert all letters to lower case. It can even remove stop words with argument `stop_words` set. It can't do stemming, however. So you'll have to do it yourself if you want to apply stemming.

In the following code, we demonstrate how to build document term matrix with `CounterVectorizer`. We set argument `stop_words='english'` to apply English stop words filtering. We will use a list of training text to create tokens in the bag of words, then transform both training text set and testing text set. Both the training and testing text sets contain two pieces of texts. In real text analysis, both may contain hundreds or thousands of different texts.

For simplicity purpose, we will not do stemming this time.

-----
[bwd]: https://en.wikipedia.org/wiki/Bag-of-words_model
[skcv]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

#training corpus
training_text = ["""Machine Learning is a technique of parsing data, learn from that data and then 
        apply what is learned to make an informed decision. Machine learning focuses on 
        designing algorithms that can learn from and make predictions on the data. 
        The learning can be supervised or unsupervised.""",
        """A special school is a school catering for students who have special educational needs 
        due to learning difficulties, physical disabilities or behavioral problems. Special 
        schools may be specifically designed, staffed and resourced to provide appropriate 
        special education for children with additional needs.
        """]
#testing corpus
testing_text = ["Machine learning has supervised and unsupervised learning.",
               "Special education is important because children with special needs have equal rights to education."]

# Define our vectorizer
cv = CountVectorizer(stop_words='english')

# Build a vocabulary from our training texts
cv.fit(training_text)

#transform training texts to a Document Term Matrix (DTM)
training_dtm = cv.transform(training_text)

#transform testing texts to a DTM
testing_dtm = cv.transform(testing_text)

#explore the characteristics of the matrix
print(f'Number of Tokens = {training_dtm.shape[1]}')
print('Tokens:')
print(cv.get_feature_names())
print(100*'-')
print(f'Number of Training Samples = {training_dtm.shape[0]}')
print(f'Number of Testing Samples = {training_dtm.shape[0]}')

# print out testing DTM
print(100*'-')
print('Testing dtm')
print(testing_dtm.todense())
print(100*'-')


Number of Tokens = 38
Tokens:
['additional', 'algorithms', 'apply', 'appropriate', 'behavioral', 'catering', 'children', 'data', 'decision', 'designed', 'designing', 'difficulties', 'disabilities', 'education', 'educational', 'focuses', 'informed', 'learn', 'learned', 'learning', 'machine', 'make', 'needs', 'parsing', 'physical', 'predictions', 'problems', 'provide', 'resourced', 'school', 'schools', 'special', 'specifically', 'staffed', 'students', 'supervised', 'technique', 'unsupervised']
----------------------------------------------------------------------------------------------------
Number of Training Samples = 2
Number of Testing Samples = 2
----------------------------------------------------------------------------------------------------
Testing dtm
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  0 1]
 [0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 0 0 0
  0 0]]
-----------------------------------------------------------------

In the above code, we fit `CounterVectorizer` with `training_text` and then transform `training_text` to get `training_dtm`. Because we fit and transform on the same text, we can combine the two steps with one function `fit_transform()` as shown below.

In [11]:
cv = CountVectorizer(stop_words='english')

# Build a vocabulary from our training text and transform training text
training_dtm = cv.fit_transform(training_text)

#transform testing set
testing_dtm = cv.transform(testing_text)


---
In the next Code cells we create two dataframe to display the training and testing DTM. This is not necessary for text analysis. It's just to show you the structure of the two document term matrices.

---

In [12]:
#set option to display all columns of a dataframe
pd.set_option('display.max_columns', None)

#create dataframe for training DTM
training_data = training_dtm.todense().tolist()
training_df = pd.DataFrame(training_data, columns=cv.get_feature_names())
print('Training DTM:')
training_df

Training DTM:


Unnamed: 0,additional,algorithms,apply,appropriate,behavioral,catering,children,data,decision,designed,designing,difficulties,disabilities,education,educational,focuses,informed,learn,learned,learning,machine,make,needs,parsing,physical,predictions,problems,provide,resourced,school,schools,special,specifically,staffed,students,supervised,technique,unsupervised
0,0,1,1,0,0,0,0,3,1,0,1,0,0,0,0,1,1,2,1,3,2,2,0,1,0,1,0,0,0,0,0,0,0,0,0,1,1,1
1,1,0,0,1,1,1,1,0,0,1,0,1,1,1,1,0,0,0,0,1,0,0,2,0,1,0,1,1,1,2,1,4,1,1,1,0,0,0


In [13]:
#create dataframe for testing DTM
testing_data = testing_dtm.todense().tolist()
testing_df = pd.DataFrame(testing_data, columns=cv.get_feature_names())
print('Testing DTM')
testing_df

Testing DTM


Unnamed: 0,additional,algorithms,apply,appropriate,behavioral,catering,children,data,decision,designed,designing,difficulties,disabilities,education,educational,focuses,informed,learn,learned,learning,machine,make,needs,parsing,physical,predictions,problems,provide,resourced,school,schools,special,specifically,staffed,students,supervised,technique,unsupervised
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1
1,0,0,0,0,0,0,1,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0


---
Now we've converted both training and testing text sets to bag of words representation(or document term matrix). Each row in `training_dtm` represents a piece of training text and each row in `testing_dtm` represents a piece of testing text. To understand the DTM, let's look at the second row in the testing DTM which represent the text `'Special education is important because children with special needs have equal rights to education.'`. The second row shows the word count of this text. There are one `'children'`, two `'education'`, one `'needs'` and two `'special'`. There are some words in the text, like `'equal'` or `'rights'` that are not captured by the matrix. The reason is that these two words are not in the training text with which the bag of words are built. But this problem will be gone when we have a large collection of training texts which produce a matrix with much more comprehensive tokens.

With the numeric representation of texts, we can then apply machine learning algorithms on the dataset for classification tasks. 

Let's say our goal is to find out what topic a new piece of text is about. Assume there are only two topics, *Machine Learning* and *Special Education*. The first text in the training DTM is labeled as Machine Learning and the second is labeled as Special Education. We can train our model with the training dtm and label, then evaluate the model with testing DTM and label. With the trained model, we can predict the topic of an unseen text into one of the two topics.

-----

<font color='red' size = '5'> Student Exercise </font>

In the preceding Code cells, we used scikit-learn to parse our sample text message by using `CountVectorizer`. 

Now that you have run the notebook, go back and make the following changes to see how the results change.

1. Try vectorizing a different message (i.e., transform a different message, or messages) with already fit `cv`. How do the results change?

-----

### TF-IDF

Previously, we have simply used counts of tokens. Even with the removal of stop words, however, this can still overemphasize tokens that might generally occur across many documents (e.g., names or general concepts). An alternative technique that often provides robust improvements in text analysis accuracy is to employ the frequency of token occurrence, normalized over the frequency with which the token occurs in all documents. In this manner, we give higher weight in the text analysis process to tokens that are more strongly tied to a particular label(ie. topic). 

Formally this concept is known as [term frequency–inverse document frequency][tfd] (or tf-idf), and scikit-learn provides this functionality via the [`TfidfTransformer`][tftd] that can either follow a tokenizer, such as `CountVectorizer` or can be combined together into a single transformer via the [`TfidfVectorizer`][tfvd].

In the following code, we will build the DTM with `TfidfVectorizer`. Then we will print out the training DTM. 

In the first row of the training DTM created by `CounterVectorizer`, both `'data'` and `'learning'` has count 3, which means both words carry the same weight in classification.

In the first row of the training DTM created by `TfidfVectorizer`, `'data'` now has value 0.489531 and `'learning'` has value 0.348306. So in this new matrix, `'data'` has more classification power than `'learning'`. The reason is that `'learning'` appears in both training texts and the total frequency of `'learning'` in the whole training text set(which is 4) is more than that of `'data'`(which is 3). So even though both words appear 3 times in the first training text, due to "inverse document frequency", the TF-IDF adjusted values are different; the value of `'data'` is greater than the value of `'learning'` due to the fact that `'data'` has less frequency in the whole text set. Using a more intuitive way to explain this: when a text has the word `'learning'` in it, it's hard to guess whether the text is about machine learning or special education because `'learning'` is a common word in both topics. On the other hand, if a text has the word `'data'`, it's more likely to be about machine learning than special education.

The algorithm of how TF-IDF value is calculated is beyond the scope of this course.

-----
[tfd]: https://en.wikipedia.org/wiki/Tf–idf

[tftd]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

[tfvd]: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Define our TFIDF vectorizer
tf_cv = TfidfVectorizer(stop_words='english')

# Build a vocabulary from our training text and transform training text
training_dtm_tf = tf_cv.fit_transform(training_text)

#transform testing set
testing_dtm_tf = tf_cv.transform(testing_text)

#create dataframe for training DTM
training_data_tf = training_dtm_tf.todense().tolist()
training_tf_df = pd.DataFrame(training_data_tf, columns=tf_cv.get_feature_names())
print('Training TFIDF DTM')
training_tf_df

Training TFIDF DTM


Unnamed: 0,additional,algorithms,apply,appropriate,behavioral,catering,children,data,decision,designed,designing,difficulties,disabilities,education,educational,focuses,informed,learn,learned,learning,machine,make,needs,parsing,physical,predictions,problems,provide,resourced,school,schools,special,specifically,staffed,students,supervised,technique,unsupervised
0,0.0,0.163177,0.163177,0.0,0.0,0.0,0.0,0.489531,0.163177,0.0,0.163177,0.0,0.0,0.0,0.0,0.163177,0.163177,0.326354,0.163177,0.348306,0.326354,0.326354,0.0,0.163177,0.0,0.163177,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.163177,0.163177,0.163177
1,0.153382,0.0,0.0,0.153382,0.153382,0.153382,0.153382,0.0,0.0,0.153382,0.0,0.153382,0.153382,0.153382,0.153382,0.0,0.0,0.0,0.0,0.109132,0.0,0.0,0.306763,0.0,0.153382,0.0,0.153382,0.153382,0.153382,0.306763,0.153382,0.613527,0.153382,0.153382,0.153382,0.0,0.0,0.0


## Ancillary Information

The following links are to additional documentation that you might find helpful in learning this material. Reading these web-accessible documents is completely optional.

1. Wikipedia article on [Bag of Words][wbow] model
1. Wikipedia article on [Document Term Matrix][wdtm]
1. Gentle Introduction (in Python 2) to text analysis with Python, [part 1][nctap1] and [part 2][nctap2]
1. Kaggle tutorial on [Bag of Words][kbow]

-----

[inlp]: https://blog.monkeylearn.com/the-definitive-guide-to-natural-language-processing/

[wnlp]: https://en.wikipedia.org/wiki/Natural_language_processing
[wbow]: https://en.wikipedia.org/wiki/Bag-of-words_model
[wdtm]: https://en.wikipedia.org/wiki/Document-term_matrix


[nytnlp]: http://www.nytimes.com/2003/10/16/technology/circuits/16mine.html?pagewanted=print
[nltk3]: http://www.nltk.org/book/ch01.html

[nctap1]: http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-1/
[nctap2]: http://nealcaren.web.unc.edu/an-introduction-to-text-analysis-with-python-part-2/

[kbow]: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words1. 

**&copy; 2019: Gies College of Business at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode