# Text Mining

## Overview

Text mining is the application of the techniques we discussed so far to textual data with the goal to infer information from the data. Examples for text mining applications are, e.g., the analysis of costumer reviews to infer their sentiment or the automated grouping of related documents. The problem with analyzing natural language text is that sentences or longer texts are neither numeric nor categorical data. Moreover, there is often some inherent structure in texts, e.g., headlines, introductions, references to other related content, or summaries. When we read text, we automatically identify these structures that textual data has internally. This is one of the biggest challenges of text mining: finding a good representation of the text such that it can be used for machine learning. 

For this, the text has to be somehow *encoded* into numeric or categorical data with as little loss of information as possible. The ideal encoding captures not only the words, but also the meaning of the words in their *context*, the grammatical structure, as well as the broader context of the text, e.g., of sentences within a document. To achieve this is still a subject of ongoing research. However, there were many advancements in recent years that made text mining into a powerful, versatile, and often sufficiently reliable tool. Since text mining itself is a huge field, we can only scratch the surface of the topic. The goal is that upon reading this chapter, you have a good idea the challenges of text mining, know basic text processing techniques, and also have a general idea of how more advanced text mining works.

We will use the following eight tweets from Donald Trump as an example for textual data to demonstrate how text mining works in general. All data processing steps are done with the goal to prepare the text such that it is possible to analyze the topic of the tweets. 

In [1]:
from textwrap import TextWrapper

tweets_list = ['Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone]',
               'Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you! https://t.co/eQC2NqdIil [Twitter for iPhone]',
               'Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the U.S. Senate, and we need the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter for iPhone]',
               'Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of thousands of dollars on his opponent because they both support a liberal agenda of higher taxes and wasteful spending! [Twitter for iPhone]',
               'Oct 4, 2018 02:29:27 PM "U.S. Stocks Widen Global Lead" https://t.co/Snhv08ulcO [Twitter for iPhone]',
               'Oct 4, 2018 02:17:28 PM Statement on National Strategy for Counterterrorism: https://t.co/ajFBg9Elsj https://t.co/Qr56ycjMAV [Twitter for iPhone]',
               'Oct 4, 2018 12:38:08 PM Working hard, thank you! https://t.co/6HQVaEXH0I [Twitter for iPhone]',
               'Oct 4, 2018 09:17:01 AM This is now the 7th. time the FBI has investigated Judge Kavanaugh. If we made it 100, it would still not be good enough for the Obstructionist Democrats. [Twitter for iPhone]']

wrapper = TextWrapper(width=70)
for tweet in tweets_list:
    print('\n'.join(wrapper.wrap(tweet)))
    print()

Oct 4, 2018 08:03:25 PM Beautiful evening in Rochester, Minnesota.
VOTE, VOTE, VOTE! https://t.co/SyxrxvTpZE [Twitter for iPhone]

Oct 4, 2018 07:52:20 PM Thank you Minnesota - I love you!
https://t.co/eQC2NqdIil [Twitter for iPhone]

Oct 4, 2018 05:58:21 PM Just made my second stop in Minnesota for a
MAKE AMERICA GREAT AGAIN rally. We need to elect @KarinHousley to the
U.S. Senate, and we need the strong leadership of @TomEmmer,
@Jason2CD, @JimHagedornMN and @PeteStauber in the U.S. House! [Twitter
for iPhone]

Oct 4, 2018 05:17:48 PM Congressman Bishop is doing a GREAT job! He
helped pass tax reform which lowered taxes for EVERYONE! Nancy Pelosi
is spending hundreds of thousands of dollars on his opponent because
they both support a liberal agenda of higher taxes and wasteful
spending! [Twitter for iPhone]

Oct 4, 2018 02:29:27 PM "U.S. Stocks Widen Global Lead"
https://t.co/Snhv08ulcO [Twitter for iPhone]

Oct 4, 2018 02:17:28 PM Statement on National Strategy for
Counterterrorism: 

## Preprocessing

Through preprocessing, text is transformed into a representation that we can use for machine learning algorithms, e.g., for the classification or for the grouping with clustering. 

### Creation of  a Corpus

The first preprocessing step is to create a *corpus* of *documents*. In the sense of the terminology we have used so far, the documents are the objects that we want to reason about, the corpus is a collection of object. In our Twitter example, the corpus is a collection of tweets, and each tweet is a document. In our case, we already have a list of tweets, which is the same as a corpus of documents. In other use cases, this can be more difficult. For example, if you crawl the internet to collect reviews for a product, it is likely that you find multiple reviews on the same Web site. In this case, you must extract the reviews into separate documents, which can be challenging.

### Relevant Content

Textual data, especially text that was automatically collected from the Internet, often contains irrelevant content for a given use case. For example, if we only want to analyze the topic of tweets, the timestamps are irrelevant. It does also not matter if a tweet was sent with an iPhone or a different application. Links are a tricky case, as they may contain relevant information, but are also often irrelevant. For example, the URL of this page contains relevant information, e.g., the author, the general topic, and the name of the current chapter. Other parts, like the http are irrelevant. Other links are completely irrelevant, e.g., in case link shorteners are used. In this case a link is just a random string. 

When we strip the irrelevant content from the tweets, we get the following.

In [2]:
import re

tweets_relevant_content = []
for tweet in tweets_list:
    # remove the first 24 chars, because they are the time stamp
    # remove everything after last [ because this is the source of the tweet
    modified_tweet = tweet[24:tweet.rfind('[')]
    # drop links
    modified_tweet = re.sub(r'http\S+', '', modified_tweet).strip()
    tweets_relevant_content.append(modified_tweet)

for tweet in tweets_relevant_content:
    print('\n'.join(wrapper.wrap(tweet)))
    print()

Beautiful evening in Rochester, Minnesota. VOTE, VOTE, VOTE!

Thank you Minnesota - I love you!

Just made my second stop in Minnesota for a MAKE AMERICA GREAT AGAIN
rally. We need to elect @KarinHousley to the U.S. Senate, and we need
the strong leadership of @TomEmmer, @Jason2CD, @JimHagedornMN and
@PeteStauber in the U.S. House!

Congressman Bishop is doing a GREAT job! He helped pass tax reform
which lowered taxes for EVERYONE! Nancy Pelosi is spending hundreds of
thousands of dollars on his opponent because they both support a
liberal agenda of higher taxes and wasteful spending!

"U.S. Stocks Widen Global Lead"

Statement on National Strategy for Counterterrorism:

Working hard, thank you!

This is now the 7th. time the FBI has investigated Judge Kavanaugh. If
we made it 100, it would still not be good enough for the
Obstructionist Democrats.



What is relevant and irrelevant can also depend on the context. For example, a different use case for Twitter data would be to analyze if tweets there a differences between tweets from different sources. In this case, the source cannot be dropped, but would be needed to divide the tweets by their source. Another analysis of Twitter data may want to consider how the content of tweets evolves over time. In this case, the timestamps cannot just be dropped. Therefore, every text mining application should carefully consider what is relevant and tailor the contents of the text to the specific needs. 

### Punctuation and Cases

When we are only interested in the topic of documents, the punctuation, as well as the cases of the letters are often not useful and introduce unwanted differences between the same words. A relevant corner case of dropping punctuation and cases are acronyms. The acronym `U.S.` from the tweets is a perfect example for this, because this becomes `us`, which has a completely different meaning. If you are aware that the may be such problems within your data, you can manually address them, e.g., by mapping and `US` to `usa` after dropping the punctuation, but before lower casing the string. 

In [9]:
import string
tweets_lowercase = []

for tweet in tweets_relevant_content:
    modified_tweet = tweet.translate(str.maketrans('', '', string.punctuation))
    modified_tweet = modified_tweet.replace('US', 'usa')
    modified_tweet = modified_tweet.lower()
    tweets_lowercase.append(modified_tweet)
    
for tweet in tweets_lowercase:
    print('\n'.join(wrapper.wrap(tweet)))
    print()

beautiful evening in rochester minnesota vote vote vote

thank you minnesota  i love you

just made my second stop in minnesota for a make america great again
rally we need to elect karinhousley to the usa senate and we need the
strong leadership of tomemmer jason2cd jimhagedornmn and petestauber
in the usa house

congressman bishop is doing a great job he helped pass tax reform
which lowered taxes for everyone nancy pelosi is spending hundreds of
thousands of dollars on his opponent because they both support a
liberal agenda of higher taxes and wasteful spending

usa stocks widen global lead

statement on national strategy for counterterrorism

working hard thank you

this is now the 7th time the fbi has investigated judge kavanaugh if
we made it 100 it would still not be good enough for the
obstructionist democrats



### Stop Words


### Stemming and Lemmatization

### Bag-of-Words

### Inverse Document Frequency


### Beyond the Bag-of-Words

## Challenges

### Dimensionality

### Ambiguities

### Syntax and Semantic

### Parsing