## 6.0 Introduction

非结构化文本数据，如书籍或tweet的内容，既是最有趣的特性来源之一，也是最复杂的处理方式之一。将讨论将文本转换为信息丰富的特征的策略。这里的方法并非是是全面的。存在着整个aca-demic规程，专注于处理此类数据和类似类型的数据，其所有技术的内容将填满一个小型的库。尽管如此，还是有一些常用的技术，了解这些技术将为预处理工具箱添加有价值的工具

## 6.1 Cleaning Text

**Problem:**\
You have some unstructured text data and want to complete some basic cleaning.

**Solution:**\
Most basic text cleaning operations should only replace Python’s core string operations, in particular `strip`, `replace`, and `split`:

In [2]:
text_data = ["  Interrobang. By aishwarya Henriette    ",
            "Parking And Going. By Karl Gautier",
            "    Today Is The night. By Jarek Prakash"]

# strip whitespaces
strip_whitespace = [string.strip() for string in text_data]
strip_whitespace 

['Interrobang. By aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [3]:
remove_periods = [string.replace(".", "") for string in strip_whitespace]
remove_periods

['Interrobang By aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [4]:
def capitalizer(string: str) -> str:
    return string.upper()

[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [5]:
import re

def replace_letters_with_X(string: str) -> str:
    return re.sub(r"[a-zA-Z]", "X", string)

[replace_letters_with_X(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

在我们可以使用它来构建特性之前，大多数文本数据都需要被清除。大多数基本的文本清理可以使用Python的标准字符串操作完成。在现实世界中，我们很可能会定义一个自定义的清理函数（例如，大写字母），并将其应用于文本数据

### See Also
* Beginners Tutorial for Regular Expressions in Python (https://www.analyticsvidhya.com/blog/2015/06/regular-expression-python/)

## 6.2 Parsing and Cleaning HTML

**Problem**\
You have text data with HTML elements and want to extract just the text.

**Solution**\
Use Beautiful Soup’s extensive set of options to parse and extract from HTML:

In [6]:
from bs4 import BeautifulSoup

html = """
    <div class='full_name'><span style='font-weight:bold'>Yan</span> Chin</div>
"""

soup = BeautifulSoup(html)

soup.find("div", {"class": "full_name"}).text

'Yan Chin'

### Discussion

尽管名字很奇怪，但是beautifulsoup是一个功能强大的Python库，专门用来抓取HTML。典型的是用来抓取实时网站，但我们可以很容易地使用它来提取嵌入在HTML中的文本数据
|

### See Also
* Beautiful Soup documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

## 6.3 Removing Punctuation

**Problem**\
You have a feature of text data and want to remove punctuation.

**Solution**\
Define a function that uses translate with a dictionary of punctuation characters:


In [7]:
import unicodedata
import sys

text_data = ['Hi!!! I. Love. This. Song.....', '10000% Agree!!!! #LoveIT', 'Right?!?!']

# create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

# for each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

## 6.4 Tokenizing Text

**Problem**\
You have text and want to break it up into individual words.

**Solution**\
Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation operations, including word tokenizing:

In [8]:
from nltk.tokenize import word_tokenize
string = "The science of today is the technology of tommorrow"

# tokenize words
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tommorrow']

In [9]:
from nltk.tokenize import sent_tokenize
string = "The science of today is the technology of tommorw. Tommorrow is today"

# tokenize sentences
sent_tokenize(string)

['The science of today is the technology of tommorw.', 'Tommorrow is today']

## 6.5 Removing Stop Words

**Problem**:\
Given tokenized text data, you want to remove extremely common words (e.g., a, is, of, on) that contain little informational value.

**Solution**:\
Use NLTK’s stopwords:

In [10]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

tokenized_words = ['i', 'am', 'going', 'to', 'go', 'to', 'the', 'store', 'and', 'park']

stop_words = stopwords.words('english')

# remove stop words
[word for word in tokenized_words if word not in stop_words]

[nltk_data] Error loading stopwords: <urlopen error [Errno 11004]
[nltk_data]     getaddrinfo failed>


['going', 'go', 'store', 'park']

## 6.6 Stemming Words

In [11]:
from nltk.stem.porter import PorterStemmer

tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

# create stemmer
porter = PorterStemmer()

# apply stemmer
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

### See Also
* Porter Stemming Algorithm (https://tartarus.org/martin/PorterStemmer/)

## 6.7 Tagging Part of Speech

In [28]:
from nltk import pos_tag
from nltk import word_tokenize
import nltk
nltk.download('averaged_perceptron_tagger')

text_data = "Chris loved outdoor running"

text_tagged = pos_tag(word_tokenize(text_data))

text_tagged

[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [Errno 11004] getaddrinfo failed>


LookupError: 
**********************************************************************
  Resource [93maveraged_perceptron_tagger[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('averaged_perceptron_tagger')
  [0m
  Searched in:
    - 'C:\\Users\\DELL/nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - 'C:\\Users\\DELL\\Anaconda3\\nltk_data'
    - 'C:\\Users\\DELL\\Anaconda3\\share\\nltk_data'
    - 'C:\\Users\\DELL\\Anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\DELL\\AppData\\Roaming\\nltk_data'
**********************************************************************


NLTK uses the Penn Treebank parts for speech tags, some examples:

| Tag | Parts of Speech |
|---  |-----------------|
|NNP| Proper noun, singular|
|NN| Noun, singular or mass|
|RB| Adverb|
|VBD| Verb, past tense|
|VBG| Verb, gerund or present participle|
|JJ| Adjective|
|PRP| Personal pronoun|

Once the text has been tagged, we can use the tags to find certain parts of speech. For example, here are all nouns:

In [42]:
[word for word, tag in text_tagged if tag in ['NN', 'NNS', 'NNP', 'NNPS']]

['Chris']

In [46]:
from sklearn.preprocessing import MultiLabelBinarizer

tweets = ["I am eating a burrito for breakfast",
         "Political science is an amazing field",
         "San Francisco is an awesome city"]

tagged_tweets = []

# tag each word and each tweet
for tweet in tweets:
    tweet_tag = nltk.pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])

# use one hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [47]:
# show feature names
one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

In [49]:
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
import nltk
nltk.download('brown')
    
# get some text from the Brown
sentences = brown.tagged_sents(categories='news')

# split into 4000 stences for training and 623 for testing
train = sentences[:4000]
test = sentences[4000:]

# create backoff tagger
unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff=unigram)
trigram = TrigramTagger(train, backoff=bigram)

trigram.evaluate(test)

[nltk_data] Downloading package brown to /Users/f00/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


0.8174734002697437

### See Also
* https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

## 6.8 Encoding Text as a Bag of Words

In [17]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'])

count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)

bag_of_words

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [18]:
bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [19]:
count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

In [20]:
count_2gram = CountVectorizer(ngram_range=(1,2), stop_words='english', vocabulary=['brazil'])
bag = count_2gram.fit_transform(text_data)
bag.toarray()
#ngram_range=(1,2)表示选取1到两个词作为组合方式

array([[2],
       [0],
       [0]], dtype=int64)

In [21]:
count_2gram.vocabulary_

{'brazil': 0}

### See Also
* n-gram (https://en.wikipedia.org/wiki/N-gram)
* bag of words meets bags of popcorn (https://www.kaggle.com/c/word2vec-nlp-tutorial)

## 6.9 Weighting Word Importance

In [22]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

text_data = np.array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'])

# create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [23]:
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [57]:
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}

$$
tfidf(t, d) = tf(t,d) * idf(t)
$$

where $t$ is a word

$d$ is a document

$$
idf(t) = log(\frac{1 + n_d}{1 + df(d, t}) +1
$$

where $n_d$ is the number of documents and 

$df(d,t)$ is term, $t$'s document frequency (i.e. number of documents where the term appears)

TF=某个词在该文档中出现的次数/该文档中总词数
IDF=log（语料库中的总文档数/（该词出现在多少个文档中+1））
可见该词出现在越多文档，重要性越低
最终每个文档被表示为一个向量，向量长度为语料库中不同单词的个数

### See Also
* scikit-learn documentation: tf-idf term weighting (http://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting)