## 6.1 Cleaning Text
### Problem
You have some unstructured text data and want to complete some basic cleaning.
### Solution
Most basic text cleaning operations should only replace Python’s core string operations, in particular strip, replace, and split:


In [1]:
# Create text
text_data = [" Interrobang. By Aishwarya Henriette ",
 "Parking And Going. By Karl Gautier",
 " Today Is The night. By Jarek Prakash "]
# Strip whitespaces
strip_whitespace = [string.strip() for string in text_data]

In [2]:
# Show text
strip_whitespace

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [3]:
# Remove periods
remove_periods = [string.replace(".", "") for string in strip_whitespace]
# Show text
remove_periods


['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [4]:
# Create function
def capitalizer(string: str) -> str:
    return string.upper()

In [5]:
# Apply function
[capitalizer(string) for string in remove_periods]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [6]:
# Import library
import re
# Create function
def replace_letters_with_X(string: str) -> str:
    return re.sub(r"[a-zA-Z]", "X", string)
# Apply function
[replace_letters_with_X(string) for string in remove_periods]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

## 6.2 Parsing and Cleaning HTML
### Problem
You have text data with HTML elements and want to extract just the text.
### Solution
Use Beautiful Soup’s extensive set of options to parse and extract from HTML:

In [7]:
# Load library
from bs4 import BeautifulSoup
# Create some HTML code
html = """
       <div class='full_name'><span style='font-weight:bold'>Masego</span> Azra</div>
       """
# Parse html
soup = BeautifulSoup(html, "lxml")
# Find the div with the class "full_name", show text
soup.find("div", { "class" : "full_name" }).text

'Masego Azra'

## 6.3 Removing Punctuation
### Problem
You have a feature of text data and want to remove punctuation.
### Solution
Define a function that uses translate with a dictionary of punctuation characters:

In [8]:
# Load libraries
import unicodedata
import sys
# Create text
text_data = ['Hi!!!! I. Love. This. Song....',
 '10000% Agree!!!! #LoveIT',
 'Right?!?!']
# Create a dictionary of punctuation characters
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
# For each string, remove any punctuation characters
[string.translate(punctuation) for string in text_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

## 6.4 Tokenizing Text
### Problem
You have text and want to break it up into individual words.
### Solution
Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation
operations, including word tokenizing:

In [9]:
# Load library
from nltk.tokenize import word_tokenize
# Create text
string = "The science of today is the technology of tomorrow"
# Tokenize words
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

In [10]:
# Load library
from nltk.tokenize import sent_tokenize
# Create text
string = "The science of today is the technology of tomorrow. Tomorrow is today."
# Tokenize sentences
sent_tokenize(string)

['The science of today is the technology of tomorrow.', 'Tomorrow is today.']

## 6.5 Removing Stop Words
### Problem
Given tokenized text data, you want to remove extremely common words (e.g., a, is, of, on) that contain little informational value.
### Solution
Use NLTK’s stopwords:

In [11]:
# Load library
from nltk.corpus import stopwords

In [12]:
# Create word tokens
tokenized_words = ['i',
                   'am',
                   'going',
                   'to',
                   'go',
                   'to',
                   'the',
                   'store',
                   'and',
                   'park']
# Load stop words
stop_words = stopwords.words('english')
# Remove stop words
[word for word in tokenized_words if word not in stop_words]

['going', 'go', 'store', 'park']

`Note that NLTK’s stopwords assumes the tokenized words are all lowercased`

## 6.6 Stemming Words
### Problem
You have tokenized words and want to convert them into their root forms.
### Solution
Use NLTK’s PorterStemmer:

In [13]:
# Load library
from nltk.stem.porter import PorterStemmer
# Create word tokens
tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']
# Create stemmer
porter = PorterStemmer()
# Apply stemmer
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

## 6.7 Tagging Parts of Speech
### Problem
You have text data and want to tag each word or character with its part of speech.
### Solution
Use NLTK’s pre-trained parts-of-speech tagger:

In [14]:
# Load libraries
from nltk import pos_tag
from nltk import word_tokenize
# Create text
text_data = "Chris loved outdoor running"
# Use pre-trained part of speech tagger
text_tagged = pos_tag(word_tokenize(text_data))
# Show parts of speech
text_tagged

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

* `Tag` Part of speech
* `NNP` Proper noun, singular
* `NN` Noun, singular or mass
* `RB` Adverb
* `VBD` Verb, past tense
* `VBG` Verb, gerund or present participle
* `JJ` Adjective
* `PRP` Personal pronoun

In [15]:
# Filter words
[word for word, tag in text_tagged if tag in ['NN','NNS','NNP','NNPS'] ]


['Chris']

In [16]:
import nltk

# Create text
tweets = ["I am eating a burrito for breakfast",
          "Political science is an amazing field",
          "San Francisco is an awesome city"]
# Create list
tagged_tweets = []
# Tag each word and each tweet
for tweet in tweets:
    tweet_tag = nltk.pos_tag(word_tokenize(tweet))
    tagged_tweets.append([tag for word, tag in tweet_tag])

In [17]:
tagged_tweets

[['PRP', 'VBP', 'VBG', 'DT', 'NN', 'IN', 'NN'],
 ['JJ', 'NN', 'VBZ', 'DT', 'JJ', 'NN'],
 ['NNP', 'NNP', 'VBZ', 'DT', 'JJ', 'NN']]

In [18]:
from sklearn.preprocessing import MultiLabelBinarizer
# Use one-hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [1, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [19]:
# Show feature names
one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

In [20]:
#how to make your own tagger
# Load library
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
# Get some text from the Brown Corpus, broken into sentences
sentences = brown.tagged_sents(categories='news')
# Split into 4000 sentences for training and 623 for testing
train = sentences[:4000]
test = sentences[4000:]
# Create backoff tagger
unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff=unigram)
trigram = TrigramTagger(train, backoff=bigram)
# Show accuracy
trigram.evaluate(test)

0.8174734002697437

## 6.8 Encoding Text as a Bag of Words
### Problem
You have text data and want to create a set of features indicating the number of times
an observation’s text contains a particular word.
### Solution
Use scikit-learn’s CountVectorizer:

In [21]:
# Load library
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])
# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
# Show feature matrix
bag_of_words


<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [22]:
bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]], dtype=int64)

In [23]:
# Show feature names
count.get_feature_names()

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']

In [24]:
# Create feature matrix with arguments
count_2gram = CountVectorizer(ngram_range=(1,2))


In [25]:
bag = count_2gram.fit_transform(text_data)
# View feature matrix
bag.toarray()

array([[0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1],
       [1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0]], dtype=int64)

In [26]:
count_2gram.vocabulary_


{'love': 10,
 'brazil': 4,
 'love brazil': 11,
 'brazil brazil': 5,
 'sweden': 12,
 'is': 8,
 'best': 2,
 'sweden is': 13,
 'is best': 9,
 'germany': 6,
 'beats': 0,
 'both': 3,
 'germany beats': 7,
 'beats both': 1}

## 6.9 Weighting Word Importance
### Problem
You want a bag of words, but with words weighted by their importance to an observation.
### Solution
Compare the frequency of the word in a document (a tweet, movie review, speech
transcript, etc.) with the frequency of the word in all other documents using term
frequency-inverse document frequency (tf-idf). scikit-learn makes this easy with
TfidfVectorizer:

In [27]:
# Load libraries
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
# Create text
text_data = np.array(['I love Brazil. Brazil!','Sweden is best','Germany beats both'])
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)
# Show tf-idf feature matrix
feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [28]:
feature_matrix.toarray()


array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

In [29]:
# Show feature names
tfidf.vocabulary_

{'love': 6,
 'brazil': 3,
 'sweden': 7,
 'is': 5,
 'best': 1,
 'germany': 4,
 'beats': 0,
 'both': 2}