<a href="https://colab.research.google.com/github/rikanga/Easy-Numpy/blob/main/ML_UP_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Text

## 6.1 Cleaning Text

**Problem**

You have some unstructured text data and want to complete some basic cleaning.

**Solution**

Most basic text cleaning operations should only replace Python’s core string opera‐
tions, in particular strip , replace , and split :

In [None]:
# Create text
text_data = [
             "   Interrobang. By Aishwarya Henriette   ",
             "Parking And Going. By Karl Gautier",
             "   Today Is The night. By Jarek Prakash   "
             ]

In [None]:
# Strip whiitespace
strip_whitespace = [strip.strip() for strip in text_data]

In [None]:
# View strip_whitespace
strip_whitespace

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [None]:
# Remove stop
remove_stop = [string.replace('.', '') for string in strip_whitespace]

In [None]:
# View remove_stop
remove_stop

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [None]:
# Create function
def capitalizer(string: str):
  return string.upper()

In [None]:
# Apply function
[capitalizer(string) for string in remove_stop]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [None]:
# USING REGULAR EXPRESSION
# Load libray
import re

# Define the function
def replace_letters_with_X(string:str):
  return re.sub(r'[a-zA-Z]','X', string)

In [None]:
# Apply function with re pattern
[replace_letters_with_X(string) for string in remove_stop]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

## 6.2 Parsing and Cleaning HTML

**Problem**

You have text data with HTML elements and want to extract just the text.

**Solution**

Use Beautiful Soup’s extensive set of options to parse and extract from HTML:

In [None]:
!pip install bs4



In [None]:
# Load libray
from bs4 import BeautifulSoup

In [None]:
# Create some HTML code
html = """
<div class='full_name'><span style='font-weight:bold'>
Masego</span> Azra</div>
"""

In [None]:
# parse html
soup = BeautifulSoup(html, 'lxml')

In [None]:
soup

<html><body><div class="full_name"><span style="font-weight:bold">
Masego</span> Azra</div>
</body></html>

In [None]:
# Find the class with full_name, show text
soup.find("div", {'class':'full_name'}).text.strip()

'Masego Azra'

In [None]:
soup.find('span', {"style":'font-weight:bold'}).text.strip()

'Masego'

In [None]:
import pandas as pd

## 6.3 Removing Punctuation

**Problem**

You have a feature of text data and want to remove punctuation.

**Solution**

Define a function that uses translate with a dictionary of punctuation characters:

In [None]:
# Load libraries
import sys
import unicodedata

In [None]:
# Create text
text_data = [
             'Hi!!!! I. Love. This. Song....',
             '10000% Agree!!!! #LoveIT',
             'Right?!?!']

In [None]:
word_found = [re.findall(r'[a-zA-Z0-9]+', x) for x in text_data]
word_found

[['Hi', 'I', 'Love', 'This', 'Song'], ['10000', 'Agree', 'LoveIT'], ['Right']]

In [None]:
new_data = [','.join(x) for x in word_found]

In [None]:
[string.replace(',', ' ') for string in new_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

## 6.4 Tokenizing Text

**Problem**

You have text and want to break it up into individual words.

**Solution**

Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation
operations, including word tokenizing:

In [None]:
# Load library
from nltk.tokenize import word_tokenize

In [None]:
# Create text
string = "The science of today is the technology of tomorrow"

In [None]:
# Load the library
import nltk

In [None]:
# Dowload 'punkt
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Tokenize the string
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

In [None]:
word_tokenize("Bonjour tout le monde, je suis chez moi à la maison. Content de vous parler")

['Bonjour',
 'tout',
 'le',
 'monde',
 ',',
 'je',
 'suis',
 'chez',
 'moi',
 'à',
 'la',
 'maison',
 '.',
 'Content',
 'de',
 'vous',
 'parler']

We can also tokenize in sentence

In [None]:
from nltk.tokenize import sent_tokenize

In [None]:
# Tokenize in the sentence
sent_tokenize(string)

['The science of today is the technology of tomorrow']

## 6.5 Removing Stop Word

**Problem**

Given tokenized text data, you want to remove extremely common words (e.g., a, is,
of, on) that contain little informational value.

After tokenize we can remove stop words
**Solution**

Use NLTK’s stopwords :

In [None]:
from nltk.corpus import stopwords

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# Create word tokens
tokenized_words = ['i',
'am',
'going',
'to',
'go',
'to',
'the',
'store',
'and',
'park']

In [None]:
tokenized_words

['i', 'am', 'going', 'to', 'go', 'to', 'the', 'store', 'and', 'park']

In [None]:
# Load stop words
stop_words = stopwords.words('english')

In [None]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [None]:
# Remove stop word
[word for word in tokenized_words if word not in stop_words]

['going', 'go', 'store', 'park']

## 6.6 Stemming Words(Mots radicaux)

**Problème**

Vous avez des mots symbolisés(tokenized) et souhaitez les convertir dans leurs formes racine.

**Solution**

Utiliser le PorterStemer de NLTK


In [None]:
# Load library
from nltk.stem.porter import PorterStemmer

In [None]:
# Create word tokens
tokenized_words = ['i', 'am', 'humbled', 'by', 'this', 'traditional', 'meeting']

In [None]:
# Create stemmer
porter = PorterStemmer()

In [None]:
# Apply stemmer
[porter.stem(word) for word in tokenized_words]

['i', 'am', 'humbl', 'by', 'thi', 'tradit', 'meet']

## 6.7 Tagging Parts of Speech

**Problem**

You have text data and want to tag each word or character with its part of speech.

**Solution**

Use NLTK’s pre-trained parts-of-speech tagger

In [1]:
from nltk.tokenize import word_tokenize
from nltk import word_tokenize
from nltk import pos_tag

In [2]:
# Create text
text_data = "Chris loved outdoor running"

In [3]:
import nltk; nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
# Tokenize the word
word_tokenized = word_tokenize(text_data)
word_tokenized

['Chris', 'loved', 'outdoor', 'running']

In [5]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [6]:
# Use pre-trained part of speech tagger
text_tag = pos_tag(word_tokenized)

In [7]:
# View the text tag
text_tag

[('Chris', 'NNP'), ('loved', 'VBD'), ('outdoor', 'RP'), ('running', 'VBG')]

In [8]:
# Filter words to have just the noun
[word for word, tag in text_tag if tag in ['NN', 'NNS', 'NNP', 'NNPS']]

['Chris']

In [9]:
# OTHER EXAMPLE
# Create text
tweets = [
          'I am eating a burrito for breakfast',
          'Political science is amazing field',
          'San Francisco is an awesome city'
]

In [10]:
# Create list
tagged_tweets = []

In [14]:
# Tag each word and each tweet
for tweet in tweets:
  tweet_tag = nltk.pos_tag(word_tokenize(tweet))
  tagged_tweets.append([tag for word, tag in tweet_tag])

In [15]:
tagged_tweets

[['PRP', 'VBP', 'VBG', 'DT', 'NN', 'IN', 'NN'],
 ['JJ', 'NN', 'VBZ', 'JJ', 'NN'],
 ['NNP', 'NNP', 'VBZ', 'DT', 'JJ', 'NN']]

In [16]:
# Load the library
from sklearn.preprocessing import MultiLabelBinarizer

In [18]:
# Use one-hot encoding to convert the tags into features
one_hot_multi = MultiLabelBinarizer()
one_hot_multi.fit_transform(tagged_tweets)

array([[1, 1, 0, 1, 0, 1, 1, 1, 0],
       [0, 0, 1, 1, 0, 0, 0, 0, 1],
       [1, 0, 1, 1, 1, 0, 0, 0, 1]])

In [19]:
# Show the feature names
one_hot_multi.classes_

array(['DT', 'IN', 'JJ', 'NN', 'NNP', 'PRP', 'VBG', 'VBP', 'VBZ'],
      dtype=object)

In [20]:
# Load library
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger

In [22]:
# Get some text from the Brown Corpus, broken into sentences
nltk.download('brown')
sentences = brown.tagged_sents(categories='news')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [25]:
# Split into 4000 sentences for training and 623 for testing
train = sentences[:4000]
test = sentences[4000:]

In [26]:
train[:5]

[[('The', 'AT'),
  ('Fulton', 'NP-TL'),
  ('County', 'NN-TL'),
  ('Grand', 'JJ-TL'),
  ('Jury', 'NN-TL'),
  ('said', 'VBD'),
  ('Friday', 'NR'),
  ('an', 'AT'),
  ('investigation', 'NN'),
  ('of', 'IN'),
  ("Atlanta's", 'NP$'),
  ('recent', 'JJ'),
  ('primary', 'NN'),
  ('election', 'NN'),
  ('produced', 'VBD'),
  ('``', '``'),
  ('no', 'AT'),
  ('evidence', 'NN'),
  ("''", "''"),
  ('that', 'CS'),
  ('any', 'DTI'),
  ('irregularities', 'NNS'),
  ('took', 'VBD'),
  ('place', 'NN'),
  ('.', '.')],
 [('The', 'AT'),
  ('jury', 'NN'),
  ('further', 'RBR'),
  ('said', 'VBD'),
  ('in', 'IN'),
  ('term-end', 'NN'),
  ('presentments', 'NNS'),
  ('that', 'CS'),
  ('the', 'AT'),
  ('City', 'NN-TL'),
  ('Executive', 'JJ-TL'),
  ('Committee', 'NN-TL'),
  (',', ','),
  ('which', 'WDT'),
  ('had', 'HVD'),
  ('over-all', 'JJ'),
  ('charge', 'NN'),
  ('of', 'IN'),
  ('the', 'AT'),
  ('election', 'NN'),
  (',', ','),
  ('``', '``'),
  ('deserves', 'VBZ'),
  ('the', 'AT'),
  ('praise', 'NN'),
  ('and', 

In [27]:
test[:5]

[[('In', 'IN'),
  ("Ruth's", 'NP$'),
  ('day', 'NN'),
  ('--', '--'),
  ('and', 'CC'),
  ('until', 'IN'),
  ('this', 'DT'),
  ('year', 'NN'),
  ('--', '--'),
  ('the', 'AT'),
  ('schedule', 'NN'),
  ('was', 'BEDZ'),
  ('154', 'CD'),
  ('games', 'NNS'),
  ('.', '.')],
 [('Baseball', 'NN'),
  ('commissioner', 'NN'),
  ('Ford', 'NP'),
  ('Frick', 'NP'),
  ('has', 'HVZ'),
  ('ruled', 'VBN'),
  ('that', 'CS'),
  ("Ruth's", 'NP$'),
  ('record', 'NN'),
  ('will', 'MD'),
  ('remain', 'VB'),
  ('official', 'JJ'),
  ('unless', 'CS'),
  ('it', 'PPS'),
  ('is', 'BEZ'),
  ('broken', 'VBN'),
  ('in', 'IN'),
  ('154', 'CD'),
  ('games', 'NNS'),
  ('.', '.')],
 [(')', ')')],
 [('``', '``'),
  ('Even', 'RB'),
  ('on', 'IN'),
  ('the', 'AT'),
  ('basis', 'NN'),
  ('of', 'IN'),
  ('154', 'CD'),
  ('games', 'NNS'),
  (',', ','),
  ('this', 'DT'),
  ('is', 'BEZ'),
  ('the', 'AT'),
  ('ideal', 'JJ'),
  ('situation', 'NN'),
  ("''", "''"),
  (',', ','),
  ('insists', 'VBZ'),
  ('Hank', 'NP'),
  ('Greenberg',

In [28]:
# Create backoff tagger
unigram = UnigramTagger(train)
bigram = BigramTagger(train, backoff=unigram)
trigram = TrigramTagger(train, backoff=bigram)

In [29]:
# Show accuracy
trigram.evaluate(test)

0.8174734002697437

## 6.8 Encoding Text as a Bag of Words

**Problem**

You have text data and want to create a set of features indicating the number of times
an observation’s text contains a particular word.

**Solution**

Use scikit-learn’s CountVectorizer :

In [33]:
# Load libray
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

In [34]:
# Create text
text_data = np.array([
                      'I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])

In [35]:
 # Create the bag of words feature matrix
 count = CountVectorizer()

 bag_of_words = count.fit_transform(text_data)

In [37]:
# Show feature matrix
bag_of_words

<3x8 sparse matrix of type '<class 'numpy.int64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [38]:
# Translate to array
bag_of_words.toarray()

array([[0, 0, 0, 2, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 1],
       [1, 0, 1, 0, 1, 0, 0, 0]])

In [40]:
# Show the feature name
print(count.get_feature_names())

['beats', 'best', 'both', 'brazil', 'germany', 'is', 'love', 'sweden']




In [44]:
count.vocabulary_

{'beats': 0,
 'best': 1,
 'both': 2,
 'brazil': 3,
 'germany': 4,
 'is': 5,
 'love': 6,
 'sweden': 7}

In [41]:
import pandas as pd

In [43]:
pd.DataFrame(bag_of_words.toarray(), columns=count.get_feature_names())



Unnamed: 0,beats,best,both,brazil,germany,is,love,sweden
0,0,0,0,2,0,0,1,0
1,0,1,0,0,0,1,0,1
2,1,0,1,0,1,0,0,0


In [45]:
pd.Series(count.vocabulary_)

love       6
brazil     3
sweden     7
is         5
best       1
germany    4
beats      0
both       2
dtype: int64

## 6.9 Weighting Word Importance

**Problem**

You want a bag of words, but with words weighted by their importance to an observation.

**Solution**

Compare the frequency of the word in a document (a tweet, movie review, speech
transcript, etc.) with the frequency of the word in all other documents using term
frequency-inverse document frequency (tf-idf). scikit-learn makes this easy with
TfidfVectorizer :

In [46]:
# Load library
from sklearn.feature_extraction.text import TfidfVectorizer

In [47]:
# Create text
text_data = np.array([
                      'I love Brazil. Brazil!',
                      'Sweden is best',
                      'Germany beats both'])
text_data

array(['I love Brazil. Brazil!', 'Sweden is best', 'Germany beats both'],
      dtype='<U22')

In [49]:
# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(text_data)

In [50]:
# Show tf-idf feature matrix
feature_matrix

<3x8 sparse matrix of type '<class 'numpy.float64'>'
	with 8 stored elements in Compressed Sparse Row format>

In [52]:
# show tf-idf feature matrix as dense matrix
feature_matrix.toarray()

array([[0.        , 0.        , 0.        , 0.89442719, 0.        ,
        0.        , 0.4472136 , 0.        ],
       [0.        , 0.57735027, 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.57735027],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027,
        0.        , 0.        , 0.        ]])

On compare ici la fréquence du mot dans tous les autres documents

In [53]:
# Show feature names
tfidf.vocabulary_

{'beats': 0,
 'best': 1,
 'both': 2,
 'brazil': 3,
 'germany': 4,
 'is': 5,
 'love': 6,
 'sweden': 7}