<a href="https://colab.research.google.com/github/rikanga/Easy-Numpy/blob/main/ML_UP_Text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Text

## 6.1 Cleaning Text

**Problem**

You have some unstructured text data and want to complete some basic cleaning.

**Solution**

Most basic text cleaning operations should only replace Python’s core string opera‐
tions, in particular strip , replace , and split :

In [1]:
# Create text
text_data = [
             "   Interrobang. By Aishwarya Henriette   ",
             "Parking And Going. By Karl Gautier",
             "   Today Is The night. By Jarek Prakash   "
             ]

In [4]:
# Strip whiitespace
strip_whitespace = [strip.strip() for strip in text_data]

In [5]:
# View strip_whitespace
strip_whitespace

['Interrobang. By Aishwarya Henriette',
 'Parking And Going. By Karl Gautier',
 'Today Is The night. By Jarek Prakash']

In [6]:
# Remove stop
remove_stop = [string.replace('.', '') for string in strip_whitespace]

In [7]:
# View remove_stop
remove_stop

['Interrobang By Aishwarya Henriette',
 'Parking And Going By Karl Gautier',
 'Today Is The night By Jarek Prakash']

In [8]:
# Create function
def capitalizer(string: str):
  return string.upper()

In [10]:
# Apply function
[capitalizer(string) for string in remove_stop]

['INTERROBANG BY AISHWARYA HENRIETTE',
 'PARKING AND GOING BY KARL GAUTIER',
 'TODAY IS THE NIGHT BY JAREK PRAKASH']

In [11]:
# USING REGULAR EXPRESSION
# Load libray
import re

# Define the function
def replace_letters_with_X(string:str):
  return re.sub(r'[a-zA-Z]','X', string)

In [14]:
# Apply function with re pattern
[replace_letters_with_X(string) for string in remove_stop]

['XXXXXXXXXXX XX XXXXXXXXX XXXXXXXXX',
 'XXXXXXX XXX XXXXX XX XXXX XXXXXXX',
 'XXXXX XX XXX XXXXX XX XXXXX XXXXXXX']

## 6.2 Parsing and Cleaning HTML

**Problem**

You have text data with HTML elements and want to extract just the text.

**Solution**

Use Beautiful Soup’s extensive set of options to parse and extract from HTML:

In [18]:
!pip install bs4



In [21]:
# Load libray
from bs4 import BeautifulSoup

In [22]:
# Create some HTML code
html = """
<div class='full_name'><span style='font-weight:bold'>
Masego</span> Azra</div>
"""

In [23]:
# parse html
soup = BeautifulSoup(html, 'lxml')

In [24]:
soup

<html><body><div class="full_name"><span style="font-weight:bold">
Masego</span> Azra</div>
</body></html>

In [31]:
# Find the class with full_name, show text
soup.find("div", {'class':'full_name'}).text.strip()

'Masego Azra'

In [34]:
soup.find('span', {"style":'font-weight:bold'}).text.strip()

'Masego'

In [35]:
import pandas as pd

## 6.3 Removing Punctuation

**Problem**

You have a feature of text data and want to remove punctuation.

**Solution**

Define a function that uses translate with a dictionary of punctuation characters:

In [36]:
# Load libraries
import sys
import unicodedata

In [38]:
# Create text
text_data = [
             'Hi!!!! I. Love. This. Song....',
             '10000% Agree!!!! #LoveIT',
             'Right?!?!']

In [61]:
word_found = [re.findall(r'[a-zA-Z0-9]+', x) for x in text_data]
word_found

[['Hi', 'I', 'Love', 'This', 'Song'], ['10000', 'Agree', 'LoveIT'], ['Right']]

In [63]:
new_data = [','.join(x) for x in word_found]

In [64]:
[string.replace(',', ' ') for string in new_data]

['Hi I Love This Song', '10000 Agree LoveIT', 'Right']

## 6.4 Tokenizing Text

**Problem**

You have text and want to break it up into individual words.

**Solution**

Natural Language Toolkit for Python (NLTK) has a powerful set of text manipulation
operations, including word tokenizing:

In [65]:
# Load library
from nltk.tokenize import word_tokenize

In [66]:
# Create text
string = "The science of today is the technology of tomorrow"

In [68]:
# Load the library
import nltk

In [71]:
# Dowload 'punkt
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [73]:
# Tokenize the string
word_tokenize(string)

['The', 'science', 'of', 'today', 'is', 'the', 'technology', 'of', 'tomorrow']

In [74]:
word_tokenize("Bonjour tout le monde, je suis chez moi à la maison. Content de vous parler")

['Bonjour',
 'tout',
 'le',
 'monde',
 ',',
 'je',
 'suis',
 'chez',
 'moi',
 'à',
 'la',
 'maison',
 '.',
 'Content',
 'de',
 'vous',
 'parler']

We can also tokenize in sentence

In [75]:
from nltk.tokenize import sent_tokenize

In [76]:
# Tokenize in the sentence
sent_tokenize(string)

['The science of today is the technology of tomorrow']

## 6.5 Removing Stop Word

**Problem**

Given tokenized text data, you want to remove extremely common words (e.g., a, is,
of, on) that contain little informational value.

After tokenize we can remove stop words
**Solution**

Use NLTK’s stopwords :

In [77]:
from nltk.corpus import stopwords

In [78]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [79]:
# Create word tokens
tokenized_words = ['i',
'am',
'going',
'to',
'go',
'to',
'the',
'store',
'and',
'park']

In [80]:
tokenized_words

['i', 'am', 'going', 'to', 'go', 'to', 'the', 'store', 'and', 'park']

In [87]:
# Load stop words
stop_words = stopwords.words('english')

In [88]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [89]:
# Remove stop word
[word for word in tokenized_words if word not in stop_words]

['going', 'go', 'store', 'park']

## 6.6 Stemming Words(Mots radicaux)

**Problème**

Vous avez des mots symbolisés et souhaitez les convertir dans leurs formes racine.

**Solution**

Utiliser le PorterStemer de NLTK
