# Text Preprocessing

Supose we have textual data available, we need to apply many of pre-processing steps to the data to transform those words into numerical features that work with machine learning algorithms.

The pre-processing steps for the problem depend mainly on the domain and the problem itself.We don't need to apply all the steps for every problem.

Here, we're going to see text preprocessing in Python. We'll use NLTK(Natural language toolkit) library here.

In [4]:
# import necessary libraries 
import nltk
import string
import re

### Text lowercase

We do lowercase the text to reduce the size of the vocabulary of our text data.

In [5]:
def lowercase_text(text): 
    return text.lower() 
  
input_str = "Data science is the study of data to extract meaningful insights for business!!"
lowercase_text(input_str) 

'data science is the study of data to extract meaningful insights for business!!'

### Remove numbers

We should either remove the numbers or convert those numbers into textual representations.
We use regular expressions(re) to remove the numbers.

In [6]:
# For Removing numbers 
def remove_num(text): 
    result = re.sub(r'\d+', '', text) 
    return result 
  
input_s = "You bought 6 apple from shop, and 8 Banana."
remove_num(input_s) 

'You bought  apple from shop, and  Banana.'

As we mentioned above,you can also convert the numbers into words. This could be done by using the inflect library.

In [7]:
# import the library 
import inflect 
q = inflect.engine() 
  
# convert number into text 
def convert_num(text): 
    # split strings into list of texts 
    temp_string = text.split() 
    # initialise empty list 
    new_str = [] 
  
    for word in temp_string: 
        # if text is a digit, convert the digit 
        # to numbers and append into the new_str list 
        if word.isdigit(): 
            temp = q.number_to_words(word) 
            new_str.append(temp) 
  
        # append the texts as it is 
        else: 
            new_str.append(word) 
  
    # join the texts of new_str to form a string 
    temp_str = ' '.join(new_str) 
    return temp_str 
  
input_str = 'You bought 6 apple from shop, and 8 Banana.'
convert_num(input_str)

'You bought six apple from shop, and eight Banana.'

### Remove Punctuation

We remove punctuations because of that we don't have different form of the same word. If we don't remove punctuations, then been, been, and been! will be treated separately.

In [8]:
# let's remove punctuation 
def rem_punct(text):
    translator = str.maketrans('', '', string.punctuation) 
    return text.translate(translator) 
  
input_str = "Is data science a good career?? Data science is a fantastic career with a tonne of potential for future growth!!!"
rem_punct(input_str) 

'Is data science a good career Data science is a fantastic career with a tonne of potential for future growth'

### Remove  stopwords:

Stopwords are words that do not contribute to the meaning of the sentence. Hence, they can be safely removed without causing any change in the meaning of a sentence. The NLTK(Natural Language Toolkit) library has the set of stopwords and we can use these to remove stopwords from our text and return a list of word tokens.

In [9]:
# importing nltk library
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

nltk.download('stopwords')
nltk.download('punkt')
  
# remove stopwords function 
def rem_stopwords(text): 
    stop_words = set(stopwords.words("english")) 
    word_tokens = word_tokenize(text) 
    filtered_text = [word for word in word_tokens if word not in stop_words] 
    return filtered_text 
  
ex_text = "Data is the new oil. A.I is the last invention"
rem_stopwords(ex_text)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\subhash\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\subhash\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Data', 'new', 'oil', '.', 'A.I', 'last', 'invention']

# RegEx

A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. 

- For example: ^a...s$

The above code defines a RegEx pattern. The pattern is: any five letter string starting with [a] and ending with [s].



In [12]:
import re

pattern = '^a...s$'

test_string = 'abyss'

test_string2 = 'abyssb'
result = re.match(pattern, test_string)

#result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")	


Search successful.


## What is RegEx and why is it important?

A Regex or we called it as regular expression, it is a type of object will help you out to extract information from any string data by searching through text and find it out what you need.Whether it's punctuation, numbers, letters, or even white spaces, RegEx will allow you to check and match any of the character combination in strings.

For example, suppose you need to match the format of a email addresses or security numbers. You can utilize RegEx to check the pattern inside the text strings and use it to replace another substring.

For instance, a RegEx could tell the program to search for the specific text from the string and then to print out the output accordingly. Expressions can include Text matching, Repetition of words,Branching,pattern-composition.


### RegEx Syntax

    import re

- *re* library in Python is used for string searching and manipulation.
- We also used it frequently for web scraping.

#### Example for w+ and ^ Expression

- *^:* Here in this expression matches the start of a string.
- *w+:* This expression matches for the alphanumeric characters from inside the string.

Here, we will give one example of how you can use "w+" and "^" expressions in code. re.findall will cover in next parts,so just focus on the "w+" and "^" expression.

Let's have an example "SuccessBatch13, Data Science Bootcamp Batch", if we execute the code we will get "SuccessBatch13" as a result.

In [13]:
import re
sent = "SuccessBatch13, Data Science Bootcamp Batch"
r2 = re.findall(r"^\w+",sent)
print(r2)

['SuccessBatch13']


*Note:* If we remove '+' sign from \w, the output will change and it'll give only first character of the first letter, i.e [S]

In [14]:
import re
sent = "SuccessBatch13, Data Science Bootcamp Batch"
r2 = re.findall(r"^\w",sent)
print(r2)

['S']


####  Example of \s expression in re.split function

- "s:" This expression we use for creating a space in the string.

To understand better this expression we will use the split function in a simple example. In this example, we have to split each words using the "re.split" function and at the same time we have used \s that allows to parse each word in the string seperately.

In [15]:
import re

print((re.split(r'\s','We splited this sentence')))

['We', 'splited', 'this', 'sentence']


As we can see above we got the output ['We', 'splited', 'this', 'sentence'] but what if we remove ' \ ' from '\s', it will give result like remove 's' from the entire sentences. Let's see in below example.

In [16]:
import re

print((re.split(r's','We splited this sentence')))

['We ', 'plited thi', ' ', 'entence']


Similarly, there are series of regular expression in Python that you can use in various ways like  \d,\D,$,\.,\b, etc.

- \d - Matches any decimal digit. Equivalent to [0-9]
- \D - Matches any non-decimal digit. Equivalent to [^0-9]
- \b - Matches if the specified characters are at the beginning or end of a word.

## Use RegEx methods

The "re" packages provide several methods to actually perform queries on an input string. We will see different methods which are

    re.match()
    re.search()
    re.findall()
    
**Note:** Based on the RegEx, Python offers two different primitive operations. This match method checks for the match only at the begining of the string while search checks for a match anywhere in the string.

### Finding Pattern in the text(re.search())

A RegEx is commonly used to search for a pattern in the text. This method takes a RegEx pattern and a string and searches that pattern with the string.

For using re.search() function, you need to import re first. The search() function takes the "pattern" and "text" to scan from our given string and returns the match object when the pattern found or else not match.

In [17]:

import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

# \A - Matches if the specified characters are at the start of a string.

if match:
  print("pattern found inside the string")
else:
  print("pattern not found") 

pattern found inside the string


In [18]:

import re

string = "I live Python programming language"

# check if 'Python' is at the beginning
match = re.search('Python', string)

# \A - Matches if the specified characters are at the start of a string.

if match:
  print("pattern found inside the string")
else:
  print("pattern not found") 

pattern found inside the string


In [19]:
import re

pattern = ["playing", "Success Analytics"]
text = "Sanju is playing outside."

for p in pattern:
    print("You're looking for '%s' in '%s'" %(p, text), end = ' ')
    
    if re.search(p, text):
        print('Found match!')
        
    else:
        print("no match found!")

You're looking for 'playing' in 'Sanju is playing outside.' Found match!
You're looking for 'Success Analytics' in 'Sanju is playing outside.' no match found!


## Using re.findall() for text

We use re.findall() module is when you wnat to iterate over the lines of the file, it'll do like list all the matches in one go. Here in a example, we would like to fetch email address from the list and we want to fetch all emails from the list, we use re.findall() method.

In [20]:

# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'
# \d - Matches any decimal digit. Equivalent to [0-9]

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

['12', '89', '34']


**Note**: If the pattern is not found, re.findall() returns an empty list.

In [21]:
import re

data = "shivan@analytics.in, bhaskar@analytics.in, ratan@analytics.in, hiteshanalytics.in"

# [\w\.-]+@[\w\.-]+ regular expression is commonly used to match email addresses. L
emails = re.findall(r'[\w\.-]+@[\w\.-]+', data)

for e in emails:
    print(e)

shivan@analytics.in
bhaskar@analytics.in
ratan@analytics.in


**Explanation**

1. [\w\.-]+: This part of the regular expression matches the username part of an email address. Here's what each component does:

    [\w\.-]: This character class [] matches any single character that is either a word character (letter, digit, or underscore), a dot . or a hyphen -.

    +: The + quantifier means that the previous character class (or group) should appear one or more times. So, [\w\.-]+ matches one or more word characters, dots, or hyphens in the username.
    
2. @: This character matches the literal "@" symbol, which separates the username from the domain in an email address.


3. [\w\.-]+: This part of the regular expression matches the domain part of an email address. It's similar to the first part:

    [\w\.-]: Again, this character class matches word characters, dots, or hyphens.
    
    +: Like before, + indicates that the previous character class (or group) should appear one or more times.


### Using re.match()

The match function is used to match the RegEx pattern to string with optional flag. Here, in this "w+" and "\W" will match the words starting from "i" and thereafter ,anything which is not started with "i" is not identified. For checking match for each element in the list or string, we run the for loop.

In [22]:
import re

lists = ['icecream images', 'i immitated', 'inner peace', 'I have an iPhone and an iPad.']

for i in lists:
    q = re.match("(i\w+)\W(i\w+)", i)
    
    if q:
        print((q.groups()))

('icecream', 'images')


**Explanation**

- (i\w+): This part of the expression matches a sequence of characters that starts with the letter "i" followed by one or more word characters (letters, digits, or underscores). The parentheses capture this sequence as a group.


- \W: This part of the expression matches a non-word character, such as whitespace or punctuation.


- (i\w+): This part is similar to the first part and captures another sequence of characters that starts with "i" and is followed by one or more word characters.

In [23]:
import re

# Sample text
text1 = "I enjoy ice cream."
text2 = "In the evening, I like to watch movies."
text3 = "The idea is interesting."
text4 = "I have an iPhone and an iPad."
text5 = "Igloos are cool."
text6 = "The Internet is a vast network."

# Regular expression pattern
pattern = r'(i\w+)\W(i\w+)'

# Function to find and print matches
def find_matches(text):
    matches = re.findall(pattern, text, re.IGNORECASE)  # Case-insensitive search
    if matches:
        for match in matches:
            print(f"Match 1: {match[0]}")
            print(f"Match 2: {match[1]}")
    else:
        print("No matches found.")

# Apply the regular expression to the sample texts
find_matches(text1)
find_matches(text2)
find_matches(text3)
find_matches(text4)
find_matches(text5)
find_matches(text6)

No matches found.
No matches found.
Match 1: idea
Match 2: is
No matches found.
No matches found.
Match 1: Internet
Match 2: is


# Happy Learning!