## 4. Text Cleaning and Manipulation

### Before you can use text to train statistical algorithms, you will need to perform several preprocessing tasks such as text cleaning, parts of speech tagging, stop word removal, stemming, lemmatization, etc

###  In Python, the text is treated as a string type. Therefore, string manipulation libraries such as Regex can be used for text cleaning and manipulation. In addition, you can use default string functions to clean and manipulate strings

1. Introduction to Regular Expressions
2. Searching Patterns in Strings
3. Substituting Text in Strings
4. Removing Texts and Alphabets from Strings
5. Removing Special Characters from Strings
6. Removing Spaces
7. Miscellaneous String Functions
8. Further Reading
9. Exercises 


### 4.1. Introduction to Regular Expressions

#### Regular Expressions, also known as Regex, is a sequence of characters used to match a pattern of strings inside some text. Once a pattern is matched, you can apply different functions on that pattern. For instance, you can substitute values on a string, and depending upon the regex pattern, you can add or remove values from text, you can search values inside a text, etc.

[Python Regular Expression Operation Documentation](https://docs.python.org/3/library/re.html)

[Regular Expression Howto](https://docs.python.org/3/howto/regex.html#regex-howto)

In [1]:
import re

### 4.2. Searching Patterns in Text (use match() or search() function)

In [4]:
sentence = "France won the FIFA World Cup 2018"

# match any string
# "." represent any symbol
# "*" match zero or more repeats of previous characters

output = re.match(r".*", sentence)

print(output)

<re.Match object; span=(0, 34), match='France won the FIFA World Cup 2018'>


In [5]:
# to return the matched string

print(output.group(0))

France won the FIFA World Cup 2018


In [6]:
sentence = ""
output = re.match(r".*", sentence)
print(output)

<re.Match object; span=(0, 0), match=''>


In [7]:
# + match one or more string

sentence = ""
output = re.match(r".+", sentence)
print(output)

None


In [36]:
# match lower and upper case alphabets from a to z

sentence = "France won the FIFA World Cup 2018"
output = re.match(r"[a-zA-z ]+", sentence)
print(output.group())

France won the FIFA World Cup 


In [8]:
sentence = "2018 was the year when France won the FIFA World Cup"
output = re.match(r"[a-zA-z ]+", sentence)
print(output)

None


## The match function only matches the string from the start. If a match is not found at the beginning of a string, the match function returns none.

In [9]:
sentence = "2018 was the year when France won the FIFA World Cup"
output = re.search(r"[a-zA-z ]+", sentence)
print(output.group(0))

 was the year when France won the FIFA World Cup


## to solve above problem, the search function is used. The search function looks for the Regex pattern not only at the beginning of the string but also anywhere in the string. The following regex pattern will skip the integer at the beginning of the string and will return the remaining text.

### 4.3. Substituting Text in strings

### To substitute text in a string, the sub()-function from the Regex module is used

In [10]:
sentence = "France won the FIFA World Cup 2018"
output = re.sub(r"2018", "1998", sentence)
print(output)

France won the FIFA World Cup 1998


In [12]:
sentence = "France won the FIFA World Cup in 2018"
output = re.sub(r"[0-9]", "*", sentence)
print(output)

France won the FIFA World Cup in ****


### 4.4. Removing Digits and Alphabets from a String

The “sub()”–function can also be used to remove digits, alphabets, or special characters from a string. All you have to
do is specify a regex pattern that finds the alphabets or digits that you want to remove and replace them with an empty
string without space.

In [43]:
# find all digits and replaces them with an empty string
# The regex pattern to find digits is \d

sentence = "France won the FIFA World Cup 2018"
output = re.sub(r"\d", "", sentence)
print(output)

France won the FIFA World Cup 


### Sometimes, removing an alphabet results in a single alphabet with no meaning. For instance, in the following example, there is an “a” at the end of the string

In [13]:
sentence = "France won the FIFA World Cup 2018 a "
result = re.sub(r"\s+[a-zA-Z]\s+", " ", sentence )
print(result)

France won the FIFA World Cup 2018 


## The following script is used to remove all the alphabets. Here, the regex expression specified is [a–z]. The attribute “flags =re.I” is used to remove case sensitivity. Hence both upper- and lower-case alphabets are replaced by empty spaces.

In [14]:
sentence = "France won the FIFA World Cup 2018"
output = re.sub(r"[a-z]", "", sentence, flags =re.I)
print(output)

      2018


### 4.5. Removing Empty Spaces from Strings

## To remove empty spaces from the text, you need to substitute empty spaces with an empty string having no space

In [15]:
sentence = "      2018"
output = re.sub(r"\s+", "", sentence, flags =re.I)
print(output)

2018


### 4.6. Removing Special Characters from strings

In [53]:
sentence = "Fr@nce won // the + - & * FIFA World Cup 2018"
output = re.sub(r"[^\w ]", "", sentence, flags =re.I)
print(output)

Frnce won  the     FIFA World Cup 2018


In [16]:
sentence = "Fr@nce won // the + - & * FIFA World Cup 2018"
output = re.sub(r"[^a-zA-z0-9 ]", "", sentence, flags =re.I)
print(output)

Frnce won  the     FIFA World Cup 2018


### 4.7.Miscellaneous String Functions

#### Finding String Length

In [17]:
sentence = "France won the FIFA World Cup 2018"
print(len(sentence))

34


#### Splitting a String

In [18]:
sentence = "France won the FIFA World Cup 2018"

print(sentence.split())

['France', 'won', 'the', 'FIFA', 'World', 'Cup', '2018']


#### Joining Strings

In [19]:
sentence1 = "France won the "
sentence2 = "FIFA World Cup 2018"
output = sentence1 + sentence2
print(output)

France won the FIFA World Cup 2018


#### Finding Start and End of a String

In [20]:
sentence = "France won the FIFA World Cup 2018"
print(sentence.startswith("France"))

True


In [21]:
sentence = "France won the FIFA World Cup 2018"
print(sentence.endswith("France"))

False


#### Changing String Case

In [22]:
sentence = "France won the FIFA World Cup 2018"
print(sentence.lower())

france won the fifa world cup 2018


In [23]:
sentence = "France won the FIFA World Cup 2018"
print(sentence.upper())

FRANCE WON THE FIFA WORLD CUP 2018


#### Finding Substring in a String

In [24]:
sentence = "France won the FIFA World Cup 2018"
print("France" in sentence)

True


In [25]:
sentence = "France won the FIFA World Cup 2018"
print("England" in sentence)

False


### 4.8.Further Readings 

#Official Python String Functions
https://docs.python.org/2.5/lib/string-methods.html

#Official Regex Functions Documentation
https://docs.python.org/3/library/re.html

## In Class Exercise

Consider the following sentence:
    
sentence = "Nick's car was sold for $ 1500".

Perform the following task on the above sentence:
    
1. Replace special characters with empty spaces
2. Remove multiple empty spaces and replace them by a single space
3. Remove any single character
4. Convert the text to all lower case
5. Split the text to individual words

the final out:

['nick', 'car', 'was', 'sold', 'for', '1500']