 <div>
<img src="https://edlitera-images.s3.amazonaws.com/new_edlitera_logo.png" width="500"/>
</div>

# Basics of working with text in Python 


* tipically we use the **`print()`** function in Python to display some text 

* the **`print()`** function is used for many different things:
    <br>
    
    * displaying results of some operation
    * searching for bugs in code
    * displaying the progress of some operation
    * etc.

## Escape characters

* help format the output of the **`print()`** function

* very useful when you want to create your own text files from some text data

### List of important escape characters

* **`\n`** - newline
* **`\t`** - horizontal tab
* **`\v`** - vertical tab
* **`\r`** - carriage return (shift to the beginning of line and replace)
* **`\b`** - backspace
* **`\f`** - form feed (delimiter for a page break)
* **`\`** - backslash
* **`\N{character name}`** - display character using character name

### Examples

In [1]:
# Newline example

print("Hello \nWorld")

Hello 
World


In [2]:
# Horizontal tab example

print("Hello \tWorld")

Hello 	World


In [3]:
# Vertical tab example

print("Hello\vWorld")

Hello
World


In [4]:
# Carriage return example

print("abcdefgh\r987654")

abcdefgh
987654


In [5]:
# Backspace example

print("Hello$\b World")

Hello$ World


In [6]:
# Form feed example

print("Hello World\f")

Hello World



In [7]:
# Backslash example

print("Backslash example \\")

Backslash example \


In [8]:
# Character name example

print("\N{GREEK CAPITAL LETTER GAMMA}")

Γ


## Raw strings

* using a **backslash( `\` )** we can escape a character to make sure we print it as is
    <br>
    
    * that way we can print characters that typically hold special meaning in Python as if they were just a part of a standard string

* another (and perhaps more elegant way) of printing strings as they are, without taking into consideration special characters is using so called **raw strings**

* to use a raw string, we just put the lowercase letter **`r`** before a string, which signals to Python that we want to treat everything inside the string, even special characters, as standard characters
    <br>
    
    * what happens is Python treats each character as its own separate literal character
    * that means that something like \n is treated as if we had a separate \ character followed by a n character


### Examples:

In [9]:
# Carriage return example
# Raw string

print(r"abcdefgh\r987654")

abcdefgh\r987654


In [10]:
# Newline example
# Raw string

print(r"Hello \nWorld")

Hello \nWorld


## String Formatting

* there are several ways to do it

* the two most popular ways are:
    <br>
    
    * using **`.format()`**
    * using **`f strings`**

### Using `.format()`


In [11]:
fav_num_1 = 22454325.879857349

In [12]:
fav_num_2 = 224545

In [13]:
'My favorite number is {:,.2f}. My other one is {:,d}.'.format(fav_num_1, fav_num_2)

'My favorite number is 22,454,325.88. My other one is 224,545.'

### Using `f strings` (my favorite)

* also the newest

In [14]:
num = 11

In [15]:
print(f'November is the {num}th month of the year')

November is the 11th month of the year


What if you want to format a large number?

In [16]:
num = 13936987629872

In [17]:
x = f'The size of the file is {num:,d} bits.'
x

'The size of the file is 13,936,987,629,872 bits.'

What's up with this `{num:,d}`?

* `{}` tells Python we're about to specify a format for a variable
* `num` is the name of the variable we're formatting
* `d` tells Python the variable is an integer number
* `,` tells Python we want it to add the commas to make the number more legible

What if the number has decimals?

In [18]:
fav_num = 22454325.879857349

In [19]:
fav_num_str = f'My favorite number is {fav_num:,.4f}.'

In [20]:
print(fav_num_str)

My favorite number is 22,454,325.8799.


What is this `{fav_num:,.4f}` sourcery?

* `{}` tells Python we're about to specify a format for a variable
* `fav_num` is the name of the variable we're formatting
* `f` tells Python the variable is a float
* `.4` tells Python we want 4 decimals of precision
* `,` again tells Python we want it to add the commas to make the number more legible

**CAUTION:** Formatting like this only works for `f` strings:

In [21]:
print('My favorite number is {fav_num:,.2f}.')

My favorite number is {fav_num:,.2f}.


**NOTE:** Variables can be anything that can be converted to string:

In [22]:
name = 'Ciprian'

In [23]:
print(f'My name is {name}. My favorite number is {fav_num}.')

My name is Ciprian. My favorite number is 22454325.87985735.


# Introduction to `Regular Expressions`

* often shortened to just **`RegEx`**

* sequences of characters that form various search patterns

* we can use them to chech whether a string contains a specific pattern

* can also be used to extract a part of a string or even replace it

* very useful for reading and writing text data

* **not unique to Python**
    <br>
    
    * the concept of using regular expressions to search for various patterns in text data exists in other programming languages such as **Java** or **C++**

## Advantages of using `RegEx`

* use cases range from basic use cases (search for certain letters in text) to advanced uses (preparing text data for Machine Learning models)

* faster than standard Python operations such as 
    <br>
    
    * using the keyword **in** to search for the presence of some substring
    * using **`find()`**  to find a certain part of a string
    * using **`replace()`**  to replace a part of a string

* one of the most useful libraries to learn if you plan to work in the field of **NLP (Natural Language Processing)**
    

## Disadvantages of using `RegEx`

* takes some time to get really comfortable with them

* they can lead to very subtle bugs

# Introduction to `RegEx` in Python

* available through the built-in module **`re`**
    <br>
    
    * more functionality available in the third party **`regex`** module

In [24]:
# Import the in-built Python module
# For using Regular Expressions

import re

* the module provides us with a set of functions we can use to manipulate strings

* the syntax of a `RegEx` query consists of two parts:
    <br>
    
    * **functions**
    * **search patterns**

## `RegEx` functions

* the most useful functions are:
    <br>
    
    * **`findall()`** 
    * **`search()`** 
    * **`split()`** 
    * **`sub()`** 

* we will use these functions in combination with search patterns to form `RegEx` queries

## Search patterns

* a search pattern consists of three parts:
    <br>
    
    * **metacharacters**
    * **special sequences**
    * **sets**

* using various combinations of these we can build a search pattern that we input into a **`RegEx`** function to achieve some result

# RegEx functions

* each of the four previously mentioned functions work on the same principle

* the search a string, looking for a particular pattern that we entered into them as an argument, and follow that up with some particular operation:
    <br>
    
    * **`findall()`** - returns a list containing all matches
    * **`search()`** - returns a match if there is one 
    * **`split()`** - returns a list that consists of a string split at each match
    * **`sub()`** - replace one or multiple matches with a new string

## Examples:

* let's take a look at a few simple examples, just to demonstrate how these functions work



* in practice we will use more complex queries that include metacharacters, special characters and sets
    <br>
    
    * that will be demonstrated after we explain how metacharacters, special characters and sets work


In [25]:
# Import what we need

import re

### `findall()` example

In [26]:
# Define example text data 

text_data = """In the advanced Python course we cover many different topics. 
We cover topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages
iterators and generators."""

In [27]:
# Find how many times
# the word "topics" appears in our text

list_of_occurrences = re.findall("topics", text_data)


In [28]:
# Display result 

print(f"The word 'topics' appears {len(list_of_occurrences)} times in our example text data.")

The word 'topics' appears 3 times in our example text data.


### `search()` example

In [29]:
# Define example text data 

text_data = """In the advanced Python course we cover many different topics. 
We cover topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages
iterators and generators."""

In [30]:
# Search for the first occurence
# of the word "cover" in our text

match = re.search("cover", text_data)


In [31]:
# display result

print(match)

<re.Match object; span=(33, 38), match='cover'>


* search creates a so-called **`Match object`**

* **`Match objects`** have certain properties and methods that allow us to access various information about the "match" that was found
    <br>
    
    * when we search for a part of a string, if that part of a string is found that is considered a "match"
    * even when a part of a string appears multiple times in our string, the **`search()`** will create a **`Match object`** that corresponds to only the first occurence 

* important property of the **`Match object`:**
    <br>
    
    * **`.string`** - returns the string passed to the **`search()`** function

* important methods of the **`Match object`**:
    <br>
    
    * **`span()`** - returns a tuple that contains the start and end position of the match inside the original string
    * **`group()`** - returns the part of the string where the match was

In [32]:
# Demonstrate string property

print(f"The string that contains the term we are searching for: \n\n{match.string}")

The string that contains the term we are searching for: 

In the advanced Python course we cover many different topics. 
We cover topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages
iterators and generators.


In [33]:
# Demonstrate span() 

print(f"The term we are searching for starts at index {match.span()[0]}, and ends at index {match.span()[1]}")

print(f"The term we are searching for is: {text_data[match.span()[0]:match.span()[1]]}")

The term we are searching for starts at index 33, and ends at index 38
The term we are searching for is: cover


In [34]:
# Demonstrate group()

print(f"The part of the string we were search for is: {match.group()}")

The part of the string we were search for is: cover


### `split()` example

In [35]:
# Define example text data 

text_data = """In the advanced Python course we cover many different topics. 
We cover topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages
iterators and generators."""

In [36]:
# Split on newline 

result_list = re.split("\n", text_data)


In [37]:
# Display result

print(f"When we split our string on a newline, we get the following list: \n\n{result_list}")

When we split our string on a newline, we get the following list: 

['In the advanced Python course we cover many different topics. ', 'We cover topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages', 'iterators and generators.']


In [38]:
# Split on word example

result_list = re.split("cover", text_data)

In [39]:
# Display result

print(f"When we split our string on the substring 'cover', we get the following list: \n\n{result_list}")

When we split our string on the substring 'cover', we get the following list: 

['In the advanced Python course we ', ' many different topics. \nWe ', ' topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages\niterators and generators.']


### `sub()` example

In [40]:
# Define example text data 

text_data = """In the advanced Python course we cover many different topics. 
We cover topics such as concurrency, asynchronous programming and web scrapping, but also topics such as modules, packages
iterators and generators."""

In [41]:
# Substitute the word "topics"
# with the word "concepts"

new_text_data = re.sub("topics", "concepts", text_data)

In [42]:
# Display result

print(f"The new version of our sentence is: \n\n{new_text_data}")

The new version of our sentence is: 

In the advanced Python course we cover many different concepts. 
We cover concepts such as concurrency, asynchronous programming and web scrapping, but also concepts such as modules, packages
iterators and generators.


# Metacharacters, special sequences and sets

* the second part of **`RegEx`** equations

* in combination with the previously mentioned functions allow us to easily (and quickly) process text

* the number of different combinations you can create is limitless
    <br>
    
    * in practice since you will probably be working on similar problems in similar fields you will with create a "database of formulas" that you can reuse whenever you need to finish a certain task
    * that being said, from time to time you will run into a situation where you need to do something new, so in that case it is important to understand how **`RegEx`** works (and not just blindly copy premade formulas)

* we already covered functions, now it is time to cover:
    <br>
    
    * **metacharacters**
    * **special sequences**
    * **sets**

## Metacharacters

* characters that carry a special meaning in the syntax of regular expressions during pattern processing

* we use them to define search criteria and other manipulations

* the most important metacharacters are:
    <br>
    
    * **`[]`** - used to form a set of characters
    * **`\`** - used as an escape sequence for escaping special characters (for example \d for digits)
    * **`.`** - any character except a newline
    * **`^`** - starts with
    * **`$`** - ends with
    * **`*`** - zero of more occurrences of a character
    * **`+`** - one of more occurrences of a character
    * **`?`** - zero of one occurrence of a character
    * **`{}`** - specified number of occurrences
    * **`|`** - either or
    * **`()`** - capture or group
    
   

### Examples:

In [1]:
# Import what we need

import re

In [18]:
# Define example text data


text_data = "Sensors XGB-100 and XGY-107 recorded value above the threshold."

In [19]:
# Example 1


result = re.findall("[a-o]", text_data)

result

['e',
 'n',
 'o',
 'a',
 'n',
 'd',
 'e',
 'c',
 'o',
 'd',
 'e',
 'd',
 'a',
 'l',
 'e',
 'a',
 'b',
 'o',
 'e',
 'h',
 'e',
 'h',
 'e',
 'h',
 'o',
 'l',
 'd']

In [20]:
# Example 2


result = re.sub("\d", "-",text_data)

result

'Sensors XGB---- and XGY---- recorded value above the threshold.'

In [21]:
# Example 3


result = re.findall("^se", text_data.lower())

result

['se']

In [25]:
# Example 4


result = re.findall("XG.{5}", text_data)

result

['XGB-100', 'XGY-107']

In [26]:
# Example 5


result = re.sub("XGB|XGY", "ASY", text_data)



result

'Sensors ASY-100 and ASY-107 recorded value above the threshold.'

## Special sequences

* formed by combining a backslash ( \ ) character with a character that holds special meaning

* depending on which character we attach to the backslash, we will get different results

* the most commonly used combinations are:
    <br>
    
    * **`\A`** - returns a match if some character(or characters) are at the beginning of string
    * **`\b`** - returns a match if some characters are at the beginning or end of word
    * **`\B`** - returns a match if some characters are NOT at beginning or end of word
    * **`\d`** - returns a match if string contains digits 
    * **`\D`** - returns a match if string does NOT contain digits
    * **`\s`** - returns a match if string contains a white space
    * **`\S`** - returns a match if string does NOT contain a white space
    * **`\w`** - returns a match if string contains any word characters (a to Z, digits 0-9, and underscore _)
    * **`\W`** - returns a match if string does NOT contain any word characters
    * **`\Z`** - returns a match if specified characters are at the end of the string 

### Examples:

In [34]:

example_text = "Sensors XGB-100 and XGY-107 recorded value above the threshold #?!%$."

In [35]:
# Example 1

result = re.findall("\d", example_text)

result

['1', '0', '0', '1', '0', '7']

In [36]:
# Example 2

result = re.findall("threshold\Z", example_text)

result

[]

In [37]:
# Example 3

result = re.findall("\W", example_text)

result

[' ',
 '-',
 ' ',
 ' ',
 '-',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '#',
 '?',
 '!',
 '%',
 '$',
 '.']

## Sets

* formed by placing characters inside a pair of square brackets

* what we place inside the brackets is the most important part

* the most commonly used sets are:
    <br>
    
    * **`[bwcd]`** - returns match if one of a group of the characters inside the square brackets is present
    * **`[a-e]`** - returns match for any character, alphabetically between the first and second letter inside the square brackets
    * **`[^asd]`** - returns match for any character except the ones in the square brackets
    * **`[0123]`** - returns match if any of the digits  inside the square brackets are present
    * **`[0-9]`** - returns match for any digit between the first mentioned digit and second mentioned digit inside the square brackets
    * **`[0-7][0-1]`** - returns a match for any two-digit numbers from the first combination to the second (e.g. 00 and 71,
    * **`[a-zA-Z]`** - returns a match for any character between a and z, lower case or upper case,
    * **`[+]`** - no special meaning in a set, treated as a + character (returns a match for any + in a string, and this also applies to similar characters such as *, ., |, (), $ and {} )

### Examples:

In [38]:
# Example 1

example_text = "The concepts of digits(0123456789) is actually relatively easy for little kids to grasp."

result = re.findall("[^cde.]", example_text[0:15])

result

['T', 'h', ' ', 'o', 'n', 'p', 't', 's', ' ', 'o', 'f']

In [39]:
# Example 2

example_text = "The concepts of digits(0123456789) is actually relatively easy for little kids to grasp."

result = re.findall("[a-g]", example_text)

result

['e',
 'c',
 'c',
 'e',
 'f',
 'd',
 'g',
 'a',
 'c',
 'a',
 'e',
 'a',
 'e',
 'e',
 'a',
 'f',
 'e',
 'd',
 'g',
 'a']

In [57]:
# Example 3

example_text = "The concepts of digits(0123456789) is actually relatively easy for little kids to grasp."

result = re.findall("[^a-zA-Z]", example_text)

result

[' ',
 ' ',
 ' ',
 '(',
 '0',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 ')',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 ' ',
 '.']

In [40]:
# Example 4

example_text = "The winning lottery combination is: 23, 17, 1, 5, 10, 32."

result = re.findall("[0-3][0-9]", example_text)

result

['23', '17', '10', '32']

#  `RegEx` in practice

* given a string, extract and print all words that contain **`ea`** in the sentence

        "eat say ease sparrow please"

* we will use the following pattern : 

     **`([^\s]*ea[^\s]*)`**

## The pattern explained:

- `(`...`)` - this represents a group of characters that match the pattern
  - use this when you want to keep track of multiple patterns / occurrences
  - `(` represents the beginning of the group
  - `)` represents the end of the group

- `[^\s]` - this will match *any* character other than `\s` (space)
  - `[`, `]` are used to describe lists of characters
  - `^` signals an exclusion (except)
  - `\s` is the character for space

- `ea` - we want each character group to contain these letter group

* `*` indicates that the character or group of characters immediately to the left should be matched 0 or more times

* **matching is greedy! regular expressions will match as much as they can**

## Demonstration

In [41]:
# Import what we need 

import re

In [42]:
# Define example text data

text_data = "eat say ease sparrow please"

In [43]:
# Get words that contain "ea"

words = re.findall(r"[^\s]*ea[^\s]*", text_data)

In [44]:
# Display result

words

['eat', 'ease', 'please']

# More examples

### Extracting numbers

* in practice you want to avoid typing in a **`RegEx`** pattern multiple times

* to make sure you don't repeat your code, you can use the **`compile()`** method from **`re`**

* **`compile()`** allows us to create a pattern we can then utilize to search for a match inside different strings

In [63]:
# Import what we need

import re

In [45]:
# Define example text data

text_data = ["My phone number is 5551234567", "You can call me at 5557214455", "Contact number: 5557615454"]

In [46]:
# Recompile RegEx
# By defining a search pattern

regex = re.compile(r'[0123456789\-]+')

In [47]:
# Extract phone numbers

numbers = regex.findall("".join(text_data))

In [48]:
# Display result 

numbers

['5551234567', '5557214455', '5557615454']

* if you want to, you can also get the result in the form of an iterator using the **`finditer()`** function

In [49]:
# Create an iterator

numbers = regex.finditer("".join(text_data))

In [50]:
numbers

<callable_iterator at 0x7fb508711d60>

In [51]:
# Display numbers

for num in numbers:
    print(num.group())

5551234567
5557214455
5557615454


### Cleaning data

* you will occasionally run into text data that has not been cleaned and contains weird characters / sequence of characters

* in those cases, the **`sub()`** function can be very useful to clean the data 
    <br>
    
    * after we clean the data we can push it further down a data pipeline, we can feed it to an ML model, etc.

In [52]:
# Import what we need

import re

In [53]:
# Define example corrupted text data

corrupted_data = "In this s3nt#nc!, the l5tt9r 3 was r0plac8d with random digits and sp$cial charact%rs."

In [54]:
# Clean data

clean_data = re.sub(r"[35908#!$%]", "e", corrupted_data)

In [55]:
# Display result

clean_data

'In this sentence, the letter e was replaced with random digits and special characters.'

### Find very specific patterns in data

* when working with large amounts of text data sometimes you need to find when a particular pattern appears in that data for the first time

* for example you might want to extract a particular IP adress

In [56]:
# Import what we need

import re

In [57]:
# Define example text data

log = "127.0.0.1 AZW-183448-BWXU INFO -- : hi there"

In [58]:
# Define search pattern

regex = re.compile(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}")

In [59]:
# Extract IP adress


ip_adress = regex.search(log).group()

In [60]:
# Display IP adress

ip_adress

'127.0.0.1'