In [1]:
import re

In [2]:
string = "we have this file1.xml, file2.xml, file3.xml"


In [3]:
pattern = re.compile(r'(\w+)\d')

In [4]:
re.findall(pattern, string)

['file', 'file', 'file']

In [5]:
pattern2 = re.compile(r'(\w+)(\d)')

In [6]:
re.findall(pattern2, string)

[('file', '1'), ('file', '2'), ('file', '3')]

In [7]:
pattern3 = re.compile(r'(\w+)(\d)\.(\w+)')

In [8]:
re.findall(pattern3, string)

[('file', '1', 'xml'), ('file', '2', 'xml'), ('file', '3', 'xml')]

In [9]:
pattern4 = re.compile(r'(\w+)\d.(\w+)')

In [10]:
re.findall(pattern4, string)

[('file', 'xml'), ('file', 'xml'), ('file', 'xml')]

# Chapter 1 Introduction to Regular Expression

in this example ```string = "we have this file1.xml, file2.xml, file3.xml"```, we can find 2 kinds of components : **literals** (```file``` and ```.xml```) and **metacharacter** (```?``` or ```*```).

A reguler expression is a pattern of text that consist of ordinary characters (e.g letter a through z or numbers 0 through 9) and special character known as metacharacters. This pattern describes the strings that would match when applied to a text.

Let's see our very first reguler expression that will match any word starting with ```-->a:```

```-->a:``` regex using literals and metacharacters and is usually written as this ```/--->a\w*/```. from those symbols we can see that, ```--->``` is literal, ```a``` is a literal ```w``` is metacharacter and ```*``` is also metacharcter.

**Representing of regular expression in this book**

In this book, regex is represented *bounded by the ```/``` symbol*. this is the QED demarcation that is followed in most of the text book. The code examples, however, won't use this notation.

on the other hand, evem with monospaced font faces, the *whitespaces* of a regular expression are difficult to count. in order to simplify the reading, everys single whitespace in the figures will appear as ```-->```.

# Literal

Literals are the simplest form of pattern mathing in regular expressions. They will simply succeed whenever that literal is found.

If we apply the regular expression ```'fox'``` to search the phrase ```"The quick brown fox jumps over the lazy dong"```, we will find one match:

In [11]:
sentence = "The quick brown fox jumps over the lazy dong"

re.findall('fox', sentence)

['fox']

However we can also obtain several results instead of just one, if we apply the regular expression ```'be'``` to the following phrase ```'To be, or not to be'```. like this code

In [12]:
sentence2 = 'To be, or not to be'
re.findall(r'be', sentence2)

['be', 'be']

We have just learned in the previous section that *metacharacters* can coexist with *literals* in the same expression. Because of this coexistence, we can fnd that some expressions do not mean what we intended. 

For example, if we apply the expression
```/(this is inside)/``` to search the text ```this is outside (this is inside)```,  
we will fnd that the parentheses are not included in the result. This happens  
because parentheses are metacharacters and they have a special meaning.

In [13]:
sentence3 = "this is outside (this is inside)"
re.findall('(this is inside)', sentence3)

# incorrectly unescaped metacharacters

['this is inside']

***We can use metacharacters as if they were literals***. There are three mechanisms  
to do so:
-  Escape the metacharacters by preceding them with a backslash.
-  In python, use the ```re.escape``` method to escape non-alphanumeric  characters that may appear in the expression. We will cover this in Chapter 2, Regular Expressions with Python.
-  Quoting with \Q and \E: There is a third mechanism to quote in regular expressions, the quoting with ```\Q``` and ```\E```. In the favors that support them, it's as simple as enclosing the parts that have to be quoted with \Q  (which starts a quote) and \E (which ends it).

**However, this is not supported in Python at the moment.**

Using the backslash method, we can convert the previous expression to ```\(this is inside\)/``` and apply it again to the same text to have the parentheses included in the result:

In [14]:
sentence3 = "this is outside (this is inside)"
re.findall(r'\(this is inside\)', sentence3)

# escaped metacharacters in regex

['(this is inside)']

In regular expressions, there are twelve metacharacters that should be escaped if they are to be used with their literal meaning:
- Backlash \
- caret ^
- dollar sign ```$```
- dot ```.```
- pipe symbol |
- question mark ?
- Asterisk *
- plus sign +
- opening parenthesis (
- closing parenthesis )
- opening square brachet [
- the opening curly brace {

In some cases, the regular expression engines will do their best to understand if they 
should have a literal meaning even if they are not escaped; for example, the opening 
curly brace { will only be treated as a metacharacter if it's followed by a number to 
indicate a repetition, as we will learn later in this chapter.

# Character classes or character sets

We are going to use a metacharacter for the frst time to learn how to leverage the character classes. The character classes (also known as character sets) allow us to  defne a character that will match if any of the defned characters on the set is present.

To defne a character class, we should use the opening square bracket metacharacter ```[```, then any accepted characters, and fnally close with a closing square bracket ```]```. For instance, let's define a regular expression that can match the word "license" in British and American English written form:

In [15]:
# searching using a character class or character set
sentence4 = "licence and license are valid"
re.findall(r'licen[cs]e', sentence4)

['licence', 'license']

It is possible to also use the range of a character. This is done by leveraging the hyphen symbol (```-```) between two related characters; for example, to match  any lowercase letter we can use ```[a-z]```. Likewise, to match any single digit we  can defne the character set ```[0-9]```.

The character classes' ranges can be combined to be able to match a character against many ranges by just putting one range after the other—no special separation is required. For instance, if we want to match any lowercase or uppercase alphanumeric character, we can use ```[0-9a-zA-Z]``` (see next table for a more detailed explanation). 

This can be alternatively written using the union mechanism: ```[0-9[a-z[A-Z]]]```.

There is another possibility—the negation of ranges. We can invert the meaning of a character set by placing a caret (^) symbol right after the opening square  bracket metacharacter (```[```). If we have a character class such as ```[0-9]``` meaning any digit, the negated character class ```[^0-9]``` will match anything that is not a digit. 

However, it is important to notice that there has to be a character that is not a digit;for example, ```'hello[^0-9]'``` won't match the string ```hello``` because after the ⇢ there has to be a non-digit character. There is a mechanism to do this — called **negative lookahead**—and it will be covered in *Chapter 4, Look Around*.

In [16]:
re.findall(r'hello[^0-9]', 'hello')

[]

In [17]:
re.findall(r'hello[^0-9]', 'hello_')

['hello_']

# Predefined character classes

After using character classes for some time, it becomes clear that some of them are very useful and probably worthy of a shortcut.

Luckily enough, there are a number of predefned character classes that can be re-used and will be already known by other developers, making the expressions 
using them more readable.

These characters are not only useful as well-known shortcuts for typical character sets, but also have different meanings in different contexts. 
The character class ```\w```, which matches any alphanumeric character, will match a different set of characters depending on the confgured locale and the support of Unicode.

The following table shows the character classes supported at this moment in Python:

https://stackoverflow.com/questions/48655801/tables-in-markdown-in-jupyter

The ```---``` in between the column definitions ```| |``` mean that the column is unjustified. In standard Markdown, this would align to the left of the column but in Jupyter notebook, it appears to align to the right instead.

If you'd like to left align or centre align, you can use ```:-``` and ```:-:``` respectively. Depending on what Jupyter notebook environment you're using, you will need to use ```-:``` to right align.

| Element | Description for regex with default flags|
|:-|:-|
|```.``` | This element matches any character *except* newline ```\n```|
| ```\d```| This matches any decimal digit; this is equivalent  to the class ```[0-9]```|
|```\D```| this matches any non-digit character, this is equivalent to the class ```[^0-9]```|
|```\s```| this matches any whitespace character, this is equivalent to the class ```[--> \t \n \r \f \v]```|
|```\S```| this matches any alphanumeric character this is equivalent to the class ```[^ --> \t \n \r \f \v]```|
|```\w```| this matches any alphanumeric character, this is equivalent to the class ```[a-zA-Z0-9_]```|
|```\W```| this matches any non-alphanumeric character, this is equivalent to the class ```[^a-zA-Z0-9_]```|

The frst one from the previous table—the dot—requires special attention. The dot is probably one of the oldest and also one of the most used metacharacters. The dot can match any character except a newline character.

Let's put the dot in practice by creating a regular expression that matches three characters of any value except newline:
$$'…'$$
which the first ```.``` means the matches any character, the second means matches any character followed by the previous one ant the third ```.``` means matches any character followed by the previous one.

The dot is a very powerful metacharacter that can create problems if it is not used moderately. In most of the cases where the dot is used, it could be considered overkill (or just a symptom of laziness when writing regular expressions).

To better define what is expected to be matched and to express more concisely  
to any ulterior reader what a regular expression is intended to do, the usage of character classes is much recommended. For instance, when working with Windows and UNIX fle paths, to match any character except the slash or the backslash, you can use a negated character set:
```'[^\/]'```

This character set is explicitly telling you that we intend to match anything but a Windows or UNIX fle path separator.

# Alternation

We have just learned how to match a single character from a set of characters. Now, we are going to learn a broader approach: how to match against a set of  
regular expressions. This is accomplished using the pipe symbol ```|``` .

Let's start by saying that we want to match either if we fnd the word "yes" or the word "no". Using alternation, it will be as simple as: ```r'yes|no'```



On the other hand, if we want to accept more than two values, we can continue adding values to the alternation like this: ```'yes|no|maybe'```

When using in bigger regular expressions, we will probably need to wrap our alternation inside parentheses to express that only that part is alternated and not the whole expression. For instance, if we make the mistake of not using the 
parentheses, as in the following expression: 

```'Licence: yes|no'```

We may think we are accepting either ```Licence: yes``` or ```Licence: no```, but we are actually accepting either ```Licence: yes``` or ```no``` as the alternation has been applied to the whole regular expression instead of just the ```yes|no``` part. A correct approach for this will be:

```Driving Licence: (yes|no)'```

In [18]:
# reguler expression using alternation
sentence5 = "Licence: yes"
re.findall('Licence: yes|no', sentence5)

['Licence: yes']

In [19]:
sentence5 = "Licence: yes"
sentence6 = "Licence: no"
re.findall(r'Licence: (yes|no)', sentence5)

['yes']

In [20]:
re.findall(r'Licence: (yes|no)', sentence6)

['no']

# Quantifiers

So far, we have learned how to defne a single character in a variety of fashions. At this point, we will leverage the quantifers—the mechanisms to defne how a character, metacharacter, or character set can be repeated.

For instance, if we defne that a \d can be repeated many times, we can easily create a form validator for the number of items feld of a shopping cart (remember that ```\d``` matches any decimal digit). But let's start from the beginning, the three basic quantifers: the question mark ```?```, the plus sign ```+```, and the asterisk ```*```.

|Symbol |Name| Quantification of previous character|
|:-:|-:|-:|
|? |Question mark| Optional (0 or 1 repetitions)|
|* |Asterisk |Zero or more times|
|+ |Plus sign |One or more times |
|{n,m}| Curly braces |Between n and m times|

In the preceding table, we can fnd the three basic quantifers, each with a specifc utility. The question mark can be used to match the word car and its plural form cars:

```'cars?'```

| Element |  Description |
|:-: | :-|
| car | Matches the characters c, a, r and s |
| s? | Optionally matches the character s|

Another interesting example of the usage of the question mark quantifer will be to match a telephone number that can be in the format ```555-555-555```, ```555 555 555```, or ```555555555```.

We now know how to leverage character sets to accept different characters,but is it possible to apply a quantifer to a character set? 

Yes, quantifers can be applied to characters, character sets, and even to groups (a feature we will cover in Chapter 3, Grouping). We can construct a regular expression like this to validate the telephone numbers:

```\d+[-\s]?\d+[-\s]?\d+```

In [21]:
re.findall('\d+[-\s]?\d+[-\s]?\d+', '555-555-555')

['555-555-555']

In [22]:
re.findall('\d+[-\s]?\d+[-\s]?\d+', '555 555 555')

['555 555 555']

In [23]:
re.findall('\d+[-\s]?\d+[-\s]?\d+', '555555555')

['555555555']

At the beginning of this section, one more kind of quantifer using the curly braces had been mentioned. Using this syntax, we can defne that the previous character must appear exactly three times by appending it with ```{3}```, that is, the expression ```\w{8}``` specifes exactly eight alphanumeric digits.

We can also define a certain range of repetitions by providing a minimum and 
maximum number of repetitions, that is, between three and eight times can be 
defned with the syntax ```{4,7}```. Either the minimum or the maximum value can be omitted defaulting to 0 and infnite respectively. To designate a repetition of up to three times, we can use ```{,3}```, we can also establish a repetition at least three times with ```{3,}```.

These four different combinations are shown in the next table:

|Syntax | Description |
|:- | :-|
|{n} |The previous character is repeated exactly n times.|
|{n,}| The previous character is repeated at least n times.|
|{,n}| The previous character is repeated at most n times.|
|{n,m}| The previous character is repeated between n and m times (both inclusive).|

another way to get this phone number from this format ```555-555-555```, ```555 555 555```, or ```555555555``` is by defining a regular expression to validate the metacharacter plus sign: 

```'/\d+[-\s]?\d+[-\s]?\d+'```

It will require the digits (```\d```) to be repeated one or more times.

Let's fine-tune the regular expression by defining that the leftmost digit group can contain up to three characters, while the rest of the digit groups should contain exactly three digits:

In [24]:
# using quantifiers
re.findall('\d{1,3}[-\s]?\d{3}[-\s]?\d{3}','555 555 555')

['555 555 555']

# Greedy and Non-greedy (reluctant) quantifiers

We still haven't defined what would match if we apply a quantifer such as this 
```".+"``` to a text such as the following: ```English "Hello", Spanish "Hola"```. We may expect that it matches ```"Hello"``` and ```"Hola"``` but it will actually match ```"Hello", Spanish "Hola"```.

In [25]:
re.findall(r'".+"', 'English "Hello", Spanish "Hola"')

['"Hello", Spanish "Hola"']

This behavior is called greedy and is one of the two possible behaviors of the 
quantifers in Python: **greedy** and **non-greedy** (also known as reluctant).
- The greedy behavior of the quantifers is applied by default in the quantifers. A greedy quantifer will try to match as much as possible to have the biggest match result possible.
- The non-greedy behavior can be requested by adding an extra question mark to the quantifer; for example, ```??```, ```*?``` or ```+?```. A quantifer marked as reluctant will behave like the exact opposite of the greedy ones. They will try to have the smallest match possible.

We can understand better how this quantifer works by looking at the next figure. We will apply almost the same regular expression (with the exception of leaving the quantifer as greedy or marking it as reluctant) to the same text, having two very different results:

In [26]:
# Greedy quantifier
re.findall(r'".+"', 'English "Hello", Spanish "Hola"')

['"Hello", Spanish "Hola"']

In [27]:
# reluctant or non-greedy quantifiers
re.findall(r'".+?"', 'English "Hello", Spanish "Hola"')

['"Hello"', '"Hola"']

In [28]:
# another example of greedy quantifier
re.findall(r'\(.+\)', r'English (HELLO), Spanish (HOLA)')

['(HELLO), Spanish (HOLA)']

In [29]:
# another example of reluctant or non-greedy quantifier
re.findall(r'\(.+?\)', r'English (HELLO), Spanish (HOLA)')

['(HELLO)', '(HOLA)']

In [30]:
# another example of reluctant or non-greedy quantifier
re.findall(r'\(.+?\)', r'Perserikatan Bangsa Bangsa (PBB), United State of America (USA), Uni Emirate Arab (UEA)')

['(PBB)', '(USA)', '(UEA)']

# Boundary matchers

Until this point, we have just tried to fnd out regular expressions within a text. Sometimes, when it is required to match a whole line, we may also need to match at the beginning of a line or even at the end. This can be done thanks to the **boundary matchers**.

The boundary matchers are a number of identifers that will correspond to a particular position inside of the input. The following table shows the boundary 
matchers available in Python:

|Matcher |Description|
|:-: | :-|
|```^``` |Matches at the beginning of a line|
|```$``` | Matches at the end of a line |
|```\b```| Matches a word boundary|
|```\B```| Matches the opposite of \b. Anything that is not a word boundary |
|```\A```| Matches the beginning of the input |
|```\Z```| Matches the end of the input|

These boundary matchers will behave differently in different contexts. For instance, the word boundaries (```\b```) will depend directly on the configured locale as different languages may have different word boundaries, and the beginning and end of line boundaries will behave differently based on certain flags that we will study in the next chapter.

Let's start working with boundary matchers by writing a regular expression that  will match *lines that start with ```"Name:"```*. If you take a look at the previous table, you may notice the existence of the metacharacter ```^``` that expresses the beginning of a line. Using it, we can write the following expression:

```'^Name:'```

|Element |Description| 
|:-: | :- |
|```^```| Matches the beginning of the line |
|```N```| Matches the followed by character N |
|```a```| Matches the followed by character a|
|```m```| Matches the followed by character m |
|```e```| Matches the followed by character e |
|```:```| Matches the followed by symbol colon |

If we want to take one step further and continue using the caret and the dollar sign in combination to match the end of the line, we should take into consideration that from now on we are going to be matching against the whole line, and not just trying to fnd a pattern within a line.

Following the previous example, let's say that we want to make sure that after the name, there are only alphabetic characters or spaces until the end of the line. We will do this by matching the whole line until the end by setting a character set with the accepted characters and allowing their repetition any number of times until the end of the line.

```'^Name:[\sa-zA-Z]+$'```

|Element |Description|
|:- | :-|
|^ |Matches the beginning of the line.|
|N |Matches the followed by character N.|
|a |Matches the followed by character a.|
|m |Matches the followed by character m.|
|e |Matches the followed by character e.|
|: |Matches the followed by colon symbol.|
|```[\sa-zA-Z]```| Then matches the followed by whitespace, or any alphabetic lowercase or uppercase character.|
|+ | The character can be repeated one or more times.|
|$ |Until the end of the line.|

In [31]:
text = r'Name: Sri Sultan Hamengku Buwono ke X'
re.findall(r'^Name:[\sa-zA-Z]+$', text)

['Name: Sri Sultan Hamengku Buwono ke X']

In [32]:
text2 = r'Name: Sri Sultan Hamengku Buwono ke X. City:Yogyakarta'
re.findall(r'^Name:[\sa-zA-Z]+', text2)

['Name: Sri Sultan Hamengku Buwono ke X']

In [33]:
text4 = r'Name: Sri Sultan Hamengku Buwono ke X City:Yogyakarta'
re.findall(r'^Name:[\sa-zA-Z]+$', text4)

[]

In [34]:
text41 = r'Name: Sri Sultan Hamengku Buwono ke X City. Yogyakarta'
re.findall(r'^Name:[\sa-zA-Z]+$', text41)

[]

In [35]:
text5 = '''
Name: Sri Sultan Hamengku Buwono 
ke X City Yogyakarta
'''
re.findall(r'^Name:[\sa-zA-Z]+$', text5)

[]

In [36]:
text6 = r'Name: Sri Sultan Hamengku Buwono ke X City:Yogyakarta'
re.findall(r'^Name:[\sa-zA-Z]+', text6)

['Name: Sri Sultan Hamengku Buwono ke X City']

In [37]:
text7 = r'Name: Sri Sultan Hamengku Buwono ke. X City Yogyakarta'
re.findall(r'^Name:[\sa-zA-Z]+', text7)

['Name: Sri Sultan Hamengku Buwono ke']

In [38]:
# percobaan ubah tanda dollar
text8 = r'Name: Sri Sultan Hamengku Buwono ke X City Yogyakarta'
re.findall(r'^Name:[\sa-zA-Z]+$', text8)

['Name: Sri Sultan Hamengku Buwono ke X City Yogyakarta']

In [39]:
# percobaan ubah tanda dollar
text9 = '''
Name: Sri Sultan Hamengku. 
Buwono ke X City Yogyakarta
'''
re.findall(r'^Name:[\sa-zA-Z]+$', text9)

[]

In [40]:
text3 = '''
 Name: Sri Sultan Hamengku 
 Buwono ke X
 Name: Yogyakarta
'''

re.findall(r'^Name:[\sa-zA-Z]+$', text3)

[]

**Please learn further about using caret ^ and dollar $.**

Another outstanding boundary matcher is the word boundary ```\b```. It will match any character that is not a word character (in the confgured locale), and therefore, any potential word boundary. This is very useful when we want to work with isolated words and we don't want to create character sets with every single character that may divide our words (spaces, commas, colons, hyphens, and so on). We can, for instance, make sure that the word hello appears in a text by using the following regular expression:

```'\bhello\b'```

|Element |Description |
| :-: | :- |
|\b | Matches a word boundary.|
|h |Matches the followed by character h.|
|e |Matches the followed by character e.|
|l |Matches the followed by character l.|
|l |Matches the followed by character l.|
|o |Matches the followed by character o.|
|\b|Then matches another followed by word boundary.|

As an exercise, we could think why the preceding expression is better than ```'hello'```. The reason is that this expression will match an isolated word instead of a word containing ```"hello"```, that is, ```'hello'``` will easily match ```hello```, ```helloed```, or ```Othello```; 

while ```'\bhello\b'``` will only match ```hello```.

In [41]:

re.findall('\bhello\b', 'hello, helloed, Othello')

[]

In [42]:
hello_string = "'hello', 'helloed', 'Othello'"
hello_pattern = re.compile(r'\b"hello"\b')
re.findall(hello_pattern, hello_string)

[]

In [43]:
hello_string = "helloed"
hello_pattern = re.compile(r'\bhello\b')
re.findall(hello_pattern, hello_string)

[]

In [44]:
hello_string = "hello"
hello_pattern = re.compile(r'\bhello\b')
re.findall(hello_pattern, hello_string)

['hello']

# Summary

In this frst chapter, we have learned the importance of the regular expressions and how they became such a relevant tool for the programmers. We also studied from a yet non-practical point of view, the basic regular expression syntax and some of the key features, such as character classes and quantifers. In the next chapter, we are going to jump over to Python to start practicing with the re module.