In [1]:
import re

In [190]:
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
print(x)

<re.Match object; span=(0, 17), match='The rain in Spain'>


# [a-z]
- Square brackets [] define a character class. It is used to match any single character that is present inside the brackets. For example, [abc] will match either 'a', 'b', or 'c'. You can also define ranges, like [a-z] to match any lowercase letter. When used, it checks if the input string contains one of the characters from the defined set.

In [20]:
#Find all lower case characters alphabetically between "a" and "m":
x = re.findall("[a-m]", txt)
print(x)

['h', 'e', 'a', 'i', 'i', 'a', 'i']


# [A-Z]

In [22]:
#Find all Upper case characters alphabetically between "A" and "Z":
x = re.findall("[A-Z]", txt)
print(x)

['T', 'S']


# \d
###### `\d` (Digit) in Regular Expressions:

The `\d` symbol matches any **digit** character. It is equivalent to `[0-9]`, which means it will match any character from 0 to 9. 

###### Example:
- `\d` matches **'1'**, **'5'**, **'9'** in the string.
- `\d{2}` matches exactly **two digits** like '12', '99', '34', etc.

In [30]:
txt = "That will be 59 dollars"

#Find all digit characters:
x = re.findall("\d", txt)
print(x)

['5', '9']


# .
###### `.` (Dot) in Regular Expressions:

The `.` symbol matches **any single character** except for newline characters (`\n`).

###### Example:
- `a.b` matches any string that has an "a", followed by any character, and then a "b". For example, it will match "acb", "axb", "a3b", but not "ab" (since there's no character between "a" and "b").

###### Key Point:
- The dot (`.`) is a wildcard, meaning it can match **any character** except a newline.

In [58]:
txt = "hello planet"

a = re.findall("he.", txt)
#Search for a sequence that starts with "he", followed by two (any) characters, and an "o":
b = re.findall("he..", txt)
c = re.findall("he..o",txt)

print(f"Output of a variable : {a}")
print(f"Output of b variable : {b}")
print(f"Output of c variable : {c}")

Output of a variable : ['hel']
Output of b variable : ['hell']
Output of c variable : ['hello']


# ^
###### `^` (Caret) in Regular Expressions:

The `^` symbol is used to **anchor** the match to the **beginning** of a string (or line, in multi-line mode). It asserts that the match must occur at the start of the string.

###### Example:
- `^Hello` will match any string that **starts with "Hello"**. For instance, it will match "Hello there" but not "Say Hello".

###### Key Point:
- The caret (`^`) is used to ensure that a match occurs **only at the start** of the string or line. It is a powerful way to restrict matching patterns to the beginning of your input text.

In [61]:
txt = "hello planet"

#Check if the string starts with 'hello':
x = re.findall("^hello", txt)
if x:
    print("Yes, the string starts with 'hello'")
else:
    print("No match")

Yes, the string starts with 'hello'


# $
###### `$` (Dollar Sign) in Regular Expressions:

The `$` symbol is used to **anchor** the match to the **end** of a string (or line, in multi-line mode). It asserts that the match must occur at the end of the string.

###### Example:
- `world$` will match any string that **ends with "world"**. For instance, it will match "Hello world" but not "world Hello".

###### Key Point:
- The dollar sign (`$`) is used to ensure that a match occurs **only at the end** of the string or line. It is useful for checking or capturing patterns at the end of a text input.

In [64]:
txt = "hello planet"

#Check if the string ends with 'planet':
x = re.findall("planet$", txt)
if x:
    print("Yes, the string ends with 'planet'")
else:
    print("No match")

Yes, the string ends with 'planet'


# *
###### `*` (Asterisk) in Regular Expressions:

The `*` symbol is used to match **zero or more occurrences** of the preceding character, group, or pattern. It means that the pattern can repeat any number of times, including not appearing at all.

###### Example:
- `a*` will match:
  - An empty string (because it allows zero occurrences of "a").
  - Any string with one or more "a"s (like "a", "aa", "aaa", etc.).

###### Key Point:
- The `*` quantifier allows flexibility in pattern matching, as it accepts both **empty** and **repeated** occurrences of the preceding element.

In [121]:
txt = "heo planet"

#Search for a sequence that starts with "he", followed by 0 or more  (any) characters, and an "o":
x = re.findall("he.*o", txt)

print(x)

['heo']


# +
###### `+` (Plus) in Regular Expressions:

The `+` symbol is used to match **one or more occurrences** of the preceding character, group, or pattern. Unlike the `*`, which allows zero occurrences, `+` requires at least one occurrence of the pattern to match.

###### Example:
- `a+` will match:
  - "a", "aa", "aaa", etc.
  - It **won't** match an empty string because at least one "a" is required.

###### Key Point:
- The `+` quantifier ensures that the pattern appears at least once, making it different from `*`, which can match zero occurrences. It is useful when you expect a minimum of one repetition.

In [127]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 1 or more  (any) characters, and an "o":
x = re.findall("he.+o", txt)

print(x)

['hello']


# ?
###### `?` (Question Mark) in Regular Expressions:

The `?` symbol is used to match **zero or one occurrence** of the preceding character, group, or pattern. It makes the preceding element optional.

###### Example:
- `a?` will match:
  - An empty string (because zero occurrences of "a" is allowed).
  - "a" (one occurrence of "a").
  - It **won't** match "aa", "aaa", or any other string with more than one "a".

###### Key Point:
- The `?` quantifier is useful when you want to make a part of your pattern optional. It means the pattern can occur **at most once** or **not at all**.

In [137]:
import re

txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1  (any) character, and an "o":
x = re.findall("he.?o", txt)

print(x)

#This time we got no match, because there were not zero, not one, but two characters between "he" and the "o"

[]


# {}
###### `{}` (Curly Braces) in Regular Expressions:

The `{}` symbol is used to **specify the exact number of occurrences** of the preceding character, group, or pattern. It allows you to match a specific number of repetitions, or a range of repetitions.

###### Syntax:
- `{n}`: Matches exactly **n** occurrences of the preceding element.
- `{n,}`: Matches **n or more** occurrences of the preceding element.
- `{n,m}`: Matches between **n and m** occurrences of the preceding element.

###### Example:
1. `a{3}`:
   - This will match exactly **3 occurrences** of "a" (i.e., "aaa").

2. `a{2,4}`:
   - This will match **2 to 4 occurrences** of "a" (i.e., "aa", "aaa", or "aaaa").

3. `a{2,}`:
   - This will match **2 or more occurrences** of "a" (i.e., "aa", "aaa", "aaaa", etc.).

4. `a{,3}`:
   - This will match **up to 3 occurrences** of "a" (i.e., "a", "aa", or "aaa").

###### Key Point:
- The `{}` quantifier allows you to fine-tune how many times a pattern can occur, giving you flexibility to match specific repetitions of a character or group.

In [144]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed excactly 2 (any) characters, and an "o":
x = re.findall("he.{2}o", txt)

print(x)

['hello']


# |
###### `|` (Pipe) in Regular Expressions:

The `|` symbol represents the **OR** operator in regular expressions. It allows you to match one pattern **or** another, providing a choice between different alternatives.

###### Syntax:
- `A|B`: Matches **either** pattern `A` **or** pattern `B`.

###### Example:
1. `cat|dog`:
   - This will match either the word **"cat"** or the word **"dog"**.

2. `apple|orange|banana`:
   - This will match **either** the word "apple", "orange", or "banana".

3. `(a|b)c`:
   - This will match either "ac" or "bc", where it first matches **"a"** or **"b"**, and then "c" must follow.

###### Key Point:
- The `|` symbol allows you to match multiple options in a regular expression. It's very useful when you want to search for any one of several possible patterns in a string.

In [150]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":
x = re.findall("falls|stays", txt)
print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['falls']
Yes, there is at least one match!


In [159]:
txt = "The rain in Spain"

#Check if the string starts with "The":
x = re.findall("\AThe", txt)
y = re.findall("^The", txt)

print(x)
print(y)

if x and y:
    print("Yes, there is a match!")
else:
    print("No match")

['The']
['The']
Yes, there is a match!


# \A
###### `\A` in Regular Expressions:

The `\A` symbol is used to match the **start of a string**, **regardless** of whether the string is multi-line or not. It ensures that the match occurs at the very beginning of the input string.

###### Syntax:
- `\A`: Matches only the start of the string.

###### Example:
1. `\AHello`:
   - This will match the word **"Hello"** only if it appears at the very beginning of the string.

   - If the string is `"Hello world"`, it matches **"Hello"**.
   - If the string is `"A Hello world"`, it does **not** match because **"Hello"** is not at the start of the string.

###### Key Point:
- Unlike `^`, which can be used in multi-line mode to match the start of any line, `\A` will only match the **start of the entire string**. This makes it more strict when you're working with multi-line input.



In [170]:
txt = "The quick brown fox.\nThe lazy dog."

# Using \A - Only matches "The" at the very beginning of the string
x = re.findall(r"\AThe", txt, re.MULTILINE)  # Output: ['The']

# Using ^ - In regular mode, it matches "The" at the beginning of the string
y = re.findall("^The", txt, re.MULTILINE)  # Output: ['The', 'The'] (with re.MULTILINE)

print(x)
print(y)

['The']
['The', 'The']


# \b
###### `\b` in Regular Expressions:

The `\b` symbol is used to match a **word boundary**. It ensures that the match occurs at the **start** or **end** of a word. A word boundary is the position between a word character (like a letter, number, or underscore) and a non-word character (like space, punctuation, etc.).

###### Syntax:
- `\b`: Matches a word boundary.
- It does **not** consume characters in the string; it only asserts the position.

###### Example:
1. `\bcat\b`:
   - This will match **"cat"** only if it is a whole word, not part of a larger word.

   - Matches: `"cat is small"`, `"a cat"` (because "cat" is a separate word).
   - Does not match: `"concats"` (because "cat" is part of the word "concats").

2. `\b123`:
   - This will match the number `123` if it occurs at the start of a word (e.g., `"123abc"`).
   - It does not match if the number `123` is part of a larger number (e.g., `"a123"`).

###### Key Point:
- The `\b` symbol helps ensure that the pattern matches **exact words**, and not part of a longer string.
- It works even in multi-line strings, where it matches the boundaries between lines as well.

In [193]:
txt = "The quick brown the fox jumps over the lazy dog."

# Using \b to match the beginning of a word
match = re.findall(r"\bThe", txt)
print(match)

# Using \b to match the end of a word
match_end = re.findall(r"dog\b", txt)
print(match_end)

['The']
['dog']


# \B
###### `\B` in Regular Expressions:

The `\B` symbol is used to match a **non-word boundary**. It asserts the position where a **word boundary does not exist**. In other words, it matches positions where there is no transition between a word character (like a letter or number) and a non-word character (such as spaces, punctuation, etc.).

###### Syntax:
- `\B`: Matches a position that is **not** a word boundary.
- It asserts that the match must happen between two word characters or between two non-word characters.

###### Example:
1. `\Bcat\B`:
   - This will match `"cat"` **only** when it is part of a larger word and not a standalone word.

   - Matches: `"concats"`, `"scattered"`, `"cats"` (because "cat" is inside a larger word).
   - Does not match: `"cat is small"` (because "cat" is a separate word and has word boundaries).

2. `\B123`:
   - This will match `123` **only** if it is part of a larger word or number.

   - Matches: `"abc123"`, `"123abc"`, `"a123"` (because there are no word boundaries around `123`).
   - Does not match: `"123 is"`, `"abc 123"` (because `123` is surrounded by spaces, which create word boundaries).

###### Key Point:
- `\B` ensures that the match occurs **inside a word**, where there are no boundaries.
- It is useful for finding patterns that should not occur at the start or end of a word but rather in the middle.

In [195]:
txt = "The quick brown fox jumps over the lazy dog."

# Using \B to match a position between two word characters (not a word boundary)
print(re.findall(r"\Bfox", txt)) 

# Using \B to match a position between two non-word characters (not a word boundary)
txt2 = "Hello! How are you?"
print(re.findall(r"\BHow", txt2)) 

# Using \B in the middle of a word
print(re.findall(r"\Bis", "This is a test")) 

[]
[]
['is']


# \d

In [196]:
txt = "The rain 8585in Spain7"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")


['8', '5', '8', '5', '7']
Yes, there is at least one match!


# \D
###### `\D` in Regular Expressions:

The `\D` symbol is used to match **any character that is not a digit**. It is the opposite of `\d`, which matches digits (0-9). Essentially, `\D` will match anything except for numbers.

###### Syntax:
- `\D`: Matches any character that is **not** a digit (0-9).

###### Example:
1. `\D+`:
   - This will match one or more non-digit characters.

   - Matches: `"abc"`, `"Hello"`, `"!@#$"`.
   - Does not match: `"123"`, `"4567"`, `"1abc"` (because they contain digits).

2. `\D{3}`:
   - This matches exactly 3 non-digit characters in a row.

   - Matches: `"abc"`, `"xyz"`, `"!@#"` (any 3 non-digit characters).
   - Does not match: `"123"`, `"12ab"`, `"a1b"` (because they contain digits).

###### Key Point:
- `\D` is useful when you want to exclude digits from your matches and focus on the characters that are non-numeric.
- It's commonly used for matching text or symbols while ignoring numbers.

In [199]:
txt = "The ra8in in 7Spain"

#Return a match at every no-digit character:
x = re.findall("\D", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


# \s
###### `\s` in Regular Expressions:

The `\s` symbol is used to match any **whitespace character**. This includes spaces, tabs, newlines, and other forms of whitespace characters.

###### Syntax:
- `\s`: Matches any **whitespace character**, such as a space, tab, or newline.

###### Example:
1. `\s+`:
   - This will match **one or more whitespace characters**.
   
   - Matches: `" "`, `"\t"`, `"\n"`, `"   "` (spaces, tabs, newlines).
   - Does not match: `"abc"`, `"123"` (because they are not whitespace characters).

2. `\s{2}`:
   - This will match **exactly two whitespace characters** in a row.
   
   - Matches: `"  "` (two spaces), `"\t\t"` (two tabs).
   - Does not match: `" "` (one space), `"   "` (three spaces).

###### Key Points:
- `\s` is useful when you need to handle spaces, tabs, and newlines in your regular expression patterns.
- It's commonly used to match spaces or ensure there's some kind of whitespace between words or elements.


In [204]:
txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


# \S
###### `\S` in Regular Expressions:

The `\S` symbol is used to match any **non-whitespace character**. It will match any character that is not a space, tab, newline, or any other whitespace character.

###### Syntax:
- `\S`: Matches any **non-whitespace character**, including letters, digits, punctuation, etc.

###### Example:
1. `\S+`:
   - This will match **one or more non-whitespace characters** in a row.
   
   - Matches: `"abc"`, `"123"`, `"hello123"`, `"!"`, `"$"`, `"abc@123"`.
   - Does not match: `" "`, `"\t"`, `"\n"` (whitespace characters).

2. `\S{3}`:
   - This will match **exactly three consecutive non-whitespace characters**.
   
   - Matches: `"abc"`, `"123"`, `"!"`.
   - Does not match: `"a b"` (because of the space), `"abc "` (because of the space at the end).

###### Key Points:
- `\S` is the inverse of `\s`. While `\s` matches whitespace characters, `\S` matches everything except them.
- It's useful when you need to capture or exclude whitespace and focus on non-whitespace elements like words, digits, punctuation, or symbols.

In [207]:
txt = "8The rain in Spain"

#Return a match at every NON white-space character:
x = re.findall("\S", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['8', 'T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


# \w
###### `\w` in Regular Expressions:

The `\w` symbol is used to match any **word character**. It includes letters (both lowercase and uppercase), digits, and the underscore (`_`).

###### Syntax:
- `\w`: Matches a **word character**, i.e., any alphanumeric character or underscore (`_`).

###### Example:
1. `\w+`:
   - This will match **one or more word characters** in a row.
   
   - Matches: `"abc"`, `"123"`, `"hello123"`, `"word_123"`.
   - Does not match: `"!"`, `"@#"`, `" "`, `"\t"` (non-word characters or whitespace).

2. `\w{3}`:
   - This will match **exactly three consecutive word characters**.
   
   - Matches: `"abc"`, `"123"`.
   - Does not match: `"ab"` (only two characters) or `"abc "` (because of the space).

###### Key Points:
- `\w` matches **letters (a-z, A-Z)**, **digits (0-9)**, and **underscores (_)**
- It is the equivalent of `[a-zA-Z0-9_]`, so it captures alphanumeric words and underscores.
- It's useful for matching identifiers (like variable names) or extracting words from a text where you want to ignore spaces, punctuation, and special symbols.

In [211]:
txt = "$The %rain in Spain _8"

#Return a match at every word character (characters from a to Z, digits from 0-9, and the underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n', '_', '8']
Yes, there is at least one match!


# \W
###### `\W` in Regular Expressions:

The `\W` symbol is the **opposite** of `\w`. It matches any character that is **not a word character**. A "word character" includes letters (both lowercase and uppercase), digits, and the underscore (`_`). Therefore, `\W` matches everything except those characters.

###### Syntax:
- `\W`: Matches a **non-word character**. This means anything that is not a letter, digit, or underscore.

###### Example:
1. `\W+`:
   - This will match **one or more consecutive non-word characters**.
   
   - Matches: `"!"`, `"@"`, `"#"`, `" "`, `"&*("`.
   - Does not match: `"abc"`, `"123"`, `"word_123"` (these are word characters).

2. `\W{3}`:
   - This will match **exactly three consecutive non-word characters**.
   
   - Matches: `"!@#"`, `" ^ "` (space and punctuation).
   - Does not match: `"abc"`, `"123"` (these are word characters).

###### Key Points:
- `\W` matches **anything that is not a letter, digit, or underscore**.
- It is the equivalent of `[^a-zA-Z0-9_]`, so it captures punctuation, spaces, special characters, and other non-word characters.
- It's useful for identifying and extracting non-word characters from a string or to split text based on non-alphanumeric symbols.

In [212]:
txt = "T%he r*ain in Spa_8in"

#Return a match at every NON word character (characters NOT between a and Z. Like "!", "?" white-space etc.):
x = re.findall("\W", txt)

print(x)

if x:
    print("Yes, there is at least one match!")
else:
    print("No match")

['%', ' ', '*', ' ', ' ']
Yes, there is at least one match!


# \Z
###### `\Z` in Regular Expressions:

The `\Z` symbol is used to match the **end of a string**. It behaves similarly to `$`, but there is a key difference. While `$` matches the end of the string or before a newline at the end of the string, `\Z` matches strictly the **very end of the string**, even if there are newline characters at the end.

###### Syntax:
- `\Z`: Matches the **end of the string**.

###### Example:
1. `re.search(r"abc\Z", "abc")`:
   - This will match because the string ends with `"abc"`.

2. `re.search(r"abc\Z", "abc\n")`:
   - This **won't match** because `\Z` requires the string to end with exactly `"abc"` without a newline at the end. 
   - In contrast, `$` would match `"abc\n"`.

###### Key Points:
- `\Z` is stricter than `$` because it ensures that **no characters follow the end of the string**.
- `\Z` does not allow for any extra newline characters or anything beyond the end of the string. If there's a newline or any other character at the end, it won't match.
- It’s especially useful when you want to ensure the string ends exactly at a point without any extra characters or newlines after it.

###### Comparison with `$`:
- **`$`**: Matches the end of the string or before a newline at the end of the string.
- **`\Z`**: Matches only at the **absolute end** of the string (no newline allowed after).

In [216]:
txt = "The rain in Spain"

#Check if the string ends with "Spain":
x = re.findall("Spain\Z", txt)

print(x)

if x:
    print("Yes, there is a match!")
else:
    print("No match")

['Spain']
Yes, there is a match!


###### The findall() Function
- The findall() function returns a list containing all matches.

In [223]:
#Print a list of all matche
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


###### The search() Function
- The search() function searches the string for a match, and returns a Match object if there is a match.

- If there is more than one match, only the first occurrence of the match will be returned:

In [224]:
#Search for the first white-space character in the string:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


###### The split() Function
- The split() function returns a list where the string has been split at each match:

In [225]:
#Split at each white-space character:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

#You can control the number of occurrences by specifying the maxsplit parameter:
#Split the string only at the first occurrence:
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain', 'in', 'Spain']
['The', 'rain in Spain']


###### The sub() Function
- The sub() function replaces the matches with the text of your choice:

In [226]:
#Replace every white-space character with the number 9:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain
