# Assignment 7

#### 1. What is the name of the feature responsible for generating Regex objects?

Regex objects is typically called a "regular expression compiler" or "regex compiler". It is a part of regular expression libraries or frameworks in programming languages that allow developers to create and manipulate regular expressions for pattern matching and text processing tasks.

#### 2. Why do raw strings often appear in Regex objects?

Raw strings ('r' prefix) are commonly used in regular expression (regex) objects to avoid the need for excessive escape characters. Regular expressions use special characters with special meanings, such as backslashes ('') to escape metacharacters, and this can lead to double escaping, which can be cumbersome and error-prone. By using a raw string, denoted by an 'r' prefix, like r"\d+\s\w+", the backslashes are treated as literal characters and not escape characters. This makes the regex pattern more readable and avoids the need for double escaping.

In [2]:
pattern1 = "\\d+\\s\\w+"  # Using regular string with double escaping
pattern2 = r"\d+\s\w+"     # Using raw string

#### 3. What is the return value of the search() method?

search() method is commonly associated with string manipulation and is used to search for a specific substring or pattern within a larger string. The return value of the search() method is typically an indication of whether the substring or pattern was found within the larger string or not. It may return a boolean value (true or false) indicating whether the substring or pattern was found, or it may return the index or position of the first occurrence of the substring or pattern within the larger string.

#### 4. From a Match item, how do you get the actual strings that match the pattern?

group() method: This method is often used to retrieve the actual matched string or a specific captured group from the match object. The group index or name may be specified as an argument to the method. Example-

In [3]:
import re

pattern = r'(\d{3})-(\d{3})-(\d{4})'  # Example pattern for a phone number
string = 'Phone: 123-456-7890'

match = re.search(pattern, string)
if match:
    matched_string = match.group(0)  # Entire matched string
    group1 = match.group(1)          # Captured group 1
    group2 = match.group(2)          # Captured group 2
    group3 = match.group(3)          # Captured group 3
    print("Matched String:", matched_string)
    print("Group 1:", group1)
    print("Group 2:", group2)
    print("Group 3:", group3)

Matched String: 123-456-7890
Group 1: 123
Group 2: 456
Group 3: 7890


#### 5. In the regex which created from the r&#39;(\d\d\d)-(\d\d\d-\d\d\d\d)&#39;, what does group zero cover? Group 2? Group 1?

Group zero (group(0)) covers the entire matched string. It represents the entire substring that matches the entire regular expression pattern.

Group 1: (\d\d\d): This captures three digits together, e.g., "123", "456", etc.
Group 2: (\d\d\d-\d\d\d\d): This captures a sequence of three digits followed by a hyphen, and then followed by four digits separated by hyphen, e.g., "789-0123", "987-6543", etc.

#### 6. In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?

In regular expressions, parentheses and periods have special meanings and are used as metacharacters to define groups and character classes, respectively. If you want to match literal parentheses or periods, you need to escape them using a backslash \ before the parentheses or periods. This is known as "escaping" or "quoting" metacharacters.

For example, to match a literal left parenthesis ( and a literal right parenthesis ), you would use the regex pattern \( \). Similarly, to match a literal period ., you would use the regex pattern \.

In [4]:
import re

# Match a literal parenthesis and a period
pattern = r'\(Hello\)\.'  # Pattern to match "(Hello)."

# Input string
string = '(Hello). Welcome to the world.'

# Search for the pattern in the string
match = re.search(pattern, string)
if match:
    print("Match found:", match.group(0))
else:
    print("Match not found.")

Match found: (Hello).


#### 7. The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?

The findall() method in regular expression libraries typically returns a list of strings when there are no capturing groups in the regular expression pattern, and it returns a list of tuples of strings when there are one or more capturing groups in the pattern.

In [5]:
import re

# Pattern with capturing group
pattern = r'(\d{3})-(\d{3}-\d{4})'  # Example pattern for phone number

# Input string
string = 'Phone: 123-456-7890, Fax: 987-654-3210'

# Using findall() to match pattern with capturing groups
matches = re.findall(pattern, string)

print(matches)

[('123', '456-7890'), ('987', '654-3210')]


if the regular expression pattern does not contain any capturing groups, findall() returns a list of strings, where each string represents the entire matched substring that matches the pattern.

In [6]:
import re

# Pattern without capturing group
pattern = r'\d{3}-\d{3}-\d{4}'  # Example pattern for phone number without capturing group

# Input string
string = 'Phone: 123-456-7890, Fax: 987-654-3210'

# Using findall() to match pattern without capturing groups
matches = re.findall(pattern, string)

print(matches)

['123-456-7890', '987-654-3210']


#### 8. In standard expressions, what does the | character mean?

The | character is used as a logical OR operator, also known as an alternation. It allows you to specify multiple alternatives for matching in a regular expression pattern.

In [7]:
import re

# Pattern with alternation
pattern = r'apple|banana|cherry'  # Example pattern with alternatives

# Input string
string = 'I like bananas and apples, but not cherries.'

# Using findall() with alternation pattern
matches = re.findall(pattern, string)

print(matches)

['banana', 'apple']


#### 9. In regular expressions, what does the character stand for?

Question is incomplete

#### 10.In regular expressions, what is the difference between the + and * characters?

"+" and "*" characters are quantifiers that specify how many times a preceding element should occur in a pattern.

The main difference between the "+" and "*" characters is that "+" requires the preceding element to occur at least once, while "*" allows the preceding element to occur zero or more times.

#### 11. What is the difference between {4} and {4,5} in regular expression?

The difference between "{4}" and "{4,5}" is that "{4}" specifiesthat the preceding element should occur exactly four times, while "{4,5}" specifies that the preceding element should occur between four and five times, inclusive. It means that the preceding element must appear at least four times, and can appear up to five times.

#### 12. What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?

Shorthand character classes such as "\d", "\w", and "\s" are used to represent common character sets

"\d" shorthand character class: The "\d" represents any digit character (0-9). It is equivalent to the character class [0-9]. For example, the regular expression "\d+" would match one or more consecutive digits in a string.

"\w" shorthand character class: The "\w" represents any word character, which includes alphanumeric characters (letters and digits) as well as underscores (). It is equivalent to the character class [a-zA-Z0-9]. For example, the regular expression "\w+" would match one or more consecutive word characters in a string.

"\s" shorthand character class: The "\s" represents any whitespace character, including spaces, tabs, and line breaks. It is equivalent to the character class [\t\n\f\r\p{Z}]. For example, the regular expression "\s+" would match one or more consecutive whitespace characters in a string.

#### 13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?

"\D", "\W", and "\S" are negated or inverse shorthand character classes that represent character sets that are opposite to their non-negated counterparts.

**"\D" shorthand character class:** The "\D" represents any character that is not a digit (0-9). It is the inverse of "\d". It is equivalent to the character class [^0-9]. For example, the regular expression "\D+" would match one or more consecutive characters that are not digits in a string.

**"\W" shorthand character class:** The "\W" represents any character that is not a word character, i.e., any character that is not an alphanumeric character (letter or digit) or an underscore (). It is the inverse of "\w". It is equivalent to the character class [^a-zA-Z0-9]. For example, the regular expression "\W+" would match one or more consecutive characters that are not word characters in a string.

**"\S" shorthand character class**: The "\S" represents any character that is not a whitespace character, i.e., any character that is not a space, tab, or line break. It is the inverse of "\s". It is equivalent to the character class [^\t\n\f\r\p{Z}]. For example, the regular expression "\S+" would match one or more consecutive characters that are not whitespace characters in a string.

#### 14. What is the difference between ".*?" and $".*"$ ?

**".*?"** - This is a non-greedy or lazy quantifier. It matches zero or more occurrences of any character (except for a newline) in a non-greedy or minimal manner. It means that it matches as few characters as possible to satisfy the overall pattern. It is used to perform the shortest possible match.

**".*"** - This is a greedy quantifier. It matches zero or more occurrences of any character (except for a newline) in a greedy or maximal manner. It means that it matches as many characters as possible to satisfy the overall pattern. It is used to perform the longest possible match.

Example- String: "Hello world!"

"Hel.*?" - the resulting match would be "Hel" because it matches "Hel" and stops at the first occurrence of the next character, which is "l". This is the shortest possible match.

"Hel.*" -  the resulting match would be the entire string "Hello world!", as it matches all characters in the string until the end. This is the longest possible match.

#### 15. What is the syntax for matching both numbers and lowercase letters with a character class?

The desired character ranges within the square brackets ([ ])

The character class "[0-9a-z]" will match any single character that is either a digit (0-9) or a lowercase letter (a-z).

#### 16. What is the procedure for making a normal expression in regax case insensitive?

To make a regular expression case insensitive, you can use the appropriate flags

**'re.I'** flag: We can use the re.I flag as an argument when using the re module for regular expressions, like this:

In [9]:
import re
pattern = re.compile(r'regex pattern', re.I)

#### 17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?

The "." (dot) character normally matches any character except for a newline. It matches any single character, such as letters, digits, symbols, whitespace characters, etc., except for a newline character.

However, if the re.DOTALL or re.S flag is passed as the second argument to the re.compile() function then the "." (dot) character will also match newline characters.

In [10]:
import re
pattern = re.compile(r'pattern', re.DOTALL)

#### 18. If numReg = re.compile(r&#39;\d+&#39;), what will numRegex.sub(&#39;X&#39;, &#39;11 drummers, 10 pipers, five rings, 4 hen&#39;) return?

In [11]:
import re

numRegex = re.compile(r'\d+')
result = numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen')
print(result)

X drummers, X pipers, five rings, X hen


**numRegex** is a compiled regular expression object that matches one or more digits ("\d+"). <br>
**The sub() function** is called on numRegex with the replacement string 'X' and the input string '11 drummers, 10 pipers, five rings, 4 hen'.<br>
**The sub() function** replaces all occurrences of the regular expression pattern (one or more digits) with the replacement string 'X'.<br>
In the input string, '11' and '10' are replaced with 'X', resulting in 'X drummers, X pipers'.
'four' in 'five rings' and '4' in '4 hen' do not match the regular expression pattern ("\d+"), so they are not replaced.

#### 19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?

The re.VERBOSE flag is a flag that can be passed as the second argument to re.compile() to enable verbose mode.<br>
Verbose mode in regular expressions allows for more readable and organized regular expression patterns by ignoring whitespace and comments within the pattern string.

#### How would you write a regex that match a number with comma for every three digits? It must match the given following:<br>
&#39;42&#39;<br>
&#39;1,234&#39;<br>
&#39;6,368,745&#39;<br>
#### but not the following:<br>
&#39;12,34,567&#39; (which has only two digits between the commas)<br>
&#39;1234&#39; (which lacks commas)<br>

In [14]:
import re

# Define the regex pattern
pattern = r'^\d{1,3}(?:,\d{3})*$'

# Test strings
strings = ['42', '1,234', '6,368,745', '12,34,567', '1234']

for string in strings:
    match = re.match(pattern, string)
    if match:
        print(f"Match: {string}")
    else:
        print(f"No match: {string}")

Match: 42
Match: 1,234
Match: 6,368,745
No match: 12,34,567
No match: 1234


#### 21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:<br>
&#39;Haruto Watanabe&#39;<br>
&#39;Alice Watanabe&#39;<br>
&#39;RoboCop Watanabe&#39;<br>
#### but not the following:<br>
&#39;haruto Watanabe&#39; (where the first name is not capitalized)<br>
&#39;Mr. Watanabe&#39; (where the preceding word has a nonletter character)<br>
&#39;Watanabe&#39; (which has no first name)<br>
&#39;Haruto watanabe&#39; (where Watanabe is not capitalized)<br>

In [15]:
import re

inputs = ["Haruto Watanabe", "Alice Watanabe", "RoboCop Watanabe",
          "haruto Watanabe", "Mr. Watanabe", "Watanabe", "Haruto watanabe"]

# Regex pattern
pattern = r"\b[A-Z][a-zA-Z]* Watanabe\b"

for input_str in inputs:
    match = re.search(pattern, input_str)
    if match:
        print(f"Match found: {input_str}")
    else:
        print(f"No match found: {input_str}")

Match found: Haruto Watanabe
Match found: Alice Watanabe
Match found: RoboCop Watanabe
No match found: haruto Watanabe
No match found: Mr. Watanabe
No match found: Watanabe
No match found: Haruto watanabe


#### 22. How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:<br>
&#39;Alice eats apples.&#39;<br>
&#39;Bob pets cats.&#39;<br>
&#39;Carol throws baseballs.&#39;<br>
&#39;Alice throws Apples.&#39;<br>
&#39;BOB EATS CATS.&#39;<br>
#### but not the following:<br>
&#39;RoboCop eats apples.&#39;<br>
&#39;ALICE THROWS FOOTBALLS.&#39;<br>
&#39;Carol eats 7 cats.&#39;<br>

In [16]:
import re

# Define the regex pattern
pattern = r'^(?i)(Alice|Bob|Carol)\s+(eats|pets|throws)\s+(apples|cats|baseballs)\.$'

sentences = [
    'Alice eats apples.',
    'Bob pets cats.',
    'Carol throws baseballs.',
    'Alice throws Apples.',
    'BOB EATS CATS.',
    'RoboCop eats apples.',
    'ALICE THROWS FOOTBALLS.',
    'Carol eats 7 cats.'
]

for sentence in sentences:
    if re.match(pattern, sentence):
        print(f"Matched: {sentence}")
    else:
        print(f"Not matched: {sentence}")


Matched: Alice eats apples.
Matched: Bob pets cats.
Matched: Carol throws baseballs.
Matched: Alice throws Apples.
Matched: BOB EATS CATS.
Not matched: RoboCop eats apples.
Not matched: ALICE THROWS FOOTBALLS.
Not matched: Carol eats 7 cats.


  if re.match(pattern, sentence):
