# Q1. Explain the difference between greedy and non-greedy syntax with visual terms in as few words as possible. What is the bare minimum effort required to transform a greedy pattern into a non-greedy one? What characters or characters can you introduce or change?

A1. In regular expressions, a greedy pattern matches the longest possible substring, while a non-greedy pattern matches the shortest possible substring. Greedy patterns are denoted by the + and * quantifiers, while non-greedy patterns are denoted by +? and *?.

For example, consider the string 'abbbbab' and the pattern 'ab*'. A greedy match would be 'abbbb', while a non-greedy match would be 'a'. The bare minimum effort required to transform a greedy pattern into a non-greedy one is to append a ? to the quantifier.

Code Example:

In [2]:
import re

# Greedy match
string = 'abbbbab'
pattern = 'ab*'
match = re.search(pattern, string)
print('Greedy match:', match.group())

# Non-greedy match
pattern = 'ab*?'
match = re.search(pattern, string)
print('Non-greedy match:', match.group())


Greedy match: abbbb
Non-greedy match: a


# Q2. When exactly does greedy versus non-greedy make a difference?  What if you're looking for a non-greedy match but the only one available is greedy?

Greedy versus non-greedy syntax makes a difference when there is a possibility of multiple matches in the same string. Greedy syntax will match as much as it can while still allowing the remainder of the pattern to match, whereas non-greedy syntax will match as little as possible to allow the remainder of the pattern to match.

For example, consider the string text = 'aabaaab' and the pattern pattern = 'a.*b'. The greedy match for this pattern on this string would be 'aabaaab', while the non-greedy match would be 'aab'.

If you're looking for a non-greedy match but the only one available is greedy, you can transform the pattern into a non-greedy one by appending a ? to the .* quantifier. For example, changing the pattern to pattern = 'a.*?b' would give the non-greedy match of 'aab'.

Example:

In [3]:
import re

# Greedy match
text = 'aabaaab'
pattern = 'a.*b'
greedy_match = re.search(pattern, text)
print(greedy_match.group())  # Output: aabaaab

# Non-greedy match
text = 'aabaaab'
pattern = 'a.*?b'
nongreedy_match = re.search(pattern, text)
print(nongreedy_match.group())  # Output: aab



aabaaab
aab


# Q3. In a simple match of a string, which looks only for one match and does not do any replacement, is the use of a nontagged group likely to make any practical difference?

In a simple match of a string, which looks only for one match and does not do any replacement, the use of a nontagged group is not likely to make any practical difference. A nontagged group is simply a way to group parts of a regular expression together without capturing the matched substring for later use. If you're not interested in capturing the matched substring, there's no need to use a nontagged group.

Example:

In [4]:
import re

# Simple match without nontagged group
text = 'hello world'
pattern = 'hello'
match = re.search(pattern, text)
print(match.group())  # Output: hello

# Simple match with nontagged group
text = 'hello world'
pattern = '(hello)'
match = re.search(pattern, text)
print(match.group(1))  # Output: hello


hello
hello


# Q4. Describe a scenario in which using a nontagged category would have a significant impact on the program's outcomes.

A scenario in which using a nontagged category would have a significant impact on the program's outcomes is when you're doing a search-and-replace operation and you want to preserve some parts of the original string while replacing others. By using nontagged categories, you can group parts of the pattern together without capturing them, and then refer to those groups in the replacement string using backreferences.

For example, suppose you have a string containing a list of names in the format "last name, first name", and you want to convert it to the format "first name last name". You could use the following regular expression:

In [5]:
import re

text = 'Smith, John\nDoe, Jane\nJohnson, Bob'
pattern = '(\w+), (\w+)'
replacement = r'\2 \1'
new_text = re.sub(pattern, replacement, text)
print(new_text)


John Smith
Jane Doe
Bob Johnson


# Q5. Unlike a normal regex pattern, a look-ahead condition does not consume the characters it examines. Describe a situation in which this could make a difference in the results of your program.

A look-ahead condition can be useful when you need to match a pattern that is followed by another pattern, but you don't want to include the second pattern in the match. For example, suppose you have a list of words and you want to find all the words that are followed by the word "apple". You can use a positive look-ahead to match the word without consuming the characters that come after it:

In [6]:
import re

text = "banana apple, cherry apple, grape apple"
pattern = r"\w+(?=\sapple)"  # positive look-ahead for " apple"

matches = re.findall(pattern, text)
print(matches)  


['banana', 'cherry', 'grape']


# Q6. In standard expressions, what is the difference between positive look-ahead and negative look-ahead?

Positive look-ahead ((?=pattern)) matches the current position if the next characters match pattern, but it doesn't consume those characters. Negative look-ahead ((?!pattern)) matches the current position if the next characters do not match pattern, but it also doesn't consume those characters.

For example, let's say you have a string of URLs, but you want to exclude any URLs that end in ".png". You can use a negative look-ahead to exclude those URLs:

In [7]:
import re

text = "http://example.com/test.html http://example.com/image.png"
pattern = r"http:\/\/[\w\.\/]+(?!\.png)"  # negative look-ahead for ".png"

matches = re.findall(pattern, text)
print(matches)  # Output: ['http://example.com/test.html']


['http://example.com/test.html', 'http://example.com/image.png']


# Q7. What is the benefit of referring to groups by name rather than by number in a standard expression?

Referring to groups by name rather than by number makes the code more readable and less error-prone. When you use named groups, you can refer to them by their names in the rest of the regular expression, as well as in any code that uses the match object.

For example, suppose you have a list of phone numbers in the format (123) 456-7890, and you want to extract the area code and the exchange number. Here's how you can use named groups to do that:

In [8]:
import re

text = "Phone numbers: (123) 456-7890, (456) 789-0123"
pattern = r"\((?P<area>\d{3})\) (?P<exchange>\d{3})-\d{4}"  # named groups for area code and exchange

matches = re.findall(pattern, text)
for match in matches:
    print("Area code:", match[0], "Exchange:", match[1])  # Access named groups using their names


Area code: 123 Exchange: 456
Area code: 456 Exchange: 789


# Q8. Can you identify repeated items within a target string using named groups, as in "The cow jumped over the moon"?

Yes, you can use named groups to match repeated items in a target string. For example, suppose you have a string containing multiple occurrences of the same word, and you want to match only the words that occur more than once. Here's how you can use a named group to do that:

In [9]:
import re

text = "The quick brown fox jumps over the lazy dog"
pattern = r"\b(?P<word>\w+)\b.*\b(?P=word)\b"  # named group for word

matches = re.findall


# Q9. When parsing a string, what is at least one thing that the Scanner interface does for you that the re.findall feature does not?

A9. The Scanner interface in Python provides more fine-grained control over string parsing than the re.findall feature. Specifically, the Scanner allows you to specify a set of patterns that match different types of tokens in the input string, and then iterate through the tokens one by one. This can be useful in cases where you need to perform different operations on different types of tokens. For example, consider the following code snippet:

In this example, we define a Scanner that recognizes three types of tokens: WORDs (sequences of letters), NUMBERs (sequences of digits), and PUNCTUATION (any non-alphanumeric characters). We also specify a regular expression pattern to match whitespace, but we don't assign a function to handle it (i.e., we pass None as the second argument to re.Scanner).

We then use the scanner.scan() method to extract tokens from an input string. This method returns a tuple of two elements: the first is a list of all the recognized tokens, and the second is the remainder of the input string that was not matched by any of the patterns. Finally, we iterate through the list of tokens and print each one. The output of this script is:


In [10]:
import re

scanner = re.Scanner([
    (r'[a-zA-Z]+', lambda scanner, token: ('WORD', token)),
    (r'[0-9]+', lambda scanner, token: ('NUMBER', int(token))),
    (r'[^\w\s]+', lambda scanner, token: ('PUNCTUATION', token)),
    (r'\s+', None)
])

input_str = 'The quick brown fox, jumped over the lazy dog.'

tokens, remainder = scanner.scan(input_str)

for token in tokens:
    print(token)


('WORD', 'The')
('WORD', 'quick')
('WORD', 'brown')
('WORD', 'fox')
('PUNCTUATION', ',')
('WORD', 'jumped')
('WORD', 'over')
('WORD', 'the')
('WORD', 'lazy')
('WORD', 'dog')
('PUNCTUATION', '.')


# Q10. Does a scanner object have to be named scanner?

A10. No, a Scanner object can be named anything you like, as long as the name is a valid Python identifier. For example, you could write:

In [15]:
import re

my_scanner = re.Scanner([
    (r'[a-zA-Z]+', lambda scanner, token: ('WORD', token)),
    (r'[0-9]+', lambda scanner, token: ('NUMBER', int(token))),
    (r'[^\w\s]+', lambda scanner, token: ('PUNCTUATION', token)),
    (r'\s+', None)
])

input_str = 'The quick brown fox, jumped over the lazy dog.'

tokens, remainder = my_scanner.scan(input_str)

for token in tokens:
    print(token)

('WORD', 'The')
('WORD', 'quick')
('WORD', 'brown')
('WORD', 'fox')
('PUNCTUATION', ',')
('WORD', 'jumped')
('WORD', 'over')
('WORD', 'the')
('WORD', 'lazy')
('WORD', 'dog')
('PUNCTUATION', '.')


This code defines a Scanner object named my_scanner instead of scanner. The name of the object is not important; what matters is that you use the same name consistently when calling its methods.