## Assignment 17

### Q1. Explain the difference between greedy and non-greedy syntax with visual terms in as few words as possible. What is the bare minimum effort required to transform a greedy pattern into a non-greedy one? What characters or characters can you introduce or change?

In regular expressions, greedy syntax matches as much as possible, while non-greedy syntax matches as little as possible. 

To transform a greedy pattern into a non-greedy one, you can introduce the `?` character after the quantifier. 

For example, consider the regular expression `.*`, which matches any character, repeated zero or more times, as much as possible (i.e., greedy). To make it non-greedy, you can add a `?` after the `*`, like so: `.*?`. This will match any character, repeated zero or more times, as little as possible (i.e., non-greedy).

Here's an example in Python:

```python
import re

text = "foo bar baz"
pattern = "f.*o"

# Greedy match
result = re.search(pattern, text)
print(result.group())  # Output: "foo bar"

# Non-greedy match
pattern = "f.*?o"
result = re.search(pattern, text)
print(result.group())  # Output: "foo"
```

In the first example, the `f.*o` pattern matches as much as possible, resulting in the entire string `"foo bar"`. In the second example, the `f.*?o` pattern matches as little as possible, resulting in just the `"foo"` substring.


### Q2. When exactly does greedy versus non-greedy make a difference?  What if you&#39;re looking for a non-greedy match but the only one available is greedy?

Greedy and non-greedy matching are two ways of matching patterns in regular expressions. 

In greedy matching, the pattern tries to match the longest possible sequence of characters that satisfies the pattern. On the other hand, non-greedy (or lazy) matching tries to match the shortest possible sequence of characters that satisfies the pattern.

Greedy matching is the default behavior of regular expressions, and it can be changed to non-greedy matching by adding a question mark `?` to the quantifier. For example, `.*?` is a non-greedy version of `.*`.

The choice between greedy and non-greedy matching can make a difference when there are overlapping matches in the text being searched. For example, consider the text `"ababab"`. If we search for the pattern `a.*b` with greedy matching, it will match the entire string `"ababab"`. However, if we use non-greedy matching by adding a `?` to the quantifier, the pattern will match only `"ab"`.

If we are looking for a non-greedy match but the only available match is greedy, we can add a negative character set to the pattern to exclude the characters that we don't want to match. For example, if we want to match the first occurrence of `"ab"` in the text `"ababab"`, we can use the pattern `a[^a]*?b`, which will match `"ab"` but not `"abab"` or `"ababab"`.

### Q3. In a simple match of a string, which looks only for one match and does not do any replacement, is the use of a nontagged group likely to make any practical difference?

In a simple match of a string, which looks only for one match and does not do any replacement, the use of a non-tagged group may not make any practical difference, but it depends on how the regular expression is written.

Non-tagged groups are used to group parts of a regular expression together without capturing them. They are denoted by parentheses with no group name or number. In a simple match, non-tagged groups may not affect the results because they do not affect the matching itself. However, they can still affect the performance of the regular expression, especially in more complex cases.

Here is an example of a simple match using a non-tagged group:

```python
import re

text = "The quick brown fox jumps over the lazy dog"
pattern = r"(quick)\s(brown)"
match = re.search(pattern, text)
print(match.group())
```

In this example, the non-tagged group `(quick)` is used to match the word "quick" in the text. However, since there is only one match, using a non-tagged group here does not make any practical difference.

However, in more complex regular expressions with multiple matches, non-tagged groups can be used to improve performance by reducing the number of captured groups. This can make a significant difference in the speed of the regular expression.

If you are looking for a non-greedy match but the only one available is greedy, you can use the non-greedy modifier `?` to make the quantifier lazy. For example, the regular expression `.*?` will match as few characters as possible, while the regular expression `.*` will match as many characters as possible.

### Q4. Describe a scenario in which using a nontagged category would have a significant impact on the program&#39;s outcomes.

In regular expressions, a nontagged category, also known as a non-capturing group, is a group that is used for matching a pattern but is not captured as a group. It can be helpful in certain scenarios where grouping is necessary for matching, but the captured group is not required for further processing.

One scenario where using a nontagged category can have a significant impact on the program's outcomes is when performing complex string manipulations, such as URL rewriting. In such cases, a group of characters may need to be matched, but only a portion of the group is required for further processing. In this case, using a nontagged category can prevent unnecessary capturing of the entire group, reducing the amount of data that needs to be processed and improving the performance of the program.

For example, consider the following regular expression that matches a URL and captures the query string parameter "id":

```python
import re

url = 'https://example.com/page?id=12345'

match = re.search(r'https://example\.com/page\?id=(\d+)', url)

if match:
    id = match.group(1)
    print(id)
```

In this example, the regular expression captures the entire query string parameter, including the "id=" prefix. However, if only the numeric ID value is required, a nontagged category can be used to capture only the digits:

```python
match = re.search(r'https://example\.com/page\?id=(?:\d+)', url)

if match:
    id = match.group(0)[len('https://example.com/page?id='):]
    print(id)
```

In this example, the nontagged category "(?:\d+)" is used to match one or more digits without capturing them as a group. The entire match is then accessed using "match.group(0)" and the "id=" prefix is removed using string slicing.

Using a nontagged category in this scenario can improve the performance of the program by reducing the amount of data that needs to be captured and processed.

### Q5. Unlike a normal regex pattern, a look-ahead condition does not consume the characters it examines. Describe a situation in which this could make a difference in the results of your programme.

A look-ahead assertion is a zero-width assertion, which means it does not consume any characters in the string. It matches a specific pattern only if it is followed by another pattern. One example of when this could make a difference in the results of a program is when you want to match a pattern only if it is not followed by another pattern.

For example, consider a string of email addresses, and you want to match all email addresses except those with a specific domain. In this case, you can use a negative look-ahead assertion to match all email addresses that are not followed by the specific domain. Here's an example:

```python
import re

email_addresses = "john@gmail.com, mary@yahoo.com, mike@hotmail.com, peter@domain.com"

# Match all email addresses except those with domain 'domain.com'
pattern = r"\b\w+@[^\s,]+(?!domain\.com)\b"
matches = re.findall(pattern, email_addresses)
print(matches)
```

In the above example, the pattern matches all email addresses that do not end with the domain ".com". The negative look-ahead assertion `(?!domain\.com)` ensures that the pattern is not followed by the domain "domain.com". 

Without the use of a look-ahead assertion, we would need to consume the characters following the email address to determine whether it is followed by the specific domain, which could be inefficient and lead to incorrect results.


### Q6. In standard expressions, what is the difference between positive look-ahead and negative look-ahead?

In regular expressions, positive look-ahead and negative look-ahead are both types of look-ahead assertions that allow for more complex pattern matching. 

Positive look-ahead, denoted as `(?=...)`, asserts that the sub-pattern inside the parentheses must be matched immediately after the current position. The match will only succeed if the sub-pattern is found, but the characters that matched the sub-pattern are not captured as part of the overall match.

Negative look-ahead, denoted as `(?!...)`, asserts that the sub-pattern inside the parentheses must not be matched immediately after the current position. The match will only succeed if the sub-pattern is not found. Like positive look-ahead, the characters that matched the sub-pattern are not captured as part of the overall match.

Here are some examples:

```python
import re

# Positive look-ahead example: match a word that is followed by "python"
text = "I love python programming"
pattern = r"\b\w+(?= python)"
match = re.search(pattern, text)
print(match.group())  # output: love

# Negative look-ahead example: match a word that is not followed by "programming"
text = "I love python programming"
pattern = r"\b\w+(?! programming)"
match = re.search(pattern, text)
print(match.group())  # output: love
```

In the first example, the positive look-ahead `(?= python)` asserts that the word must be followed by the string "python", but "python" is not captured as part of the match. The second example uses a negative look-ahead `(?! programming)` to assert that the word must not be followed by "programming".

### Q7. What is the benefit of referring to groups by name rather than by number in a standard expression?

Referring to groups by name instead of number in a standard expression has several benefits:

1. Clarity: Naming groups makes the regular expression more readable and easier to understand, especially for people who are not familiar with the pattern.

2. Maintainability: If the regular expression changes and the group numbers shift, using names ensures that the code doesn't break, whereas using group numbers could cause errors.

3. Reusability: Named groups can be reused in the same pattern or in other patterns, making the code more modular and flexible.

To refer to a group by name, the syntax is `(?P<name>pattern)` where `name` is the name of the group and `pattern` is the regular expression pattern that defines the group.

Here is an example that shows the difference between referring to groups by name and by number:

```python
import re

# Using group number
text = 'Hello, world!'
pattern = r'(\w+), (\w+)'
match = re.match(pattern, text)
print(match.group(1)) # Output: Hello
print(match.group(2)) # Output: world

# Using group name
text = 'Hello, world!'
pattern = r'(?P<greeting>\w+), (?P<target>\w+)'
match = re.match(pattern, text)
print(match.group('greeting')) # Output: Hello
print(match.group('target')) # Output: world
```

In this example, we have two groups in our regular expression pattern: `(\w+)` and `(\w+)`. In the first case, we refer to the groups by their position (1 and 2). In the second case, we use the `(?P<name>pattern)` syntax to name our groups and refer to them by name (`greeting` and `target`).

### Q8. Can you identify repeated items within a target string using named groups, as in &quot;The cow jumped over the moon&quot;?

Yes, we can identify repeated items within a target string using named groups. We can use the syntax `(?P<name>pattern)` to define a named group. 

Here's an example:

```python
import re

# Define the pattern with named group 'word'
pattern = r'(?P<word>\b\w+\b)\s+(?P=word)'

# Test string
text = 'The cow jumped over the moon'

# Find all repeated words using named groups
matches = re.findall(pattern, text)

# Print the results
print(matches) # Output: ['The', 'cow', 'over']
```

In this example, we define the pattern with a named group 'word' that matches a sequence of one or more word characters (`\w+`) surrounded by word boundaries (`\b`). We then use the named group syntax `(?P=word)` to match any occurrence of the same sequence of characters that was previously matched by the named group 'word'. 

When we call `re.findall(pattern, text)`, the regular expression engine finds all occurrences of the pattern in the target string `text` and returns a list of all matched substrings. In this case, the output is `['The', 'cow', 'over']`, which are the repeated words in the text.

### Q9. When parsing a string, what is at least one thing that the Scanner interface does for you that the re.findall feature does not?

The `Scanner` interface in Python is part of the `re` module and is used for parsing strings, specifically for tokenizing input based on regular expressions. Here are some differences between `Scanner` and `re.findall()`:

- `Scanner` allows for more fine-grained control over the matching process. You can specify regular expressions for individual tokens and even define custom functions to handle the resulting match objects.
- `Scanner` returns a generator object that yields the next token on each iteration, whereas `re.findall()` returns a list of all non-overlapping matches.
- `Scanner` can be more efficient than `re.findall()` when dealing with large input strings or when the regular expressions used to match tokens are complex.

Here's an example of using `Scanner` to tokenize a string based on whitespace and punctuation:

```python
import re

text = "Hello, world! This is a test."

scanner = re.Scanner([
    (r"\w+", lambda s, t: ("WORD", t)),
    (r"\s+", lambda s, t: ("SPACE", t)),
    (r"[^\w\s]+", lambda s, t: ("PUNCT", t)),
])

tokens, remainder = scanner.scan(text)

print(tokens)
```

Output:
```
[('WORD', 'Hello'), ('PUNCT', ','), ('SPACE', ' '), ('WORD', 'world'), ('PUNCT', '!'), ('SPACE', ' '), ('WORD', 'This'), ('SPACE', ' '), ('WORD', 'is'), ('SPACE', ' '), ('WORD', 'a'), ('SPACE', ' '), ('WORD', 'test'), ('PUNCT', '.')]
```

Here, we define a scanner object that tokenizes the input string based on whitespace and punctuation. The regular expressions used to match each token are defined as tuples containing the pattern and a lambda function that processes the resulting match object and returns a tuple representing the token type and value. The resulting tokens are returned as a list of tuples containing the token type and value.

### Q10. Does a scanner object have to be named scanner?

The Scanner interface is not available in Python's standard library, so there is no Scanner object in Python. However, there are similar functionalities provided by other modules, such as the `csv` module or the `pandas` library, which allow parsing structured data in various formats. 

In these cases, you can use any variable name you prefer to store the object returned by the constructor. For example:

```python
import csv

# Create a CSV reader object and assign it to the variable 'my_reader'
with open('data.csv', 'r') as csv_file:
    my_reader = csv.reader(csv_file)
    for row in my_reader:
        # process each row
        pass
```

In this example, the `csv.reader()` function returns a reader object, which is assigned to the variable `my_reader`. The object has the functionality of the Scanner interface, but it is not named `scanner`.