Skip to content

Files

Latest commit

 

History

History
105 lines (38 loc) · 8.65 KB

mastering-regular-expressions--3rd-edition.md

File metadata and controls

105 lines (38 loc) · 8.65 KB

Mastering Regular Expressions, 3rd Edition

> Home

1: Introduction to Regular Expressions

The sequence ⌈.⌋ is described as an escaped period or escaped dot, and you can do this with all the normal metacharacters, except in a character-class (link)

This yields ⌈<([A-Za-z]+)•+1>⌋. (link)

This is called the interval quantifier. For example, ⌈···{3,12}⌋ matches up to 12 times if possible, but settles for three. One might use ⌈[a-zA-Z]{1,5}⌋ to match a US stock ticker (from one to five letters). (link)

Let’s look at matching color or colour. Since they are the same except that one has a u and the other doesn’t, we can use ⌈colou?r⌋ to match either. (link)

The rules about which characters are and aren’t metacharacters (and exactly what they mean) are different inside a character class. For example, dot is a metacharacter outside of a class, but not within one. Conversely, a dash is a metacharacter within a class (usually), but not outside. Moreover, a caret has one meaning outside, another if specified inside a class immediately after the opening [, and a third if given elsewhere in the class. (link)

For example, ⌈Bob⌋ and ⌈Robert⌋ are separate expressions, but ⌈Bob|Robert⌋ is one expression that matches either. When combined this way, the subexpressions are called alternatives. (link)

Probably the easiest metacharacters to understand are ⌈^⌋ (caret) and ⌈$⌋ (dollar), which represent the start and end, respectively, of the line of text as it is being checked. (link)

Ther egular-expression construct ⌈[···]⌋, usually called a character class, lets you list the characters you want to allow at that point in the match. (link)

Within a character class, the character-class metacharacter ‘-’ (dash) indicates a range of characters: ⌈<H[1-6]>⌋ is identical to the previous example. (link)

Negated character classes If you use ⌈[^···]⌋ instead of ⌈[···]⌋, the class matches any character that isn’t listed. For example, ⌈[^1-6]⌋ matches a character that’s not 1 through 6 (link)

In ⌈03[-./]l9[-./]76⌋, the dots are not metacharacters because they are within a character class. (Remember, the list of metacharacters and their meanings are different inside and outside of character classes.) The dashes are also not class metacharacters in this case because each is the first thing after [ or [^. (link)

Vikram: [- within character class ie [] has two meanings. as first character, it has literal meaning, other places it denotes range. outside of character class it has literal meaning.]

Remember, a negated character class means “match a character that’s not listed” and not “don’t match what is listed.” (link)

Vikram: [example q[^u]]

egrep’s command-line option “-i” tells it to do a case-insensitive match. (link)

We can accomplish this by using parentheses to “constrain” the alternation:

⌈^(From|Subject|Date):•⌋ (link)

The expression ⌈⌋ literally means “match if we can find a start-of-word position, followed immediately by c·a·t, followed immediately by an end-of-word position.” (link)

Note that ⌈<⌋ and ⌈>⌋ alone are not metacharacters — when combined with a backslash, the sequences become special. This is why I called them “metasequences.” (link)

The expression ⌈⌋ literally means “match if we can find a start-of-word position, followed immediately by c·a·t, followed immediately by an end-of-word position.” More naturally, it means “find the word cat.” (link)

We can accomplish this by using parentheses to “constrain” the alternation:

⌈^(From|Subject|Date):•⌋ (link)

A very convenient metacharacter is ⌈|⌋, which means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. For example, ⌈Bob⌋ and ⌈Robert⌋ are separate expressions, but ⌈Bob|Robert⌋ is one expression that matches either. When combined this way, the subexpressions are called alternatives. (link)

Remember, a negated character class means “match a character that’s not listed” and not “don’t match what is listed.” These might seem the same, but the Iraq example shows the subtle difference. (link)

Just as the English word “wind” can mean different things depending on the context (sometimes a strong breeze, sometimes what you do to a clock), so can a metacharacter. (link)

The rules regarding which metacharacters are supported (and what they do) are completely different inside and outside of character classes. (link)

the only special characters within the class in ⌈[0-9A-Z_!.?]⌋ are the two dashes). (link)

Within a character class, the character-class metacharacter ‘-’ (dash) indicates a range of characters: ⌈<H[1-6]>⌋ is identical to the previous example. ⌈[0-9]⌋ and ⌈[a-z]⌋ are common shorthands for classes to match digits and English lowercase letters, respectively. (link)

The contents of a class is a list of characters that can match at that point, so the implication is “or.” (link)

Ther egular-expression construct ⌈[···]⌋, usually called a character class, lets you list the characters you want to allow at that point in the match. (link)

⌈^cat⌋ matches if you have the beginning of a line, followed immediately by c, followed immediately by a, followed immediately by t (link)

Similarly, ⌈cat$⌋ finds c·a·t only at the end of the line, such as a line ending with scat (link)

Just as there is a difference between playing a musical piece well and making music, there is a difference between knowing about regular expressions and really understanding them (link)

> Home