# Reading 17: Regular Expressions & Grammars (JAVA)
Objectives
After today’s class, you should:

* Understand the ideas of grammar productions and regular expression operators
* Be able to read a grammar or regular expression and determine whether it matches a sequence of characters
* Be able to write a grammar or regular expression to match a set of character sequences and parse them into a data structure

今天的阅读介绍了几个概念：

* 语法，包括生成、非终结符、终结符和运算符(grammars, with productions, nonterminals, terminals, and operator)
* 正则表达式
有些程序模块以字节序列或字符序列的形式接收输入或产生输出，如果只是存储在内存中，则称为字符串，如果流入或流出模块，则称为流。在今天的阅读中，我们将讨论如何编写这样一个序列的规范。具体来说，字节或字符序列可以是

* 一个字符串
* 磁盘上的文件，在这种情况下，规范被称为文件格式
* 通过网络发送的信息，在这种情况下，the specification is a wire protocol
* 用户在控制台上键入的命令，在这种情况下，he specification is a command line interface

对于这些类型的序列，我们引入了grammar的概念，它不仅能让我们区分合法序列和非法序列，还能将序列解析为程序可以使用的数据结构。grammar产生的数据结构通常是递归数据类型，就像我们在递归数据类型阅读中谈到的那样。

我们还将讨论语法的一种特殊形式，即regular expression。除了用于规范和解析，正则表达式还是一种广泛使用的工具，可用于许多需要拆解字符串、从中提取信息或转换字符串的字符串处理任务。

下一篇阅读将讨论解析器生成器，这是一种将语法自动转化为该语法的解析器的工具。

## Grammars
要描述一串符号（无论是字节、字符还是从固定集合中抽取的其他类型符号），we use a compact representation called a grammar.
A grammar defines a set of strings.假设我们要编写一个表示 URL 的grammar。我们的 URL grammar将指定 HTTP 协议中合法 URL 的字符串集合。

grammer中的literal strings 称为terminals。它们之所以被称为terminals，是因为它们不能再进一步扩展。我们通常用引号来书写terminals，如 "http "或":"，就是常量字符。

grammar由一组production描述，每个production定义一个nonterminal。你可以把nonterminal看成是代表一组字符串的变量（实际上就是变量），而production则是该变量在其他变量（nonterminal）、运算符和常量（终端）方面的定义。非终结符是表示字符串的树的内部节点。

To describe a string of symbols, whether they are bytes, characters, or some other kind of symbol drawn from a fixed set, we use a compact representation called a grammar.

A grammar defines a set of strings. Suppose we want to write a grammar that represents URLs. Our grammar for URLs will specify the set of strings that are legal URLs in the HTTP protocol.

The literal strings in a grammar are called terminals. They’re called terminals because they can’t be expanded any further. We generally write terminals in quotes, like 'http' or ':'.

A grammar is described by a set of productions, where each production defines a nonterminal. You can think of a nonterminal like a variable that stands for a set of strings, and the production as the definition of that variable in terms of other variables (nonterminals), operators, and constants (terminals). Nonterminals are internal nodes of the tree representing a string.

A production in a grammar has the form

grammar中的production形式为
 nonterminal ::= expression of terminals, nonterminals, and operators 

**Grammar 实际就是production的集合。**
grammar = {production}
production = {nonterminal ::= expression of terminals, nonterminals, and operators}

**One of the nonterminals of the grammar is designated as the root**. The set of strings that the grammar recognizes are the ones that match the root nonterminal. This nonterminal is sometimes called root or start or even just S, but in the grammars below we will typically choose more readable names for the root, like url, html, and markdown.

So a grammar that represents a singleton set, allowing only one specific URL, might have just one production defining the nonterminal url, with a terminal on the righthand side:
>url ::= 'http://mit.edu/'

### Grammar operators

Productions can use operators to combine terminals and nonterminals on the righthand side. The three most important operators in a production expression are:

Repetition, represented by *:
> x ::= y*        // x matches zero or more y

Concatenation, represented not by a symbol, but just a space:
> x ::= y z       // x matches y followed by z 

Union, also called alternation, represented by |:
> x ::= y | z     // x matches either y or z 

By convention, postfix operators like * have highest precedence, which means they are applied first. Concatenation is applied next. Alternation | has lowest precedence, which means it is applied last. Parentheses can be used to override precedence:
> m ::= a (b|c) d      // m matches a, followed by either b or c, followed by d
x ::= (y z | a b)*   // x matches zero or more yz or ab pairs

Let’s use these operators to generalize our url grammar to match some other hostnames, such as http://stanford.edu/ and http://google.com/

> url ::= 'http://' hostname '/'
hostname ::= 'mit.edu' | 'stanford.edu' | 'google.com'

The first rule of this grammar demonstrates concatenation. The url nonterminal matches strings that start with the literal string http://, followed by a match to the hostname nonterminal, followed by the literal string /.

The hostname rule demonstrates union. A hostname can match one of the three literal strings, mit.edu or stanford.edu or google.com.

So this grammar represents the set of three strings, http://mit.edu/, http://google.com/, and http://stanford.edu/.

Let’s take it one step further by allowing any lowercase word in place of mit, stanford, google, com and edu:

> url ::= 'http://' hostname '/'
hostname ::= word '.' word
word ::= letter*
letter ::= ('a' | 'b' | 'c' | 'd' | 'e' | 'f' | 'g' | 'h' | 'i' 
                | 'j' | 'k' | 'l' | 'm' | 'n' | 'o' | 'p' | 'q' 
                | 'r' | 's' | 't' | 'u' | 'v' | 'w' | 'x' | 'y' | 'z')


The new word rule matches a string of zero or more lowercase letters, so the overall grammar can now match http://alibaba.com/ and http://zyxw.edu/ as well. Unfortunately word can also match an empty string, so this url grammar also matches http://./, which is not a legal URL. Here’s a verbose way to fix that, which requires word to match at least one letter.

> word ::= letter letter*


### More grammar operators
You can also use additional operators which are just syntactic sugar (i.e., they’re equivalent to combinations of the main three operators).

0 or 1 occurrence is represented by ?:

> x ::= y?       // an x is a y or is the empty string

1 or more occurrences is represented by +:

>x ::= y+       // an x is one or more y
                //    equivalent to  x ::= y y*

An exact number of occurrences is represented by {n}, and a range of occurences by {n,m}, {n,}, or {,m}:

> x ::= y{3}     // an x is three y
               // equivalent to x ::= y y y 

> x ::= y{1,3}   // an x is between one and three y
               // equivalent to x ::= y | y y | y y y

> x ::= y{,4}    // an x is at most four y
               // equivalent to x ::=   | y | y y | y y y | y y y y
               //                     ^--- note the empty string here, so this can match zero y's

> x ::= y{2,}    // an x is two or more y
               // equivalent to x ::= y y y*

A character class [...] represents the set of single-character strings matching any of the characters listed in the square brackets:

> x ::= [aeiou]  // equivalent to  x ::= 'a' | 'e' | 'i' | 'o' | 'u'

Ranges of characters can be described compactly using -:
> x ::= [a-ckx-z]    // equivalent to  x ::= 'a' | 'b' | 'c' | 'k' | 'x' | 'y' | 'z'

An inverted character class [^...] represents the set of single-character strings matching a character not listed in the brackets:

>x ::= [^a-c]  // equivalent to  x ::= 'd' | 'e' | 'f' | ... | '0' | '1' | '2' | ... | '!' | '@'
              //                          | ... (all other possible characters)


These additional operators allow the word production to be expressed more compactly:

> url ::= 'http://' hostname '/'
hostname ::= word '.' word
word ::= [a-z]+

### Recursion in grammars
How else do we need to generalize? Hostnames can have more than two components, and there can be an optional port number:

> http://didit.csail.mit.edu:4949/

> To handle this kind of string, the grammar is now:
url ::= 'http://' hostname (':' port)? '/' 
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+


Notice how hostname is now defined recursively in terms of itself. Which part of the hostname definition is the base case, and which part is the recursive step? What kinds of hostnames are allowed?

Using the repetition operator, we could also write hostname without recursion, like this:

> hostname ::= (word '.')+ word

Recursion can sometimes be eliminated from a grammar using operators like this, but not always.

Another thing to observe is that this grammar allows port numbers that are not technically legal, since port numbers can only range from 0 to 65535 (216-1). We could write a more complex definition of port that would match only these integers, but that’s not typically done in a grammar. Instead, the constraint 0 ≤ port ≤ 65535 would be specified in the program that uses the grammar.

There are more things we should do to go further:

* generalizing http to support the additional protocols that URLs can have
* generalizing the / at the end to a slash-separated path
* allowing hostnames with the full set of legal characters instead of just a-z




















## Parse trees

Matching a grammar against a string can generate a parse tree that shows how parts of the string correspond to parts of the grammar.

The leaves of the parse tree are labeled with terminals, representing the parts of the string that have been parsed. They don’t have any children, and can’t be expanded any further. If we concatenate the leaves together in order, we get back the original string. A trivial example is the one-line URL grammar that we started with, whose (only possible) parse tree is shown at the right:
![](ref/lect12/2023-08-05-13-59-37.png)

> url ::= 'http://mit.edu/'



The internal nodes of the parse tree are labeled with nonterminals, since they all have children. The immediate children of a nonterminal node must follow the pattern of the nonterminal’s production rule in the grammar. For example, in our more elaborate URL grammar that allows any two-part hostname, the children of a hostname node in the tree must follow the pattern of the hostname rule, word '.' word. The figure on the right shows the parse tree produced by matching this grammar against http://mit.edu/:

> url ::= 'http://' hostname '/'
hostname ::= word '.' word
word ::= [a-z]+

![](ref/lect12/2023-08-05-14-04-09.png)

For a more elaborate example, here is the parse tree for the recursive URL grammar. The tree has more structure now. The hostname and word nonterminals are labeling nodes of the tree whose subtrees match those rules in the grammar.

>url ::= 'http://' hostname (':' port)? '/' 
hostname ::= word '.' hostname | word '.' word
port ::= [0-9]+
word ::= [a-z]+

![](ref/lect12/2023-08-05-15-04-13.png)







#### Example: Markdown and HTML
Now let’s look at grammars for some file formats. We’ll be using two different markup languages that represent typographic style in text. Here they are:

**Markdown**:

This is _italic_.

**HTML:**
Here is an <i>italic</i> word.

For simplicity, our example HTML and Markdown grammars will only specify italics, but other text styles are of course possible. Also for simplicity, we will assume the plain text between the formatting delimiters isn’t allowed to use any formatting punctuation, like _ or <.

Here’s the grammar for our simplified version of Markdown:

>markdown ::= ( normal | italic ) *
italic ::= '_' normal '_'
normal ::= text
text ::= [^_]*

![](ref/lect12/2023-08-05-15-26-03.png)

Here’s the grammar for our simplified version of HTML:

> html ::= ( normal | italic ) *
italic ::= '<i>' html '</i>'
normal ::= text
text ::= [^<>]*

![](ref/lect12/2023-08-05-15-28-35.png)

## Regular expressions
A regular grammar has a special property: by substituting every nonterminal (except the root one) with its righthand side, you can reduce it down to a single production for the root, with only terminals and operators on the right-hand side.

Our URL grammar is regular. By replacing nonterminals with their productions, it can be reduced to a single expression:

>url ::= 'http://' ([a-z]+ '.')+ [a-z]+ (':' [0-9]+)? '/' 

The Markdown grammar is also regular:
> markdown ::= ([^_]* | '_' [^_]* '_' )*

But our HTML grammar can’t be reduced completely. By substituting righthand sides for nonterminals, you can eventually reduce it to something like this:

```regex
html ::= ( [^<>]* | '<i>' html '</i>' )*
```
Unfortunately, the recursive use of html on the righthand side can’t be eliminated, and can’t be simply replaced by a repetition operator either. So the HTML grammar is not regular.

The reduced expression of terminals and operators can be written in an even more compact form, called a regular expression. A regular expression does away with the quotes around the terminals, and the spaces between terminals and operators, so that it consists just of terminal characters, parentheses for grouping, and operator characters. For example, the regular expression for our markdown format is just

```regex
([^_]*|_[^_]*_)*
```

Regular expressions are also called regexes for short. A regex is far less readable than the original grammar, because it lacks the nonterminal names that documented the meaning of each subexpression. But many programming languages have library support for regexes (and not for grammars), and regexes are much faster to match than a grammar.

The regex syntax commonly implemented in programming language libraries has a few more special characters, in addition to the ones we used above in grammars. Here are some common useful ones:

```regex
.   // matches any single character (but sometimes excluding newline, depending on the regex library)

\d  // matches any digit, same as [0-9]
\s  // matches any whitespace character, including space, tab, newline
\w  // matches any word character including underscore, same as [a-zA-Z_0-9]
```
Backslash is also used to “escape” an operator or special character so that it matches literally. Here are some of the common special characters that you need to escape:

```regex
\.  \(  \)  \*  \+  \|  \[  \]  \\
```

```regex
http://([a-z]+\.)+[a-z]+(:[0-9]+)?/
```

Another way to escape a special character is to wrap it in character-class brackets. Instead of \., we could also write [.] to match just the literal . character. Inside character-class brackets, most special characters lose their special meaning, and are simply treated literally. But characters that are special to the character-class syntax, like [, ], ^, -, and \, still need to be escaped by a backslash to be used literally.

### Using regular expressions in practice (TS)

<div data-outline="using_regular_expressions_in_practice">

<p id="@regexes_widely_used"><a class="jump" href="#@regexes_widely_used"></a>Regexes are widely used in programming, and you should have them in your toolbox.</p>

<p id="@typescript-javascript_you_can"><a class="jump" href="#@typescript-javascript_you_can"></a>In TypeScript/JavaScript, you can use regexes for manipulating strings (see <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split"><code>string.split</code></a>, <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/match"><code>string.match</code></a>, <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp"><code>RegExp</code></a>).
They’re built-in as a first-class feature of other modern scripting languages like Python and Ruby, and you can use them in many text editors for find and replace.</p>

<p id="@some_languages_including"><a class="jump" href="#@some_languages_including"></a>In some languages (including TypeScript/JavaScript), regexes have their own literal syntax delimited by forward slashes, <code>/..../</code>.
In others (including Python and Java), there is no special syntactic support for regexes, so you just use a quoted string.</p>

<p id="@here_some_examples"><a class="jump" href="#@here_some_examples"></a>Here are some examples of using regexes in TypeScript.</p>

<p id="@replace_all_runs"><a class="jump" href="#@replace_all_runs"></a>Replace all runs of spaces in a string <code>s</code> with a single space:</p>

<pre id="@const_singlespacedstring_s-replace"><a class="jump" href="#@const_singlespacedstring_s-replace"></a><code class="language-ts hljs typescript"><span class="hljs-keyword">const</span> singleSpacedString = s.replace(<span class="hljs-regexp">/ +/g</span>, <span class="hljs-string">" "</span>);</code></pre>

<p id="@notice_g_character"><a class="jump" href="#@notice_g_character"></a>Notice the <code>g</code> character in the example above.
By default, a regex will only match the first instance in a string.
Adding a <code>g</code> character after the regex makes this a “global” pattern which forces it to match all instances in the string.</p>

<p id="@match_url"><a class="jump" href="#@match_url"></a>Match a URL:</p>

<pre id="@if_s-match-http-a-z-a-z-0-9_then"><a class="jump" href="#@if_s-match-http-a-z-a-z-0-9_then"></a><code class="language-ts hljs typescript"><span class="hljs-keyword">if</span> (s.match(<span class="hljs-regexp">/http:\/\/([a-z]+\.)+[a-z]+(:[0-9]+)?\//</span>)) {
    <span class="hljs-comment">// then s is a url</span>
}</code></pre>

<p id="@notice_backslashes_example"><a class="jump" href="#@notice_backslashes_example"></a>Notice the backslashes in the example above.
We want to match a literal period <code>.</code>, so we have to first escape it as <code>\.</code> to protect it from being interpreted as the regex match-any-character operator.
Furthermore, we want to match literal forward slashes <code>/</code>, so we also have to escape them as <code>\/</code> to prevent them from prematurely signalling the end of the regex literal.
The frequent necessity for backslash escapes makes regexes less readable.</p>

<p id="@extract_parts_date"><a class="jump" href="#@extract_parts_date"></a>Extract parts of a date like <code>"2020-03-18"</code>:</p>

<pre id="@const_s_2020-03-18"><a class="jump" href="#@const_s_2020-03-18"></a><code class="language-ts hljs typescript"><span class="hljs-keyword">const</span> s = <span class="hljs-string">"2020-03-18"</span>;
<span class="hljs-keyword">const</span> regex = <span class="hljs-regexp">/(?&lt;year&gt;\d{4})-(?&lt;month&gt;\d{2})-(?&lt;day&gt;\d{2})/</span>;
<span class="hljs-keyword">const</span> m = s.match(regex);
<span class="hljs-keyword">if</span> (m) {
  assert(m.groups);
  <span class="hljs-keyword">const</span> year = m.groups.year;
  <span class="hljs-keyword">const</span> month = m.groups.month;
  <span class="hljs-keyword">const</span> day = m.groups.day;
  <span class="hljs-comment">// m.groups.name is the part of s that matched (?&lt;name&gt;...)</span>
}</code></pre>

<p id="@example_uses_named"><a class="jump" href="#@example_uses_named"></a>This example uses <a href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Ranges#using_named_groups"><em>named capturing groups</em></a> like <span class="nowrap"><code>(?&lt;year&gt;...)</code></span> to extract parts of the matched string and assign them names.
The <span class="nowrap"><code>(?&lt;name&gt;...)</code></span> syntax matches the regex <code>...</code> inside the parentheses, and then assigns <em>name</em> to the string that match.
Note that <code>?</code> here does <em>not</em> mean 0 or 1 repetition.
In this context, right after an open parenthesis, the <code>?</code> signals that these parentheses have special meaning, not just grouping.</p>

<p id="@named_capturing_groups"><a class="jump" href="#@named_capturing_groups"></a>Named capturing groups can be retrieved by the groups property after a successful match.
If this regex were matched against <code>"2025-03-18"</code>, for example, then <code>m.groups.year</code> would return <code>"2025"</code>, <code>m.groups.month</code> would return <code>"03"</code>, and <code>m.groups.day</code> would return <code>"18"</code>.</p>

<div class="reading-exercises exercises panel-group converted" id="ex-regular_expressions_3"><h4 class="text-danger">reading exercises</h4><div class="panel panel-danger"><div class="panel-heading" data-structure-tag="exercise" id="@ex-regular_expressions_3-regexes_for_substring_replacement" data-target="#ex-regular_expressions_3-regexes_for_substring_replacement" data-toggle="collapse"><a class="jump" href="#@ex-regular_expressions_3-regexes_for_substring_replacement"></a><span class="panel-title">Regexes for substring replacement</span></div><div class="panel-collapse collapse exercise-panel" id="ex-regular_expressions_3-regexes_for_substring_replacement" data-outline="regexes_for_substring_replacement" data-ex-id="using_regular_expressions_in_practice/regexes_for_substring_replacement" data-ex-category="reading-exercises" data-ex-remote="https://6031.mit.edu/handx/sp23/submit.php" data-ex-handout="classes-12-regex-grammars"><div class="panel-body">

<p id="@write_shortest_regex"><a class="jump" href="#@write_shortest_regex"></a>Write the shortest regex you can to remove <strong>single-word, lowercase-letter-only HTML tags</strong> from a string:</p>

<pre id="@const_input_b-good-b"><a class="jump" href="#@const_input_b-good-b"></a><code class="language-ts hljs typescript"><span class="hljs-keyword">const</span> input = <span class="hljs-string">"The &lt;b&gt;Good&lt;/b&gt;, the &lt;i&gt;Bad&lt;/i&gt;, and the &lt;strong&gt;Ugly&lt;/strong&gt;"</span>;
<span class="hljs-keyword">const</span> regex = <span class="hljs-regexp">/TODO/g</span>;
<span class="hljs-keyword">const</span> output = input.replace(regex, <span class="hljs-string">""</span>);</code></pre>

<p id="@desired_output_example"><a class="jump" href="#@desired_output_example"></a>The desired output for that example is <code>"The Good, the Bad, and the Ugly"</code>.
What is the shortest regex that you can put in place of <code>TODO</code>, to match any single-word, lowercase-letter-only HTML tag, not just the ones in this particular input string?</p>

<p>You may find it useful to <a href="https://regex101.com/r/FUKDDV/9">try your answer here</a>.</p>

<div class="form-group exercise-part" data-outline="a"><div class="textfield exercise-choice"><input type="text" class="form-control" style="width: 26%;"><span class="exercise-answer exercise-remote" style="display: none;">(missing answer)</span></div>
</div>

<div class="exercise-explain exercise-remote"><p>(missing explanation)</p></div>

<div class="form-inline"><div class="form-group"><button class="btn btn-default exercise-submit">check</button> <button class="btn btn-default exercise-reveal" style="display: none;">explain</button></div><div class="exercise-progress progress"><div class="progress-bar progress-bar-danger progress-bar-striped active"></div></div><div class="exercise-error"></div></div></div></div></div>

<div class="panel panel-danger"><div class="panel-heading" data-structure-tag="exercise" id="@ex-regular_expressions_3-regexes_for_string_parsing" data-target="#ex-regular_expressions_3-regexes_for_string_parsing" data-toggle="collapse"><a class="jump" href="#@ex-regular_expressions_3-regexes_for_string_parsing"></a><span class="panel-title">Regexes for string parsing</span></div><div class="panel-collapse collapse exercise-panel" id="ex-regular_expressions_3-regexes_for_string_parsing" data-outline="regexes_for_string_parsing" data-ex-id="using_regular_expressions_in_practice/regexes_for_string_parsing" data-ex-category="reading-exercises" data-ex-remote="https://6031.mit.edu/handx/sp23/submit.php" data-ex-handout="classes-12-regex-grammars"><div class="panel-body">

<p id="@consider_snippet_code"><a class="jump" href="#@consider_snippet_code"></a>Consider this snippet of code that matches a street address:</p>

<pre id="@const_input_77"><a class="jump" href="#@const_input_77"></a><code class="language-ts hljs typescript"><span class="hljs-keyword">const</span> input = <span class="hljs-string">"77 Rose Court Ln"</span>;
<span class="hljs-keyword">const</span> regex = <span class="hljs-regexp">/[0-9]+ .* (Rd|St|Ave|Ln)/</span>;
<span class="hljs-keyword">const</span> m = input.match(regex);
<span class="hljs-keyword">if</span> (m) {
    ...
}</code></pre>

<p id="@want_not_only"><a class="jump" href="#@want_not_only"></a>We want to not only match the street address, but also parse it into pieces.</p>

<p id="@insert_named_capturing"><a class="jump" href="#@insert_named_capturing"></a>Insert named capturing groups in the regular expression <code>regex</code> so that we could extract these pieces after successfully matching <code>m</code>:</p>

<pre id="@const_housenumber_m-groups-housenumber"><a class="jump" href="#@const_housenumber_m-groups-housenumber"></a><code class="language-ts hljs typescript"><span class="hljs-keyword">const</span> houseNumber = m.groups.houseNumber; <span class="hljs-comment">// should be "77"</span>
<span class="hljs-keyword">const</span> streetName = m.groups.streetName; <span class="hljs-comment">// should be "Rose Court"</span>
<span class="hljs-keyword">const</span> streetType = m.groups.streetType; <span class="hljs-comment">// should be "Ln"</span></code></pre>

<p id="@start_original_regular"><a class="jump" href="#@start_original_regular"></a>Start with the original regular expression <code>[0-9]+ .* (Rd|St|Ave|Ln)</code>, already filled in below, and just put correctly-named parentheses around the parts of the expression that correspond to the pieces that need to be extracted.
Take care with the spaces, because spaces have meaning in a regex.</p>

<span class="long-monospaced-answer" data-prefilled-answer="[0-9]+ .* (Rd|St|Ave|Ln)"></span>

<div class="form-group exercise-part" data-outline="a"><div class="textfield exercise-choice"><input type="text" class="form-control" style="width: 50%;" value="[0-9]+ .* (Rd|St|Ave|Ln)"><span class="exercise-answer exercise-remote" style="display: none;">(missing answer)</span></div>
</div>

<p>You may find it useful to <a href="https://regex101.com/r/TAHcBA/9">try your answer here</a>.</p>

<div class="exercise-explain exercise-remote"><p>(missing explanation)</p></div><div class="form-inline"><div class="form-group"><button class="btn btn-default exercise-submit">check</button> <button class="btn btn-default exercise-reveal" style="display: none;">explain</button></div><div class="exercise-progress progress"><div class="progress-bar progress-bar-danger progress-bar-striped active"></div></div><div class="exercise-error"></div></div></div></div></div></div>

</div>

### Context-free grammars

<div data-outline="context-free_grammars">

<p id="@general_language_can"><a class="jump" href="#@general_language_can"></a>In general, a language that can be expressed with our system of grammars is called context-free.  <a href="https://en.wikipedia.org/wiki/Chomsky_hierarchy">Not all context-free languages are also regular</a>; that is, some grammars can’t be reduced to single nonrecursive productions.  Our HTML grammar is context-free but not regular.</p>

<p id="@grammars_most_programming"><a class="jump" href="#@grammars_most_programming"></a>The grammars for most programming languages are also context-free.  In general, any language with nested structure (like nesting parentheses or braces) is context-free but not regular.  That description applies to the TypeScript grammar, shown here in part:</p>

<pre id="@statement_statement_if"><a class="jump" href="#@statement_statement_if"></a><code class="language-parserlib hljs">statement <span class="hljs-keyword">::=</span> 
  <span class="hljs-string">'{'</span> statement<span class="hljs-keyword">*</span> <span class="hljs-string">'}'</span>
<span class="hljs-keyword">|</span> <span class="hljs-string">'if'</span> <span class="hljs-string">'('</span> expression <span class="hljs-string">')'</span> statement (<span class="hljs-string">'else'</span> statement)<span class="hljs-keyword">?</span>
<span class="hljs-keyword">|</span> <span class="hljs-string">'for'</span> <span class="hljs-string">'('</span> forinit<span class="hljs-keyword">?</span> <span class="hljs-string">';'</span> expression<span class="hljs-keyword">?</span> <span class="hljs-string">';'</span> forupdate<span class="hljs-keyword">?</span> <span class="hljs-string">')'</span> statement
<span class="hljs-keyword">|</span> <span class="hljs-string">'while'</span> <span class="hljs-string">'('</span> expression <span class="hljs-string">')'</span> statement
<span class="hljs-keyword">|</span> <span class="hljs-string">'do'</span> statement <span class="hljs-string">'while'</span> <span class="hljs-string">'('</span> expression <span class="hljs-string">')'</span> <span class="hljs-string">';'</span>
<span class="hljs-keyword">|</span> <span class="hljs-string">'try'</span> <span class="hljs-string">'{'</span> statement<span class="hljs-keyword">*</span> <span class="hljs-string">'}'</span> ( catches <span class="hljs-keyword">|</span> catches<span class="hljs-keyword">?</span> <span class="hljs-string">'finally'</span> <span class="hljs-string">'{'</span> statement<span class="hljs-keyword">*</span> <span class="hljs-string">'}'</span> )
<span class="hljs-keyword">|</span> <span class="hljs-string">'switch'</span> <span class="hljs-string">'('</span> expression <span class="hljs-string">')'</span> <span class="hljs-string">'{'</span> switchgroups <span class="hljs-string">'}'</span>
<span class="hljs-keyword">|</span> <span class="hljs-string">'return'</span> expression<span class="hljs-keyword">?</span> <span class="hljs-string">';'</span>
<span class="hljs-keyword">|</span> <span class="hljs-string">'throw'</span> expression <span class="hljs-string">';'</span> 
<span class="hljs-keyword">|</span> <span class="hljs-string">'break'</span> label<span class="hljs-keyword">?</span> <span class="hljs-string">';'</span>
<span class="hljs-keyword">|</span> <span class="hljs-string">'continue'</span> label<span class="hljs-keyword">?</span> <span class="hljs-string">';'</span>
<span class="hljs-keyword">|</span> expression <span class="hljs-string">';'</span> 
<span class="hljs-keyword">|</span> label <span class="hljs-string">':'</span> statement
<span class="hljs-keyword">|</span> <span class="hljs-string">';'</span></code></pre>

</div>

### Taking stock

<div data-outline="taking_stock">

<p id="@so_far_we-ve"><a class="jump" href="#@so_far_we-ve"></a>So far, we’ve discussed:</p>

<ul id="@grammars_which_describe">
<li><a class="jump" href="#@grammars_which_describe"></a><a href="#grammars"><em>grammars</em></a>, which describe groups of strings using a <em>production</em> of the form <code>nonterminal ::= expression of terminals, nonterminals, and operators</code>

<ul><li>grammars make use of <em>operators</em>; the three most important ones are <code>a * b</code> (repetition), <code>a b</code> (concatenation), and <code>a | b</code> (union)</li>
<li>grammars help differentiate between meaningful strings and nonmeaningful strings; for example, a properly formatted URL will <em>match</em> our url grammar, but a file-system path will not</li></ul></li>
<li><a href="#parse_trees"><em>parse trees</em></a>, which diagram how a concrete string matches a grammar

<ul><li>parse trees will come in handy when designing software to automatically match strings to a grammar in useful ways,
which we will see in the next section</li></ul></li>
<li><a href="#regular_expressions"><em>regular expressions</em></a>, or <em>regexes</em>, a compact version of regular (i.e. non-recursive) grammars 

<ul><li>regexes have a rich history in computer science; they are widely used throughout the full stack and across many languages </li>
<li>TypeScript/JavaScript (like many languages) has built-in support for regexes, 
which includes <em>named capturing groups</em> for meaningfully handling matches</li>
<li><a href="https://regex101.com">regex101.com</a> is a great site to experiment with regular expressions</li></ul></li>
</ul>

<p id="@last_part_reading"><a class="jump" href="#@last_part_reading"></a>The last part of this reading introduces the idea of a <em>parser generator</em>, which automates the process of parsing with a grammar.</p>

</div>

### Using regular expressions in practice (JAVA)

<div data-outline="using_regular_expressions_in_practice">

<p id="@regexes_widely_used"><a class="jump" href="#@regexes_widely_used"></a>Regexes are widely used in programming, and you should have them in your toolbox.</p>

<p id="@java_you_can"><a class="jump" href="#@java_you_can"></a>In Java, you can use regexes for manipulating strings (see <a href="http://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/lang/String.html#split(java.lang.String)"><code>String.split</code></a>, <a href="http://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/lang/String.html#matches(java.lang.String)"><code>String.matches</code></a>, <a href="http://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html"><code>java.util.regex.Pattern</code></a>).  They’re built-in as a first-class feature of modern scripting languages like Python, Ruby, and JavaScript, and you can use them in many text editors for find and replace.  Regular expressions are your friend!  Most of the time.  Here are some examples.</p>

<p id="@replace_all_runs"><a class="jump" href="#@replace_all_runs"></a>Replace all runs of spaces in a string <code>s</code> with a single space:</p>

<pre id="@string_singlespacedstring_s-replaceall"><a class="jump" href="#@string_singlespacedstring_s-replaceall"></a><code class="language-java hljs">String singleSpacedString = s.replaceAll(<span class="hljs-string">" +"</span>, <span class="hljs-string">" "</span>);</code></pre>

<p id="@match_url"><a class="jump" href="#@match_url"></a>Match a URL:</p>

<pre id="@if_s-matches-http-a-z-a-z-0-9_then"><a class="jump" href="#@if_s-matches-http-a-z-a-z-0-9_then"></a><code class="language-java hljs"><span class="hljs-keyword">if</span> (s.matches(<span class="hljs-string">"http://([a-z]+\\.)+[a-z]+(:[0-9]+)?/"</span>)) {
    <span class="hljs-comment">// then s is a url</span>
}</code></pre>

<p id="@notice_backslashes_example"><a class="jump" href="#@notice_backslashes_example"></a>Notice the backslashes in the example above.
We want to match a literal period <code>.</code>, so we have to first escape it as <code>\.</code> to protect it from being interpreted as the regex match-any-character operator, and then we have to further escape it as <code>\\.</code> to protect the backslash from being interpreted as a Java string escape character.
The frequent necessity for double-backslash escapes makes regexes still less readable.</p>

<p id="@extract_parts_date"><a class="jump" href="#@extract_parts_date"></a>Extract parts of a date like <code>"2020-03-18"</code>:</p>

<pre id="@string_s_2020-03-18"><a class="jump" href="#@string_s_2020-03-18"></a><code class="language-java hljs">String s = <span class="hljs-string">"2020-03-18"</span>;
Pattern regex = Pattern.compile(<span class="hljs-string">"(?&lt;year&gt;\\d{4})-(?&lt;month&gt;\\d{2})-(?&lt;day&gt;\\d{2})"</span>);
Matcher m = regex.matcher(s);
<span class="hljs-keyword">if</span> (m.matches()) {
    String year = m.group(<span class="hljs-string">"year"</span>);
    String month = m.group(<span class="hljs-string">"month"</span>);
    String day = m.group(<span class="hljs-string">"day"</span>);
    <span class="hljs-comment">// Matcher.group(name) returns the part of s that matched (?&lt;name&gt;...)</span>
}</code></pre>

<p id="@example_uses_named"><a class="jump" href="#@example_uses_named"></a>This example uses <a href="http://docs.oracle.com/en/java/javase/15/docs/api/java.base/java/util/regex/Pattern.html#groupname"><em>named capturing groups</em></a> like <span class="nowrap"><code>(?&lt;year&gt;...)</code></span> to extract parts of the matched string and assign them names.
The <span class="nowrap"><code>(?&lt;name&gt;...)</code></span> syntax matches the regex <code>...</code> inside the parentheses, and then assigns <em>name</em> to the string that match.
Note that <code>?</code> here does <em>not</em> mean 0 or 1 repetition.
In this context, right after an open parenthesis, the <code>?</code> signals that these parentheses have special meaning, not just grouping.</p>

<p id="@named_capturing_groups"><a class="jump" href="#@named_capturing_groups"></a>Named capturing groups can be retrieved by the group() method after a successful match.
If this regex were matched against <code>"2025-03-18"</code>, for example, then <code>m.group("year")</code> would return <code>"2025"</code>, <code>m.group("month")</code> would return <code>"03"</code>, and <code>m.group("day")</code> would return <code>"18"</code>.</p>

<div class="reading-exercises exercises panel-group converted" id="ex-regular_expressions_2"><h4 class="text-danger">reading exercises</h4><div class="panel panel-danger"><div class="panel-heading" data-structure-tag="exercise" id="@ex-regular_expressions_2-regexes_for_substring_replacement" data-target="#ex-regular_expressions_2-regexes_for_substring_replacement" data-toggle="collapse" aria-expanded="true"><a class="jump" href="#@ex-regular_expressions_2-regexes_for_substring_replacement"></a><span class="panel-title">Regexes for substring replacement</span></div><div class="panel-collapse exercise-panel collapse in" id="ex-regular_expressions_2-regexes_for_substring_replacement" data-outline="regexes_for_substring_replacement" data-ex-id="using_regular_expressions_in_practice/regexes_for_substring_replacement" data-ex-category="reading-exercises" data-ex-remote="https://6031.mit.edu/handx/sp21-java/submit.php" data-ex-handout="classes-17-regex-grammars" aria-expanded="true" style=""><div class="panel-body">

<p id="@write_shortest_regex"><a class="jump" href="#@write_shortest_regex"></a>Write the shortest regex you can to remove <strong>single-word, lowercase-letter-only HTML tags</strong> from a string:</p>

<pre id="@string_input_b-good-b"><a class="jump" href="#@string_input_b-good-b"></a><code class="language-java hljs">String input = <span class="hljs-string">"The &lt;b&gt;Good&lt;/b&gt;, the &lt;i&gt;Bad&lt;/i&gt;, and the &lt;strong&gt;Ugly&lt;/strong&gt;"</span>;
String regex = <span class="hljs-string">"TODO"</span>;
String output = input.replaceAll(regex, <span class="hljs-string">""</span>);</code></pre>

<p id="@desired_output_example"><a class="jump" href="#@desired_output_example"></a>The desired output for that example is <code>"The Good, the Bad, and the Ugly"</code>.
What is the shortest regex that you can put in place of <code>TODO</code>, to match any single-word, lowercase-letter-only HTML tag, not just the ones in this particular input string?</p>

<p>You may find it useful to <a href="https://regex101.com/r/FUKDDV/1">try your answer here</a>.</p>

<div class="form-group exercise-part" data-outline="a"><div class="textfield exercise-choice"><input type="text" class="form-control" style="width: 25%;"><span class="exercise-answer exercise-remote" style="display: none;">(missing answer)</span></div>
</div>

<div class="exercise-explain exercise-remote"><p>(missing explanation)</p></div>

<div class="form-inline"><div class="form-group"><button class="btn btn-default exercise-submit">check</button> <button class="btn btn-default exercise-reveal" style="display: none;">explain</button></div><div class="exercise-progress progress"><div class="progress-bar progress-bar-danger progress-bar-striped active"></div></div><div class="exercise-error"></div></div></div></div></div>

<div class="panel panel-danger"><div class="panel-heading" data-structure-tag="exercise" id="@ex-regular_expressions_2-regexes_for_string_parsing" data-target="#ex-regular_expressions_2-regexes_for_string_parsing" data-toggle="collapse" aria-expanded="true"><a class="jump" href="#@ex-regular_expressions_2-regexes_for_string_parsing"></a><span class="panel-title">Regexes for string parsing</span></div><div class="panel-collapse exercise-panel collapse in" id="ex-regular_expressions_2-regexes_for_string_parsing" data-outline="regexes_for_string_parsing" data-ex-id="using_regular_expressions_in_practice/regexes_for_string_parsing" data-ex-category="reading-exercises" data-ex-remote="https://6031.mit.edu/handx/sp21-java/submit.php" data-ex-handout="classes-17-regex-grammars" aria-expanded="true" style=""><div class="panel-body">

<p id="@consider_snippet_code"><a class="jump" href="#@consider_snippet_code"></a>Consider this snippet of code that matches a street address:</p>

<pre id="@string_input_77"><a class="jump" href="#@string_input_77"></a><code class="language-java hljs">String input = <span class="hljs-string">"77 Rose Court Ln"</span>;
Pattern regex = Pattern.compile(<span class="hljs-string">"[0-9]+ .* (Rd|St|Ave|Ln)"</span>);
Matcher m = regex.matcher(input);
<span class="hljs-keyword">if</span> (m.matches()) {
    ...
}</code></pre>

<p id="@want_not_only"><a class="jump" href="#@want_not_only"></a>We want to not only match the street address, but also parse it into pieces.</p>

<p id="@insert_named_capturing"><a class="jump" href="#@insert_named_capturing"></a>Insert named capturing groups in the regular expression <code>regex</code> so that we could extract these pieces after successfully matching <code>m</code>:</p>

<pre id="@string_housenumber_m-group-housenumber"><a class="jump" href="#@string_housenumber_m-group-housenumber"></a><code class="language-java hljs">String houseNumber = m.group(<span class="hljs-string">"houseNumber"</span>); <span class="hljs-comment">// should be "77"</span>
String streetName = m.group(<span class="hljs-string">"streetName"</span>); <span class="hljs-comment">// should be "Rose Court"</span>
String streetType = m.group(<span class="hljs-string">"streetType"</span>); <span class="hljs-comment">// should be "Ln"</span></code></pre>

<p id="@start_original_regular"><a class="jump" href="#@start_original_regular"></a>Start with the original regular expression <code>[0-9]+ .* (Rd|St|Ave|Ln)</code>, already filled in below, and just put correctly-named parentheses around the parts of the expression that correspond to the pieces that need to be extracted.
Take care with the spaces, because spaces have meaning in a regex.</p>

<span class="long-monospaced-answer" data-prefilled-answer="[0-9]+ .* (Rd|St|Ave|Ln)"></span>

<div class="form-group exercise-part" data-outline="a"><div class="textfield exercise-choice"><input type="text" class="form-control" style="width: 50%;" value="[0-9]+ .* (Rd|St|Ave|Ln)"><span class="exercise-answer exercise-remote" style="display: none;">(missing answer)</span></div>
</div>

<p>You may find it useful to <a href="https://regex101.com/r/TAHcBA/1">try your answer here</a>.</p>

<div class="exercise-explain exercise-remote"><p>(missing explanation)</p></div><div class="form-inline"><div class="form-group"><button class="btn btn-default exercise-submit">check</button> <button class="btn btn-default exercise-reveal" style="display: none;">explain</button></div><div class="exercise-progress progress"><div class="progress-bar progress-bar-danger progress-bar-striped active"></div></div><div class="exercise-error"></div></div></div></div></div></div>

</div>


## Context-free grammars

[regular and context-free](https://en.wikipedia.org/wiki/Chomsky_hierarchy#Regular_(Type-3)_grammars)

In general, a language that can be expressed with our system of grammars is called context-free. Not all context-free languages are also regular; that is, some grammars can’t be reduced to single nonrecursive productions. Our HTML grammar is context-free but not regular.

The grammars for most programming languages are also context-free. In general, any language with nested structure (like nesting parentheses or braces) is context-free but not regular. That description applies to the Java grammar, shown here in part:

statement ::= 
  '{' statement* '}'
| 'if' '(' expression ')' statement ('else' statement)?
| 'for' '(' forinit? ';' expression? ';' forupdate? ')' statement
| 'while' '(' expression ')' statement
| 'do' statement 'while' '(' expression ')' ';'
| 'try' '{' statement* '}' ( catches | catches? 'finally' '{' statement* '}' )
| 'switch' '(' expression ')' '{' switchgroups '}'
| 'synchronized' '(' expression ')' '{' statement* '}'
| 'return' expression? ';'
| 'throw' expression ';' 
| 'break' identifier? ';'
| 'continue' identifier? ';'
| expression ';' 
| identifier ':' statement
| ';'

总体来说，context-free grammar 左侧为变量，右侧可以添加任何规则，而regular grammar，右侧只能添加受到限制的规则(对于一个确定的regular grammar,他的可能的形式是已经去确定了)。
所以regular grammar 等效于context-free grammar，但CFG 不等效于regular 。

* 上下文无关文法（CFG）可以表示嵌套和递归结构，如括号嵌套的算术表达式。正则文法（Regular Grammar）不能表示这种嵌套结构。

* 上下文无关文法（CFG）的规则可以更复杂，可以包括更多的非终结符和递归定义。正则文法的规则相对简单，通常只包括字符和闭包操作。

* 上下文无关文法适用于描述复杂的编程语言语法，而正则文法适用于简单的文本模式匹配，如正则表达式。
