In [16]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## String -- Common string operations

### String constants

In [17]:
import string
string.ascii_letters    # this value is not locale-dependent?
string.ascii_lowercase
string.ascii_uppercase
string.digits
string.hexdigits
string.octdigits
string.punctuation
string.printable
string.whitespace

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

'abcdefghijklmnopqrstuvwxyz'

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

'0123456789'

'0123456789abcdefABCDEF'

'01234567'

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

' \t\n\r\x0b\x0c'

### Custom String Formatting

In [18]:
# format, string.format method
"{0} is year, {1} is month, {2} is day. And this is the literal {{}}".format(2023, 8, 5)

# If the arg_name is a number, it refers to a positional argument, else it is a keyword argument
"{0!r} is year, {1} is month, {_} is day. And this is the literal {{}}".format(2023, 8, _=12)

year = 2023
month = 8
day = 5

# f-string
f'{year} is year, {month} is month, {day} is day.'

# % string
'%d is year, %d is month, %d is day.' % (year, month, day)

'2023 is year, 8 is month, 5 is day. And this is the literal {}'

'2023 is year, 8 is month, 12 is day. And this is the literal {}'

'2023 is year, 8 is month, 5 is day.'

'2023 is year, 8 is month, 5 is day.'

In [19]:
format('123')

'123'

### Metasyntax

*Element* of metasyntax
- *Terminals*: a stand-alone syntactic structure. Terminals could be denoted by double quoting the name of the terminals.
  e.g. `"else"`, `"if"`, `"then"`, `"while"`
- *Nonterminals*: a symbolic representation defining a set of allowable syntactic structures that is composed of a subset of elements. Nonterminals could be denoted by angle bracketing the name of the nonterminals.
  e.g. `<int>`, `<char>`, `<boolean>`
- *Metasymbol*: a symbolic representation denoting syntactic information.
  e.g. `:=`, `|`, `{}`, `()`, `[]`, `*`


*Methods* of phrase termination
- Juxtaposition: e.g. `A B`
- Alternation: e.g. `A|B`
- Repetition: e.g. `{A B}`
- Optional phrase: e.g. `[A B]`
- Grouping: e.g. `(A|B)`

### Format String Syntax

Format strings contain "replacement fields" surrounded by curly braces `{}`. The grammar for a replacement filed is as follow:

```code
replacement_field ::=  "{" [field_name] ["!" conversion] [":" format_spec] "}"
field_name        ::=  arg_name ("." attribute_name | "[" element_index "]")*
arg_name          ::=  [identifier | digit+]
attribute_name    ::=  identifier
element_index     ::=  digit+ | index_string
index_string      ::=  <any source character except "]"> +
conversion        ::=  "r" | "s" | "a"
format_spec       ::=  <described in the next section>
```

In [20]:
name = "Bryant"
players = ["Kobe", "Jamas"]

"First, thou shalt count to {0}".format(12, 14)    # the last one is unused
"My quest is {name}".format(name="Bryant")
"Weight in tons {0.count}".format(["Kobe", "Jamas"])
"Units destroyed: {players[0]}".format(players = ["Kobe", "Jamas"])

'First, thou shalt count to 12'

'My quest is Bryant'

'Weight in tons <built-in method count of list object at 0x107884a80>'

'Units destroyed: Kobe'

In [21]:
# Three conversion flags are currently supported
"Harold's a clever {0!s}".format('boy')        # Calls str() on the argument first
"Bring out the holy {name!r}".format(name='cow')    # Calls repr() on the argument first
"More {!a}".format('A')                      # Calls ascii() on the argument first

"Harold's a clever boy"

"Bring out the holy 'cow'"

"More 'A'"

In [22]:
# Accessing arguments by name
coord = {'latitude': '37.24N', 'longitude': '-115.81W'}
'Coordinates: {latitude}. {longitude}'.format(**coord)

'Coordinates: 37.24N. -115.81W'

In [23]:
# Accessing arguments' attributes
c = 3-5j
('The complex number {0} is formed from the real part {0.real} '
 'and the imaginary part {0.imag}.').format(c)

'The complex number (3-5j) is formed from the real part 3.0 and the imaginary part -5.0.'

In [24]:
# Accessing arguments' items
coord = (3, 5)
'X: {0[0]};  Y: {0[1]}'.format(coord)

'X: 3;  Y: 5'

In [25]:
# Replacing %s and %r
"repr() shows quotes: {!r}; str() doesn't: {!s}".format('test1', 'test2')

"repr() shows quotes: 'test1'; str() doesn't: test2"

In [26]:
# Aligning the text and specifying a width
'{:<30}'.format('left aligned')
'{:>30}'.format('right aligned')
'{:^30}'.format('centered')
'{:*^30}'.format('centered')    # use '*' as a fill char

'left aligned                  '

'                 right aligned'

'           centered           '

'***********centered***********'

In [27]:
# Replacing %x and %o and converting the value to different bases
"int: {0:d}; hex: {0:x}; oct: {0:o}; bin: {0:b}".format(42)

'int: 42; hex: 2a; oct: 52; bin: 101010'

In [28]:
# Math specifying
'{:,}'.format(1234567890)    # using a comma as a thousand separator
'Correct answers: {:.2%}'.format(19/22)    # exressing a percentage

# Using type-specific formatting
import datetime
d = datetime.datetime(2010, 7, 4, 12, 15, 58)
'{:%Y-%m-%d %H:%M:%S}'.format(d)

'1,234,567,890'

'Correct answers: 86.36%'

'2010-07-04 12:15:58'

In [29]:
# Nesting arguments
for align, text in zip('<^>', ['left', 'center', 'right']):
    '{0:{fill}{align}16}'.format(text, fill=align, align=align)

octets = [192, 168, 0, 1]
'{:02X}{:02X}{:02X}{:02X}'.format(*octets)

int(_, 16)

width = 5
for num in range(5, 12):
    for base in 'dXob':    # d represents 'decimal', X represnets 'Hexdecimal', o represent 'octal', b represents 'binary'
        print('{0:{width}{base}}'.format(num, base=base, width=width), end=' ')
    print()


'left<<<<<<<<<<<<'

'^^^^^center^^^^^'

'>>>>>>>>>>>right'

'C0A80001'

3232235521

    5     5     5   101 
    6     6     6   110 
    7     7     7   111 
    8     8    10  1000 
    9     9    11  1001 
   10     A    12  1010 
   11     B    13  1011 


### Template strings

In [30]:
from string import Template
s = Template('$who likes $what')
s.substitute(who='tim', what='kung pao')

d = dict(who='tim')
Template('$who likes $what').safe_substitute(d)

'tim likes kung pao'

'tim likes $what'

## re - Regular expression operations

Usually patterns will be expressed in Python code using *raw string notation*.

### Regular Expression Syntax

The special characters:

**Character classes and class-like constructs**:
- `.` In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.
- `\` Either escapes special characters or signals a special sequences.
- `[]` Used to indicate a set of characters.


**Anchors（锚点）**:
- `^` Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
- `$` Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.
- `(?=...)` Matches if `...` matches next, but doesn't consume any of the string. This is called *lookahead assertion*. For example, `Isaac (?=Asimov)` will match `'Isaac '` only if it’s followed by `'Asimov'`.
- `(?!...)` Matches if `...` doesn’t match next. This is a *negative lookahead assertion*. For example, `Isaac (?!Asimov)` will match `'Isaac '` only if it’s not followed by `'Asimov'`.
- `(?<=...)` 

**Grouping, Capturing, Conditional, and Control**:
- `*` Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as possible. `ab*` will match 'a', 'ab', or 'a' followed by any number of 'b's.
- `+` Causes the resulting RE to match 1 or more repetitions of the preceding RE. `ab+` will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.
- `?` Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. `ab?` will match either 'a' or 'ab'.
- `*?`, `+?`, `??` No-greedy matches quantifiers.
- `*+`, `++`, `?+` Possessive quantifiers, these do not allow back-tracking when the expression following it fails to match.
- `{m}` Specifies that exactly *m* copies of the previous RE should be matched.
- `{m, n}` Causes the resulting RE to match from *m* to *n* repetitions of the preceding RE, attempting to match as many as possible.
- `|` `A|B`, where *A* and *B* can be arbitrary REs, creates a regular expression that will match either *A* or *B*.
- `(...)` Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the `\number` special sequence, described below.

- `(?:...)` *Grouping-only parentheses*. A non-capturing version of regular parentheses.

- `(?>...)` *Atomic grouping*(固化分组): Attempts to match `...` as if it was a separate regular expression, and if successful, continues to match the rest of the pattern following it

- `(?P<name>...)` Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name *name*.
- `(?P=name)` A backreference to a named group; it matches whatever text was matched by the earlier group named *name*.

**Mode modifier**: *(?modifier)*, such as *(?a)* or *(?-i)*

- `(?...)` This is an extension notation. The first character after the `?` determines what the meaning and further syntax of the construct is.

- `(?aiLmsux)` One or more letters from the set 'a', 'i', 'L', 'm', 's', 'u', 'x'. The group matches the empty string; the letters set the corresponding flags: re.A (*ASCII-only matching*), re.I (*ignore case*), re.L (*locale dependent*), re.M (*multi-line*), re.S (*dot matches all*), re.U (*Unicode matching*), and re.X (*verbose*), for the entire regular expression.

Changed in version 3.11: This construction can only be used at the start of the expression.
在 3.11 版更改：**此构造只能在表达式的开头使用**。

- `(?aiLmsux-imsx:...)` *Mode-modified span(模式作用范围)*, (?*modifier*:...), such as (?:...). 

- `(?#...)` A comment; the contents of the parentheses are simply ignored.
