## 2. Strings and Text

字符串的基本操作，参考 <https://docs.python.org/3/library/string.html?highlight=string#module-string>


主要是几大块：

1. 正则表达式
2. 编解码
3. 格式化输出
4. 格式化输入（html、xml文本）


### 2.1 分割字符串

string 自带 `split`方法。 但是如果想要更灵活，可以用 `re.split`

In [3]:
line = 'asdfjl. sdfj sld; fsdkfj, sdfj, fsdfl'

import re

re.split(r'[;,\.\s]\s*', line)

['asdfjl', 'sdfj', 'sld', 'fsdkfj', 'sdfj', 'fsdfl']

### 2.2 startswith 和 endswith

注意，这两个方法，可以用元组参数，比如 `endswith(('.c', '.h'))`

### 2.3 Matching Strings Using Shell Wildcard Patterns

就是如何使用Unix shell通配符的形式，来匹配文件名字符串？


In [7]:
from fnmatch import fnmatch, fnmatchcase

assert fnmatch('foo.txt', '*.txt')
assert fnmatch('foo.txt', '?oo.txt')
assert fnmatch('Dat45.csv', 'Dat[0-9]*')


assert fnmatchcase('foo.txt', '*.TXT') == False

### 2.4 字符串匹配问题

这是一个非常大的问题。这里有简单的

1. 一般用 str.find('xx')，可以找出子串的起始位置。
2. 复杂一点的用 re.match

In [13]:
import re

# Some sample text
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'

# (a) Find all matching dates
datepat = re.compile(r'\d+/\d+/\d+')
print(datepat.findall(text))

# (b) Find all matching dates with capture groups
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(year, month, day))

# (c) Iterative search
for m in datepat.finditer(text):
    print(m.groups())


['11/27/2012', '3/13/2013']
2012-11-27
2013-3-13
('11', '27', '2012')
('3', '13', '2013')


### 2.5 字符串替换

- 一般用法是 `str.replace`方法
- `re.sub(1, 2, text)`， 将1 替换为2
- `pat.sub(func, text)`， pat为 re.compile 对象， func是替换方法，使用 match对象


In [12]:
import re

# Some sample text
text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'

datepat = re.compile(r'(\d+)/(\d+)/(\d+)')

# (a) Simple substitution
print(datepat.sub(r'\3-\1-\2', text))

# (b) Replacement function
from calendar import month_abbr

def change_date(m):
    mon_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2), mon_name, m.group(3))

print(datepat.sub(change_date, text))

Today is 2012-11-27. PyCon starts 2013-3-13.
Today is 27 Nov 2012. PyCon starts 13 Mar 2013.


### 2.6 字符串匹配选项

1. `re.IGNORECASE`，匹配时忽略大小写
2. Shortest匹配（非贪婪匹配）： 例如`r'\"(.*?)\"'`

In [14]:
# Sample text
text = 'Computer says "no." Phone says "yes."'

# (a) Regex that finds quoted strings - longest match
str_pat = re.compile(r'\"(.*)\"')
print(str_pat.findall(text))

# (b) Regex that finds quoted strings - shortest match
str_pat = re.compile(r'\"(.*?)\"')
print(str_pat.findall(text))


['no." Phone says "yes.']
['no.', 'yes.']


### 2.7 字符串匹配之多行匹配

`re.DOTALL`

In [18]:
text = '''/* this is a
              multiline comment */
'''
comment = re.compile(r'/\*(.*?)\*/')
print(comment.findall(text))   # no

comment = re.compile(r'/\*(.*?)\*/', re.DOTALL)
print(comment.findall(text))   # no

comment = re.compile(r'/\*((?:.|\n)*?)\*/')
print(comment.findall(text))

[]
[' this is a\n              multiline comment ']
[' this is a\n              multiline comment ']


### 2.8 Unicode 字符处理

这也是一个大问题。。

参考这个 <https://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001386819196283586a37629844456ca7e5a7faa9b94ee8000>

注意几点：

1. 计算机内存里面都是 unicode编码。统一长度（4个字节，32位），方便处理。
2. 传输或储存的时候，为了减少带宽和存储空间，用 utf-8编码， 这时候， 英文字符的ascii码和utf-8编码是一样的（1个字节，8位）。
3. `bytes.decode()`， `str.encode()`，注意这两个用法和名称，编解码。
4. 当然也可以转成其他码（`encode('gbk')`)，不过不建议使用

python3 与python2 又不一样。。

In [58]:
s = '中'
print(s)
b1 = s.encode('ascii', 'backslashreplace')
print(b1, len(b1))   # 注意这里是 unicode 编码，为什么是 6个bytes？，理论上，去掉 '\u'，还有4个字节。
b2 = s.encode('utf-8')
print(b2, len(b2))   # utf-8编码，占 了三个bytes

中
b'\\u4e2d' 6
b'\xe4\xb8\xad' 3


In [78]:
print(sys.getsizeof(12))
print(sys.getsizeof(''))  # 49 + 0
print(sys.getsizeof('a'))  # 49 + 1
print(sys.getsizeof('中'))   # 49 + 25 + 2
print(sys.getsizeof('中文'))   # 49 + 25 + 4
print(sys.getsizeof('中文字'))   # 49 + 25 + 6
print(sys.getsizeof('中文字d'))   # 49 + 25 + 8
print(sys.getsizeof('d中文字d2'))   # 49 + 25 + 12

28
49
50
76
78
80
82
86


### 2.9 strip，去掉首尾不想要的字符

- strip
- lstrip
- rstrip

### 2.10 清理字符串

关于unicodedata的用法没有看懂。。。


In [86]:
# A tricky string
s = 'p\xfdt\u0125\xf6\xf1\x0cis\tawesome\r\n'
print(s)

# (a) Remapping whitespace
remap = {
    ord('\t') : ' ',
    ord('\f') : ' ',
    ord('\r') : None      # Deleted
}

a = s.translate(remap)
print('whitespace remapped:', a)

# (b) Remove all combining characters/marks
import unicodedata
import sys
cmb_chrs = dict.fromkeys(c for c in range(sys.maxunicode)
                         if unicodedata.combining(chr(c)))

b = unicodedata.normalize('NFD', a)
c = b.translate(cmb_chrs)
print('accents removed:', c)

# (c) Accent removal using I/O decoding
d = b.encode('ascii','ignore').decode('ascii')
print('accents removed via I/O:', d)


pýtĥöñis	awesome

whitespace remapped: pýtĥöñ is awesome

accents removed: python is awesome

accents removed via I/O: python is awesome



### 2.11 对齐字符串

- ljust
- rjust
- center
- format

用 format 更灵活一些，可以用于数字之类的

In [103]:
text = "Hello World"
print('"%s"' % text.ljust(20))
print('"%s"' % text.rjust(20))
print('"%s"' % text.center(20))

print('"%s"' % text.ljust(20, '='))
print('"%s"' % text.rjust(20, '-'))
print('"%s"' % text.center(20, '*'))

print('"%s"' % format(text, '<20'))
print('"%s"' % format(text, '>20'))
print('"%s"' % format(text, '^20'))

print('"%s"' % format(text, '=<20'))
print('"%s"' % format(text, '->20'))
print('"%s"' % format(text, '*^20'))

print('"%-20s"' % text)
print('"%20s"' % text)


print('"%s"' % format(1.2345, '=<20.2f'))
print('"%s"' % format(1.2345, '->20.2f'))
print('"%s"' % format(1.2345, '*^20.2f'))

"Hello World         "
"         Hello World"
"    Hello World     "
"---------Hello World"
"****Hello World*****"
"Hello World         "
"         Hello World"
"    Hello World     "
"---------Hello World"
"****Hello World*****"
"Hello World         "
"         Hello World"
"----------------1.23"
"********1.23********"



### 2.12 拼接字符串

`''.join` 很简单。。

In [104]:
def sample():
    yield "Is"
    yield "Chicago"
    yield "Not"
    yield "Chicago?"

# (a) Simple join operator
text = ''.join(sample())
print(text)

# (b) Redirection of parts to I/O
import sys
for part in sample():
    sys.stdout.write(part)
sys.stdout.write('\n')

# (c) Combination of parts into buffers and larger I/O operations
def combine(source, maxsize):
    parts = []
    size = 0
    for part in source:
        parts.append(part)
        size += len(part)
        if size > maxsize:
            yield ''.join(parts)
            parts = []
            size = 0
    yield ''.join(parts)

for part in combine(sample(), 32768):
    sys.stdout.write(part)
sys.stdout.write('\n')


IsChicagoNotChicago?
IsChicagoNotChicago?
IsChicagoNotChicago?


### 2.13 格式化文本（wrap）

In [105]:

# A long string
s = "Look into my eyes, look into my eyes, the eyes, the eyes, \
the eyes, not around the eyes, don't look around the eyes, \
look into my eyes, you're under."

import textwrap

print(textwrap.fill(s, 70))
print()

print(textwrap.fill(s, 40))
print()

print(textwrap.fill(s, 40, initial_indent='    '))
print()

print(textwrap.fill(s, 40, subsequent_indent='    '))
print()


Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes, don't look around the eyes, look into my eyes,
you're under.

Look into my eyes, look into my eyes,
the eyes, the eyes, the eyes, not around
the eyes, don't look around the eyes,
look into my eyes, you're under.

    Look into my eyes, look into my
eyes, the eyes, the eyes, the eyes, not
around the eyes, don't look around the
eyes, look into my eyes, you're under.

Look into my eyes, look into my eyes,
    the eyes, the eyes, the eyes, not
    around the eyes, don't look around
    the eyes, look into my eyes, you're
    under.



### 2.14 Tokenizing Text

类似于分割字符串，这里更多是提取操作符。

In [106]:
import re
from collections import namedtuple

NAME = r'(?P<NAME>[a-zA-Z_][a-zA-Z_0-9]*)'
NUM  = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
TIMES = r'(?P<TIMES>\*)'
EQ    = r'(?P<EQ>=)'
WS    = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NAME, NUM, PLUS, TIMES, EQ, WS]))

Token = namedtuple('Token', ['type','value'])

def generate_tokens(pat, text):
    scanner = pat.scanner(text)
    for m in iter(scanner.match, None):
        yield Token(m.lastgroup, m.group())

for tok in generate_tokens(master_pat, 'foo = 42'):
    print(tok)


Token(type='NAME', value='foo')
Token(type='WS', value=' ')
Token(type='EQ', value='=')
Token(type='WS', value=' ')
Token(type='NUM', value='42')



### 2.15 简单的表达式解析器

表达式语法：

```BNF
expr ::= expr + term
     |   expr - term
     |   term
term ::= term * factor
     |   term / factor
     |   factor
factor ::= (expr)
       |   NUM
```

In [107]:
# example.py
#
# An example of writing a simple recursive descent parser

import re
import collections

# Token specification
NUM    = r'(?P<NUM>\d+)'
PLUS   = r'(?P<PLUS>\+)'
MINUS  = r'(?P<MINUS>-)'
TIMES  = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS     = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NUM, PLUS, MINUS, TIMES, 
                                  DIVIDE, LPAREN, RPAREN, WS]))

# Tokenizer
Token = collections.namedtuple('Token', ['type','value'])

def generate_tokens(text):
    scanner = master_pat.scanner(text)
    for m in iter(scanner.match, None):
        tok = Token(m.lastgroup, m.group())
        if tok.type != 'WS':
            yield tok

# Parser 
class ExpressionEvaluator:
    '''
    Implementation of a recursive descent parser.   Each method
    implements a single grammar rule.  Use the ._accept() method
    to test and accept the current lookahead token.  Use the ._expect()
    method to exactly match and discard the next token on on the input
    (or raise a SyntaxError if it doesn't match).
    '''

    def parse(self,text):
        self.tokens = generate_tokens(text)
        self.tok = None             # Last symbol consumed
        self.nexttok = None         # Next symbol tokenized
        self._advance()             # Load first lookahead token
        return self.expr()

    def _advance(self):
        'Advance one token ahead'
        self.tok, self.nexttok = self.nexttok, next(self.tokens, None)

    def _accept(self,toktype):
        'Test and consume the next token if it matches toktype'
        if self.nexttok and self.nexttok.type == toktype:
            self._advance()
            return True
        else:
            return False

    def _expect(self,toktype):
        'Consume next token if it matches toktype or raise SyntaxError'
        if not self._accept(toktype):
            raise SyntaxError('Expected ' + toktype)

    # Grammar rules follow

    def expr(self):
        "expression ::= term { ('+'|'-') term }*"

        exprval = self.term()
        while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                exprval += right
            elif op == 'MINUS':
                exprval -= right
        return exprval
    
    def term(self):
        "term ::= factor { ('*'|'/') factor }*"

        termval = self.factor()
        while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval *= right
            elif op == 'DIVIDE':
                termval /= right
        return termval

    def factor(self):
        "factor ::= NUM | ( expr )"

        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            exprval = self.expr()
            self._expect('RPAREN')
            return exprval
        else:
            raise SyntaxError('Expected NUMBER or LPAREN')

if __name__ == '__main__':
    e = ExpressionEvaluator()
    print(e.parse('2'))
    print(e.parse('2 + 3'))
    print(e.parse('2 + 3 * 4'))
    print(e.parse('2 + (3 + 4) * 5'))

# Example of building trees

class ExpressionTreeBuilder(ExpressionEvaluator):
    def expr(self):
        "expression ::= term { ('+'|'-') term }"

        exprval = self.term()
        while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                exprval = ('+', exprval, right)
            elif op == 'MINUS':
                exprval = ('-', exprval, right)
        return exprval
    
    def term(self):
        "term ::= factor { ('*'|'/') factor }"

        termval = self.factor()
        while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval = ('*', termval, right)
            elif op == 'DIVIDE':
                termval = ('/', termval, right)
        return termval

    def factor(self):
        'factor ::= NUM | ( expr )'

        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            exprval = self.expr()
            self._expect('RPAREN')
            return exprval
        else:
            raise SyntaxError('Expected NUMBER or LPAREN')

if __name__ == '__main__':
    e = ExpressionTreeBuilder()
    print(e.parse('2 + 3'))
    print(e.parse('2 + 3 * 4'))
    print(e.parse('2 + (3 + 4) * 5'))
    print(e.parse('2 + 3 + 4'))


2
5
14
37
('+', 2, 3)
('+', 2, ('*', 3, 4))
('+', 2, ('*', ('+', 3, 4), 5))
('+', ('+', 2, 3), 4)


In [109]:
# plyexample.py
#
# Example of parsing with PLY
# need install PyParsing


from ply.lex import lex
from ply.yacc import yacc

# Token list
tokens = [ 'NUM', 'PLUS', 'MINUS', 'TIMES', 'DIVIDE', 'LPAREN', 'RPAREN' ]

# Ignored characters

t_ignore = ' \t\n'

# Token specifications (as regexs)
t_PLUS   = r'\+'
t_MINUS  = r'-'
t_TIMES  = r'\*'
t_DIVIDE = r'/'
t_LPAREN = r'\('
t_RPAREN = r'\)'

# Token processing functions
def t_NUM(t):
    r'\d+'
    t.value = int(t.value)
    return t

# Error handler
def t_error(t):
    print('Bad character: {!r}'.format(t.value[0]))
    t.skip(1)

# Build the lexer
lexer = lex()

# Grammar rules and handler functions
def p_expr(p):
    '''
    expr : expr PLUS term
         | expr MINUS term
    '''
    if p[2] == '+':
        p[0] = p[1] + p[3]
    elif p[2] == '-':
        p[0] = p[1] - p[3]

def p_expr_term(p):
    '''
    expr : term
    '''
    p[0] = p[1]

def p_term(p):
    '''
    term : term TIMES factor
         | term DIVIDE factor
    '''
    if p[2] == '*':
        p[0] = p[1] * p[3]
    elif p[2] == '/':
        p[0] = p[1] / p[3]

def p_term_factor(p):
    '''
    term : factor
    '''
    p[0] = p[1]

def p_factor(p):
    '''
    factor : NUM
    '''
    p[0] = p[1]

def p_factor_group(p):
    '''
    factor : LPAREN expr RPAREN
    '''
    p[0] = p[2]

def p_error(p):
    print('Syntax error')

parser = yacc()

if __name__ == '__main__':
    print(parser.parse('2'))
    print(parser.parse('2+3'))
    print(parser.parse('2+(3+4)*5'))


ModuleNotFoundError: No module named 'ply'