# Pandas String操作
- String对象方法
    - split()
    - strip()
    - join()
    - index()
    - find()
    - count()
    - replace()
- 正则表达式
    - \s+:one or more whitespace characters
    - findall()
    - search():returns only the first match.
    - match():only matches at the beginning of the string.
    - sub()
        - using special symbols like \1 and \2
    - groups()
        - findall returns a list of tuples when the pattern has groups
    - 列表：Regular expression methods
- Pandas中向量化string函数
    - SeriesObj.str.contains('keyword')
    - SeriesObj.str.findall(pattern, flags=re.IGNORECASE) 结合正则表达式
    - 列表：Vectorized string methods

In [1]:
# coding:utf-8
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
%pwd

u'/Users/zhangjun/Documents/machine-learning-notes/data-processing'

## String对象方法
A comma-separated string can be broken into pieces with `split`:

In [2]:
val = 'a,b,  guido'
val.split(',')

['a', 'b', '  guido']

`split` is often combined with `strip` to trim whitespace (including line breaks):

In [3]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

These substrings could be `concatenated` together with a two-colon delimiter using addition:

In [4]:
first, second, third = pieces
first + '::' + second + '::' + third

'a::b::guido'

A faster and more Pythonic way is to pass a list or tuple to the `join` method on the string '::':

In [5]:
'::'.join(pieces)

'a::b::guido'

Using Python’s `in` keyword is the best way to detect a substring, though `index` and `find` can also be used:

In [6]:
'guido' in val

True

In [7]:
val.index(',')

1

In [8]:
val.find(':')

-1

Note the difference between `find` and `index` is that index raises an exception if the string isn’t found (versus returning -1):

In [9]:
val.index(':')

ValueError: substring not found

Relatedly, `count` returns the number of occurrences of a particular substring:

In [10]:
val.count(',')

2

`replace` will substitute occurrences of one pattern for another.

In [11]:
val.replace(',', '::')

'a::b::  guido'

In [12]:
val.replace(',', '')

'ab  guido'

Table 7-3. Python built-in string methods

Argument | Description
---------|------------
count | Return the number of non-overlapping occurrences of substring in the string.
endswith | Returns True if string ends with suffix.
startswith | Returns True if string stats with prefix.
join | Use string as delimiter for concatenating a sequence of other strings.
index | Return position of first character in substring if found in the string. Raises ValueError if not found.
find | Return position of first character of first occurrence of substring in the string. Like index, but returns -1 if not found.
rfind | Return position of first character of last occurrence of substring in the string. Returns -1 if not found.
replace | Replace occurrences of string with another string.
strip, rstrip, lstrip | Trim whitespace, including newlines; equivalent to x.strip() (and rstrip, lstrip, respectively) for each element.
split | Break string into list of substrings using passed delimiter.
lower | Convert alphabet characters to lowercase
upper | Convert alphabet characters to uppercase
casefold | Convert characters to lowercase, and convert any region-specific variable character combinations to a common comparable form
ljust, rjust | Left justify or right justify, respectively. Pad opposite side of string with spaces (or some other fill character) to return a string with a minimum width.

## 正则表达式
The `re` module functions fall into three categories: pattern matching, substitution, and splitting. The regex describing one or more whitespace characters is `\s+`:

In [13]:
import re
text = "foo    bar\t baz  \tqux"
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

When you call re.split('\s+', text), the regular expression is first compiled, then its split method is called on the passed text. You can compile the regex yourself with re.`compile`, forming a reusable regex object:

In [14]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'baz', 'qux']

Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.

If, instead, you wanted to get a list of all patterns matching the regex, you can use the findall method:

In [15]:
regex.findall(text)

['    ', '\t ', '  \t']

`match` and `search` are closely related to `findall`. While `findall` returns all matches in a string, `search` returns only the first match. More rigidly, match only matches at the beginning of the string. 

In [16]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

Using `findall` on the text produces a list of the e-mail addresses:

In [17]:
regex.findall(text)

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

`search` returns a special match object for the first email address in the text. 

In [18]:
m = regex.search(text)
m

<_sre.SRE_Match at 0x118488ac0>

In [19]:
text[m.start():m.end()]

'dave@google.com'

regex.`match` returns None, as it only will match if the pattern occurs at the start of the string:

In [20]:
print(regex.match(text))

None


Relatedly, `sub` will return a new string with occurrences of the pattern replaced by the a new string:

In [21]:
print(regex.sub('REDACTED', text))

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED



Suppose you wanted to find email addresses and simultaneously segment each address into its 3 components: username, domain name, and domain suffix.A match object produced by this modified regex returns a tuple of the pattern components with its `groups` method:

In [22]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
m = regex.match('wesm@bright.net')
m.groups()

('wesm', 'bright', 'net')

`findall` returns a list of tuples when the pattern has groups:

In [24]:
regex.findall(text)

[('dave', 'google', 'com'),
 ('steve', 'gmail', 'com'),
 ('rob', 'gmail', 'com'),
 ('ryan', 'yahoo', 'com')]

`sub` also has access to groups in each match using special symbols like `\1` and `\2`. The symbol \1 corresponds to the first matched group, \2 the second, and so forth.

In [25]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))

Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com



One variation on the above email regex gives names to the match groups:

In [26]:
regex = re.compile(r"""
	(?P<username>[A-Z0-9._%+-]+)
	@
	(?P<domain>[A-Z0-9.-]+)
	\.
	(?P<suffix>[A-Z]{2,4})""", flags=re.IGNORECASE|re.VERBOSE)
m = regex.match('wesm@bright.net')
m.groupdict()

{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}

Table 7-4. Regular expression methods

Argument | Description
---------|------------
findall | Return all non-overlapping matching patterns in a string as a list.
finditer | Like findall, but returns an iterator
match | Match pattern at start of string and optionally segment pattern components into groups. If the pattern matches, returns a match object, otherwise None.
search | Scan string for match to pattern; returning a match object if so. Unlike match, the match can be anywhere in the string as opposed to only at the beginning.
split | Break string into pieces at each occurrence of pattern.
sub, subn | Replace all (sub) or first n occurrences (subn) of pattern in string with replacement expression. Use symbols \1, \2, ... to refer to match group elements in the replacement string.

## Pandas中向量化string函数

In [27]:
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}
data = pd.Series(data)
data

Dave     dave@google.com
Rob        rob@gmail.com
Steve    steve@gmail.com
Wes                  NaN
dtype: object

In [28]:
data.isnull()

Dave     False
Rob      False
Steve    False
Wes       True
dtype: bool

String and regular expression methods can be applied (passing a lambda or other function) to each value using data.`map`, but it will fail on the NA (null) values. To cope with this, Series has array-oriented methods for string operations that skip NA values. These are accessed through Series’s str attribute; for example, we could check whether each email address has 'gmail' in it with str.`contains`:

In [29]:
data.str.contains('gmail')

Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object

Regular expressions can be used, too, along with any `re` options like `IGNORECASE`:

In [30]:
pattern

'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

In [31]:
data.str.findall(pattern, flags=re.IGNORECASE)

Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

There are a couple of ways to do vectorized element retrieval. Either use str.`get` or index into the str attribute:

In [32]:
matches = data.str.match(pattern, flags=re.IGNORECASE)
matches

  if __name__ == '__main__':


Dave     (dave, google, com)
Rob        (rob, gmail, com)
Steve    (steve, gmail, com)
Wes                      NaN
dtype: object

In [33]:
matches.str.get(1)

Dave     google
Rob       gmail
Steve     gmail
Wes         NaN
dtype: object

In [34]:
matches.str[0]

Dave      dave
Rob        rob
Steve    steve
Wes        NaN
dtype: object

In [35]:
data.str[:5]

Dave     dave@
Rob      rob@g
Steve    steve
Wes        NaN
dtype: object

Table 7-5. Vectorized string methods

Method | Description
-------|------------
cat | Concatenate strings element-wise with optional delimiter
contains | Return boolean array if each string contains pattern/regex
count | Count occurrences of pattern
endswith | Equivalent to x.endswith(pattern) for each element.
startswith | Equivalent to x.startswith(pattern) for each element.
findall | Compute list of all occurrences of pattern/regex for each string
get | Index into each element (retrieve i-th element)
isalnum | Equivalent to built-in str.alnum
isalpha | Equivalent to built-in str.isalpha
isdecimal | Equivalent to built-in str.isdecimal
isdigit | Equivalent to built-in str.isdigit
islower | Equivalent to built-in str.islower
isnumeric | Equivalent to built-in str.isnumeric
isupper | Equivalent to built-in str.isupper
join | Join strings in each element of the Series with passed separator
len | Compute length of each string
lower, upper | Convert cases; equivalent to x.lower() or x.upper() for each element.
match | Use re.match with the passed regular expression on each element, returning matched groups as list.
pad | Add whitespace to left, right, or both sides of strings
center | Equivalent to pad(side='both')
repeat | Duplicate values; for example s.str.repeat(3) equivalent to x * 3 for each string.
replace | Replace occurrences of pattern/regex with some other string
slice | Slice each string in the Series.
split | Split strings on delimiter or regular expression
strip | Trim whitespace from both sides, including newlines
rstrip | Trim whitespace on right side
lstrip | Trim whitespace on left side