# Manipulation of strings

pandas provides the ability to succinctly apply string and regular expressions to entire arrays of data.

## String object methods

In many string munging and scripting applications, the built-in string methods are sufficient. For example, a comma-separated string can be split into parts using [split](https://docs.python.org/3/library/stdtypes.html#str.split):

In [1]:
str = """Jupyter tutorial, PyViz tutorial, Python basics,
Jupyter tutorial, Python basics"""

chunks = str.split(',')

print(chunks)

['Jupyter tutorial', ' PyViz tutorial', ' Python basics', '\nJupyter tutorial', ' Python basics']


`split` is often combined with [str.strip](https://docs.python.org/3/library/stdtypes.html#str.strip) to remove spaces and line breaks:

In [2]:
chunks = [x.strip() for x in str.split(',')]

chunks

['Jupyter tutorial',
 'PyViz tutorial',
 'Python basics',
 'Jupyter tutorial',
 'Python basics']

Eine schnelle Methode zur Übergabe einer Liste oder eines Tupels an eine Zeichenkette ist die [str.join](https://docs.python.org/3/library/stdtypes.html#str.join)-Methode:

In [3]:
';'.join(chunks)

'Jupyter tutorial;PyViz tutorial;Python basics;Jupyter tutorial;Python basics'

By using the Python keyword `in`, it is easy to check whether a certain string is present:

In [4]:
'Python basics' in chunks

True

The number of strings can be determined with [str.count](https://docs.python.org/3/library/stdtypes.html#str.count):

In [5]:
str.count('Python basics')

2

The [str.replace](https://docs.python.org/3/library/stdtypes.html#str.replace) method can be used to replace the occurrence of one pattern with another. It is also often used to delete patterns by passing an empty string:

In [6]:
str.replace(', ', ';')

'Jupyter tutorial;PyViz tutorial;Python basics,\nJupyter tutorial;Python basics'

In [7]:
str.replace('\n', '')

'Jupyter tutorial, PyViz tutorial, Python basics,Jupyter tutorial, Python basics'

Python built-in string methods:

Method | Description
:----- | :----------
`count` | returns the number of non-overlapping occurrences of the string
`endswith` | returns `True` if the string ends with the suffix
`startswith` | returns `True` if the string starts with the prefix
`join` | uses the string as a delimiter for concatenating a sequence of other strings
`index` | returns the position of the first character in the string if it is found in the string; raises a `ValueError` if it is not found
`find` | returns the position of the first character of the first occurrence of the substring in the string; like `index`, but returns `-1` if nothing was found
`rfind` | returns the position of the first character of the last occurrence of the substring in the string; returns `-1` if nothing was found
`replace` | replaces occurrences of a string with another string
`strip`, `rstrip`, `lstrip` | truncate spaces, including line breaks
`split` | splits a string into a list of substrings using the passed separator character
`lower` | converts alphabetic characters into lower case letters
`upper` | converts alphabetic characters to uppercase letters
`casefold` | converts characters to lower case and converts all region-specific variable character combinations to a common comparable form
`ljust`, `rjust` | left-justified and right-justified respectively; fills the opposite side of the string with spaces (or another fill character) to obtain a string with a minimum width

## Regular expressions

Regular expressions, also called *regex*, provide a flexible way to search or match (often more complex) string patterns in text. Python’s built-in [re](https://docs.python.org/3/library/re.html) module is responsible for applying regular expressions to strings. The functions of the `re` module fall into three categories: Pattern matching, substitution and splitting. These are all related, of course; a regex describes a pattern to be found in text, which can then be used for many purposes.

> **See also:**
> 
> [Regular expressions](../ipython/unix-shell/regex.ipynb)

Consider a simple example: suppose we want to split a string with a variable number of spaces (tabs, spaces and newlines). The regex describing one or more spaces is `\s+`:

In [8]:
import re

re.split('\s+', str)

['Jupyter',
 'tutorial,',
 'PyViz',
 'tutorial,',
 'Python',
 'basics,',
 'Jupyter',
 'tutorial,',
 'Python',
 'basics']

When you call `re.split('\s+', str)`, the regular expression is first compiled and then its split method is called for the passed text. You can compile the regex itself with `re.compile` and thus form a reusable regex object:

In [9]:
regex = re.compile('\s+')

regex.split(str)

['Jupyter',
 'tutorial,',
 'PyViz',
 'tutorial,',
 'Python',
 'basics,',
 'Jupyter',
 'tutorial,',
 'Python',
 'basics']

If instead you want to get a list of all patterns that match the regex, you can use the [re.findall](https://docs.python.org/3/library/re.html#re.findall) method:

In [10]:
regex.findall(str)

[' ', ' ', ' ', ' ', ' ', '\n', ' ', ' ', ' ']

> **Note:**
> 
> To avoid unwanted escaping with `\` in a regular expression, use raw string literals like `r'C:\PATH\TO\FILE'` instead of the corresponding `'C:\PATH\TO\FILE'`.

Creating a regex object with `re.compile` is highly recommended if you intend to apply the same expression to many strings; this also saves CPU cycles.

`match` and `search` are closely related to `findall`. While `findall` returns all matches in a string, `search` returns only the first match and `match` returns only matches at the beginning of the string. As a less trivial example, consider a block of text and a regular expression that can identify most email addresses:

In [11]:
addresses = """Veit <veit@cusy.io>
Veit Schiele <veit.schiele@cusy.io>
cusy GmbH <info@cusy.io>
"""

In [12]:
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
regex = re.compile(pattern, flags=re.IGNORECASE)

`re.IGNORECASE` ignores case sensitivity in the regex.

Using `findall` for the text gives a list of email addresses:

In [13]:
regex.findall(addresses)

['veit@cusy.io', 'veit.schiele@cusy.io', 'info@cusy.io']

`search` returns a special `match` object for the first email address in the text. For the preceding regex, the `match` object can only specify the start and end position of the pattern in the string:

In [14]:
first = regex.search(addresses)

first

<re.Match object; span=(6, 18), match='veit@cusy.io'>

In [15]:
addresses[first.start():first.end()]

'veit@cusy.io'

`regex.match` returns `None` because the pattern only matches if it is at the beginning of the string:

In [16]:
print(regex.match(addresses))

None


Entsprechend gibt `sub` eine neue Zeichenkette zurück, in der alle Vorkommen des Musters durch die neue Zeichenkette ersetzt sind:

In [17]:
print(regex.sub('…', addresses))

Veit <…>
Veit Schiele <…>
cusy GmbH <…>



Accordingly, sub returns a new string in which all occurrences of the pattern are replaced by the new string:

In [18]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)

In [19]:
match = regex.match('veit@cusy.io')

match.groups()

('veit', 'cusy', 'io')

`findall` returns a list of tuples if the pattern contains groups:

In [20]:
regex.findall(addresses)

[('veit', 'cusy', 'io'),
 ('veit.schiele', 'cusy', 'io'),
 ('info', 'cusy', 'io')]

`sub` also has access to the groups in each match with special symbols. Thus `\1` stands for the first matching group, `\2` for the second and so on:

In [21]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', addresses))

Veit <Username: veit, Domain: cusy, Suffix: io>
Veit Schiele <Username: veit.schiele, Domain: cusy, Suffix: io>
cusy GmbH <Username: info, Domain: cusy, Suffix: io>



The following table contains a brief overview of methods for regular expressions:

Method | Description
:----- | :----------
`findall` returns all non-overlapping matching patterns in a string as a list
`finditer` like `findall`, but returns an iterator
`match` matches the pattern at the beginning of the string and optionally segments the pattern components into groups; if the pattern matches, a `match` object is returned, otherwise none
`search` searches the string for matches to the pattern; in this case, returns a `match` object; unlike `match`, the match can be anywhere in the string, not just at the beginning
`split` splits the string into parts at each occurrence of the pattern
`sub`, `subn` replaces all (`sub`) or the first n occurrences (`subn`) of the pattern in the `string` with a replacement expression; uses the symbols `\1`, `\2`, …, to refer to the elements of the match group in the replacement string

## Vectorised string functions in pandas

Cleaning up a cluttered dataset for analysis often requires a lot of string manipulation. To make matters worse, a column containing strings sometimes has missing data:

In [22]:
import pandas as pd
import numpy as np

data = {'Veit': np.nan, 'Veit Schiele': 'veit.schiele@cusy.io',
        'cusy GmbH': 'info@cusy.io'}
data = pd.Series(data)

data

Veit                             NaN
Veit Schiele    veit.schiele@cusy.io
cusy GmbH               info@cusy.io
dtype: object

In [23]:
data.isna()

Veit             True
Veit Schiele    False
cusy GmbH       False
dtype: bool

You can apply string and regular expression methods to any value (by passing a lambda or other function) using `data.map`, but this fails for `NA` values. To deal with this, `Series` has array-oriented methods for string operations that skip and pass `NA` values. These are accessed via Series’ `str` attribute; for example, we could use `str.contains` to check whether each email address contains `veit`:

In [24]:
data.str.contains('veit')

Veit              NaN
Veit Schiele     True
cusy GmbH       False
dtype: object

Regular expressions can also be used, along with options such as `IGNORECASE`:

In [25]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'

data.str.findall(pattern, flags=re.IGNORECASE)

Veit                                   NaN
Veit Schiele    [(veit.schiele, cusy, io)]
cusy GmbH               [(info, cusy, io)]
dtype: object

There are several ways to retrieve a vectorised element. Either use `str.get` or the index of `str`:

In [26]:
matches = data.str.findall(pattern, flags=re.IGNORECASE).str[0]

matches

Veit                                 NaN
Veit Schiele    (veit.schiele, cusy, io)
cusy GmbH               (info, cusy, io)
dtype: object

In [27]:
matches.str.get(1)

Veit             NaN
Veit Schiele    cusy
cusy GmbH       cusy
dtype: object

Similarly, you can also cut strings with this syntax:

In [28]:
data.str[:5]

Veit              NaN
Veit Schiele    veit.
cusy GmbH       info@
dtype: object

The [pandas.Series.str.extract](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html) method returns the captured groups of a regular expression as a DataFrame:

In [29]:
data.str.extract(pattern, flags=re.IGNORECASE)

Unnamed: 0,0,1,2
Veit,,,
Veit Schiele,veit.schiele,cusy,io
cusy GmbH,info,cusy,io


More vectorised pandas string methods:

Method | Description
:----- | :----------
`cat` | concatenates strings element by element with optional delimiter
`contains` | returns a boolean array if each string contains a pattern/gex
`count` | counts occurrences of the pattern
`extract` | uses a regular expression with groups to extract one or more strings from a set of strings; the result is a DataFrame with one column per group
`endswith` | equivalent to `x.endswith(pattern)` for each element
`startswith` | equivalent to `x.startswith(pattern)` for each element
`findall` | computes list of all occurrences of pattern/regex for each string
`get` | index in each element (get `i`-th element)
`isalnum` | Equivalent to built-in `str.alnum`
`isalpha` | Equivalent to built-in `str.isalpha`
`isdecimal` | Equivalent to built-in `str.isdecimal`
`isdigit` | Equivalent to built-in `str.isdigit`
`islower` | Equivalent to built-in `str.islower`
`isnumeric` | Equivalent to built-in `str.isnumeric`
`isupper` | Equivalent to built-in `str.isupper`
`join` | joins strings in each element of the series with the passed separator character
`len` | calculates the length of each string
`lower`, `upper` | converts case; equivalent to `x.lower()` or `x.upper()` for each element
`match` | uses `re.match` with the passed regular expression for each element, returning `True` or `False` if matched.
`extract` | captures group elements (if any) by index from each string
`pad` | inserts spaces on the left, right or both sides of strings
`centre` | Equivalent to `pad(side='both')`
`repeat` | Duplicates values (for example `s.str.repeat(3)` equals `x * 3` for each string)
`replace` | replaces pattern/rulex with another string
`slice` | splits each string in the series
`split` | splits strings using delimiters or regular expressions
`strip` | truncates spaces on both sides, including line breaks
`rstrip` | truncates spaces on the right side
`lstrip` | truncates spaces on the left side