# Vectorized String Operations

This sections walks through some of the _vectorized string operations_ that Pandas provides to efficiently handle string data.

Previous sections introduced how tools like NumPy and Pandas use _vectorization_ to efficiently (without the need to loop over an object) and elegantly (with simple and concise syntax) perform the same operation on many array elements at once. Let's see a quick example:

In [1]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

array([ 4,  6, 10, 14, 22, 26])

Unfortunately, NumPy does not provide the same functionality for array strings, forcing us to use the conventional and more verbose syntax:

In [2]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

Probably the biggest downside to this approach, specially when working with real-world data, is that it just isn't robust enough to deal with missing values:

In [3]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

However, Pandas addresses both the need for vectorized string operations and correctly handling missing data through the `str` attribute of `Series` and `Index` objects containing strings. As an example, we'll use `data` to create a new `Series` object:

In [4]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

Now we can capitalize all entries calling a single method, skipping over any missing values:

In [5]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

Next, we will go over some examples to illustrate the capabilities of the string operations that are provided through the `str` attribute. First, let's declare the series of names that will be used:

In [6]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

## Methods similar to Python string methods

There are quite a few methods that behave in a very similar way to Python built-in string manipulation methods. Let's go over a few:

In [7]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [8]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [9]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [10]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

The book lists the rest of them [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html#Methods-similar-to-Python-string-methods).

## Methods using regular expressions

Pandas also provides us with a variety of methods that accept regular expressions as parameters, allowing for very flexible and powerful operations. Let's see some examples:

Imagine we want to extract the first name of each entry. We can do so by matching a contiguous group of characters at the beginning of each element:

In [11]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

We can also find all names that begin and end with a consonant, by matching the start of the string (`^`) and end of string (`$`):

In [12]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

There are many more methods and operations that can be performed on string data. The ["Working with text data"](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) section of the Pandas documentation goes in a lot more details on what are the capabilities and methods available. The handbook also goes over some miscellaneous methods [here](https://jakevdp.github.io/PythonDataScienceHandbook/03.10-working-with-strings.html#Miscellaneous-methods) and a great example working with real data shortly after.