# Vectorized String Operations

### Introduction

In [4]:
import numpy as np
x = np.array([2, 3, 5, 7, 11, 13])
x * 2

array([ 4,  6, 10, 14, 22, 26])

- NumPy does not do vectorization for string arrays - thus you're stuck using a loop syntax:

In [5]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

- This will break if there are any missing values.

In [6]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

- Pandas solves this problem via the ``str`` attribute of Pandas Series and Index objects containing strings:

In [7]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [8]:
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

- Reminder: using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

### Tables of Pandas String Methods

In [9]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

### Methods similar to Python string methods
Nearly all Python's built-in string methods are mirrored by a Pandas vectorized string method. Here is a list of Pandas ``str`` methods that mirror Python string methods:

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |

- Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [10]:
monte.str.lower()

0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

- Some return numbers:

In [11]:
monte.str.len()

0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

- Or Boolean values:

In [12]:
monte.str.startswith('T')

0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

- Others return lists or other compound values for each element:

In [13]:
monte.str.split()

0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

### Methods using regular expressions

- Several methods accept regular expressions. They follow the API conventions of Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

- For example, we can extract the first name from each by asking for a contiguous group of characters at the beginning of each element.

In [14]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

- Find all names that start and end with a consonant, using the start-of-string (``^``) and end-of-string (``$``) regular expression characters:

In [15]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

### Miscellaneous methods

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

#### Vectorized item access and slicing

- ``get()`` and ``slice()`` provide vectorized element access from each string array. For example, we can get the first three characters of each array using ``str.slice(0, 3)``. This behavior is also available through Python's normal indexing syntax–for example, ``df.str.slice(0, 3)`` is equivalent to ``df.str[0:3]``:

In [16]:
monte.str[0:3]

0    Gra
1    Joh
2    Ter
3    Eri
4    Ter
5    Mic
dtype: object

- Indexing via ``df.str.get(i)`` and ``df.str[i]`` works similarly. 

- ``get()`` and ``slice()`` also let you access elements of arrays returned by ``split()``. For example, to extract the last name of each entry, we can combine ``split()`` and ``get()``:

In [17]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

#### Indicator variables

- ``get_dummies()`` is useful when your data has a column containing a coded indicator. For example, we might have a dataset that contains information in the form of codes, such as A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam":

In [18]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,info,name
0,B|C|D,Graham Chapman
1,B|D,John Cleese
2,A|C,Terry Gilliam
3,B|D,Eric Idle
4,B|C,Terry Jones
5,B|C|D,Michael Palin


- ``get_dummies()`` lets you split these indicator variables into a ``DataFrame``:

In [19]:
full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


### Example: Recipe Database

- Vectorized string operations become most useful when cleaning up messy, real-world data. Let's parse some recipe data into ingredient lists, so we can quickly find a recipe based on some ingredients we have on hand.

- The scripts used to compile this can be found at https://github.com/fictivekin/openrecipes, and the link to the current version of the database is found there as well.

- See [issue# 62](https://github.com/jakevdp/PythonDataScienceHandbook/issues/62) - updated dataset is [here](https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz).

In [20]:
#!curl -O http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz
#!gunzip recipeitems-latest.json.gz

- The database is in JSON format - try ``pd.read_json`` to read it:

In [21]:
try:
    #recipes = pd.read_json('recipeitems-latest.json')
    recipes = pd.read_json('20170107-061401-recipeitems.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


- Oops! We get a ``ValueError`` mentioning that there is "trailing data."
Searching for the text of this error on the Internet, it seems that it's due to using a file in which *each line* is itself a valid JSON, but the full file is not. Let's check if this interpretation is true:

In [22]:
with open('20170107-061401-recipeitems.json') as f:
    line = f.readline()
pd.read_json(line).shape

(2, 12)

- Yes, apparently each line is a valid JSON, so we'll need to string them together. One way we can do this is to actually construct a string representation containing all these JSON entries, and then load the whole thing with ``pd.read_json``:

In [23]:
# read the entire file into a Python array
with open('20170107-061401-recipeitems.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

In [24]:
recipes.shape

(173278, 17)

- Examine one row to see what we have:

In [25]:
recipes.iloc[0]

_id                                {'$oid': '5160756b96cc62079cc2db15'}
cookTime                                                          PT30M
creator                                                             NaN
dateModified                                                        NaN
datePublished                                                2013-03-11
description           Late Saturday afternoon, after Marlboro Man ha...
image                 http://static.thepioneerwoman.com/cooking/file...
ingredients           Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
name                                    Drop Biscuits and Sausage Gravy
prepTime                                                          PT10M
recipeCategory                                                      NaN
recipeInstructions                                                  NaN
recipeYield                                                          12
source                                                  thepione

- There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web. The ingredient list is in string format; we're going to have to carefully extract the information we're interested in. Describe the ingredients:

In [26]:
recipes.ingredients.str.len().describe()

count    173278.000000
mean        244.617926
std         146.705285
min           0.000000
25%         147.000000
50%         221.000000
75%         314.000000
max        9067.000000
Name: ingredients, dtype: float64

- The ingredient lists averages ~244 characters. Which recipe has the longest ingredient list?

In [27]:
recipes.name[np.argmax(recipes.ingredients.str.len())]

'Carrot Pineapple Spice &amp; Brownie Layer Cake with Whipped Cream &amp; Cream Cheese Frosting and Marzipan Carrots'

- How many of the recipes are for breakfast?

In [28]:
recipes.description.str.contains('[Bb]reakfast').sum()

3524

- How many of the recipes list cinnamon as an ingredient?

In [29]:
recipes.ingredients.str.contains('[Cc]innamon').sum()

10526

- Any recipes misspell the ingredient as "cinamon"?

In [30]:
recipes.ingredients.str.contains('[Cc]inamon').sum()

11

### A simple recipe recommender

- Design a simple recipe recommendation system: given a list of ingredients, find a recipe that uses all those ingredients.
- While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.
- So we will cheat by starting with a list of common ingredients and search to see whether they are in each recipe's ingredient list.
- For simplicity, start with herbs and spices.

In [31]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

- Build a Boolean ``DataFrame`` consisting of True and False values, indicating whether this ingredient appears in the list.

In [32]:
import re
spice_df = pd.DataFrame(dict((spice, recipes.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df.head()

Unnamed: 0,cumin,oregano,paprika,parsley,pepper,rosemary,sage,salt,tarragon,thyme
0,False,False,False,False,False,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,False,False,False,True,False,False,True,False,False
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False


- Let's say we'd want a recipe that uses parsley, paprika, and tarragon. Use the ``query()`` method of ``DataFrame``s.

In [33]:
selection = spice_df.query('parsley & paprika & tarragon')
len(selection)

10

- Use the index returned by this selection to discover the names of the 10 recipes that have this combination.

In [34]:
recipes.name[selection.index]

2069      All cremat with a Little Gem, dandelion and wa...
74964                         Lobster with Thermidor butter
93768      Burton's Southern Fried Chicken with White Gravy
113926                     Mijo's Slow Cooker Shredded Beef
137686                     Asparagus Soup with Poached Eggs
140530                                 Fried Oyster Po’boys
158475                Lamb shank tagine with herb tabbouleh
158486                 Southern fried chicken in buttermilk
163175            Fried Chicken Sliders with Pickles + Slaw
165243                        Bar Tartine Cauliflower Salad
Name: name, dtype: object