Code based on: https://nbviewer.jupyter.org/github/Springboard-CourseDev/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:

In [1]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

This will break if there is missing values

In [2]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

In [3]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [4]:
# Use str to skip missing values
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

In [5]:
# Create new array of names

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])


Sample of several methods

In [6]:
monte.str.lower()


0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [7]:
monte.str.len()


0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [8]:
monte.str.startswith('T')


0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [9]:
monte.str.split()


0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Using regular expressions:

In [10]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [11]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

In [12]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

In [13]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [14]:
# The dummy variables method

full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


EG Recipe Database

In [15]:
#!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz
# Extraction done in browser
#!gunzip recipeitems-latest.json.gz

In [16]:
try:
    recipes = pd.read_json('20170107-061401-recipeitems.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


In [17]:
import json
recipe_data = pd.DataFrame()
with open('20170107-061401-recipeitems.json') as f:
    line = f.readline()
    while line:
        data = pd.DataFrame([json.loads(line)])
        recipe_data = pd.concat([recipe_data,data])
        try:
            line = f.readline()
            
        except:
            print('missed line')
            break
        



missed line


In [18]:
recipe_data.shape

(115, 16)

In [19]:
recipe_data.iloc[0]

_id                            {'$oid': '5160756b96cc62079cc2db15'}
name                                Drop Biscuits and Sausage Gravy
ingredients       Biscuits\n3 cups All-purpose Flour\n2 Tablespo...
url               http://thepioneerwoman.com/cooking/2013/03/dro...
image             http://static.thepioneerwoman.com/cooking/file...
ts                                         {'$date': 1365276011104}
cookTime                                                      PT30M
source                                              thepioneerwoman
recipeYield                                                      12
datePublished                                            2013-03-11
prepTime                                                      PT10M
description       Late Saturday afternoon, after Marlboro Man ha...
totalTime                                                       NaN
creator                                                         NaN
recipeCategory                                  

In [20]:
recipe_data.ingredients.str.len().describe()


count    115.000000
mean     284.313043
std      142.117222
min       29.000000
25%      182.000000
50%      258.000000
75%      366.500000
max      904.000000
Name: ingredients, dtype: float64

In [21]:
import numpy as np
recipe_data.iloc[np.argmax(recipe_data.ingredients.str.len())]['name']

'Spaghetti Sauce'

In [22]:
recipe_data.description.str.contains('[Bb]reakfast').sum()

3

In [23]:
recipe_data.ingredients.str.contains('[Cc]innamon').sum()

7

In [24]:
spice_list = ['salt', 'pepper', 'oregano', 'sage', 'parsley',
              'rosemary', 'tarragon', 'thyme', 'paprika', 'cumin']

In [25]:
import re
spice_df = pd.DataFrame(dict((spice, recipe_data.ingredients.str.contains(spice, re.IGNORECASE))
                             for spice in spice_list))
spice_df = spice_df.reset_index(drop = True)
spice_df

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
0,False,False,False,True,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False
2,True,True,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
110,True,False,False,False,False,False,False,False,False,False
111,True,False,False,False,False,False,False,False,False,False
112,True,False,False,False,False,False,False,False,False,False
113,True,False,False,False,False,False,False,False,False,False


In [26]:
selection = spice_df.query('cumin & salt')
selection

Unnamed: 0,salt,pepper,oregano,sage,parsley,rosemary,tarragon,thyme,paprika,cumin
2,True,True,False,False,False,False,False,False,False,True
25,True,True,False,False,False,False,False,False,False,True
29,True,True,False,False,False,False,False,False,False,True
50,True,True,False,False,False,False,False,False,False,True
82,True,False,False,False,False,False,False,False,True,True


In [27]:
recipe_data.name.iloc[selection.index]

0                   Morrocan Carrot and Chickpea Salad
0                 Kale Artichoke Dip with Greek Yogurt
0    Homemade Frozen Bean and Veggie BurritosHomema...
0                                 Chicken Tikka Masala
0                           Charmoula-Rubbed Mahi-Mahi
Name: name, dtype: object