Code based on: https://nbviewer.jupyter.org/github/Springboard-CourseDev/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:

In [1]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

This will break if there is missing values

In [2]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

In [3]:
import pandas as pd
names = pd.Series(data)
names

0    peter
1     Paul
2     None
3     MARY
4    gUIDO
dtype: object

In [4]:
# Use str to skip missing values
names.str.capitalize()

0    Peter
1     Paul
2     None
3     Mary
4    Guido
dtype: object

In [5]:
# Create new array of names

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])


Sample of several methods

In [6]:
monte.str.lower()


0    graham chapman
1       john cleese
2     terry gilliam
3         eric idle
4       terry jones
5     michael palin
dtype: object

In [7]:
monte.str.len()


0    14
1    11
2    13
3     9
4    11
5    13
dtype: int64

In [8]:
monte.str.startswith('T')


0    False
1    False
2     True
3    False
4     True
5    False
dtype: bool

In [9]:
monte.str.split()


0    [Graham, Chapman]
1       [John, Cleese]
2     [Terry, Gilliam]
3         [Eric, Idle]
4       [Terry, Jones]
5     [Michael, Palin]
dtype: object

Using regular expressions:

In [10]:
monte.str.extract('([A-Za-z]+)', expand=False)

0     Graham
1       John
2      Terry
3       Eric
4      Terry
5    Michael
dtype: object

In [11]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

0    [Graham Chapman]
1                  []
2     [Terry Gilliam]
3                  []
4       [Terry Jones]
5     [Michael Palin]
dtype: object

In [12]:
monte.str.split().str.get(-1)

0    Chapman
1     Cleese
2    Gilliam
3       Idle
4      Jones
5      Palin
dtype: object

In [13]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

Unnamed: 0,name,info
0,Graham Chapman,B|C|D
1,John Cleese,B|D
2,Terry Gilliam,A|C
3,Eric Idle,B|D
4,Terry Jones,B|C
5,Michael Palin,B|C|D


In [14]:
# The dummy variables method

full_monte['info'].str.get_dummies('|')

Unnamed: 0,A,B,C,D
0,0,1,1,1
1,0,1,0,1
2,1,0,1,0
3,0,1,0,1
4,0,1,1,0
5,0,1,1,1


EG Recipe Database

In [15]:
#!curl -O https://s3.amazonaws.com/openrecipes/20170107-061401-recipeitems.json.gz
#!gunzip 20170107-061401-recipeitems.json.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  7 29.3M    7 2396k    0     0  2113k      0  0:00:14  0:00:01  0:00:13 2111k
 23 29.3M   23 7071k    0     0  3386k      0  0:00:08  0:00:02  0:00:06 3386k
 35 29.3M   35 10.5M    0     0  3490k      0  0:00:08  0:00:03  0:00:05 3490k
 46 29.3M   46 13.7M    0     0  3437k      0  0:00:08  0:00:04  0:00:04 3437k
 60 29.3M   60 17.8M    0     0  3601k      0  0:00:08  0:00:05  0:00:03 3822k
 75 29.3M   75 22.0M    0     0  3711k      0  0:00:08  0:00:06  0:00:02 4076k
 88 29.3M   88 25.8M    0     0  3735k      0  0:00:08  0:00:07  0:00:01 3881k
 97 29.3M   97 28.7M    0     0  3639k      0  0:00:08  0:00:08 --:--:-- 3730k
100 29.3M  100 29.3M    0     0  3644k      0  0:00

In [17]:
try:
    recipes = pd.read_json('20170107-061401-recipeitems.json')
except ValueError as e:
    print("ValueError:", e)

ValueError: Trailing data


In [19]:
with open('20170107-061401-recipeitems.json') as f:
    line = f.readline()
pd.read_json(line).shape

ValueError: Protocol not known: { "_id" : { "$oid" : "5160756b96cc62079cc2db15" }, "name" : "Drop Biscuits and Sausage Gravy", "ingredients" : "Biscuits\n3 cups All-purpose Flour\n2 Tablespoons Baking Powder\n1/2 teaspoon Salt\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\n1-1/4 cup Butermilk\n SAUSAGE GRAVY\n1 pound Breakfast Sausage, Hot Or Mild\n1/3 cup All-purpose Flour\n4 cups Whole Milk\n1/2 teaspoon Seasoned Salt\n2 teaspoons Black Pepper, More To Taste", "url" : "http

In [21]:
# read the entire file into a Python array
with open('20170107-061401-recipeitems.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3611: character maps to <undefined>