Code based on: https://nbviewer.jupyter.org/github/Springboard-CourseDev/PythonDataScienceHandbook/blob/master/notebooks/03.10-Working-With-Strings.ipynb

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you're stuck using a more verbose loop syntax:

In [1]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']
[s.capitalize() for s in data]

['Peter', 'Paul', 'Mary', 'Guido']

This will break if there is missing values

In [2]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

AttributeError: 'NoneType' object has no attribute 'capitalize'

In [None]:
import pandas as pd
names = pd.Series(data)
names

In [None]:
# Use str to skip missing values
names.str.capitalize()

In [None]:
# Create new array of names

monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])


Sample of several methods

In [None]:
monte.str.lower()


In [None]:
monte.str.len()


In [None]:
monte.str.startswith('T')


In [None]:
monte.str.split()


Using regular expressions:

In [None]:
monte.str.extract('([A-Za-z]+)', expand=False)

In [None]:
monte.str.findall(r'^[^AEIOU].*[^aeiou]$')

In [None]:
monte.str.split().str.get(-1)

In [None]:
full_monte = pd.DataFrame({'name': monte,
                           'info': ['B|C|D', 'B|D', 'A|C',
                                    'B|D', 'B|C', 'B|C|D']})
full_monte

In [None]:
# The dummy variables method

full_monte['info'].str.get_dummies('|')

EG Recipe Database

In [None]:
!curl -O https://s3.amazonaws.com/openrecipes/recipeitems-latest.json.gz
#!gunzip 20170107-061401-recipeitems.json.gz

In [None]:
try:
    recipes = pd.read_json('recipeitems-latest.json')
except ValueError as e:
    print("ValueError:", e)

In [None]:
with open('recipeitems-latest.json') as f:
    line = f.readline()
pd.read_json(line).shape

In [None]:
# read the entire file into a Python array
with open('recipeitems-latest.json', 'r') as f:
    # Extract each line
    data = (line.strip() for line in f)
    # Reformat so each line is the element of a list
    data_json = "[{0}]".format(','.join(data))
# read the result as a JSON
recipes = pd.read_json(data_json)