Preface: This is a notebook testing the performance of str methods in pandas and how they compare to other ways to achieve the same results of the methods.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
from textblob import TextBlob
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

# Example Set 1 - concatenating and acessing

In [None]:
# Create pools
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
# months = [str(i).zfill(2) for i in range(1, 13)]
days = [str(i).zfill(2) for i in range(1, 29)]
years = [str(i) for i in range(1992, 2016)]

In [None]:
# Randomly generate from pools
rand_months = np.random.choice(months, size=10000)
rand_days = np.random.choice(days, size=10000)
rand_years = np.random.choice(years, size=10000)

In [None]:
# Place into df
%time rand_dates = pd.DataFrame({'yr':rand_years, 'mo':rand_months, 'da': rand_days})

In [None]:
rand_dates.head()

Pandas str.cat method vs simple concat - 3x slower

In [None]:
%timeit rand_dates['mo_da']  = rand_dates['mo'].str.cat(rand_dates['da'])

In [None]:
%timeit rand_dates['mo_da2']  = rand_dates['mo'] + rand_dates['da']

Str.get() or Str[idx] vs Str.slice(i, j) - 2x slower

In [None]:
%timeit rand_dates['mo'].str.slice(0,1) + rand_dates['da'].str.slice(0,1)

In [None]:
%timeit rand_dates['mo'].str[0] + rand_dates['da'].str[0]

In [None]:
%timeit rand_dates['mo'].str.get(0) + rand_dates['da'].str.get(0)

Adding an empty string - stays the same

In [None]:
rand_dates.loc[23, 'mo'] = ''

In [None]:
%timeit rand_dates['mo'].str.slice(0,1) + rand_dates['da'].str.slice(0,1)

Adding NaN, a different data type - takes 1.2x longer

In [None]:
rand_dates.loc[23, 'mo'] = np.NaN

In [None]:
%timeit rand_dates['mo'].str.slice(0,1) + rand_dates['da'].str.slice(0,1)

In [None]:
%timeit rand_dates['mo'] + rand_dates['da']

There may be some extra functionality in the str methods but for the simple case where you know the dtypes of your columns, you can get an extra boost in performance by opting for the simpler calls. 

# Example 2:  Operating on the same string

In [None]:
train = pd.read_csv('../input/train.csv', nrows=10000)
train2 = train.copy()

In [None]:
%timeit d = train['question_text'].str.slice(0,1) # same trend as earlier

In [None]:
%timeit a = train['question_text'].str[0]

In [None]:
%timeit b = train['question_text'].str.count('e')

In [None]:
%timeit c = train['question_text'].str.capitalize()

### Using pandas str method once to get create each new column

In [None]:
%%timeit
train['first'] = train['question_text'].str[0]
train['count_e'] = train['question_text'].str.count('e')
train['cap'] = train['question_text'].str.capitalize()
# Just the individual values added together

### Returning a tuple with apply and zip(*) - ~15% faster 

In [None]:
def extract_text_features(x):
    return x[0], x.count('e'), x.capitalize()

In [None]:
%timeit train['first'], train['count_e'], train['cap'] = zip(*train['question_text'].apply(extract_text_features))

### Basic python loop and then assign to new columns - Another 25% faster 

In [None]:
%%timeit
a,b,c = [], [], []
for s in train['question_text']:
    a.append(s[0]), b.append(s.count('e')), c.append(s.capitalize())
train['first'] = a
train['count_e'] = b
train['cap'] = c
# assigning to new column takes about the same time in either method

### Back to str methods:  
    Str.len() vs apply lambda - ~25% faster - has some optimization done

In [None]:
%timeit x = train['question_text'].str.len()

In [None]:
%timeit b = train['question_text'].apply(lambda x:len(x))

In [None]:
# bonus - getting memory of your array
train['question_text'].values.nbytes

### More string methods

### Individually vs Tuples vs Series -  
 Function calls are expensive 

In [None]:
%%timeit 
train2['num_chars'] = train2['question_text'].str.len()
train2['is_titlecase'] = train2['question_text'].str.istitle().astype('int')
train2['has_*'] = train2['question_text'].str.contains(r'[A-Za-z]\*.|.\*[A-Za-z]', regex=True).astype('int')


In [None]:
def srs_funcs(srs):
    a = len(srs)
    b = int(srs.istitle())
    c = int(bool(re.search(r'[A-Za-z]\*.|.\*[A-Za-z]', srs)))
    return a, b, c
# would have expected this to be faster than creating three new columns individually but maybe the type conversion calls slowed it down

In [None]:
%timeit  train2['num_chars'] , train2['is_titlecase'], train2['has_*'] = zip(*train2['question_text'].apply(srs_funcs))

In [None]:
def srs_funcs2(srs):
    a = len(srs)
    b = int(srs.istitle())
    c = int(bool(re.search(r'[A-Za-z]\*.|.\*[A-Za-z]', srs)))
    return pd.Series([a, b, c])

In [None]:
%timeit  train2[['num_chars','is_titlecase','has_*']] = train2['question_text'].apply(srs_funcs2)
# calling pd.series each time through loop kills performance

### Example 3: Same string, more complicated function

In [None]:
def textblob_methods(blob):
    '''Access Textblob methods and returns as tuple
    '''
    # convert to python list of tokens
    return blob.polarity, blob.subjectivity, int(blob.ends_with('?'))

In [None]:
train3 = pd.read_csv('../input/train.csv', nrows=10000)
train3.head()

In [None]:
# Convert  - any ways to make this faster? 
%timeit train3['blobs'] = train3['question_text'].map(lambda x: TextBlob(x))

# Suggestions: 
1) parallelize with joblib, multiprocessing
2) SpaCy parallel processing
3) Pandas extension arrays

###  Locate both index and column at the same time s faster

In [None]:
%timeit zsamp = train3.loc[5006,'blobs']

In [None]:
%timeit zsamp = train3.loc[5006]['blobs']

In [None]:
zsamp = train3.loc[5006]['blobs']

In [None]:
%timeit textblob_methods(zsamp)

In [None]:
%timeit  train3['polarity'], train3['subjectivity'], train3['ends_with_?'] = zip(*train3['blobs'].map(textblob_methods))

In [None]:
%%timeit
a, b, c = [], [], []
for s in train3['blobs']:
    a.append(s.polarity), b.append(s.subjectivity), c.append(int(s[-1] in '?'))
train3['polarity'], train3['subjectivity'], train3['ends_with_?'] = a, b, c

In [None]:
%%timeit
# Doing it separately - takes longer
train3['polarity'] = train3['blobs'].apply(lambda x: x.polarity)
train3['subjectivity'] = train3['blobs'].apply(lambda x: x.subjectivity)
train3['ends_with_?'] = train3['blobs'].apply(lambda x: x.endswith('?'))

In [None]:
def textblob_methods2(blob):
    '''Access Textblob methods and returns as tuple
    '''
    # convert to python list of tokens
    return blob.polarity, blob.subjectivity

In [None]:
%timeit  train3['polarity'], train3['subjectivity'] = zip(*train3['blobs'].map(textblob_methods2))

In [None]:
%%timeit
a, b = [], []
for s in train3['blobs']:
    a.append(s.polarity), b.append(s.subjectivity)
train3['polarity'], train3['subjectivity'] = a, b

# Takeaways
1. Str methods are better than apply but still nowhere close to the performance increase from vectorization like with numeric columns.   
2. Apply is still a loop. Try to do multiple things within the same loop if you can.
3. Unzipping tuples can be a way to output multiple columns, but using lists and python loops can be surprisingly fast for str series.

# Resources:
1. [**Pandas docs on working with text**](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html)
2. [General order of precedence for performance of various operations from pandas maintainer](https://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316#24871316)
3. [Numeric Vectorization](https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c/52674448#52674448)
4. [Article overview, focused more on numeric columns](https://realpython.com/fast-flexible-pandas/)
5. [**Python loop outperforming apply**](https://stackoverflow.com/questions/16236684/apply-pandas-function-to-column-to-create-multiple-new-columns/47097625#47097625)

This notebook was based primarily on the two bolded links.
    