* [.str.upper()](#strupper)
* [.str.isdigit()](#strdigit)
* [.str.split()](#strsplit)
* [.str.split().str[0] get values at 0 index of Series](#index0)
* [.str.split(expand=True)   from Series to DataFrame](#splitexpand)
* [.str.replace()](#strreplace)
* [.str.strip()](#strstrip)
* [.str.capitalize()](#strcap)
* [Time efficiency measurement with _timeit_ module](#timeit)

___

https://pandas.pydata.org/docs/user_guide/text.html

### Working with Text Data
__`.str library`__

__.str__ is only for calling methods off a __pandas series__

___

- Often text data needs to be cleaned or manipulated for processing.
- While we can always use a custom __apply() function__ for these tasks, pandas comes with many built-in __string method calls__. And those sort of things are actually built in to the __.str method call library in pandas__.

___

There are two ways to store text data in pandas:

- __object__ dtype
- __StringDtype__ extension type

___

In [1]:
import numpy as np
import pandas as pd

In [2]:
email = 'jose@email.com'

In [3]:
email.split('@')

['jose', 'email.com']

In [4]:
names = pd.Series(['andrew', 'bobo', 'claire', 'david', '5'])

In [5]:
names

0    andrew
1      bobo
2    claire
3     david
4         5
dtype: object

<a id='strupper'></a>
__`.str.upper()`__

In [6]:
names.str.upper()

0    ANDREW
1      BOBO
2    CLAIRE
3     DAVID
4         5
dtype: object

__Keep in mind__, this __does NOT permanently affect the series__, we have to reassign it to make that effect permanent.

___

<a id='strdigit'></a>

__`.str.isdigit()`__

In [19]:
email

'jose@email.com'

In [20]:
email.isdigit()

False

In [21]:
'5'.isdigit()

True

In [22]:
names.str.isdigit()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [23]:
names[names.str.isdigit()]  # we can actually use this to filter

4    5
dtype: object

___

<a id='strsplit'></a>
__`.str.split()`__

In [24]:
tech_finance = ['GOOG,APPL,AMZN', 'JPM,BAC,GS']

In [25]:
len(tech_finance)

2

In [26]:
tickers = pd.Series(tech_finance)

In [27]:
tickers

0    GOOG,APPL,AMZN
1        JPM,BAC,GS
dtype: object

In [28]:
tickers.str.split(',')

0    [GOOG, APPL, AMZN]
1        [JPM, BAC, GS]
dtype: object

<a id='index0'></a>
__`.str.split().str[0]`__

To understand let's look at a single item at a time.

In [31]:
tech = 'GOOG,APPL,AMZN'
tech.split(',')[0] # on a single string

'GOOG'

To only return back the first items after the split in pandas Series, indexing should be done with another call from .str

In [39]:
tickers.str.split(',').str[0] # pandas version of above cell

0    GOOG
1     JPM
dtype: object

<a id='splitexpand'></a>
__`.str.split(expand=True)`__ 

Let's say we don't want just the first item, but instead, I want to make this into three columns.

In [41]:
tickers.str.split(',', expand=True) # essentially now I've built out a data frame

Unnamed: 0,0,1,2
0,GOOG,APPL,AMZN
1,JPM,BAC,GS


___

In [42]:
messy_names = pd.Series(['andrew  ', 'bo;bo', '   claire  '])

In [43]:
messy_names

0       andrew  
1          bo;bo
2       claire  
dtype: object

In [45]:
messy_names[0] # there is spacing involved

'andrew  '

<a id='strreplace'></a>
__`.str.replace()`__

In [46]:
messy_names.str.replace(';', '')

0       andrew  
1           bobo
2       claire  
dtype: object

<a id='strstrip'></a>
__`.str.strip()`__

In [52]:
messy_names.str.strip() # to remove whitespaces

0    andrew
1     bo;bo
2    claire
dtype: object

In [54]:
messy_names.str.strip()[0]

'andrew'

<a id='strcap'></a>
__`.str.capitalize()`__

In [58]:
messy_names.str.replace(';', '').str.strip().str.capitalize()  

# you're replacing the semicolons, stripping out whitespace and then capitalizing

0    Andrew
1      Bobo
2    Claire
dtype: object

___

Always __keep in mind__ that if you have something that's really hard to do, you can use an apply() custom call.

In [60]:
def cleanup(name):
    name = name.replace(';', '')
    name = name.strip()
    name = name.capitalize()
    return name

# does the same thing

In [62]:
messy_names.apply(cleanup)

0    Andrew
1      Bobo
2    Claire
dtype: object

In [64]:
np.vectorize(cleanup)(messy_names)

array(['Andrew', 'Bobo', 'Claire'], dtype='<U6')

__Keep in mind__, if your stuff is getting really complicated and you have to start looking for certain situations like an if statement, it's going to be impossible to just do it through string method calls and you will have to create your own custom function.

___

<a id='timeit'></a>
### Which one is more efficient?

Time efficiency measurement with _timeit_ module

In [65]:
import timeit

# code snippet (фрагмент) to be executed only once
setup = '''
import numpy as np
import pandas as pd
messy_names = pd.Series(['andrew  ', 'bo;bo', '   claire  '])
def cleanup(name):
    name = name.replace(';', '')
    name = name.strip()
    name = name.capitalize()
    return name
'''


# code snippet whose execution time is to be measured
stmt_pandas_str = '''
messy_names.str.replace(';', '').str.strip().str.capitalize()
'''

stmt_pandas_apply = '''
messy_names.apply(cleanup)
'''

stmt_pandas_vectorize = '''
np.vectorize(cleanup)(messy_names)
'''

In [66]:
timeit.timeit(setup=setup,
             stmt = stmt_pandas_str,
             number=10000)

4.935063399999763

In [67]:
timeit.timeit(setup=setup, 
              stmt = stmt_pandas_apply,
              number=10000)

1.4165186999998696

In [68]:
timeit.timeit(setup=setup,
             stmt = stmt_pandas_vectorize,
             number=10000)

0.26484249999975873

While .str() methods can be extremely convienent, when it comes to performance, don't forget about np.vectorize()! Review the "Useful Methods" lecture for a deeper discussion on np.vectorize()