<center> </center>

In [1]:
%run supportvectors-common.ipynb

# Working with Strings

In this lab, we will learn to manipulate string datatypes. Pandas has numerous features pertaining to strings. We will cover a few basics you will mostly need to work with while cleaning datasets. References to the Pandas documentation are provided wherever more information is required.

In [2]:
import pandas as pd
import numpy as np

# paths to the most commonly used dataset repositories for this lab

# path to pandas_for_everyone datasets repository
pfe_rep_path = 'https://raw.githubusercontent.com/chendaniely/pandas_for_everyone/master/data/'

# path to SupportVectors data-wrangling-datasets repository
sv_rep_path = 'https://raw.githubusercontent.com/supportvectors/data-wrangling-datasets/main/'

## Storing text  

Text or generally any non-numeric data is stored as `object` datatype. 

In [3]:
s = pd.Series(['a', '2' , 'c10'])
s

0      a
1      2
2    c10
dtype: object

Pandas recommends to store string data as `StringDtype`

In [4]:
s = pd.Series(['a', '2' , 'c10'],
              dtype=pd.StringDtype() # or dtype='string'
             )
s

0      a
1      2
2    c10
dtype: string

There is a behaviour difference between these two methods. 

See documentation: https://pandas.pydata.org/docs/user_guide/text.html#behavior-differences

## Accessing string methods

String methods can be accessed using the `.str` accessor. These string methods are similar to the built-in python string methods having similar names. 

A list of all string methods:  https://docs.python.org/2.5/lib/string-methods.html

### Weather dataset
This weather data is collected from https://www.ncdc.noaa.gov/cdo-web/ where free access is provided to global historic climate data. For this exercise, we take a subset of the data collected from the New Orleans Lakefront Airport station. Specifically, we will work with four variables. 

Columns

* AWND = Average daily wind speed (miles per hour)
* PRCP = Precipitation (inches to hundredths)
* TMAX = Maximum temperature (Fahrenheit)
* TMIN = Minimum temperature (Fahrenheit)

Let's load this dataset.

In [20]:
source = sv_rep_path + 'New_Orleans_Lakefront_Airport_weather_2021_subset.csv'

weather_df = pd.read_csv(source)
weather_df.head()

Unnamed: 0,STATION,NAME,DATE,AWND,PRCP,TMIN,TMAX
0,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",01-01-2021,13.42,0.0,56.0,69.0
1,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",02-01-2021,13.87,0.0,48.0,57.0
2,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",03-01-2021,7.83,0.0,47.0,56.0
3,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",04-01-2021,6.49,0.0,44.0,68.0
4,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",05-01-2021,5.82,0.0,52.0,61.0


We observe that the columns are all uppercase which might be difficult to work with, let's change them using some of the string methods

In [21]:
cols = weather_df.columns.str.lower() # change  text to lowercase
cols

Index(['station', 'name', 'date', 'awnd', 'prcp', 'tmin', 'tmax'], dtype='object')

In [22]:
cols = weather_df.columns.str.capitalize() # change case of the first letter of each string to upper
cols

Index(['Station', 'Name', 'Date', 'Awnd', 'Prcp', 'Tmin', 'Tmax'], dtype='object')

In [23]:
weather_df.columns = cols # reset the column names to the modified names
weather_df

Unnamed: 0,Station,Name,Date,Awnd,Prcp,Tmin,Tmax
0,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",01-01-2021,13.42,0.00,56.0,69.0
1,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",02-01-2021,13.87,0.00,48.0,57.0
2,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",03-01-2021,7.83,0.00,47.0,56.0
3,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",04-01-2021,6.49,0.00,44.0,68.0
4,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",05-01-2021,5.82,0.00,52.0,61.0
...,...,...,...,...,...,...,...
357,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",27-12-2021,7.61,0.00,64.0,78.0
358,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",28-12-2021,14.09,0.30,68.0,81.0
359,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",29-12-2021,12.75,0.00,71.0,82.0
360,USW00053917,"NEW ORLEANS LAKEFRONT AIRPORT, LA US",30-12-2021,9.62,0.00,72.0,83.0


## Split and Replace

Let's split the Station name into "Name" and "State" wit the separator as `,`. In each row split text is stored as a list and a dataframe is returned. Let's call this `name_state_df`.

In [24]:
weather_df.Name.head()

0    NEW ORLEANS LAKEFRONT AIRPORT, LA US
1    NEW ORLEANS LAKEFRONT AIRPORT, LA US
2    NEW ORLEANS LAKEFRONT AIRPORT, LA US
3    NEW ORLEANS LAKEFRONT AIRPORT, LA US
4    NEW ORLEANS LAKEFRONT AIRPORT, LA US
Name: Name, dtype: object

Now, let us split it on the `Name` column, using `,` as the delimiter. This should produce and array with two columns, one for the name of the weather station, and the second for the state and country it belongs to.

In [9]:

name_state_df = weather_df.Name.str.split(",")
name_state_df

0      [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
1      [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
2      [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
3      [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
4      [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
                        ...                   
357    [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
358    [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
359    [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
360    [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
361    [NEW ORLEANS LAKEFRONT AIRPORT,  LA US]
Name: Name, Length: 362, dtype: object

To access the a row of the strings from the list we can make use of it's index.

In [10]:
name_state_df[1] # or use .str.get(1)

['NEW ORLEANS LAKEFRONT AIRPORT', ' LA US']

We can also directly expand the string column into two (or more if needed) separate columns by setting `expand` to True. This yields a dataframe.

In [11]:
name_state_df = weather_df.Name.str.split(",", expand=True)
name_state_df

Unnamed: 0,0,1
0,NEW ORLEANS LAKEFRONT AIRPORT,LA US
1,NEW ORLEANS LAKEFRONT AIRPORT,LA US
2,NEW ORLEANS LAKEFRONT AIRPORT,LA US
3,NEW ORLEANS LAKEFRONT AIRPORT,LA US
4,NEW ORLEANS LAKEFRONT AIRPORT,LA US
...,...,...
357,NEW ORLEANS LAKEFRONT AIRPORT,LA US
358,NEW ORLEANS LAKEFRONT AIRPORT,LA US
359,NEW ORLEANS LAKEFRONT AIRPORT,LA US
360,NEW ORLEANS LAKEFRONT AIRPORT,LA US


In [26]:
weather_df['Name'] = name_state_df[0] # adding columns to the dataset
weather_df['State'] = name_state_df[1]
weather_df

Unnamed: 0,Station,Name,Date,Awnd,Prcp,Tmin,Tmax,State
0,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,01-01-2021,13.42,0.00,56.0,69.0,LA US
1,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,02-01-2021,13.87,0.00,48.0,57.0,LA US
2,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,03-01-2021,7.83,0.00,47.0,56.0,LA US
3,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,04-01-2021,6.49,0.00,44.0,68.0,LA US
4,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,05-01-2021,5.82,0.00,52.0,61.0,LA US
...,...,...,...,...,...,...,...,...
357,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,27-12-2021,7.61,0.00,64.0,78.0,LA US
358,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,28-12-2021,14.09,0.30,68.0,81.0,LA US
359,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,29-12-2021,12.75,0.00,71.0,82.0,LA US
360,USW00053917,NEW ORLEANS LAKEFRONT AIRPORT,30-12-2021,9.62,0.00,72.0,83.0,LA US


Let's replace LA with Louisiana in the `State` column

In [32]:
weather_df['State'] = weather_df['State'].replace("LA", "Louisiana")
weather_df['State']

0      Louisiana
1      Louisiana
2      Louisiana
3      Louisiana
4      Louisiana
         ...    
357    Louisiana
358    Louisiana
359    Louisiana
360    Louisiana
361    Louisiana
Name: State, Length: 362, dtype: object

Huh! That did not work! 

Unfortunately, `replace` will make the substitution only if there is an exact match! To do this, we will have to resort to:

In [35]:
weather_df['State'] = weather_df['State'].replace(" LA US", "Louisiana")
weather_df['State']

0      Louisiana
1      Louisiana
2      Louisiana
3      Louisiana
4      Louisiana
         ...    
357    Louisiana
358    Louisiana
359    Louisiana
360    Louisiana
361    Louisiana
Name: State, Length: 362, dtype: object

## Regular expressions

The intended replacement has not been made. This is because 'replace' by default does not make a expected replacement unless there is an exact match. For literal replacement (i.e., replacement in the sense we mean it), we must use regex. `Regular Expressions` are very powerful in finding patterns in strings. Learning to work with regular expressions is a very handy tool while cleaning text data. 

https://regex101.com/ is a useful place to start learning and testing out regular expressions.

In [14]:
weather_df['State'] = weather_df['State'].replace("LA", "Louisiana", regex=True)
weather_df['State']

0       Louisiana US
1       Louisiana US
2       Louisiana US
3       Louisiana US
4       Louisiana US
           ...      
357     Louisiana US
358     Louisiana US
359     Louisiana US
360     Louisiana US
361     Louisiana US
Name: State, Length: 362, dtype: object

## Concatenation

As opposed to splitting a column of string datatype, concatenation joins two columns of string datatype using the specified separator. It is done using the `.cat()` method. Lets concatenate the `Name` and `State` column.

In [15]:
station_name = weather_df['Name'].str.cat(weather_df['State'], sep=", ")
station_name

0      NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
1      NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
2      NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
3      NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
4      NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
                           ...                     
357    NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
358    NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
359    NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
360    NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
361    NEW ORLEANS LAKEFRONT AIRPORT,  Louisiana US
Name: Name, Length: 362, dtype: object

## Extraction

At times the relevant substrings may not be separated by a specific character eg. If we wanted to split the string `Cases_Guinea` into `Cases` and `Guinea` we can do so by appling the split method with the separator specified as `_`. if we wanted to split the string `CasesGuiniea` into `Cases` and `Guniea` the `split` method cannot be used. 

For these cases, we can use the `extract()` method which accepts regular expressions with more than one `capture group`. Capture groups are specified using `()` parenthesis. **Each capture group is extracted in a separate column**


Let's separate the series `sex_age` using regular expressions. 

https://regex101.com/ is a useful place to start learning and testing out regular expressions.

In [16]:
sex_age = pd.Series(['f7', 'm54', 'm33', 'f42', 'm', 'f29', 'm19'])

sex_age

0     f7
1    m54
2    m33
3    f42
4      m
5    f29
6    m19
dtype: object

The following regex pattern captures 2 groups

1. first group captures the letter m or f
2. The second group captures 0 or more digits

In [17]:
regex = r'([mf])(\d*)'  # captures two groups 
sex_age = sex_age.str.extract(regex)
sex_age

Unnamed: 0,0,1
0,f,7.0
1,m,54.0
2,m,33.0
3,f,42.0
4,m,
5,f,29.0
6,m,19.0


When there are nested capture groups, each group is extracted as a column. 

In [18]:
sexage_range = pd.Series(['f0-10', 'f10-20', 'f20-30', 'f30+', 'm0-10', 'm10-20', 'm20-30', 'm30+'])
sexage_range

0     f0-10
1    f10-20
2    f20-30
3      f30+
4     m0-10
5    m10-20
6    m20-30
7      m30+
dtype: object

This pattern captures nested groups `'([mf])((\d{1,2}-\d{1,2})|(\d{1,2}\+))'`

* `([mf])` -  group `1` captures the letter m or f
* `((\d{1,2}-\d{1,2})|(\d{1,2}\+))` - group `2` has two subgroups`2.a` and `2.b`. Either `2.a` or `2.b` is captured in `2`.
* `(\d{1,2}-\d{1,2})` - subgroup `2.a` captures strings of type `00-00`
* `(\d{1,2}\+))` - subgroup `2.b` captures strings of type `00+` 

In [19]:
regex = r'([mf])((\d{1,2}-\d{1,2})|(\d{1,2}\+))'  # (1)((2.a)(2.b))
sexage_range = sexage_range.str.extract(regex)
sexage_range

Unnamed: 0,0,1,2,3
0,f,0-10,0-10,
1,f,10-20,10-20,
2,f,20-30,20-30,
3,f,30+,,30+
4,m,0-10,0-10,
5,m,10-20,10-20,
6,m,20-30,20-30,
7,m,30+,,30+


The output has four columns corresponding to the capture groups `1`, `2`, `2.a`, and `2.b`  