#  String and Text Manipulation II

**Learning Objectives:**
  * Gain an introduction to string and text manipulaation using regex and *pandas* libraries

## Library import

The following line imports *pandas*, *numpy* and *re* library

In [80]:
import pandas as pd
import re
import numpy as np


## REGEX-based operations


### Search operations

In [81]:
s = 'foo123bar'

In [82]:
re.search('123',s)

<re.Match object; span=(3, 6), match='123'>

In [83]:
re.search('456',s)

In [84]:
if re.search('123', s):
  print('Found the pattern 123')
else:
    print('No match.')

Found the pattern 123


In [85]:
sentence='This is a sentence with 123456 a bit of 123 and some bits of foo123foo'

In [86]:
re.findall('123',sentence)

['123', '123', '123']

In [87]:
re.findall('bit',sentence)

['bit', 'bit']

### Substitution operations

In [88]:
AnotherSentence = 'This is a sentence containing #s lots of #s that need to be replaced'

In [89]:
re.sub('#', 'hashtag', AnotherSentence)

'This is a sentence containing hashtags lots of hashtags that need to be replaced'

### Split operations

In [90]:
words= 'foo # bar # baz # qux # quux # corge'

In [91]:
re.split('#',words)

['foo ', ' bar ', ' baz ', ' qux ', ' quux ', ' corge']

In [92]:
for word in re.split('#',words):
  print("I have found:",word)

I have found: foo 
I have found:  bar 
I have found:  baz 
I have found:  qux 
I have found:  quux 
I have found:  corge


## REGEX-based matching

### Single digit matching

In [93]:
aString='foo456bar'

In [94]:
if re.search('[0-9]', aString):
  print('Found a digit')
else:
    print('No match.')

Found a digit


In [95]:
re.findall('[0-9]',aString)

['4', '5', '6']

### Single character matching

In [96]:
Number='2344555a'

In [97]:
if re.search('[a-z]', Number):
  print('Found a character')
else:
    print('No match.')

Found a character


In [98]:
re.findall('[a-z]',Number)

['a']

### Multiple digit matching

In [99]:
YetAnotherSentence='This is a sentence with a number 234 in it '

In [100]:
if re.search('[0-9]+', YetAnotherSentence):
  print('Found a number')
else:
    print('No match.')

Found a number


In [101]:
SentencewithNumbers="this is another sentence with several numbers 345 in it 12 and 67 and 9800"

In [102]:
re.findall('[0-9]+',SentencewithNumbers)

['345', '12', '67', '9800']

### Multiple character matching

In [103]:
NumbersAndWords='122345, 393939 some text 2999 another word'

In [104]:
re.findall('[a-z]+',NumbersAndWords)

['some', 'text', 'another', 'word']

## REGEX-based Anchors

In [105]:
Words='foo appears at the beginning, then another foo appears in the middle'

In [106]:
re.findall('^foo',Words)

['foo']

In [107]:
YetMoreWords='word appears at the beginning, word in the middle and finally word'

In [108]:
re.findall('word$',YetMoreWords)

['word']

## Applying REGEX on Pandas DataFrames

In [109]:
emails = {"Dave": "dave@google.com", "Steve": "steve@gmail.com","Rob": "rob@gmail.com", "Wes": "wes@yahoo.com"}
emails

{'Dave': 'dave@google.com',
 'Steve': 'steve@gmail.com',
 'Rob': 'rob@gmail.com',
 'Wes': 'wes@yahoo.com'}

In [110]:
TextSeries=pd.Series(emails)
TextSeries

Dave     dave@google.com
Steve    steve@gmail.com
Rob        rob@gmail.com
Wes        wes@yahoo.com
dtype: object

In [111]:
TextDataFrame=TextSeries.to_frame(name="emails")

In [112]:
TextDataFrame

Unnamed: 0,emails
Dave,dave@google.com
Steve,steve@gmail.com
Rob,rob@gmail.com
Wes,wes@yahoo.com


### Split operation

In [113]:
TextDataFrame['emails'].str.split('@')

Dave     [dave, google.com]
Steve    [steve, gmail.com]
Rob        [rob, gmail.com]
Wes        [wes, yahoo.com]
Name: emails, dtype: object

In [114]:
TextDataFrame['emailsSplit']=TextDataFrame['emails'].str.split('@')

In [115]:
TextDataFrame

Unnamed: 0,emails,emailsSplit
Dave,dave@google.com,"[dave, google.com]"
Steve,steve@gmail.com,"[steve, gmail.com]"
Rob,rob@gmail.com,"[rob, gmail.com]"
Wes,wes@yahoo.com,"[wes, yahoo.com]"


### Filter operation

In [116]:
employees = {'name': ['david','john','peter','fangfang','lucas','daniel'], 'unit': ['engineering', 'finance','marketing','accounting','sales','marketing'],'email':['david@tesla.com','john@tesla.com','peterNoemail','fangfang@tesla.com','lucasthielNoEemail','daniel@tesla.com']}

In [117]:
employeesDataFrame=pd.DataFrame(employees)
employeesDataFrame

Unnamed: 0,name,unit,email
0,david,engineering,david@tesla.com
1,john,finance,john@tesla.com
2,peter,marketing,peterNoemail
3,fangfang,accounting,fangfang@tesla.com
4,lucas,sales,lucasthielNoEemail
5,daniel,marketing,daniel@tesla.com


In [118]:
containsEmailFilter=employeesDataFrame['email'].str.contains('@')

In [119]:
employeesDataFrame[containsEmailFilter]

Unnamed: 0,name,unit,email
0,david,engineering,david@tesla.com
1,john,finance,john@tesla.com
3,fangfang,accounting,fangfang@tesla.com
5,daniel,marketing,daniel@tesla.com


### Replace operation

In [120]:
employeesDataFrame['emailWithHashtag']=employeesDataFrame['email'].str.replace('@','#')

In [121]:
employeesDataFrame

Unnamed: 0,name,unit,email,emailWithHashtag
0,david,engineering,david@tesla.com,david#tesla.com
1,john,finance,john@tesla.com,john#tesla.com
2,peter,marketing,peterNoemail,peterNoemail
3,fangfang,accounting,fangfang@tesla.com,fangfang#tesla.com
4,lucas,sales,lucasthielNoEemail,lucasthielNoEemail
5,daniel,marketing,daniel@tesla.com,daniel#tesla.com


### Filter operation

In [125]:
startsWithDFilter=employeesDataFrame['email'].str.contains('^d+')

In [126]:
employeesDataFrame[startsWithDFilter]

Unnamed: 0,name,unit,email,emailWithHashtag
0,david,engineering,david@tesla.com,david#tesla.com
5,daniel,marketing,daniel@tesla.com,daniel#tesla.com
