## Text Manipulation using string methods and regular expressions

## Dealing with String Data

These include methods applied to string objects that 
* split a string by given delimiter - `.split()`
* trim whitespace - `.strip()`
* concatenate strings - `.join()`
* detect substrings - `.find()` and `.index()`
* count occurrences - `.count()`
* find and replace - `.replace()`

In [13]:
s = 'ready, set ,   go '
s

'ready, set ,   go '

In [14]:
s.split(',')

['ready', ' set ', '   go ']

In [15]:
s.split(' ')

['ready,', 'set', ',', '', '', 'go', '']

In [16]:
'_'.join(s.split(','))

'ready_ set _   go '

In [None]:
# String Splitting
'_'.join([x.strip() for x in s.split(',')])

In [19]:
# Trimming whitespace
pieces = [x.strip() for x in s.split(',')]
pieces
# Also see rstrip, lstrip

['ready', 'set', 'go']

In [20]:
'_#_'.join(list('abcde'))

'a_#_b_#_c_#_d_#_e'

In [21]:
# Concatenating Strings
print '::'.join(pieces)
print '--'.join(pieces)
print ' '.join(pieces)

ready::set::go
ready--set--go
ready set go


In [22]:
# Does a Substring belong to a string
print 'steady' in s
print 'set' in s

False
True


In [23]:
# Locate a substring
s.index('go')

15

In [24]:
s[15:17]

'go'

In [25]:
#find vs index
sentence = 'the sun rises in the east'
sentence.find('east')

21

In [26]:
sentence.index('east')

21

In [27]:
print sentence.find('west')
#print sentence.index('west') #it will throw an error

-1


In [28]:
# Locate a substring
s.find(',')# Count occurrences
s.count(',')

5

In [29]:
# Count occurrences
s.count(',')

2

In [30]:
sentence.endswith('east')

True

In [31]:
s2 = 'the quick brown fox jumps over the lazy dog'
s2.find('fox')

print 'lazy' in s2

print s2.endswith('dog')

True
True


In [32]:
s.startswith('ready')
# similarly .endswith()

True

In [45]:
# Data importing 
cust_demo = pd.read_csv("Cust_demo.csv")

In [46]:
cust_demo.columns

Index([u'ID', u'Location', u'Gender', u'age', u'Martial_Status',
       u'NumberOfDependents', u'Own_House', u'No_Years_address'],
      dtype='object')

In [47]:
cust_demo.Location.head(5)

0         Gandhinagar,Gujarat
1    Hyderabad,Andhra Pradesh
2     Shimla,Himachal Pradesh
3                 Srinagar,JK
4              Imphal,Manipur
Name: Location, dtype: object

In [48]:
cust_demo[['State', 'City']]=cust_demo['Location'].str.split(',', expand=True)

In [49]:
cust_demo.head(5)

Unnamed: 0,ID,Location,Gender,age,Martial_Status,NumberOfDependents,Own_House,No_Years_address,State,City
0,4532,"Gandhinagar,Gujarat",0,39,Single,1.0,1,3,Gandhinagar,Gujarat
1,148736,"Hyderabad,Andhra Pradesh",0,52,Married,0.0,0,3,Hyderabad,Andhra Pradesh
2,95965,"Shimla,Himachal Pradesh",0,62,Married,0.0,0,2,Shimla,Himachal Pradesh
3,61759,"Srinagar,JK",0,42,Single,1.0,1,1,Srinagar,JK
4,49806,"Imphal,Manipur",0,41,Single,1.0,0,3,Imphal,Manipur


## Regular Expressions

A Regex is a sequence of characters that define a search pattern used in find-and-replace actions.

Example: The regex
* `\s+` describes one or more whitespaces
* `(?<=\.) {2,}(?=[A-Z])` matches at least two spaces occurring after period (.) and before an upper case letter

Note:
* Before a regex is applied to a string, it must be _compiled_ to create a reusable regex object.
* The object's methods can then be called on a string.
* These include: 
    * **`split`**, 
    * **`findall`** (returns all matches), 
    * **`match`** (checks only the beginning of the string), 
    * **`search`** (returns the first occurrence)
    * **`sub`** (returns a new string with occurrences of the pattern replaced with the supplied string)

Syntax:
1. `import re`
2. `r_obj = re.compile('my-regex')`
3. `r_obj.method(my-text)`

In [1]:
import re

In [2]:
# The phone number example
# Aim is to identify if a text contains a phone number in the above pattern:

phone = re.compile(r'\+\d\d \d\d\d-\d\d\d-\d\d\d\d')
phoneString1 = "My phone number is +91 905-298-9892"

ps_1 = phone.search(phoneString1)
print("Phone number identified " + ps_1.group())

Phone number identified +91 905-298-9892


In [3]:
# Aim is to identify the country code and number separately

phone2 = re.compile(r'(\+\d\d) (\d\d\d-\d\d\d-\d\d\d\d)')

ps_2 = phone2.search(phoneString1)
print("The country code is : " + ps_2.group(1) + " and the number is : " + ps_2.group(2))

The country code is : +91 and the number is : 905-298-9892


In [4]:
# My way, The Python way
# Use groups() to extract all the groups

CountryCode, PhoneNumber = ps_2.groups()
print("The country code is : " + CountryCode)
print("The phone number is : " + PhoneNumber)

The country code is : +91
The phone number is : 905-298-9892


In [5]:
#The pipe | to match multiple groups
#  the regular expression r'Queen|MJ' will match either 'Queen' or 'MJ'.

BandRegex = re.compile(r'Queen|MJ')

BandSearch1 = BandRegex.search("Bohemian Rhapsody by Queen")
print BandSearch1.group()

BandSearch2 = BandRegex.search("Black or White by MJ")
print BandSearch2.group()


Queen
MJ


In [6]:
# In case both are present, then group() will return the first occurrence of either of the patterns
BandSearch3 = BandRegex.search("Queen and MJ songs are awesome!")
print BandSearch3.group()

Queen


In [7]:
# pipe can also match in groups

LambRegex = re.compile(r'Lamborghini (Countach|Diablo|Murcielago|Gallardo|Aventedor|SestoElemento|)')

Trivia1 = "NFS1 featured Lamborghini Diablo"
Trivia2 = "NFS Movie had Lamborghini SestoElemento"
Trivia3 = "FnF was first featured a Koenigsegg CXX-R"

LambSearch1 = LambRegex.search(Trivia1)
print LambSearch1.group()
print LambSearch1.group(0)
print LambSearch1.groups()


Lamborghini Diablo
Lamborghini Diablo
('Diablo',)


In [8]:
#The ? character flags the group that precedes it as an optional part of the pattern.
#? - either once or none at all

# Optional Country code for a phone number
phone2 = re.compile(r'(\+\d\d)? (\d\d\d-\d\d\d-\d\d\d\d)')

ps_2 = phone2.search(phoneString1)
print ps_2.group()

phoneString2 = "The other number is 789-256-1234"
ps_3 = phone2.search(phoneString2)
print ps_3.group()

+91 905-298-9892
 789-256-1234


In [9]:
#The * character
#The group that precedes the star can occur any number of times in the text.
#* - any number of times or none at all

reString = re.compile(r'[Cc]hair(wo)*man')

string2 = "SBI's Chairwoman is Mrs Arundhati Roy"
star1 = reString.search(string2)
print star1.group()

string3 = "Vice President is ex-officio chairman of Rajya Sabha"
star2 = reString.search(string3)
print star2.group()

string4= "Chairwowowowowowoman"
star3 = reString.search(string4)
print star3.group()


Chairwoman
chairman
Chairwowowowowowoman


In [10]:
#The + character
#The group that precedes the plus must occur at least once in the text.
#+ - at least once, else no match!

reString = re.compile(r'[Cc]hair(wo)+man')

string2 = "SBI's Chairwoman is Mrs Arundhati Roy"
plus1 = reString.search(string2)
print plus1.group()

string3 = "Vice President is ex-officio chairman of Rajya Sabha"
plus2 = reString.search(string3)
print plus2 == None

string4= "Chairwowowowowowoman"
plus3 = reString.search(string4)
print plus3.group()

Chairwoman
True
Chairwowowowowowoman


In [11]:
#The use of curly braces to specify the counts!

#{n} -> exactly n number of times

#{n,} -> at least n number of times

#{,m} -> upto m number of times

#{n,m} -> at least n times and upto m times

pat1 = re.compile(r'\d{3}')
num1 = "666"
numMatch1 = pat1.match(num1)
print numMatch1.group()


pat2 = re.compile(r'\d{3}')
num2 = "42"
numMatch2 = pat2.match(num2)
print numMatch2 == None

pat3 = re.compile(r'\d{10,14}?')
num3 = "01242344325"
numMatch3 = pat3.match(num3)
print numMatch3.group()

666
True
0124234432


In [12]:
### the findall() method will return the strings of every match in the searched string.


phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
PhoneInfo = "Cell: 783-824-5336 Work: 987-555-1111"

mo = phoneNumRegex.search(PhoneInfo)
print mo.group()

print phoneNumRegex.findall(PhoneInfo)

783-824-5336
['783-824-5336', '987-555-1111']


## USE Case

In [35]:
import pandas as pd

In [36]:
# Create a dataframe with a single column of strings
data = {'raw': ['Arizona 1 2014-12-23       3242.0',
                'Iowa 1 2010-02-23       3453.7',
                'Oregon 0 2014-06-20       2123.0',
                'Maryland 0 2014-03-14       1123.6',
                'Florida 1 2013-01-15       2134.0',
                'Georgia 0 2012-07-14       2345.6']}
df = pd.DataFrame(data, columns = ['raw'])
df

Unnamed: 0,raw
0,Arizona 1 2014-12-23 3242.0
1,Iowa 1 2010-02-23 3453.7
2,Oregon 0 2014-06-20 2123.0
3,Maryland 0 2014-03-14 1123.6
4,Florida 1 2013-01-15 2134.0
5,Georgia 0 2012-07-14 2345.6


### Objective is to split the above raw field into multiple columns

In [37]:
df['raw'].str.contains('......-..-...',regex=True)

0    True
1    True
2    True
3    True
4    True
5    True
Name: raw, dtype: bool

In [38]:
# Which rows of df['raw'] contain 'xxxx-xx-xx'?
df['raw'].str.contains('....-..-..', regex=True)

0    True
1    True
2    True
3    True
4    True
5    True
Name: raw, dtype: bool

In [39]:
# In the column 'raw', extract xxxx-xx-xx in the strings
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
df

Unnamed: 0,raw,date
0,Arizona 1 2014-12-23 3242.0,2014-12-23
1,Iowa 1 2010-02-23 3453.7,2010-02-23
2,Oregon 0 2014-06-20 2123.0,2014-06-20
3,Maryland 0 2014-03-14 1123.6,2014-03-14
4,Florida 1 2013-01-15 2134.0,2013-01-15
5,Georgia 0 2012-07-14 2345.6,2012-07-14


In [40]:
# In the column 'raw', extract single digit in the strings
df['female'] = df['raw'].str.extract('(\d)', expand=True)
df

Unnamed: 0,raw,date,female
0,Arizona 1 2014-12-23 3242.0,2014-12-23,1
1,Iowa 1 2010-02-23 3453.7,2010-02-23,1
2,Oregon 0 2014-06-20 2123.0,2014-06-20,0
3,Maryland 0 2014-03-14 1123.6,2014-03-14,0
4,Florida 1 2013-01-15 2134.0,2013-01-15,1
5,Georgia 0 2012-07-14 2345.6,2012-07-14,0


In [41]:
# In the column 'raw', extract ####.## in the strings
df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)
df

Unnamed: 0,raw,date,female,score
0,Arizona 1 2014-12-23 3242.0,2014-12-23,1,3242.0
1,Iowa 1 2010-02-23 3453.7,2010-02-23,1,3453.7
2,Oregon 0 2014-06-20 2123.0,2014-06-20,0,2123.0
3,Maryland 0 2014-03-14 1123.6,2014-03-14,0,1123.6
4,Florida 1 2013-01-15 2134.0,2013-01-15,1,2134.0
5,Georgia 0 2012-07-14 2345.6,2012-07-14,0,2345.6


In [42]:
# In the column 'raw', extract the word in the strings
df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)
df

Unnamed: 0,raw,date,female,score,state
0,Arizona 1 2014-12-23 3242.0,2014-12-23,1,3242.0,Arizona
1,Iowa 1 2010-02-23 3453.7,2010-02-23,1,3453.7,Iowa
2,Oregon 0 2014-06-20 2123.0,2014-06-20,0,2123.0,Oregon
3,Maryland 0 2014-03-14 1123.6,2014-03-14,0,1123.6,Maryland
4,Florida 1 2013-01-15 2134.0,2013-01-15,1,2134.0,Florida
5,Georgia 0 2012-07-14 2345.6,2012-07-14,0,2345.6,Georgia


In [43]:
df

Unnamed: 0,raw,date,female,score,state
0,Arizona 1 2014-12-23 3242.0,2014-12-23,1,3242.0,Arizona
1,Iowa 1 2010-02-23 3453.7,2010-02-23,1,3453.7,Iowa
2,Oregon 0 2014-06-20 2123.0,2014-06-20,0,2123.0,Oregon
3,Maryland 0 2014-03-14 1123.6,2014-03-14,0,1123.6,Maryland
4,Florida 1 2013-01-15 2134.0,2013-01-15,1,2134.0,Florida
5,Georgia 0 2012-07-14 2345.6,2012-07-14,0,2345.6,Georgia
