# Pandas String Operations

A string is a sequence of characters. You can access the characters one at a time
with the bracket operator

In [1]:
town = 'douala'


In [2]:
town[1]

'o'

In Python, the index
is an offset from the beginning of the string, and the offset of the first letter is zero

In [3]:
town[0]

'd'

In [4]:
town[0:2]

'do'

What about accessing the last letter

In [5]:
town[5]

'a'

An intuitive way of accessing the last element of the string is by passing -1 as an index

In [7]:
town[-1]

'doual'

If I'm interested in getting the whole string except the last element

In [10]:
town[:-1]

'doual'

* Write a method that will take a string as input, and return a new string with the same letters in reverse order.


* Write a method that takes in a string. Return the longest word in the string. You may assume that the string contains only letters and spaces.

* Write a method that takes a string and returns the number of vowels in the string. You may assume that all the letters are lower cased

* Write a method that takes a string and returns true if it is a palindrome. A palindrome is a string that is the same whether written backward or forward. Assume that there are no spaces; only lowercase  letters will be given

In [12]:
dir(town)  #This command will give all possible attribute we can use to play with strings

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

These are few we are going to use for the sake of the context

**islower , isupper , title , lower() , upper(), replace , rsplit , strip , startwith**

In [14]:
town.islower()

True

In [15]:
town.isupper()

False

In [17]:
town.title()

'Douala'

In [20]:
town.upper()

'DOUALA'

In [24]:
v1= '   Here we go    '

In [25]:
#Whatever amount of space we have at the beginning and at the end of the string, it removes it
v1.strip()

'Here we go'

In [36]:
v2 = 'Hello World'

In [39]:
v2.rsplit()  #by default the separator is a white space. We might specify ours

['Hello', 'World']

In [43]:
v2.startswith('H')

True

In [49]:
v2.replace('e', 'y')

'Hyllo World '

The good thing is we can use the same string operations with pandas series and dataframes

Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the str attribute and generally have names matching the equivalent (scalar) built-in string methods:



In [50]:
import numpy as np
import pandas as pd

In [51]:
s = pd.Series(['cat' , 'dog' , 'mouse' , 'rabbit', 'lion'])

In [52]:
s.str.lower()

0       cat
1       dog
2     mouse
3    rabbit
4      lion
dtype: object

In [53]:
s.str.upper()

0       CAT
1       DOG
2     MOUSE
3    RABBIT
4      LION
dtype: object

In [54]:
s.str.title()

0       Cat
1       Dog
2     Mouse
3    Rabbit
4      Lion
dtype: object

In [55]:
s.str.len()

0    3
1    3
2    5
3    6
4    4
dtype: int64

In [56]:
t = pd.Series([' guitar' , '   drum ', 'piano ', np.nan , "violon       "])

In [61]:
t.str.strip()                    #.str.len()

0    guitar
1      drum
2     piano
3       NaN
4    violon
dtype: object

In [62]:
s1 = pd.Series(['the_lonely_hour' , '1000_of_fears' , 'some_people_have_problem', 'The_Thrill_of_it_all'])

In [63]:
s1

0             the_lonely_hour
1               1000_of_fears
2    some_people_have_problem
3        The_Thrill_of_it_all
dtype: object

In [69]:
s1.str.split('_').str.get(1)

0    lonely
1        of
2    people
3    Thrill
dtype: object

In [70]:
s1.str.split('_' , expand=True)

Unnamed: 0,0,1,2,3,4
0,the,lonely,hour,,
1,1000,of,fears,,
2,some,people,have,problem,
3,The,Thrill,of,it,all


In [76]:
text = ['I can cry a river for you' , 'I will love you' , 'life is hell without you', "don't leave me, please"]
response = ['yes' , 'no', 'yes', 'yes']

In [77]:
text_Series = pd.Series(text)

In [103]:
df1 = text_Series.str.split(' ', expand= True)
df1

Unnamed: 0,0,1,2,3,4,5,6
0,I,can,cry,a,river,for,you
1,I,will,love,you,,,
2,life,is,hell,without,you,,
3,don't,leave,"me,",please,,,


In [104]:
df1.columns = ["term "+str(i) for i in range(1,8)] 

In [105]:
df1['response'] = pd.Series(response)

In [106]:
df1

Unnamed: 0,term 1,term 2,term 3,term 4,term 5,term 6,term 7,response
0,I,can,cry,a,river,for,you,yes
1,I,will,love,you,,,,no
2,life,is,hell,without,you,,,yes
3,don't,leave,"me,",please,,,,yes


In [108]:
df1.index = ["doc "+str(i) for i in range(1,len(df1['term 1']) + 1)] 

In [109]:
df1

Unnamed: 0,term 1,term 2,term 3,term 4,term 5,term 6,term 7,response
doc 1,I,can,cry,a,river,for,you,yes
doc 2,I,will,love,you,,,,no
doc 3,life,is,hell,without,you,,,yes
doc 4,don't,leave,"me,",please,,,,yes


We call this one a document-term matrix. 
It's more often used to build a spam filtering , or to predict the "yes or no"_ness of a message

In [120]:
df1['term 2'].str.replace(r'^' , 'oups')     
#*  =  0 or more times 
#+ =  1 or more times 
#^ = starts with  and $ = ends with
#\w = word character  \d = digit character 

doc 1    oups
doc 2    oups
doc 3    oups
doc 4    oups
Name: term 2, dtype: object

In [115]:
df1

Unnamed: 0,term 1,term 2,term 3,term 4,term 5,term 6,term 7,response
doc 1,I,can,cry,a,river,for,you,yes
doc 2,I,will,love,you,,,,no
doc 3,life,is,hell,without,you,,,yes
doc 4,don't,leave,"me,",please,,,,yes
