# **Module 1: Working with Text in Python**

## Section I Handling Text in Python

#### **1. Primitive Constructs in Text**

$\qquad$(1) Sentences/Input strings  
$\qquad$(2) Words or tokens  
$\qquad$(3) Characters  
$\qquad$(4) Documents and large files  

...And their properties

#### **2. Basic Operations of a Text**  

$\qquad$(1) Length of the text:  
$\qquad\qquad$ a. Character: *len(text)*

In [2]:
#Python practice cell

text1 = 'Ethics are built right into the ideals and objectives of the United Nations '
len(text1)

76

$\qquad\qquad$ b. Words: *split* + *len(text)*


In [4]:
text2 = text1.split(' ')
print(text2)
len(text2)

['Ethics', 'are', 'built', 'right', 'into', 'the', 'ideals', 'and', 'objectives', 'of', 'the', 'United', 'Nations', '']


14

$\qquad$ (2) Finding specific words:  
$\qquad\qquad$ a. Long word (e.g. more than 3 letters): use *if*

In [9]:
text3 = [w for w in text2 if len(w) > 3] 
#'len' can call both a list (return number of elements in the list) or an element (if a word, return the number of characters in the word)
print(text3)
len(text3)

['Ethics', 'built', 'right', 'into', 'ideals', 'objectives', 'United', 'Nations']


8

$\qquad\qquad$ b. Capitalized words (words begin with capital letters): use *istitle*

In [6]:
text4 = [w for w in text2 if w.istitle()]
print(text4)
len(text4)

['Ethics', 'United', 'Nations']


3

$\qquad\qquad$ c. words that ends with 's': use *endswith*

In [8]:
text5 = [w for w in text2 if w.endswith('s')]
print(text5)
len(text5)

['Ethics', 'ideals', 'objectives', 'Nations']


4

$\qquad\qquad$ d. Finding unique words (**words that are repeated count only once**): use *set*

In [10]:
text6 = 'To be or not to be'
text7 = text6.split(' ')

len(text7)

6

In [13]:
print(set(text7))
len(set(text7))
# returned 5 because 'to' and 'To' are treated as different words

{'to', 'not', 'or', 'be', 'To'}


5

In [15]:
# solving the issue: use 'lower'

text8 = set([w.lower() for w in text7])
print(text8)
len(text8)

{'to', 'or', 'be', 'not'}


4

$\qquad\qquad$ e. Other functions

$\qquad\qquad\qquad$ - *startswith*  
$\qquad\qquad\qquad$ - *t in s*  
$\qquad\qquad\qquad$ - *isupper*, *islower*, *istitle*  
$\qquad\qquad\qquad$ - *isalpha*, *isdigit*, *isalnum*

In [22]:
text9 = 'There are 9000 people living within a 100 km2 land in this town of Canada. SO SPARSE!'
text10 = text9.split(' ')
print(text10)
len(text10)

['There', 'are', '9000', 'people', 'living', 'within', 'a', '100', 'km2', 'land', 'in', 'this', 'town', 'of', 'Canada.', 'SO', 'SPARSE!']


17

In [23]:
text11 = [w for w in text10 if w.istitle()]
print(text11, len(text11))

text12 = [w for w in text10 if w.isupper()]
print(text12, len(text12))

text13 = [w for w in text10 if w.isdigit()]
print(text13, len(text13))

text14 = [w for w in text10 if 'e' in w]
print(text14, len(text14))

text15 = [w for w in text10 if w.isalpha()]
print(text15, len(text15))

text16 = [w for w in text10 if w.isalnum()]
print(text16, len(text16))

['There', 'Canada.'] 2
['SO', 'SPARSE!'] 2
['9000', '100'] 2
['There', 'are', 'people'] 3
['There', 'are', 'people', 'living', 'within', 'a', 'land', 'in', 'this', 'town', 'of', 'SO'] 12
['There', 'are', '9000', 'people', 'living', 'within', 'a', '100', 'km2', 'land', 'in', 'this', 'town', 'of', 'SO'] 15


$\qquad$ (3) String Operations:  
$\qquad\qquad$ a. Change the capitalization of words: *lower*, *upper*, *titlecase*


In [25]:
text17 = 'The quick brown fox Jumps Over the lazy dog.'
print('Original sentence: ', text17)

text18 = text17.lower()
print('All lower case: ', text18)

text19 = text17.upper()
print('All upper case: ', text19)

text20 = text17.title()
print('Title format: ', text20)

Original sentence:  The quick brown fox Jumps Over the lazy dog.
All lower case:  the quick brown fox jumps over the lazy dog.
All upper case:  THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.
Title format:  The Quick Brown Fox Jumps Over The Lazy Dog.


$\qquad\qquad$ b. Split a sentence by *t*

In [26]:
text21 = text17.split('o')
print("Split the sentence with letter 'o': ", text21)

Split the sentence with letter 'o':  ['The quick br', 'wn f', 'x Jumps Over the lazy d', 'g.']


$\qquad\qquad$ c. Split a sentence by lines

In [28]:
text22 = 'I love you.\nYou love me.\nWe are sweethearts.'
print('Original sentence is: ', text22, len(text22))

text23 = text22.splitlines()
print("Split the sentence into lines", text23,len(text23))

Original sentence is:  I love you.
You love me.
We are sweethearts. 44
Split the sentence into lines ['I love you.', 'You love me.', 'We are sweethearts.'] 3


$\qquad\qquad$ d. Join words by *t* **(applied on a list)**

In [33]:
text24 = 'oo'.join(text21)
print('Rejoined sentence: ', text24, len(text24))

Rejoined sentence:  The quick broown foox Jumps Over the lazy doog. 47


$\qquad\qquad$ e. Cleaning operations: remove all whitespace characters using *strip* (remove whitespace from the front and back) or *rstrip* (remove whitespace only from the back)

In [57]:
text25 = "        If you won't leave me, I will be with you till the end of the time.       "
print(text25, len(text25))

text25_1 = text25.split(' ')
print(text25_1, len(text25_1))

text26 = text25.strip()
print(text26, len(text26))

text26_1 = text26.split(' ')
print(text26_1, len(text26_1))


text27 = text25.rstrip()
print(text27, len(text27))

text27_1 = text27.split(' ')
print(text27_1, len(text27_1))

text28 = text25.strip().rstrip()
print(text28, len(text28))

text28_1 = text28.split(' ')
print(text28_1, len(text28_1))


        If you won't leave me, I will be with you till the end of the time.        82
['', '', '', '', '', '', '', '', 'If', 'you', "won't", 'leave', 'me,', 'I', 'will', 'be', 'with', 'you', 'till', 'the', 'end', 'of', 'the', 'time.', '', '', '', '', '', '', ''] 31
If you won't leave me, I will be with you till the end of the time. 67
['If', 'you', "won't", 'leave', 'me,', 'I', 'will', 'be', 'with', 'you', 'till', 'the', 'end', 'of', 'the', 'time.'] 16
        If you won't leave me, I will be with you till the end of the time. 75
['', '', '', '', '', '', '', '', 'If', 'you', "won't", 'leave', 'me,', 'I', 'will', 'be', 'with', 'you', 'till', 'the', 'end', 'of', 'the', 'time.'] 24
If you won't leave me, I will be with you till the end of the time. 67
['If', 'you', "won't", 'leave', 'me,', 'I', 'will', 'be', 'with', 'you', 'till', 'the', 'end', 'of', 'the', 'time.'] 16


$\qquad\qquad$ f. Find the location of *t*: use *find* (from the front) or *rfind* (from the back)

In [48]:
pos1 = text28.find('you')
print(pos1)

pos2 = text28.rfind('you')
print(pos2)

3
38


$\qquad\qquad$ g. Replace *u* by *v*: *replace*

In [49]:
text29 = text18.replace('o','oo')
print(text18, '\n', text29)

the quick brown fox jumps over the lazy dog. 
 the quick broown foox jumps oover the lazy doog.


$\qquad\qquad$ h. Transfer a word to all characters

In [55]:
text30 = 'ouagadougou'
text31 = list(text30)
text32 = [c for c in text30]

text33 = []
for c in text30:
    if c not in text33:
        text33.append(c)
    else:
        pass

text34 = list(set(list(text30)))

print(text30, '\n', text31, '\n', text32, '\n', text33, '\n', text34)

ouagadougou 
 ['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u'] 
 ['o', 'u', 'a', 'g', 'a', 'd', 'o', 'u', 'g', 'o', 'u'] 
 ['o', 'u', 'a', 'g', 'd'] 
 ['d', 'o', 'g', 'u', 'a']


#### **3. Handling Larger Texts**

$\qquad$(1) Read files

$\qquad\qquad$ a. Open the full file in read mode: *'r'*  
$\qquad\qquad$ b. Read the line following the position: *readline*

In [96]:
f = open('UNDHR.txt','r') # r stands for read mode
f.readline()

'Universal Declaration of Human Rights\n'

$\qquad\qquad$ c. Read the whole passage: *read*

In [97]:
f.read()

'Preamble\nWhereas recognition of the inherent dignity and of the equal and inalienable\nrights of all members of the human family is the foundation of freedom, justice\nand peace in the world,\nWhereas disregard and contempt for human rights have resulted in barbarous\nacts which have outraged the conscience of mankind, and the advent of a world\nin which human beings shall enjoy freedom of speech and belief and freedom\nfrom fear and want has been proclaimed as the highest aspiration of the common\npeople,\nWhereas it is essential, if man is not to be compelled to have recourse, as a last\nresort, to rebellion against tyranny and oppression, that human rights should be\nprotected by the rule of law,\nWhereas it is essential to promote the development of friendly relations between\nnations,\nWhereas the peoples of the United Nations have in the Charter reaffirmed their\nfaith in fundamental human rights, in the dignity and worth of the human person\nand in the equal rights of men and 

$\qquad\qquad$ d. Reset the position back to x: *seek(x)*  
$\qquad\qquad$ e. Read a certain amount of characters in the content: *read(n)*

In [98]:
f.seek(50)
f.read(10)

'hereas rec'

$\qquad$(2) Write a message  

$\qquad\qquad$ a. Open a file in writing mode: *'w'*  
$\qquad\qquad$ b. Write the message: *write*  
$\qquad\qquad$ c. Close the file: *close*

In [103]:
g = open('test.txt','w') # 'w' means at writing mode
g.write('This is a test message.')
g.close()

In [104]:
g = open('test.txt','r')
g.read()

'This is a test message.'

$\qquad$(3) Issues with reading text files

$\qquad\qquad$ Remove the newline character at last: *rstrip*

In [106]:
f.seek(0)
text37 = f.readline()
text37

'Universal Declaration of Human Rights\n'

In [107]:
text38 = text37.rstrip()
text38

'Universal Declaration of Human Rights'

#### ***\* Take Home Concepts***

$\qquad$ - Handling text sentences *len*  
$\qquad$ - Splitting sentences into words, words into characters *split*, *list*, *list comprehension*  
$\qquad$ - Finding unique words *set*  
$\qquad$ - Handling text from documents *open*, *read*, *write*, *seek*, *close*