# Common String Operations
Here are some useful resources for Python string operations:
* https://www.learnpython.org/en/Basic_String_Operations
* https://www.w3schools.com/python/python_ref_string.asp

Here, we will a text file with a movie review and count how often each word appears.

In [41]:
# It is useful to set input/output folders as global variables
INPUT_FOLDER = 'C:/Users/tyifat/Workspace/python-for-dna/Student Projects/Sentiment Analyzer Project/'

In [42]:
import glob
# glob finds all the pathnames matching a specified pattern according to the rules used by the Unix shell
train_pos_files = glob.glob(INPUT_FOLDER + 'train/pos/*.txt') 
len(train_pos_files)

12500

In [45]:
train_pos_files[:5]

['C:/Users/tyifat/Workspace/python-for-dna/Student Projects/Sentiment Analyzer Project/train/pos\\0_9.txt',
 'C:/Users/tyifat/Workspace/python-for-dna/Student Projects/Sentiment Analyzer Project/train/pos\\10000_8.txt',
 'C:/Users/tyifat/Workspace/python-for-dna/Student Projects/Sentiment Analyzer Project/train/pos\\10001_10.txt',
 'C:/Users/tyifat/Workspace/python-for-dna/Student Projects/Sentiment Analyzer Project/train/pos\\10002_7.txt',
 'C:/Users/tyifat/Workspace/python-for-dna/Student Projects/Sentiment Analyzer Project/train/pos\\10003_8.txt']

In [51]:
# Opening a file using a context manager - this way the file is automatically closed.
with open(train_pos_files[2]) as file:
    text = file.read()
text

'Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love scenes in clothes warehouse are second to none. The corn on face is a classic, as good as anything in Blazing Saddles. The take on lawyers is also superb. After being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of Pepto Bolt shrugs indifferently "I\'m a lawyer" he says. Three funny words. Jeffrey Tambor, a favorite from the later Larry Sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. His character is more malevolent than usual. The hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. Look for the legs scene and the two big diggers fighting (one bleeds). This movie gets better each time I see it (which is quite often).'

In [52]:
# Converting to lower case
text = text.lower()
text

'brilliant over-acting by lesley ann warren. best dramatic hobo lady i have ever seen, and love scenes in clothes warehouse are second to none. the corn on face is a classic, as good as anything in blazing saddles. the take on lawyers is also superb. after being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of pepto bolt shrugs indifferently "i\'m a lawyer" he says. three funny words. jeffrey tambor, a favorite from the later larry sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. his character is more malevolent than usual. the hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. look for the legs scene and the two big diggers fighting (one bleeds). this movie gets better each time i see it (which is quite often).'

In [53]:
print(text)

brilliant over-acting by lesley ann warren. best dramatic hobo lady i have ever seen, and love scenes in clothes warehouse are second to none. the corn on face is a classic, as good as anything in blazing saddles. the take on lawyers is also superb. after being accused of being a turncoat, selling out his boss, and being dishonest the lawyer of pepto bolt shrugs indifferently "i'm a lawyer" he says. three funny words. jeffrey tambor, a favorite from the later larry sanders show, is fantastic here too as a mad millionaire who wants to crush the ghetto. his character is more malevolent than usual. the hospital scene, and the scene where the homeless invade a demolition site, are all-time classics. look for the legs scene and the two big diggers fighting (one bleeds). this movie gets better each time i see it (which is quite often).


In [55]:
# Removing special characters
text = text.replace('(', '')
text = text.replace(')', '')
text = text.replace(r'"', '')
text = text.replace(r'.', '')
text = text.replace(r',', '')
text

"brilliant over-acting by lesley ann warren best dramatic hobo lady i have ever seen and love scenes in clothes warehouse are second to none the corn on face is a classic as good as anything in blazing saddles the take on lawyers is also superb after being accused of being a turncoat selling out his boss and being dishonest the lawyer of pepto bolt shrugs indifferently i'm a lawyer he says three funny words jeffrey tambor a favorite from the later larry sanders show is fantastic here too as a mad millionaire who wants to crush the ghetto his character is more malevolent than usual the hospital scene and the scene where the homeless invade a demolition site are all-time classics look for the legs scene and the two big diggers fighting one bleeds this movie gets better each time i see it which is quite often"

In [56]:
# Splitting to word
word_by_word = text.split()
word_by_word[-5:]

['it', 'which', 'is', 'quite', 'often']

In [57]:
# How many words?
len(word_by_word)

147

In [58]:
# How many unique words?
word_count = {word: 0 for word in set(word_by_word)}
print('Unique words: %d' % len(word_count))
word_count['the']

Unique words: 112


0

In [60]:
for key in word_count:
    word_count[key] = word_by_word.count(key)
word_count

{'selling': 1,
 'big': 1,
 'often': 1,
 'ghetto': 1,
 'turncoat': 1,
 'malevolent': 1,
 'clothes': 1,
 'more': 1,
 'larry': 1,
 'also': 1,
 'gets': 1,
 'in': 2,
 'face': 1,
 'who': 1,
 'are': 2,
 'take': 1,
 'ever': 1,
 'scene': 3,
 'shrugs': 1,
 'good': 1,
 'have': 1,
 'classic': 1,
 'lawyer': 2,
 'the': 10,
 'here': 1,
 'by': 1,
 'diggers': 1,
 'dramatic': 1,
 'two': 1,
 'show': 1,
 'hobo': 1,
 'better': 1,
 'warehouse': 1,
 'bolt': 1,
 'usual': 1,
 'later': 1,
 'dishonest': 1,
 'funny': 1,
 'classics': 1,
 'blazing': 1,
 'he': 1,
 'for': 1,
 'words': 1,
 'and': 4,
 'a': 6,
 'i': 2,
 'lady': 1,
 'being': 3,
 'corn': 1,
 'see': 1,
 'bleeds': 1,
 'hospital': 1,
 'three': 1,
 'best': 1,
 'lawyers': 1,
 'scenes': 1,
 'out': 1,
 'character': 1,
 'sanders': 1,
 'seen': 1,
 'his': 2,
 'love': 1,
 'where': 1,
 'which': 1,
 'quite': 1,
 'ann': 1,
 'pepto': 1,
 'brilliant': 1,
 'of': 2,
 'boss': 1,
 'crush': 1,
 'warren': 1,
 'tambor': 1,
 'mad': 1,
 'favorite': 1,
 'saddles': 1,
 'lesley': 1,

In [66]:
import pandas as pd

df_pos_words = pd.Series(word_count).to_frame(name='Word Count')
df_pos_words.sort_values('Word Count', ascending=False).head()

Unnamed: 0,Word Count
the,10
a,6
is,5
and,4
as,3


In [67]:
# Calculate the share of each word of the total words
df_pos_words['Word Share'] = df_pos_words['Word Count'] / df_pos_words['Word Count'].sum()
# Display the results in a user friendly way
df_pos_words.sort_values('Word Share', ascending=False).head(10).style.format({
        'Word Count': '{:,d}'.format, 
        'Word Share': '{:,.1%}'.format})

Unnamed: 0,Word Count,Word Share
the,10,6.8%
a,6,4.1%
is,5,3.4%
and,4,2.7%
as,3,2.0%
being,3,2.0%
scene,3,2.0%
are,2,1.4%
of,2,1.4%
i,2,1.4%


# Regular Expressions (regex)
Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language that are made available in Python through the `re` module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways. (Based on the [Python official documentation](https://docs.python.org/3/howto/regex.html).)

Here are useful regex resources:
* [Learn regex](https://regexone.com/) - we're going to use this here to learn the regex language.
* [Regex with Python tutorial](https://docs.python.org/3/howto/regex.html)
* [Regex tester](https://regex101.com/)
* [Regex Golf](https://alf.nu/RegexGolf)

Let's try to extract some data from Excel's template for customer invoices:

![image.png](attachment:image.png)

I save the sheet as a text file so we can easily process it.

In [7]:
with open('Customer Invoice.txt') as file:
    invoice_text = file.read()
invoice_text

'\tcompany name\t\t\t\t\t\t\n\tStreet Address\t\tP: Phone Number\t\tEmail\t\t\n\t"City, State ZIP Code"\t\tF: Fax Number\t\tWebsite\t\t\n\tBill To:\t"Contoso, Ltd"\tPhone: 432-555-0189\t\tInvoice #:\t3-456-2\t\n\tAddress:\t567 Walnut Lane\tFax:     432-555-0123\t\tInvoice Date:\t6/28/2020\t\n\t\t"Moline, MO 098765"\tEmail: someone@example.com\t\t\t\t\n\tInvoice For: Project 2\t\t\t\t\t\t\n\tItem #\tDescription\tQty\tUnit Price\tDiscount\tPrice\t\n\t Z4567\t Invoice 3-456-2 Data 1  \t39 \t $5.00 \t $-   \t $195.00 \t\n\t Z4568\t Invoice 3-456-2 Data 2  \t40 \t 4.00 \t 5.00 \t 155.00 \t\n\t Z4569\t Invoice 3-456-2 Data 3  \t30 \t 6.00 \t 7.00 \t 173.00 \t\n\t Z4570\t Invoice 3-456-2 Data 4  \t40 \t 7.00 \t -   \t 280.00 \t\n\t Z4571\t Invoice 3-456-2 Data 5  \t10 \t 4.00 \t -   \t 40.00 \t\n\t Z4572\t Invoice 3-456-2 Data 6  \t5 \t 8.00 \t -   \t 40.00 \t\n\t Z4573\t Invoice 3-456-2 Data 7  \t70 \t 6.00 \t -   \t 420.00 \t\n\t Z4574\t Invoice 3-456-2 Data 8  \t25 \t 4.00 \t -   \t 100.00

In [3]:
with open('Customer Invoice 2.txt') as file:
    invoice_text_2 = file.read()
print(invoice_text_2)

	company name						
	Street Address		P: Phone Number		Email		
	"City, State ZIP Code"		F: Fax Number		Website		
	Bill To:	"Bananas & Co."	Phone: (519) 479-0159		Invoice #:	3-456-3	
	Address:	909 Aviation Pkwy #100 	Fax:      			Invoice Date:	June 28th, 2020	
		Morrisville, NC 27560	Email: bartolo@bananas.ai				
	Invoice For: Project 2						
	Item #	Description			Qty	Unit Price	Discount	Price	
	 Z5678	 Invoice 3-456-3 Item-1  	1	$7,860 	 	 $-   		$7,860	
	 Z4568	 Invoice 3-456-3 Item-2  	40 	 4.00 	 	5.00 		 195.00 	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
					Invoice Subtotal  	 $8,055	
					Tax Rate  	8.25% 	
					Sales Tax  	 664.53 	
					Other  	 -   	
	Make all checks payable to company name.				Deposit Received  	 -   	
	Total due in <#> days. Overdue accounts subject to a service charge 

In [4]:
with open('Customer Invoice 3.txt') as file:
    invoice_text_3 = file.read()
print(invoice_text_3)

	company name						
	Street Address		P: Phone Number		Email		
	"City, State ZIP Code"		F: Fax Number		Website		
	Bill To:	Kiwi Designs	Phone: +1-432-555-0189		Invoice #:	3-456-4	
	Address:	567 Pine Nut Ave.	Lane	Fax:     +1-683-555 0123		Invoice Date:	2020-06-30	
		"Moline, MO 098765"	Email: someone@example.com				
	Invoice For: Project 2						
	Item #	Description	Qty	Unit Price	Discount	Price	
	 Z7890	 Invoice 3-456-4 Data 1  	50 	 $15.00  $50.00	 $690.00 	
	 Z7891	 Invoice 3-456-4 Data 2  	50 	 4.00 	 5.00 	 195.00 	
	 Z7892	 Invoice 3-456-4 Data 3  	50 	 6.00 	 7.00 	 293.00 	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
	 	   		  	  	  	
					Invoice Subtotal  	" $1078.00 "	
					Tax Rate  	8.75% 	
					Sales Tax  	 94.33 	
					Other  	 -   	
	Make all checks payable to company name.				Deposit Received  	 -   	


In [3]:
print(print(repr(invoice_text))

SyntaxError: unexpected EOF while parsing (<ipython-input-3-711117ff73b7>, line 1)

In [25]:
# Note the parentheses in the regular expression
import re
match = re.search('Invoice \d[^\t]+\t(\d+)', invoice_text_3)
match.group(0)

'Invoice 3-456-4 Data 1  \t50'

In [24]:
# We can extract only the unit price using the group() method
import re
match = re.search('Invoice \d[^\t]+\t(\d+)', invoice_text_3)
match.group(1)

'50'

In [9]:
match[0]

'432-555-0189'

In [72]:
# What if there is no match?
# Let's look for the phone number
match = re.search('\d{3}-\d{13}-\d{4}', invoice_text)
print(match)

None


In [73]:
match = re.findall('\d{3}-\d{3}-\d{4}', invoice_text)
match

['432-555-0189', '432-555-0123']

# Homework
See how many of the invoice fields you can extract.

In [6]:
import pandas as pd
pd.read_csv('Customer Invoice.txt', sep='\t')

Unnamed: 0.1,Unnamed: 0,company name,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7
0,,Street Address,,P: Phone Number,,Email,,
1,,"City, State ZIP Code",,F: Fax Number,,Website,,
2,,Bill To:,"Contoso, Ltd",Phone: 432-555-0189,,Invoice #:,3-456-2,
3,,Address:,567 Walnut Lane,Fax: 432-555-0123,,Invoice Date:,6/28/2020,
4,,,"Moline, MO 098765",Email: someone@example.com,,,,
5,,Invoice For: Project 2,,,,,,
6,,Item #,Description,Qty,Unit Price,Discount,Price,
7,,Z4567,Invoice 3-456-2 Data 1,39,$5.00,$-,$195.00,
8,,Z4568,Invoice 3-456-2 Data 2,40,4.00,5.00,155.00,
9,,Z4569,Invoice 3-456-2 Data 3,30,6.00,7.00,173.00,
