# 2 | Working with Text Data Files

This notebook covers specific text file topics:
- Loading and writing text files
- Parsing texts


Regular Expressions can be used in other programming languages as well to match and parse strings
   - `^`   Matches the beginning of a line 
   - `$`   Matches the end of the line
   - `.`   Matches any character
   - `*`   Repeats a character zero or more times
   - `\s`    Matches whitespace
   - `\S`    Matches any non­whitespace character
   - `*?`    Repeats a character zero or more times (non­greedy) + Repeats a character one or more times
   - `+?`    Repeats a character one or more times(non­greedy) [aeiou] Matchesasinglecharacterinthelistedset [^XYZ] Matches a single character not in the listed set [a­z0­9] Thesetofcharacterscanincludearange
   - `( `   Indicates where string extraction is to start
   - `) `   Indicates where string extraction is to end


*Resources*: 
- **Coursera Python for Informatics** (University of Michigan)


## Text Files and the Operating System

In [203]:
cd /app/1-Basics/0-Functions

/app/1-Basics/0-Functions


In [204]:
ls

1-Python-Intro.ipynb  4-Dataframes.ipynb    advanced-pandas.ipynb
2-Texts.ipynb         [0m[01;34mData[0m/                 numpy.ipynb
3-Arrays.ipynb        advanced-numpy.ipynb  pandas2.ipynb


In [210]:
# set path and open text with IO wrapper 
#use open for large files rather than reading in a file with input()
path = 'Data/examples/segismundo.txt'
f = open(path)
f

<_io.TextIOWrapper name='Data/examples/segismundo.txt' mode='r' encoding='UTF-8'>

In [208]:
# strip lines from text
with open(path) as f:
    lines = [x.rstrip() for x in f]
lines

['Sueña el rico en su riqueza,',
 'que más cuidados le ofrece;',
 '',
 'sueña el pobre que padece',
 'su miseria y su pobreza;',
 '',
 'sueña el que a medrar empieza,',
 'sueña el que afana y pretende,',
 'sueña el que agravia y ofende,',
 '',
 'y en el mundo, en conclusión,',
 'todos sueñan lo que son,',
 'aunque ninguno lo entiende.',
 '']

In [211]:
# read first 10 char
f.read(10)

'Sueña el r'

In [213]:
f.tell()

11

In [212]:
f2 = open(path, 'rb')  # Binary mode
f2.read(10)

b'Sue\xc3\xb1a el '

In [228]:
# remove encoding
data.decode('utf8')

'Sueña el '

In [None]:
f.seek(3)
f.read(1)

In [223]:
# open and write lines without space between lines
with open('data/tmp.txt', 'w') as handle:
    handle.writelines(x for x in open(path) if len(x) > 1)
with open('data/tmp.txt') as f:
    lines = f.readlines()
lines

['Sueña el rico en su riqueza,\n',
 'que más cuidados le ofrece;\n',
 'sueña el pobre que padece\n',
 'su miseria y su pobreza;\n',
 'sueña el que a medrar empieza,\n',
 'sueña el que afana y pretende,\n',
 'sueña el que agravia y ofende,\n',
 'y en el mundo, en conclusión,\n',
 'todos sueñan lo que son,\n',
 'aunque ninguno lo entiende.\n']

In [None]:
f.close()
f2.close()

### Text Parsing

In [33]:
#use split to split items in a sequence
text="the clown ran after the car and the car ran into the tent and the tent fell down on the clown and the car"
words=text.split()
words

['the',
 'clown',
 'ran',
 'after',
 'the',
 'car',
 'and',
 'the',
 'car',
 'ran',
 'into',
 'the',
 'tent',
 'and',
 'the',
 'tent',
 'fell',
 'down',
 'on',
 'the',
 'clown',
 'and',
 'the',
 'car']

In [35]:
#count word frequency with get()
counts=dict()
for word in words:
    counts[word]=counts.get(word,0)+1
counts

{'the': 7,
 'clown': 2,
 'ran': 2,
 'after': 1,
 'car': 3,
 'and': 3,
 'into': 1,
 'tent': 2,
 'fell': 1,
 'down': 1,
 'on': 1}

In [36]:
# find most frequent word from text
bigcount = None
bigword = None
for word,count in counts.items():
    if bigcount is None or count > bigcount: 
        bigword = word
        bigcount = count 
print(bigword, bigcount)

the 7


In [37]:
#split breaks a sequence into parts to create a list of strings
#items based by spaces but considers multiple spaces one split
#you can also split by defining parameter by semicolon, comma or dash etc.
s ='spam-spam-spam' 
delimiter = '-'
s.split(delimiter) 

['spam', 'spam', 'spam']

In [105]:
#import files from internet using requests
import requests
url = "http://www.py4inf.com/code/mbox-short.txt"
res = requests.get(url)
text = res.text
# get length of text characters
len(text)

94626

In [42]:
# have a look at the first 100 characters from text file
text[:100]

'From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008\nReturn-Path: <postmaster@collab.sakaiprojec'

In [141]:
lines=text.split('\n')
lines[:10]

['From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008',
 'Return-Path: <postmaster@collab.sakaiproject.org>',
 'Received: from murder (mail.umich.edu [141.211.14.90])',
 '\t by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;',
 '\t Sat, 05 Jan 2008 09:14:16 -0500',
 'X-Sieve: CMU Sieve 2.3',
 'Received: from murder ([unix socket])',
 '\t by mail.umich.edu (Cyrus v2.2.12) with LMTPA;',
 '\t Sat, 05 Jan 2008 09:14:16 -0500',
 'Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu [141.211.14.79])']

In [237]:
# count lines that start with From
count=0
for line in lines:
    line = line.rstrip()
    if line.startswith("From"): 
        count=count+1
print("There were",count,"subject lines")

There were 54 subject lines


In [234]:
#re.findall finds and retrieves a list
output = []
for line in lines:
    # strip text of trailing characters and lower case all words in text
    line = line.rstrip().lower()
    #remove punctuation
    line = re.sub('[!#?]', '', line)
    # return lines that begin with From
    if line.find('from'): continue
    split=line.split("@")
    # get unique domain names
    domain = split[1].split()
    output.append(domain[0])
print(set(output))

{'umich.edu', 'iupui.edu', 'caret.cam.ac.uk', 'uct.ac.za', 'gmail.com', 'media.berkeley.edu'}


In [242]:
# sort alphabetically
strings = list(set(output))
strings.sort(key=lambda x: len(set(list(x))))
strings

['iupui.edu',
 'uct.ac.za',
 'umich.edu',
 'gmail.com',
 'caret.cam.ac.uk',
 'media.berkeley.edu']