# (3C) Files and strings

In this notebook we'll focus on 'strings' and how to do things with them, like find and count instances of a word. At the very end, we'll learn how to open a text file and read it to a string. Text files are just big strings, so text mining always begins with string processing.

*This notebook began as an adaption of this [strings tutorial by Jerry Pusinen](https://jerry-git.github.io/learn-python3/notebooks/beginner/html/strings.html).*

## String slicing

We just looked at some of the methods strings have. There's one more major functionality of strings: slicing.

'Slicing' refers to slicing off a specific piece of the string.

This is all coordinated through two brackets at the end of a string:

    my_string[a:b]
    
    a = number of characters/letters into my_string to start at
    b = number of characters/letters into my_string to end at
    
For instance...

### Slice off the beginning of strings

To get the first N characters:

    my_string[:N]

In [3]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'

Each letter of the alphabet string has an index:

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25

In [4]:
# first three letters

alphabet[0:3]     # means: starting at 0, and including everything until 3, which is not included
                  # or:    the first 3 letters

'abc'

In [5]:
# first 10 letters
alphabet[0:10]

'abcdefghij'

In [6]:
alphabet[:10]          # also works

'abcdefghij'

In [7]:
# @TODO: slice 'William' away from 'William Shakespeare'

bardname='William Shakespeare'

bardname[:7]

'William'

In [8]:
# @TODO: slice off your own first name

fullname='Ryan Heuser'

fullname[0:4]

'Ryan'

### Slice off the ends of strings

The alphabet indices in reverse

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
    -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9  -8  -7  -6  -5  -4  -3  -2  -1

In [13]:
# last 3 letters
alphabet[-3:]  

'xyz'

In [14]:
# last 10 letters
alphabet[-10:]

'qrstuvwxyz'

In [15]:
# @TODO slice 'Shakespeare' from 'William Shakespeare'

bardname[-11:]

'Shakespeare'

In [16]:
# @TODO slice off your last name

fullname[-6:]

'Heuser'

### Slice the middles of strings

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25

In [17]:
# the first three
alphabet[0:3]

'abc'

In [18]:
# the next four letters 
alphabet[3:7]

'defg'

In [19]:
# the next four
alphabet[7:11]

'hijk'

In [20]:
# the next five
alphabet[11:16]

'lmnop'

In [22]:
# everything but the first three and the last three
alphabet[3:-3]

'defghijklmnopqrstuvw'

In [24]:
word='Hello,'

print(word)

Hello,


In [25]:
word[:-1]

'Hello'

In [None]:
# @TODO: Slice Shakespeare's name so that you see only
# the last two letters of his first name ('am') and the first two letters of his last name ('Sh')



In [None]:
# @TODO: Slice the middle of your own name in the same away as just above:



In [None]:
# @TODO: Slice away everything but the first and last letters of your name



### Get the length of strings

Use the built-in function `len` to get the length (character count) of a string.

In [27]:
len('Is this tweet too long?')

23

In [28]:
len(alphabet)

26

In [33]:
##
# @TODO: Pare this string down to fit into a tweet (280 chars) 
#
lorem = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit """

len(lorem)

271

## String methods

So far we've seen a few things strings can do. They can be 'added' to each other. They can be sliced.

But strings can do many more things using its **methods**.

A method is something an object can do. If my dog is a Python object, things he might be able to do are 'sit', 'stay', 'walk'. Here's some pseudo-code:

    my_dog.sit()
    my_dog.stay()
    my_dog.walk()

In other words, a method is activated by this general formula:

    object.method()

For strings specifically, certain methods do very useful things.

### `str.upper(), str.lower(), str.title()`

These three methods let you change the capitalization of the string you have.

In [34]:
mixed_case = 'PyTHoN hackER'

In [35]:
mixed_case.upper()

'PYTHON HACKER'

In [36]:
mixed_case.lower()

'python hacker'

In [37]:
mixed_case.title()

'Python Hacker'

In [39]:
# @TODO: Recreate the following title page with its correct captialization using the variable names.
"""
The
LIFE
and
Strange Surprising
ADVENTURES
of
ROBINSON CRUSOE
Of York, Mariner
"""

line1='the'
line2='life'
line3='AND'
line4='strange surprising'
line5='adventures'
line6='OF'
line7='robinson crusoe'
line8='of york, mariner'

print(line1.title())
print(line2.upper())
print(line3.lower())
print(line4.title())
print(line5.upper())
print(line6.lower())
print(line7.upper())
print(line8.title())
# .... Print the other lines using the appropriate string method

The
LIFE
and
Strange Surprising
ADVENTURES
of
ROBINSON CRUSOE
Of York, Mariner


In [44]:
# Remember you can always check the help for a string method like this:
help(str.title)

Help on method_descriptor:

title(...)
    S.title() -> str
    
    Return a titlecased version of S, i.e. words start with title case
    characters, all remaining cased characters have lower case.



### `str.replace()`

A lot of the time we need to replace something inside a string with something else.

#### Fixed That For You #1

In [52]:
corporation = 'Dunkin Donuts'

In [53]:
x='Dunkin'
y='Capitalist'

In [54]:
corporation.replace(x,y)

'Capitalist Donuts'

In [48]:
# This did not overwrite 'corporation'!

corporation

'Dunkin Donuts'

In [57]:
# To do that, we need to re-assign corporation

corporation = corporation.replace(x,y)

In [58]:
corporation

'Capitalist Donuts'

#### Fixed That For You #2

In [59]:
heteronormative = 'men like women, women like men'

In [61]:
heteronormative.replace('women','people')

'men like people, people like men'

In [62]:
heteronormative.replace('men','people')

'people like wopeople, wopeople like people'

In [63]:
heteronormative.replace('women','people').replace('men','people')

'people like people, people like people'

In [64]:
# This did not overwrite heteronormative!

heteronormative

'men like women, women like men'

In [65]:
# We can overwrite one at a time

heteronormative = heteronormative.replace('women','people')

In [66]:
heteronormative

'men like people, people like men'

In [67]:
heteronormative = heteronormative.replace('men','people')

In [68]:
heteronormative

'people like people, people like people'

#### Fixed That For You #3

In [73]:
#@TODO: Rid this string of its gendered assumptions (use 'they', 'them', and 'person' as neutral pro/nouns).

gendered = 'You saw your doctor? What did he say? And the nurse? What did she say?'

gendered.replace('she','they').replace(' he',' they')

'You saw your doctor? What did they say? And the nurse? What did they say?'

### `str.strip()`

Sometimes we need to clean strings up and remove the whitespace they have surrounding them.

In [78]:
overpadded = """


                                  Chapeter 1
                                  
                                  
                                  The night was young
                                  
                                  
                                  
                                  
                                  
                                  
                                  

"""

In [79]:
print(overpadded)




                                  Chapeter 1
                                  
                                  
                                  The night was young
                                  
                                  
                                  
                                  
                                  
                                  
                                  




In [80]:
overpadded

'\n\n\n                                  Chapeter 1\n                                  \n                                  \n                                  The night was young\n                                  \n                                  \n                                  \n                                  \n                                  \n                                  \n                                  \n\n'

In [81]:
overpadded.strip()

'Chapeter 1\n                                  \n                                  \n                                  The night was young'

### `str.count()`

How many strings are inside another string?

In [96]:
rafaela = """I wish either my father or my mother, or indeed both of them, as they
were in duty both equally bound to it, had minded what they were about
when they begot me; had they duly consider'd how much depended upon what
they were then doing;--that not only the production of a rational
Being was concerned in it, but that possibly the happy formation and
temperature of his body, perhaps his genius and the very cast of his
mind;--and, for aught they knew to the contrary, even the fortunes of
his whole house might take their turn from the humours and dispositions
which were then uppermost;--Had they duly weighed and considered all
this, and proceeded accordingly,--I am verily persuaded I should have
made a quite different figure in the world, from that in which the
reader is likely to see me.--Believe me, good folks, this is not so
inconsiderable a thing as many of you may think it;--you have all, I
dare say, heard of the animal spirits, as how they are transfused from
father to son, &c. &c.--and a great deal to that purpose:--Well, you may
take my word, that nine parts in ten of a man's sense or his nonsense,
his successes and miscarriages in this world depend upon their motions
and activity, and the different tracks and trains you put them into, so
that when they are once set a-going, whether right or wrong, 'tis not
a half-penny matter,--away they go cluttering like hey-go mad; and by
treading the same steps over and over again, they presently make a road
of it, as plain and as smooth as a garden-walk, which, when they are
once used to, the Devil himself sometimes shall not be able to drive
them off it."""

In [97]:
rafaela=rafaela.replace('  ',' ')

In [98]:
# How many crabs?
rafaela.count('crab')

0

In [99]:
# How many of these animals respectively?
print(rafaela.count('iguana'),rafaela.count('scorpion'),rafaela.count('snake'),rafaela.count('parakeet'))

0 0 0 0


In [100]:
# How many hummingbirds?
rafaela.count('hummingbird')

0

In [101]:
# How come!?
# Look: it's capitalized in text. "Hummingbirds"!
# So...

# @TODO how do we fix this?

rafaela_l = rafaela.lower()

rafaela_l.count('hummingbird')

0

In [112]:
##
# Getting some stats
##

num_commas = rafaela.count(',')
num_commas

29

In [113]:
num_sents = rafaela.count('.') + rafaela.count('?') + rafaela.count('!')
num_sents

4

In [114]:
num_words = rafaela.strip().count(' ')
num_words

269

In [115]:
num_quots = rafaela.count('"')
num_quots

0

In [116]:
# Words per sentence
wps = num_words/num_sents
wps

67.25

In [125]:
# Commas per sentence
cps = num_commas/num_sents
cps

7.25

In [126]:
bobby = """
THEY order, said I, this matter better in France.—You have been in
France? said my gentleman, turning quick upon me, with the most civil
triumph in the world.—Strange! quoth I, debating the matter with myself,
That one and twenty miles sailing, for ’tis absolutely no further from
Dover to Calais, should give a man these rights:—I’ll look into them: so,
giving up the argument,—I went straight to my lodgings, put up half a
dozen shirts and a black pair of silk breeches,—“the coat I have on,”
said I, looking at the sleeve, “will do;”—took a place in the Dover
stage; and the packet sailing at nine the next morning,—by three I had
got sat down to my dinner upon a fricaseed chicken, so incontestably in
France, that had I died that night of an indigestion, the whole world
could not have suspended the effects of the _droits d’aubaine_; {557}—my
shirts, and black pair of silk breeches,—portmanteau and all, must have
gone to the King of France;—even the little picture which I have so long
worn, and so often have told thee, Eliza, I would carry with me into my
grave, would have been torn from my neck!—Ungenerous! to seize upon the
wreck of an unwary passenger, whom your subjects had beckoned to their
coast!—By heaven! Sire, it is not well done; and much does it grieve me,
’tis the monarch of a people so civilized and courteous, and so renowned
for sentiment and fine feelings, that I have to reason with!—
"""

In [127]:
num_sents_bobby = bobby.count('.') + bobby.count('!') + bobby.count('?')
num_sents_bobby

9

In [128]:
num_commas_bobby = bobby.count(',')
num_commas_bobby

31

In [129]:
num_words_bobby = bobby.strip().count(' ')
num_words_bobby

231

In [130]:
# Words per sent
wps_bobby = num_words_bobby / num_sents_bobby
wps_bobby

25.666666666666668

In [131]:
# Commas per word
cps_bobby = num_commas_bobby / num_sents_bobby
cps_bobby

3.4444444444444446

In [132]:
##
# Compare Rafaela and Bobby:
##

In [133]:
print ('\t','Rafaela (ch1 pr1)','\t','Bobby (ch2 pr1)')
print('WPS','\t',wps,'\t',wps_bobby)
print('CPS','\t',cps,'\t',cps_bobby)

	 Rafaela (ch1 pr1) 	 Bobby (ch2 pr1)
WPS 	 67.25 	 25.666666666666668
CPS 	 7.25 	 3.4444444444444446


In [None]:
##
# @TODO: Run the above statistics on the first paragraph of chapter 3 (Emi)
# * find the text file in the jupyter file browser
# * click the home button, then corpora > tropic_of_orange > ch3.txt
# * copy the first paragraph and create a string for it
# * then rerun the procedures above
# * and save variables: wps_emi and cps_emi



In [None]:
print ('\t','Rafaela (ch1 pr1)','\t','Bobby (ch2 pr1)','\t','Emi (ch3)')
print('WPS','\t',wps,'\t',wps_bobby,'\t',wps_emi)
print('CPS','\t',cps,'\t',cps_bobby,'\t',cps_emi)

### `str.index(substr)`

Find the index of a substring in a string. Useful for Keywords in Context (KWIC).

First let's start small.

In [None]:
alphabet.index('j')

In [None]:
alphabet[9]

In [None]:
alphabet[9-2:9+3]

In [None]:
# let's abstract from this

index = alphabet.index('j')
alphabet[index-2:index+3]

In [None]:
# let's abstract one more time

index = alphabet.index('j')
radius = 2
alphabet[index-radius:index+radius+1]

#### Key Words In Context (KWIC)

In [None]:
index_crab = rafaela.index('crab')
index_crab

In [None]:
radius_r = 200

print(rafaela[index_crab-radius_r:index_crab+radius_r+1])

In [None]:
help(rafaela.index)

In [None]:
index_crab2=rafaela.index('crab',index_crab+1)
index_crab2

In [None]:
print(rafaela[index_crab2-radius_r:index_crab2+radius_r+1])

In [None]:
##
# @TODO: Find the first and second instances of 'Windex' in bobby's paragraph
#



## Functions

We've been repeating ourselves a lot. Let's write what we've written over as a 'function', a little algorithm or recipe which requires certain input and will deliver certain output.

The basic format is:

```python
def function_name(input):
    # do things...
    return output
```

Note the **indentation**: throughout Python, indentaiton means "inside of" or "in this context."

### Demo functions

In [134]:
def shout(string):
    loud_string = string.upper() + '!!!'
    print(loud_string)    # this will not print now, because it is 'inside' the function
    
print('hello?')           # this will print now, because it is 'outside' the function

hello?


In [145]:
shout("wtf")

WTF!!!


In [136]:
def hours2weeks(num_hours):
    num_days=num_hours / 24
    num_weeks = num_days / 7
    return num_weeks

In [138]:
hours2weeks(100)

0.5952380952380952

In [139]:
def hours2workweeks(num_hours,num_working_hours=8,num_working_days=5):
    num_workdays = num_hours / num_working_hours
    num_workweeks = num_workdays / num_working_days
    return num_workweeks

In [144]:
hours2workweeks(num_working_days=3, num_hours=10000)

416.6666666666667

### Functions using string slicing

In [None]:
def first_n_letters(string,n):
    letters = string[:n]
    return letters

In [None]:
first_n_letters(bardname,10)

In [155]:
##
# @TODO: Write a function that returns your Star Wars name!
#
# First name: 
# First 3 letters of your last name
# + First 2 letters of your first name
#
# Last name:
# First 2 letters of your mother's maiden name
# + First 3 letters of the town you were born in

def star_wars_name_gen(first_name, last_name, maiden_name, town_name):
    
    star_wars_first_name = last_name[:3] + first_name[:2].lower()
    
    star_wars_last_name = maiden_name[:2] + town_name[:3].lower()
    
    return star_wars_first_name + ' ' + star_wars_last_name

In [156]:
star_wars_name_gen('Ryan', 'Heuser', 'Waitt', 'Margate')

'Heury Wamar'

In [None]:
##
# @TODO: Execute the function
#
# star_wars_name_gen(...)



### Functions using `str.count()`

In [158]:
def get_num_sents(string):
    num_sents = string.count('.') + string.count('?') + string.count('!')
    return num_sents

In [160]:
get_num_sents("Hello. This is a sentnece? ok?")

3

In [163]:
get_num_sents(rafaela)

4

In [161]:
get_num_sents(bobby)

9

In [164]:
def get_num_words(string):
    # @TODO: Is this right?
    return string.strip().count(' ')

In [165]:
get_num_words(rafaela)

269

In [166]:
get_num_words(bobby)

231

In [185]:
def get_words_per_sent(string):
    
    # First, get the number of sentences
    num_sents = get_num_sents(string)
    
    # then, get the number of words
    num_words = get_num_words(string)
    
    # The words per sentence is Nw / Ns
    wps = num_words / num_sents
    
    # return wps
    return wps

In [175]:
get_words_per_sent(rafaela)

0.01486988847583643

In [170]:
get_words_per_sent(bobby)

25.666666666666668

In [172]:
get_words_per_sent('Hello. These are very short sentences. You know?')

2.3333333333333335

In [None]:
##
# @TODO: Write your own function counting something interesting about a text
#
# e.g. count the animals in a text? count the exclamation marks?
# return as raw count or as a ratio of num words (or some other stat)
#
def my_interesting_calculation(string):
    return output

### Functions using `str.index()`

In [None]:
# let's abstract one last time!
def kwic(string,substring,offset=0,radius=100):
    index = string.index(substring,offset+1)
    print(string[index-radius:index+radius+1])
    return index

In [None]:
kwic(bobby,'Windex')

In [None]:
kwic(bobby,'Windex',217)

In [None]:
index_first_match_r = kwic(rafaela,'scorpion',radius=50)

In [None]:
index_second_match_r = kwic(rafaela,'scorpion',index_first_match_r,radius=50)

In [None]:
## 
# @TODO: Find the third instance of 'dead' in Rafaela's first paragraph
#
#

## Files

We've been working a lot with strings. But where have these strings come from? So far they've just been manually entered, whether by me, you, or from a copy/paste of a text.

Instead of manual entry, we can automatically fill a string with the contents of a text file.

### Files are just big strings

To open a file, we use a specific syntax:

```python
with open('filename.txt') as file:
    string = file.read()
```

This means: open filename.txt, and while it's open, read out its contents to a variable called `string` (then you can close the file, with the contents still saved in `string`). 

As an analogy, imagine this process:

```python
with open(the_refrigerator) as open_fridge:
    filled_glass = open_fridge.pour_oj()
```
Unindented now, the fridge door is closed. But we still have our `filled_glass` of OJ...

In [178]:
pwd

'/Users/ryan/Dropbox/PHD/Teaching/Literary Text Mining/literarytextmining/03_python'

In [190]:
# Let's open Rafaela's chapter (ch 1)
with open('../corpora/testing/tristram.txt') as file_r:
    rafaela_chap1 = file_r.read()

In [191]:
# Print the first 2000 characters
print(rafaela_chap1[:2000])

﻿To the Right Honourable Mr. Pitt.

Sir,

Never poor Wight of a Dedicator had less hopes from his Dedication,
than I have from this of mine; for it is written in a bye corner of the
kingdom, and in a retir'd thatch'd house, where I live in a constant
endeavour to fence against the infirmities of ill health, and other
evils of life, by mirth; being firmly persuaded that every time a man
smiles,--but much more so, when he laughs, it adds something to this
Fragment of Life.

I humbly beg, Sir, that you will honour this book, by taking it--(not
under your Protection,--it must protect itself, but)--into the country
with you; where, if I am ever told, it has made you smile; or can
conceive it has beguiled you of one moment's pain--I shall think myself
as happy as a minister of state;--perhaps much happier than any one (one
only excepted) that I have read or heard of.

I am, Great Sir, (and, what is more to your Honour) I am, Good Sir, Your
Well-wisher, and most humble Fellow-subject,

The Au

In [186]:
get_words_per_sent(rafaela)

67.25

In [187]:
get_words_per_sent(rafaela_chap1)

24.273429377691965

In [192]:
# Load Bobby's chapter
with open('../corpora/testing/sentimental.txt') as file:
    bobby_chap2 = file.read()

In [193]:
get_words_per_sent(bobby)

25.666666666666668

In [194]:
get_words_per_sent(bobby_chap2)

24.125412541254125

In [None]:
##
# @TODO: Load Emi's chapter and print the words per sentence
#


In [None]:
##
# @TODO: Find the second instance of 'funghi' in Emi
#


### Reading your own files

In [195]:
##
# @TODO: Load one of your texts into a string
#


In [203]:
from textblob import TextBlob

# create a 'blob'
blob = TextBlob("A rose is a rose is a rose. This is another sentence.")

# print the count for a word
blob.words

WordList(['A', 'rose', 'is', 'a', 'rose', 'is', 'a', 'rose', 'This', 'is', 'another', 'sentence'])

In [200]:
##
# @TODO: Print its number of words per sentence
#


In [None]:
##
# @TODO: Print its number of commas per sentence
#


In [None]:
##
# @TODO: Perform your interesting_calculation (written above) on your text
#


In [None]:
##
# @TODO: Find a few instances of an interesting word to you in your text
#
