# Files and strings

*This Notebook was prepared by [Ryan Heuser](https://ryanheuser.org). His original excellent original notebooks can be found [here](https://github.com/quadrismegistus/literarytextmining). I (Russell Williams) have made some light amendments to use his material in the context of our course*

In this notebook we'll focus on 'strings' and how to do things with them, like find and count instances of a word. At the very end, we'll learn how to open a text file and read it to a string. Text files are just big strings, so text mining always begins with string processing.

*This notebook began as an adaption of this [strings tutorial by Jerry Pusinen](https://jerry-git.github.io/learn-python3/notebooks/beginner/html/strings.html).*

## String slicing

We just looked at some of the methods strings have. There's one more major functionality of strings: slicing.

'Slicing' refers to slicing off a specific piece of the string.

This is all coordinated through two brackets at the end of a string:

    my_string[a:b]
    
    a = number of characters/letters into my_string to start at
    b = number of characters/letters into my_string to end at
    
For instance...

### Slice off the beginning of strings

To get the first N characters:

    my_string[:N]

In [None]:
alphabet = 'abcdefghijklmnopqrstuvwxyz'

Each letter of the alphabet string has an index:

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25

In [None]:
# first three letters

alphabet[0:3]     # means: starting at 0, and including everything until 3, which is not included
                  # or:    the first 3 letters

In [None]:
# first 10 letters
alphabet[0:10]

In [None]:
alphabet[:10]          # also works

In [None]:
# @TODO: slice 'William' away from 'William Shakespeare'

bardname='William Shakespeare'


In [None]:
bardname[:]

In [None]:
# @TODO: slice off your own first name

fullname='Russell Williams'


In [None]:
fullname[:]

### Slice off the ends of strings

The alphabet indices in reverse

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
    -26 -25 -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9  -8  -7  -6  -5  -4  -3  -2  -1

In [None]:
# last 3 letters
alphabet[-3:]   

In [None]:
# last 10 letters
alphabet[-10:]

In [None]:
# @TODO slice 'Shakespeare' from 'William Shakespeare'
bardname[-11:]



In [None]:
# @TODO slice off your last name



### Slice the middles of strings

    a   b   c   d   e   f   g   h   i   j   k   l   m   n   o   p   q   r   s   t   u   v   w   x   y   z
    |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |
    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25

In [None]:
# the first three
alphabet[0:3]

In [None]:
# the next four letters 
alphabet[3:7]

In [None]:
# the next four
alphabet[7:11]

In [None]:
# the next five
alphabet[11:16]

In [None]:
# everything but the first three and the last three
alphabet[3:-3]

In [None]:
# @TODO: Slice Shakespeare's name so that you see only
# the last two letters of his first name ('am') and the first two letters of his last name ('Sh')



In [None]:
# @TODO: Slice the middle of your own name in the same away as just above:



In [None]:
# @TODO: Slice away everything but the first and last letters of your name



### Get the length of strings

Use the built-in function `len` to get the length (character count) of a string.

In [None]:
len('hello')

In [None]:
len(alphabet)

In [None]:
##
# @TODO: Pare this string down to fit into a tweet (280 chars) 
#
lorem = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat. Duis aute irure dolor in reprehenderit in voluptate
velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in culpa qui officia deserunt
mollit anim id est laborum."""

len(lorem)

## String methods

So far we've seen a few things strings can do. They can be 'added' to each other. They can be sliced.

But strings can do many more things using its **methods**.

A method is something an object can do. If my dog is a Python object, things he might be able to do are 'sit', 'stay', 'walk'. Here's some pseudo-code:

    my_dog.sit()
    my_dog.stay()
    my_dog.walk()

In other words, a method is activated by this general formula:

    object.method()

For strings specifically, certain methods do very useful things.

### `str.upper(), str.lower(), str.title()`

These three methods let you change the capitalization of the string you have.

In [None]:
mixed_case = 'PyTHoN hackER'

In [None]:
mixed_case.upper()

In [None]:
mixed_case.lower()

In [None]:
mixed_case.title()

In [None]:
# @TODO: Recreate the following title page with its correct captialization using the variable names.
"""
The
LIFE
and
Strange Surprising
ADVENTURES
of
ROBINSON CRUSOE
Of York, Mariner
"""

line1='the'
line2='life'
line3='AND'
line4='strange surprising'
line5='adventures'
line6='OF'
line7='robinson crusoe'
line8='of york, mariner'

print(line1.title())
# .... Print the other lines using the appropriate string method

In [None]:
# Remember you can always check the help for a string method like this:
#help(str.title)

### `str.replace()`

A lot of the time we need to replace something inside a string with something else.

#### Fixed That For You #1

In [None]:
corporation = 'Dunkin Donuts'

In [None]:
corporation.replace('Dunkin','Unkind')

In [None]:
# This did not overwrite 'corporation'!

corporation

In [None]:
# To do that, we need to re-assign corporation

corporation = corporation.replace('Dunkin','Unkind')

In [None]:
corporation

#### Fixed That For You #2

In [None]:
heteronormative = 'men like women, women like men'

In [None]:
heteronormative.replace('women','people')

In [None]:
heteronormative.replace('men','people')

In [None]:
heteronormative.replace('women','people').replace('men','people')

In [None]:
# This did not overwrite heteronormative!

heteronormative

In [None]:
# We can overwrite one at a time

heteronormative = heteronormative.replace('women','people')

In [None]:
heteronormative

In [None]:
heteronormative = heteronormative.replace('men','people')

In [None]:
heteronormative

#### Fixed That For You #3

In [None]:
#@TODO: Rid this string of its gendered assumptions (use 'they', 'them', and 'person' as neutral pro/nouns).

gendered = 'You saw your doctor? What did he say? And the nurse? What did she say?'



### `str.strip()`

Sometimes we need to clean strings up and remove the whitespace they have surrounding them.

In [None]:
overpadded = """


                                  hi!

"""

In [None]:
print(overpadded)

In [None]:
overpadded

In [None]:
overpadded.strip()

### `str.count()`

How many strings are inside another string?

In [None]:
rafaela = """Rafaela Cortes spent the morning barefoot, sweeping both dead and
living things from over and under beds, from behind doors and shutters,
through archways, along the veranda—sweeping them all across the deep
shadows and luminous sunlight carpeting the cool tile floors. Her
slender arms worked the broom industriously through the air—already
thickening with tepid heat—and along the floor, her feet following,
printing their moisture in dark footprints over baked clay. Every
morning, a small pile of assorted insects and tiny animals—moths and
spiders, lizards and beetles—collected, their brittle bodies tossed in
waves along the floor, a cloudy hush of sandy soil, cobwebs, and human
hair. An iguana, a crab, and a mouse. And there was the scorpion,
always dead—its fragile back broken in the middle. And the snake that
slithered away at the urging of her broom—probably not poisonous, but
one never knew. Every morning it was the same. Every morning, she swept
this mound of dead and wiggling things to the door and off the side of
the veranda and into the dark green undergrowth with the same flourish.
Occasionally, there was more of one species or the other, but each
somehow always made its way back into the house. The iguana, the crab,
and the mouse, for example, were always there. Sometimes they were
dead; sometimes they were alive. As for the scorpion, it was always
dead, but the snake was always alive. On some days, it seemed to twirl
before her broom communicating a kind of dance that seemed to send a
visceral message up the broom to her fingertips. There was no
explanation for any of it. It made no difference if she closed the
doors and shutters at the first sign of dusk or if she left the house
unoccupied and tightly shut for several days. Every morning when the
house was thrown open to the sunlight, she knew that she and the boy
had not slept alone that night. Hummingbirds and parakeets fluttered
across the rooms, stirring the languid humidity settled by the night,
frantically searching for escape through the open lace curtains, while
crawling lives hid beneath furniture or presented itself lifeless at
her feet."""

In [None]:
# How many crabs?
rafaela.count('her')

In [None]:
# How many of these animals respectively?
print(rafaela.count('iguana'), rafaela.count('scorpion'),rafaela.count('snake'),rafaela.count('parakeet'))

In [None]:
# How many hummingbirds?
rafaela.count('???')

In [None]:
# How come!?
# Look: it's capitalized in text. "Hummingbirds"!
# So...

# @TODO how do we fix this?



In [None]:
##
# Getting some stats
##

num_commas = rafaela.count(',')
num_commas

In [None]:
num_sents = rafaela.count('.') + rafaela.count('?')
num_sents

In [None]:
num_words = rafaela.strip().count(' ')
num_words

In [None]:
# Words per sentence
wps = num_words/num_sents
wps

In [None]:
# Commas per sentence
cps = num_commas/num_sents
cps

In [None]:
bobby = """
Check it out, ése. You know this story? Yeah, over at Sanitary Supply
they always tell it. This dude drives up, drives up to Sanitary. Makes
a pickup like always. You know. Paper towels. Rags. Mop handles. Gallon
of Windex. Stuff like that. Drives up in a Toyota pickup. Black shiny
deal, all new, big pinche wheels. Very nice. Yeah. Asian dude. Kinda
skinny. Short, yeah. But so what? Dark glasses. Cigarette in the mouth.
He’s getting out the truck, see. In the parking lot. Big tall dude
comes by with a gun. Yeah, a gun. Puts it to his head and says, GIMME
THE KEYS! It’s a jacker. Asian dude don’t lose no time, man. No time.
Not a doubt. Rams the door closed. wham! Just like that. Slams the door
on the jacker’s hand. On the jacker’s gun! Smashes the gun! Smashes the
hand. Gun ain’t worth shit. Hand’s worth even less. Jacker loses it
bad. He’s crying. Screaming. It’s not over. Asian dude swings the door
open. Attacks the jacker. Pushes him up to the wall of Sanitary and
beats the shit out. Dude don’t come up to the jacker’s nose. But it
don’t matter. Got every trick in the books. Bruce Lee moves. Kick. VAP!
WHOP! Damn. Don’t mess with this man. By now Sanitary’s called the
police. Crowd’s seen it all. Jacker’s a mess. Blood everywhere. Never
seen so much blood. But not a drop on the Asian. Not a drop. Never took
off his shades. Never even stopped smoking. Turns over the jacker’s
remains to the police. Don’t say nothing. That’s it. Goes into
Sanitary. Picks up the mop handles, Windex, rags. Gets in the pickup.
He’s gone. That’s it. That’s it.
"""

In [None]:
num_sents_bobby = bobby.count('.') + bobby.count('!') + bobby.count('?')
num_sents_bobby

In [None]:
num_commas_bobby = bobby.count(',')
num_commas_bobby

In [None]:
num_words_bobby = bobby.strip().count(' ')
num_words_bobby

In [None]:
# Words per sent
wps_bobby = num_words_bobby / num_sents_bobby
wps_bobby

In [None]:
# Commas per word
cps_bobby = num_commas_bobby / num_sents_bobby
cps_bobby

In [None]:
##
# Compare Rafaela and Bobby:
##

In [None]:
print ('\t','Rafaela (ch1 pr1)','\t','Bobby (ch2 pr1)')
print('WPS','\t',wps,'\t',wps_bobby)
print('CPS','\t',cps,'\t',cps_bobby)

### `str.index(substr)`

Find the index of a substring in a string. Useful for Keywords in Context (KWIC).

First let's start small.

In [None]:
alphabet.index('j')

In [None]:
alphabet[9]

In [None]:
alphabet[9-2:9+3]

In [None]:
# let's abstract from this

index = alphabet.index('j')
alphabet[index-2:index+3]

In [None]:
# let's abstract one more time

index = alphabet.index('j')
radius = 2
alphabet[index-radius:index+radius+1]

#### Key Words In Context (KWIC)

In [None]:
index_crab = rafaela.index('crab')
index_crab

In [None]:
radius_r = 200

print(rafaela[index_crab-radius_r:index_crab+radius_r+1])

In [None]:
help(rafaela.index)

In [None]:
index_crab2=rafaela.index('crab',index_crab+1)
index_crab2

In [None]:
print(rafaela[index_crab2-radius_r:index_crab2+radius_r+1])

In [None]:
##
# @TODO: Find the first and second instances of 'Windex' in bobby's paragraph
#



## Functions

We've been repeating ourselves a lot. Let's write what we've written over as a 'function', a little algorithm or recipe which requires certain input and will deliver certain output.

The basic format is:

```python
def function_name(input):
    # do things...
    return output
```

Note the **indentation**: throughout Python, indentaiton means "inside of" or "in this context."

### Demo functions

In [None]:
def shout(string):
    loud_string = string.upper() + '!!!'
    print(loud_string)    # this will not print now, because it is 'inside' the function
    
print('hello?')           # this will print now, because it is 'outside' the function

In [None]:
shout("I'm listening!")

In [None]:
def hours2weeks(num_hours):
    num_days=num_hours / 24
    num_weeks = num_days / 7
    return num_weeks

In [None]:
hours2weeks(10000)

In [None]:
def hours2workweeks(num_hours,num_working_hours=8,num_working_days=5):
    num_workdays = num_hours / num_working_hours
    num_workweeks = num_workdays / num_working_days
    return num_workweeks

In [None]:
hours2workweeks(10000)

### Functions using string slicing

In [None]:
def first_n_letters(string,n):
    letters = string[:n]
    return letters

In [None]:
first_n_letters(bardname,10)

In [None]:
##
# @TODO: Write a function that returns your Star Wars name!
#
# First name: 
# First 3 letters of your last name
# + First 2 letters of your first name
#
# Last name:
# First 2 letters of your mother's maiden name
# + First 3 letters of the town you were born in

def star_wars_name_gen(first_name, last_name, maiden_name, town_name):
    
    star_wars_first_name = ''
    
    star_wars_last_name = ''
    
    return 

In [None]:
##
# @TODO: Execute the function
#
# star_wars_name_gen(...)



### Functions using `str.count()`

In [None]:
def get_num_sents(string):
    num_sents = string.count('.') + string.count('?') + string.count('!')
    return num_sents

In [None]:
get_num_sents(rafaela)

In [None]:
get_num_sents(bobby)

In [None]:
def get_num_words(string):
    # @TODO: Is this right?
    return string.count(' ')

In [None]:
get_num_words(rafaela)

In [None]:
get_num_words(bobby)

In [None]:
def get_words_per_sent(string):
    num_sents = get_num_sents(string)
    num_words = get_num_words(string)
    return num_words / num_sents

In [None]:
get_words_per_sent(rafaela)

In [None]:
get_words_per_sent(bobby)

In [None]:
##
# @TODO: Write your own function counting something interesting about a text
#
# e.g. count the animals in a text? count the exclamation marks?
# return as raw count or as a ratio of num words (or some other stat)
#
def my_interesting_calculation(string):
    return output

### Functions using `str.index()`

In [None]:
# let's abstract one last time!
def kwic(string,substring,offset=0,radius=100):
    index = string.index(substring,offset+1)
    print(string[index-radius:index+radius+1])
    return index

In [None]:
kwic(bobby,'Windex')

In [None]:
kwic(bobby,'Windex',217)

In [None]:
index_first_match_r = kwic(rafaela,'scorpion',radius=50)

In [None]:
index_second_match_r = kwic(rafaela,'scorpion',index_first_match_r,radius=50)

In [None]:
## 
# @TODO: Find the third instance of 'dead' in Rafaela's first paragraph
#
#

## Files

We've been working a lot with strings. But where have these strings come from? So far they've just been manually entered, whether by me, you, or from a copy/paste of a text.

Instead of manual entry, we can automatically fill a string with the contents of a text file.

### Files are just big strings

To open a file, we use a specific syntax:

```python
with open('filename.txt') as file:
    string = file.read()
```

This means: open filename.txt, and while it's open, read out its contents to a variable called `string` (then you can close the file, with the contents still saved in `string`). 

As an analogy, imagine this process:

```python
with open(the_refrigerator) as open_fridge:
    filled_glass = open_fridge.pour_oj()
```
Unindented now, the fridge door is closed. But we still have our `filled_glass` of OJ...

In [None]:
# Load the opening letters from Frankenstein by Mary Shelley

import urllib.request

with urllib.request.urlopen("https://raw.githubusercontent.com/rwilliamsparis/AUPCL1099/main/corpora/FrankensteinLetters.txt") as f:
    frank_text=f.read().decode()


In [None]:
print(frank_text[:2000])

In [None]:
get_words_per_sent(frank_text)

In [None]:
# Let's open Cat Person by Kristen Roupenian

with urllib.request.urlopen("https://raw.githubusercontent.com/rwilliamsparis/AUPCL1099/main/corpora/Cat_Person.txt") as c:
    cat_text=c.read().decode()

In [None]:
# Print the first 2000 characters

print(cat_text[:2000])

In [None]:
get_words_per_sent(cat_text)

In [None]:
# Load the opening four chapters of Dracula by Bram Stoker


with urllib.request.urlopen("https://raw.githubusercontent.com/rwilliamsparis/AUPCL1099/main/corpora/Dracula_Chapters_1_to_4.txt") as d:
    drac_text=d.read().decode()

In [None]:
print(drac_text[:2000])

In [None]:
get_words_per_sent(drac_text)

In [None]:
##
# @TODO: Print Dracula's number of words per sentence
#


In [None]:
##
# @TODO: Print Dracula's number of commas per sentence
#


In [None]:
##
# @TODO: Find a few instances of an interesting word to you in Dracula
#
