# Lecture 5, Part 2 – File I/O

## CSS Summer Bootcamp, Week 1 🥾

#### Suraj Rampure

### Motivation

We'll look at two tasks:
- Reading files from the computer (file input).
- Writing files to the computer (file output).

The primary function that will help us with both tasks is `open`.

### Reading files

To create a file object by "reading" a file from the computer, use the `open` function and set the second argument to `'r'`.

In [1]:
# Means "show me all the files in the data folder"
!ls data

users.csv words.txt


In [2]:
f = open('data/words.txt', 'r')

In [3]:
type(f)

_io.TextIOWrapper

Now that we've created our file object, we need to do something with it to actually read the information. There are three relevant methods:
- `read`.
- `readline`.
- `readlines`.

### `read`

The `read` method returns the entire contents of the file as a string.

In [4]:
f = open('data/words.txt', 'r')

In [5]:
whole_file = f.read()
whole_file[:100]

'A\na\naa\naal\naalii\naam\nAani\naardvark\naardwolf\nAaron\nAaronic\nAaronical\nAaronite\nAaronitic\nAaru\nAb\naba\nA'

**Question:** When might this be a bad idea?

Note that once we've called `read()` once, the entire contents of the file have already been read, so if we call it again, we won't see anything:

In [6]:
f.read()

''

### `readlines`

The `readlines` method returns the entire contents of the file as a list, with one entry per line.

In [7]:
f = open('data/words.txt', 'r')

In [8]:
all_lines = f.readlines()

In [9]:
all_lines[:10]

['A\n',
 'a\n',
 'aa\n',
 'aal\n',
 'aalii\n',
 'aam\n',
 'Aani\n',
 'aardvark\n',
 'aardwolf\n',
 'Aaron\n']

Note that this is the same information that `whole_file.split('\n')` contains.

In [10]:
whole_file.split('\n')

['A',
 'a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'Aani',
 'aardvark',
 'aardwolf',
 'Aaron',
 'Aaronic',
 'Aaronical',
 'Aaronite',
 'Aaronitic',
 'Aaru',
 'Ab',
 'aba',
 'Ababdeh',
 'Ababua',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'Abadite',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'Abama',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'Abanic',
 'Abantes',
 'abaptiston',
 'Abarambo',
 'Abaris',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'Abasgi',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'Abassin',
 'abastardize',
 'abatable',
 'abate

### `readline`

The `readline` method reads one line of the file at a time.

In [11]:
f = open('data/words.txt', 'r')

In [12]:
f.readline()

'A\n'

In [13]:
f.readline()

'a\n'

In [14]:
f.readline()

'aa\n'

Note that if the file we're reading is larger than the available memory on our computer, then we have no choice but to use `readline`!

Alternatively, we can loop through the file directly:

In [15]:
f = open('data/words.txt', 'r')

contains_age = set() # can't use {}, that initializes a dictionary
for line in f:
    if 'age' in line.lower():
        contains_age.add(line.strip())

In [16]:
contains_age

{'subdrainage',
 'pager',
 'diageotropic',
 'autodrainage',
 'dumpage',
 'encastage',
 'culvertage',
 'fossage',
 'homageable',
 'Lagenaria',
 'quinquagesimal',
 'plumage',
 'presavagery',
 'warpage',
 'cullage',
 'thanage',
 'unappendaged',
 'ageustia',
 'minorage',
 'haulageway',
 'grandparentage',
 'overdiscourage',
 'sage',
 'flagellum',
 'intortillage',
 'pannage',
 'pillager',
 'encage',
 'grainage',
 'scrapeage',
 'formagenic',
 'perioesophageal',
 'boomage',
 'aeciostage',
 'avenage',
 'counteragent',
 'coupage',
 'patronage',
 'matriheritage',
 'railage',
 'rager',
 'bacteriophage',
 'fathomage',
 'nonvillager',
 'nonagesimal',
 'agentess',
 'defoliage',
 'precartilage',
 'heritage',
 'plumaged',
 'cageman',
 'suckage',
 'polypage',
 'gulravage',
 'bestowage',
 'screenage',
 'collagenic',
 'imager',
 'pilgrimage',
 'ravage',
 'stowage',
 'beverage',
 'enraged',
 'laborage',
 'squattage',
 'antisavage',
 'rediscourage',
 'fosterage',
 'Ageratum',
 'enheritage',
 'lymphorrhage',

### Closing files

Whenever you `open` a file, you either need to:
- `close` the file after using it.
- Use `with`, which automatically closes the file (**preferred**).


Otherwise, you may later on run into errors saying that the file is being accessed by another process.

Option 1:

In [17]:
f = open('data/words.txt', 'r')
all_words = f.read()
f.close()

Option 2 (**preferred**):

In [18]:
with open('data/words.txt', 'r') as f:
    all_words = f.read()

## Aside: File formats

### File formats

Next week, when you start learning about `pandas`, you will focus on manipulating **tabular data** – that is, data that looks like a table/spreadsheet. Such data is typically stored in `.csv` files (CSV stands for "comma-separated values").

<center><img src='images/tabular.png' width=60%><i>An example CSV file, and an equivalent spreadsheet.</i></center>

In [20]:
import pandas as pd
pd.read_csv('data/pups.csv')

Unnamed: 0,name,age,breed
0,Junior Smith,11,cockapoo
1,Rex Rogers,7,labradoodle
2,Flash Heat,3,labrador
3,Reese Bo,4,boston terrier
4,Polo Cash,2,shih tzu


**Issue:** Not all data can be stored in a spreadsheet!

<center><img src='images/json_tree.png' width=60%></center>

<center>How would I store these relationships in a spreadsheet?</center>

### JSON

JSON, which stands for JavaScript Object Notation, is a file format that allows us to store hierarchical data.

- The JSON format looks very similar to the syntax we use for defining dictionaries in Python!
- Technically, JSON can be used to store tabular data, but it’s far less elegant than just using a CSV.

The file `data/family.json` contains a representation of the family tree in JSON. How do we load this in as a dictionary?

### The `json` module

The `json` module provides useful functions for converting JSON to and from Python dictionaries.
- `json.load(f)` takes a "file object" `f` and returns a Python dictionary.
- `json.loads(s)` takes a string `s` and returns a Python dictionary.

In [21]:
import json

In [22]:
with open('data/family.json', 'r') as f:
    family_tree = json.load(f)

In [23]:
family_tree

{'name': 'Grandma',
 'children': [{'name': 'Dad',
   'children': [{'name': 'Me'}, {'name': 'Brother'}]},
  {'name': 'my aunt',
   'children': [{'name': 'Cousin 1'},
    {'name': 'Cousin 2', 'children': [{'name': 'Cousin 2 Jr.'}]}]}]}

In [26]:
family_tree['children'][0]['children'][0]

{'name': 'Me'}

In [29]:
json.loads('{"hey": "hello"}')

{'hey': 'hello'}

## Writing files

### Writing files

To create a file object to "write" a file to the computer, use the open function and set the second argument to `'w'`. Then, use the `write` method to write a string to the file.

In [38]:
with open('test_document.txt', 'w') as f:
    f.write('Hello! I am a student in the CSS MS program at UCSD.\nI\'m currently in Week 1 of the summer bootcamp.\n')

- If the path you specified does not already exist, Python will create the file for you.
- If the path you specified **does** already exist, Python will overwrite the file (deleting anything that was already there).
    - Use the `'a'` mode to **append** to the end of an existing file without deleting its contents.

### Saving dictionaries

We can use the function `json.dump` to save a dictionary to a JSON file, in case we need to share it with someone else (or another program).

In [40]:
updated_family_tree = family_tree.copy()

In [41]:
updated_family_tree['children'][0]['children'][0]['children'] = [{'name': 'Nobody yet'}]

In [39]:
with open('data/new_tree.json', 'w') as f:
    json.dump(updated_family_tree, f)

## More on strings

### f-strings

- f-strings in Python provide a convenient way to format strings (the "f" stands for "format").
- To create an f-string, create a string with the character `f` **right before** the opening quote. Then, anything in the subsequent string that is inside `{curly brackets}` will be evaluated. 

In [45]:
f'2 + 3 = {2 + 3}'

'2 + 3 = 5'

In [46]:
def make_greeting(name):
    return f'Hi {name}! 👋 Your name has {len(name)} characters, the first of which is {name[0]}.'

In [47]:
make_greeting('Billy')

'Hi Billy! 👋 Your name has 5 characters, the first of which is B.'

### Multiline strings

So far, we've seen that strings can be created using `'single quotes'` and `"double quotes"`. They can also be created with `'''triple quotes'''`, the benefit being that **strings created with triple quotes can span multiple lines**.

In [48]:
long_string = '''Hi Will,

Hope you're well! Take a look at the edits I made to tomorrow's lecture notebook, and let me know what you think.

Thanks,
Suraj
'''

In [49]:
long_string

"Hi Will,\n\nHope you're well! Take a look at the edits I made to tomorrow's lecture notebook, and let me know what you think.\n\nThanks,\nSuraj\n"

In [50]:
print(long_string)

Hi Will,

Hope you're well! Take a look at the edits I made to tomorrow's lecture notebook, and let me know what you think.

Thanks,
Suraj



### Example: marketing emails

Let's put everything we just learned together. The file `data/users.csv` contains the name, email address, and most recent purchase on Amazon for several individuals. We will:
- Read this file into our notebook.
- Write a custom advertising email for each individual.
- Save each email to the computer.

Note that you will learn how to work with `csv` files more efficiently next week when you learn about `pandas`, but for now, let's use the techniques we've learned.

In [51]:
with open('data/users.csv', 'r') as f:
    users = f.readlines()

In [52]:
users

['Sill Wtyler,swtyler@ucsandiego.edu,a Linguistics 101 Textbook,True\n',
 'Haqian Yuang,hay101@ucsandiego.edu,a MacBook Pro,False\n',
 'Len Bang,lbang@ucsandiego.edu,Ph.D. regalia,True\n',
 'Ruraj Sampure,rsampure@ucsandiego.edu,no-show socks,True\n',
 'Kradeep Phosla,kphosla@ucsandiego.edu,a UCSD sweater,False']

Let's write a function that produces an email for each user. We'll use a little bit of Markdown and HTML to make the email look somewhat "legit".

In [53]:
def format_email(user):
    name, email, purchase, member = user.split(',')
    member = member == 'True\n' # If the last element in the line is 'True\n' they are a member, otherwise they're not
    
    img_path = "https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Amazon_logo.svg/1200px-Amazon_logo.svg.png"
    
    out = f'''<br>
<img src="{img_path}" width=10%>

### Purchase Followup

Hi {name},

We hope you enjoyed your recent purchase of {purchase}.

'''
    
    if member:
        out += '''As an Amazon Prime member, we encourage you to check out Amazon Prime, home to over 3,000 TV shows and movies on demand.'''
        
    else:
        out += '''Interested in free two-day shipping and access to over 3,000 TV shows and movies on demand? Sign up for Amazon Prime.
        Click [here](#) to sign up for a free 14-day trial, no strings attached.'''
        
    return out

The following cell doesn't tell us much:

In [54]:
format_email(users[0])[:1000]

'<br>\n<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Amazon_logo.svg/1200px-Amazon_logo.svg.png" width=10%>\n\n### Purchase Followup\n\nHi Sill Wtyler,\n\nWe hope you enjoyed your recent purchase of a Linguistics 101 Textbook.\n\nAs an Amazon Prime member, we encourage you to check out Amazon Prime, home to over 3,000 TV shows and movies on demand.'

Instead, we can use the `Markdown` function from `IPython.display` to format our email nicely.

In [55]:
from IPython.display import Markdown

In [56]:
Markdown(format_email(users[0]))

<br>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a9/Amazon_logo.svg/1200px-Amazon_logo.svg.png" width=10%>

### Purchase Followup

Hi Sill Wtyler,

We hope you enjoyed your recent purchase of a Linguistics 101 Textbook.

As an Amazon Prime member, we encourage you to check out Amazon Prime, home to over 3,000 TV shows and movies on demand.

Now, let's write each email to a file. First, let's create an `emails` folder on our computer. We can do this using the `os` module in Python.

In [58]:
import os

In [59]:
os.mkdir('emails')

Time for a `for`-loop!

In [61]:
for user in users:
    email_content = format_email(user)
    email = user.split(',')[1]
    with open(f'emails/{email}.md', 'w') as f:
        f.write(email_content)

Now, all we need to do is find a way to send each email. That's beyond the scope of what we'll look at, but know that it's possible!

### Aside: docstrings

Another common use for multiline strings is in creating **docstrings**. A docstring is a (typically) multiline string placed at the start of a function's body.
- Docstrings usually explain what the function's purpose is.
- They often contain "doctests", or example behavior of the function.
- **Important:** When you run `help(func)` or `func?`, the docstring appears!

In [62]:
def harmonic_mean(a, b):
    '''Computes the harmonic mean between two numbers.
       >>> harmonic_mean(30, 60)
       40.0
       >>> harmonic_mean(30, 90)
       45.0
    '''
    return 2 / (1 / a + 1 / b)

In [63]:
harmonic_mean(60, 80)

68.57142857142857

In [64]:
help(harmonic_mean)

Help on function harmonic_mean in module __main__:

harmonic_mean(a, b)
    Computes the harmonic mean between two numbers.
    >>> harmonic_mean(30, 60)
    40.0
    >>> harmonic_mean(30, 90)
    45.0



The following allows us to verify that our code behaves as we expected!

In [65]:
import doctest

In [66]:
doctest.run_docstring_examples(harmonic_mean, {'harmonic_mean': harmonic_mean})

### Aside: Python versions

- In this course, we've used Python 3, which was released in 2008. Most people writing **new** Python code write in Python 3.
- However, on the internet you'll often run into code that uses Python 2. **Python 2 is not backwards compatible with Python 3!**
    - There are some aspects of Python 2 that don't work in Python 3.
- When using code you find on the internet, make sure it's in the right Python version (3).