# Agenda, week 3

1. Q&A
2. Dictionaries
    - What are they?
    - How to create them
    - How to work with them
    - How to retrieve from them
    - How iterate over them
    - How dicts are implemented behind the scenes
    - Three paradigms for dict use
3. Files
    - Reading from (text) files
    - Iterating over text files
    - Writing to files (a little bit)
    - The `with` statement -- what it does, and why we want it

In [1]:
# what does enumerate do?
# the short answer: it gives us a numeric index for each element of something we iterate over

# slightly longer answer: In other languages, we often use an index to retrieve a value.
# that means we automatically have the index. But in Python, we don't -- we have just the values.
# enumerate gives us "back" the index, along with the value, in each iteration

s = 'abcd'

for index, one_character in enumerate(s):
    print(f'{index}: {one_character}')

0: a
1: b
2: c
3: d


# What's happening in the above code:

1. `for` turns to the value at the end of the line, which is `enumerate(s)`.
2. `enumerate` is a builtin function that is designed for use inside of a `for` loop, and can only take an argument that is itself iterable. (So yes to strings, lists, and tuples. No to integers.)
3. `for` asks `enumerate(s)`: Are you iterable? The answer is "yes."
4. Repeatedly, `for` asks `enumerate(s)` to give us the next value.
5. In this case, the value of each iteration is a 2-element tuple. The first element is the index, starting with 0. The second item is the value from `s` (what `enumerate` is wrapping) that is associated with that index. In other words, we'll get `(0, 'a')`, `(1, 'b')`, `(2, 'c')`, and `(3, 'd')`.
6. Because we're running the `for` loop with two loop variables, Python uses "tuple unpacking" to take the two elements of our tuple and assign them in parallel to the two variables. So if the current loop value is `(0, 'a')`, then `index` will be assigned `0`, and `one_character` will be assigned `'a'`.
7. The loop body runs
8. We go back to step 4, stopping when we run out of values.

# Where do we now stand?

- Python uses lots of values, and each value has a different type
- We can assign these values to variables, and then reuse the the values
- Data structures we've seen so far:
    - integers and floats (numbers)
    - strings (for text)
    - lists and tuples (for sequences of values)

Many times, we don't want one piece of data at a time, though. We want combinations of data.

You could have a list of tuples, or a list of lists, etc.  In fact, we often do such a thing in Python.

But dictionaries are really the most important data structure in Python because they combine flexibility with speedy retrieval. We can build very complex combinations of data structures using dicts, and still know that things will run quickly and flexibly.

# What is a dictionary?

What we call a dictionary (or `dict`) in Python is not new to Python, and not unique to the language. If you've used another language before, you might have heard of similar data structures:

- hash table
- hash
- hash map
- map
- associative array
- key-value store
- name-value store

All of these are basically (or exactly) the same idea: Each item in the dict contains two parts a "key" and a "value." Because so many things in life can be classified in this way as two-part data structures, dicts are very popular and useful.

You can think of a dict as a list in which we control the index. And the index can be just about any value. In a list, we are stuck with the indexes 0, 1, 2, 3, etc., and we know that the final value will be at `len(mylist) - 1`. In a dict, we can use any values we want for keys, and they can be in any order. Yes, this means we can use *strings* as keys. 

In many ways, this means that dicts are often self-documenting, where the keys aren't random numbers, or unconnected to the data. Rather, the keys are inherently part of the data and reflect its values.

# Some rules for a dict

The term "key" in a dict is basically the same as the "index" in a string, list, or tuple, except that we can choose it. It isn't chosen for us, and there isn't any way to get Python to choose it for us.

## Keys
- Anything at all, so long as *it is immutable*! This means that we'll usually use integers and strings for dict keys
- In a given dict, the keys must be unique. In other words, no key can repeat.
- Every key has a value, and every value has a key. There is no way for a key to not have a value, although it could have a value of `None` or 0 or the empty string.

## Values
- Values can be absolutely anything in Python, without any restrictions
- This means that values can indeed repeat, even though keys cannot.

# Dict syntax

- We use `{}` to create dictionaries. (These are completely unrelated to the `{}` we use inside of f-strings.)
- Inside of a dict definition, each key is separated from its value with `:`
- The key-value pairs in a dict are separated by `,`
- A dict may contain any number of key-value pairs, from zero (the empty dict, `{}`) to whatever will overfill your computer's memory
- It's usual for keys to all be of one type, but that's not a rule


In [2]:
# here, I'm defining a dict with 3 key-value pairs
# always count dicts in pairs

d = {'a':10, 'b':20, 'c':30}

In [3]:
len(d)

3

In [4]:
# the empty dict

d = {}
len(d)

0

In [5]:
# to retrieve a value from a dict, just pass the key inside of []

d = {'a':10, 'b':20, 'c':30}

d['a']

10

In [6]:
d['b']

20

In [7]:
d['c']

30

In [8]:
d['A']  # what about this?  It won't work; the key 'A' does not exist, even though the key 'a' does

KeyError: 'A'

In [9]:
d['a ']   # this won't work either, because it contains not only 'a' (which is a key) but also ' ' (which is not, or part of one)

KeyError: 'a '

In [10]:
d['ab']  # This won't work either; the key 'a' and the key 'b' both have nothing to do with the key 'ab', which doesn't exist

KeyError: 'ab'

In [11]:
# what if I don't put the key in quotes?

d[ab]  # this means: find the value of the variable ab, and use its value as a key to retrieve from d

NameError: name 'ab' is not defined

In [12]:
k = 'a'    # the variable k now exists
d[k]       # the variable k's value, aka 'a', will be used as the key for retrieval

10

In [13]:
# if i were to put quotes around the variable name...

d['k']

KeyError: 'k'

In [14]:
d

{'a': 10, 'b': 20, 'c': 30}

In [16]:
# I can search in a dict with the "in" operator -- but it only looks in the keys, never the values

'a' in d    # this means: is 'a' (a string) a key in the dict d?

True

In [17]:
'x' in d

False

In [18]:
# we can use a combination of "in" and [] to retrieve from a dict only if the key exists


In [20]:
# will this work?

# when assigning, the right side runs before the left
# meaning: we create our dict first

# this dict should have one key-value pair
# the key whatever the value is of the variable a
# the value will be 10

d = {a:10}   # this is *NOT* a dict with a key 'a' (a string), but rather a key with whatever the variable a contains

NameError: name 'a' is not defined

In [21]:
a   # this means: give me the value of the variable a

NameError: name 'a' is not defined

In [22]:
'a'   # this is a one-character string

'a'

In [23]:
d = {'a':10}

d

{'a': 10}

In [26]:
# or...

a = 'hello'
d = {a:10}  # here, it'll grab the value of the variable a, and use it as the key

In [25]:
d

{'hello': 10}

# Where are dicts useful?

- A dict with keys that are names of months, and values that are numbers of months
- A dict with keys that are month numbers, and values that are month names
- A dict with keys being user ID numbers and the values are usernames
- A dict with keys being user ID numbers and the values are dicts describing each user

# Exercise: Restaurant

1. Define a dict, `menu`, in which the keys are strings (names of items on the menu) and the values are integers (prices of items on the menu).
2. Define `total` to be 0.
3. Ask the user, repeatedly, to order something from the menu.
    - If the user enters an empty string, that's a sign to stop asking them. Break out of the this loop, and print `total`.
    - If the user enters a string that *is* an item on the menu (i.e., a key in the `menu` dict), then print the price of the item and the new total. (And add to `total`.)
    - If the string does *not* match any key in the `menu` dict, scold the user but let them try again.
4. At the end, print the total owed.

Example:

    Order: sandwich
    sandwich is 20, total is 20
    Order: apple
    apple is 3, total is 23
    Order: elephant
    Sorry, we're fresh out of elephant today!
    Order: [ENTER]
    Your total is 23

In [29]:
menu = {'sandwich':20, 'apple':3, 'cake':10, 'tea':5}
total = 0

while True:   # infinite loop!
    order = input('Order: ').strip()
    
    if order == '':    # if the user enters the empty string, break out of the loop
        break

    elif order in menu:        # is the user's input a key in our menu dict?
        price = menu[order]  # grab the price (value) from the menu, based on the user's input
        total += price       # update the total
        print(f'{order} is {price}, total is now {total}')
    else:
        print(f'Sorry, we are fresh out of {order} today!')

print(f'total = {total}')    

Order:  sandwich


sandwich is 20, total is now 20


Order:  apple


apple is 3, total is now 23


Order:  tea


tea is 5, total is now 28


Order:  cookie


Sorry, we are fresh out of cookie today!


Order:  


total = 28


In [31]:
# SB

menu = {'burger':10, 'pizza':15, 'fries':5}

total = 0

while True:
    order = input('Please order something from menu: ').strip()
    
    if order == '':
        print(f'total={total}')
        break
    elif order in menu:
        total += menu[order]
        print(f'total={total}')

Please order something from menu:  burger


total=10


Please order something from menu:  burgetr
Please order something from menu:  burger


total=20


Please order something from menu:  


total=20


# `while` loops

`while` is kind of like `if`:

- It looks to its right
- If it sees a `True` value, then it executes its block

However, it's also different from an `if`:
- If it sees a `False` value to its right, then the loop exits, and does not run the loop body
- At the end of the loop block, Python returns to the top of the loop, and again evaluates the expression to the right of `while`

Here, we gave an explicit `True` to the right of `while`, meaning it'll always run the loop block, and an infinite number of times. So we need a way to ensure that the `while` loop will eventually exit -- and that's our `break` statement.

# Are dicts mutable?

Short answer: Yes!

Longer answer: Yes, indeed!

What does it mean for a data structure to be mutable:

- We can modify its elements
- We can add new elements to it
- We can remove existing elements from it

In the case of a dict, each "element" is really a key-value pair.

In [32]:
# can we change the value associated with a key? Yes, with assignment!

d = {'a':10, 'b':20, 'c':30}

d['b'] = 999   # here, I'm modify the existing dictionary 
d

{'a': 10, 'b': 999, 'c': 30}

In [33]:
# can we add new key-value pairs? Yes!
# with a list, we need to append to the list, and that adds one more item
# with a dict, we .. just assign!

d['z'] = 888
d

{'a': 10, 'b': 999, 'c': 30, 'z': 888}

There's no difference between adding a new key-value pair and updating the value for an existing key:

- If the key exists, we update the value
- If the key doesn't exist, we add the new key-value pair



In [34]:
d


{'a': 10, 'b': 999, 'c': 30, 'z': 888}

In [35]:
d['a'] += 1    # this is the same as "d['a'] = d['a'] + 1" , we'll update the value d['a'] 

In [36]:
d

{'a': 11, 'b': 999, 'c': 30, 'z': 888}

In [37]:
# what if I try to use += on a key that doesn't exist?

d['q'] += 1

KeyError: 'q'

In [38]:
# how do we remove a key-value pair?
# we use the "pop" method, specifying the key we want to remove
# the value is returned, and the key-value pair is deleted

d

{'a': 11, 'b': 999, 'c': 30, 'z': 888}

In [39]:
d.pop('z')

888

In [40]:
d

{'a': 11, 'b': 999, 'c': 30}

# PM asks: how do we delete a dict?

In Python, you cannot directly delete a value. You can delete a variable referring to a value, and when no more variables are referring to that value, the value will be "garbage collected," and the memory released.

You can use the `del` function to delete a variable.

Easier: Assign the variable to something else.

In [41]:
d = {'a':10, 'b':20, 'c':30}

d = {}   # empty dict -- totally new, different dict -- the previous one is released from memory

In [43]:
d = {'a':10, 'b':20, 'c':30}

# to delete a key-value pair, I invoke dict.pop, and I pass the key
# the key-value pair will be removed, and the value will be returned

d.pop('a')  # this removes the key-value pair with 'a'

10

In [44]:
d.pop('b')  # this removes the key-value pair with 'b'

20

In [45]:
d.pop('c') #this removes the key-value pair with 'c'

30

In [46]:
d  # what remains? nothing, we have an empty dict

{}

In [47]:
d.pop('a')  # trying to remove the key-value pair with 'a' results in a KeyError

KeyError: 'a'

In [48]:
# how can I comment all four of the "print" invocations in this cell?
# is there a "multi-line comment" in Python?

# no, but... many people use a triple-quoted string above and below the stuff they want to comment

"""
print('a')
print('b')
print('c')
print('d')
"""

a
b
c
d


In [49]:
dict.popleft

AttributeError: type object 'dict' has no attribute 'popleft'

# Next up

- Accumulating with dicts
- Accumulating the unknown (!)
- Iterating over dicts

# Accumulating with dicts

We've seen how we can keep track of values in variables. If I want to keep track how many odd vs. even numbers, I can do that with variables.

In [50]:
odds = 0
evens = 0

numbers = [10, 15, 20, 25]   # list of ints

for one_number in numbers:
    if one_number % 2 == 0:  # if there's a remainder of 0 after dividing by 2, it's even!
        evens += 1
    else:
        odds += 1     

print(f'odds = {odds}')        
print(f'evens = {evens}')

odds = 2
evens = 2


In [51]:
# in many cases, it's easier to keep track of things if we use a dict
# we get the same results, but semantically, they're kept together
# we can then print/pass/use them together, too.

counts = {'odds':0, 'evens':0}

numbers = [10, 15, 20, 25]   # list of ints

for one_number in numbers:
    if one_number % 2 == 0:  # if there's a remainder of 0 after dividing by 2, it's even!
        counts['evens'] += 1
    else:
        counts['odds'] += 1     

print(counts)

{'odds': 2, 'evens': 2}


# Exercise: Vowels, digits, and others (dict edition)

1. Define a dict in which `vowels`, `digits`, and `others` are all keys. All three values should be 0.
2. Ask the user to enter a string.
3. Go through each character in the string:
    - If it's a vowel, add 1 to `vowels`
    - If it's a digit, add 1 to `digits`
    - Otherwise, add 1 to `others`
4. Print the entire dict (no reason to break it apart)

In [52]:
counts = {'vowels':0,
          'digits':0,
          'others':0}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in 'aeiou':
        counts['vowels'] += 1
    elif one_character.isdigit():
        counts['digits'] += 1
    else:
        counts['others'] += 1

print(counts)    

Enter text:  hello!! 123


{'vowels': 2, 'digits': 3, 'others': 6}


In [53]:
# SB

d = {'vowels':0, 'digits':0, 'others':0}

val = input('Enter a string: ').strip()

for one_character in val:
    if one_character in 'aeiou':
        d['vowels'] += 1
    elif one_character.isdigit():
        d['digits'] += 1
    else:
        d['others'] += 1
        
print(d)

Enter a string:  hello!! 123


{'vowels': 2, 'digits': 3, 'others': 6}


# Exercise: Vowels, digits, and others (dict edition, part 2)

Now, I want you to create a dict whose keys are still `vowels`, `digits`, and `others`. But the values in the dict should start off as empty *lists*. 

When you find a character of each type (vowel, digit, other), then `append` the character to the appropriate list.

In the end, you won't have a count for each type of character, but rather a list of characters from the user's input text categorized.

In [54]:
chars  = {'vowels':[],
          'digits':[],
          'others':[]}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in 'aeiou':
        chars['vowels'].append(one_character)
    elif one_character.isdigit():
        chars['digits'].append(one_character)
    else:
        chars['others'].append(one_character)

print(chars)    

Enter text:  hello!! 123


{'vowels': ['e', 'o'], 'digits': ['1', '2', '3'], 'others': ['h', 'l', 'l', '!', '!', ' ']}


In [56]:
# SB

d = {'vowels':[], 'digits':[], 'others':[]}

val = input('Enter a string: ').strip()

for one_character in val:
    if one_character in 'aeiou':
        d['vowels'].append(one_character)
    elif one_character.isdigit():
        d['digits'].append(one_character)
    else:
        d['others'].append(one_character)
print(d)

Enter a string:  hello!! 123


{'vowels': ['e', 'o'], 'digits': ['1', '2', '3'], 'others': ['h', 'l', 'l', '!', '!', ' ']}


# Accumulating the unknown

So far, we've used dictionaries to track things that we know to look for: I'm looking for odd/even numbers, so I create a dict with "odd" and "even" keys. I'm looking for vowels/digits/others, so I create a dict with those keys.

But if I don't know what I'm looking for? You might not know the values in advance, or there might be so many that's not worth creating the dict with all of those keys.

For example, if I want to count the characters in a string. I could create a dict in which every single possible character that Python recognized is a key, and we start with 0 as the value for each.  That's a huge waste of memory and CPU.

Instead, I can start with an empty dict. As I encounter a character, I can check:

- If the character is already a key in the dict, add 1 to the count
- If not, then add a key-value pair with this character as a key and 1 as the count

In [57]:
counts = {}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in counts:    # is one_character already a key in counts?
        counts[one_character] += 1 #   add 1 to the count
    else:
        counts[one_character] = 1  # otherwise, add a new key-value pair

print(counts)        

Enter text:  this is the most interesting program ever!


{'t': 5, 'h': 2, 'i': 4, 's': 4, ' ': 6, 'e': 5, 'm': 2, 'o': 2, 'n': 2, 'r': 4, 'g': 2, 'p': 1, 'a': 1, 'v': 1, '!': 1}


# Exercise: Rainfall

1. Define an empty dict, `rainfall`. We will, over time, add keys (strings, names of cities) and values (ints, mm of rain) based on user input.
2. Ask the user to enter a city name.
3. If they give us an empty city name, stop asking; break out of the loop and print the `rainfall` dict.
4. Ask the user to enter the mm rain that fell in that city.
5. We can assume that we'll get digits.
6. If the city already exists in the dict as a key, just add the new value to the existing one.
7. If the city doesn't yet exist, then add a new key-value pair, with this city and the rainfall.
8. After the loop, print the dict.

Example:

    City: a
    Rain: 5
    City: b
    Rain: 4
    City: a
    Rain: 3
    City: [ENTER]
    {'a':8, 'b':4}

In [58]:
rainfall = {}

while True:
    city_name = input('City: ').strip()

    if city_name == '':     # got empty input? break out of the loop
        break

    mm_rain = input('Rain: ').strip()
    mm_rain = int(mm_rain)   # assume we got digits, and I'll just convert it

    if city_name in rainfall:            # if we have seen this city before...
        rainfall[city_name] += mm_rain   # add mm_rain to the existing amount for this city
    else:
        rainfall[city_name] = mm_rain    # otherwise, add this new key-value pair

print(rainfall)        
        

City:  a
Rain:  5
City:  b
Rain:  4
City:  a
Rain:  3
City:  


{'a': 8, 'b': 4}


# Next up:

- Iterating over dicts
- How dicts work

# How do we iterate over dicts?

We've seen that we can iterate over strings, lists, and tuples. In each case, the data structure dictates what we get with each iteration.

In [59]:
s = 'abcd'

for one_item in s:   # when we iterate over strings, we get the characters
    print(one_item)

a
b
c
d


In [60]:
mylist = [10, 20, 30, 40]

for one_item in mylist:  # when we iterate over lists, we get the items
    print(one_item)

10
20
30
40


In [61]:
t = (10, 20, 30, 40)

for one_item in t:  # when we iterate over tuples, we get the items
    print(one_item)

10
20
30
40


In [62]:
# what about a dict?

d = {'a':10, 'b':20, 'c':30, 'd':40}

for one_item in d:
    print(one_item)

a
b
c
d


Iterating over a dict gives us, one by one, each of the *keys*. 

This makes sense:

- When we search in a dict using `in`, we search only on the keys.
- We need the key to a value, but we cannot use a value to get a key

Dicts are all about the keys primarily, and the values come along for the ride.

In [63]:
# what if I want to print the key-value pairs in a dict?

for one_key in d:
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30
d: 40


In [64]:
# you might know that there is a dict.keys method
# that returns the keys of a dict
# so you can iterate over that, instead

# you'll get the same answer as iterating over d... but why invoke the keys() method?
# there's almost never any good reason to invoke dict.keys

for one_key in d.keys():
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30
d: 40


In [65]:
# there is also a method called dict.items
# with each iteration, dict.items returns a 2-element tuple with the (key, value)

for one_pair in d.items():
    print(one_pair)

('a', 10)
('b', 20)
('c', 30)
('d', 40)


In [67]:
# I could do tuple unpacking!

for one_pair in d.items():
    key, value = one_pair     # assign the 2 elements of one_pair to key, value
    print(f'{key}: {value}')

a: 10
b: 20
c: 30
d: 40


In [68]:
# we can cut down the code and do it all in two lines:

for key, value in d.items():
    print(f'{key}: {value}')

a: 10
b: 20
c: 30
d: 40


In [69]:
# there is a dict.values method. You can use it to search on the values or iterate over them
# the values are not guaranteed to be unique, but they will be in the same order as the keys

# Exercise: Odds and evens

1. Define a dict, `odds_and_evens`, with two keys, `odds` and `evens`. The values should both be empty lists.
2. Ask the user to enter a string containing numbers separated by spaces.
3. Break that string apart into separate numbers/words, and iterate over that list.
    - If the item is non-numeric, then scold the user and move on
4. If the item is numeric, then convert it to an integer
5. Decide if the number is odd or even, and append it to the appropriate list.
6. Iterate over the dict, printing each key-value pair.

Example:

    10 11 hello 15 18
    adding 10 to evens
    adding 11 to odds
    ignoring hello; not numeric
    adding 15 to odds
    adding 18 to evens
    evens: [10, 18]
    odds: [11, 15]

In [74]:
# odds_and_evens is a dict with strings for keys (odds/evens) and lists for values (starting with [])
odds_and_evens = {'odds':[],
                  'evens':[]}

text = input('Enter numbers: ').strip()

# go through each item in the user's input
for one_item in text.split():

    if not one_item.isdigit():  # if one_item contains non-digit characters...
        print(f'{one_item} is not numeric; ignoring')
        continue
        
    n = int(one_item)   # one_item is a string; we invoke int() on it to get an integer based on it

    if n % 2 == 0:    # if dividing by 2 gives us a remainder of 0... it's even
        odds_and_evens['evens'].append(n)
    else:
        odds_and_evens['odds'].append(n)

# print a report
for key, value in odds_and_evens.items():
    print(f'{key}: {value}')

Enter numbers:  10 hello 15


hello is not numeric; ignoring
odds: [15]
evens: [10]


In [79]:
# SB

eng = {'odds':[], 'evens':[]}

text = input('Enter a string: ').strip()

text = text.split()

for one_charac in text:
    if not one_charac.isdigit():
        print(f'{one_charac} is not numeric')
else:
    n = int(one_charac)

Enter a string:  10 hello 15


hello is not numeric


In [76]:
s = 'abcd efgh ikjl'

s.split()

['abcd', 'efgh', 'ikjl']

In [81]:
text = '10'
int(text)  # int returns a new integer based on the string textr

10

In [82]:
text = 'tada!'
int(text)  # int returns a new integer based on the string textr

ValueError: invalid literal for int() with base 10: 'tada!'

In [83]:
text = 'tada!'

if text.isdigit():
    int(text)  # int returns a new integer based on the string text
else:
    print(f'Cannot turn {text} into an integer')

Cannot turn tada! into an integer


In [86]:
text = '9876'

if text.isdigit():
    int(text)  # int returns a new integer based on the string text
    print(int(text))
else:
    print(f'Cannot turn {text} into an integer')

    

9876


# How dicts work

We've seen that dicts have a bunch of rules associated with them:

- Every key has a value, every value has a key
- Keys must be immutable
- You can get from the key to the value, but not vice versa
- Search in a dict is very very fast (`'a' in d` wil give you a very fast answer.)
- From a key you can get to a value, but not vice versa.

Dictionaries are stored in a very clever way: When we have a key-value pair, and Python needs to decide where to store it in memory, it invokes a special function, `hash`, on the key. The value Python gets back from `hash(key)` determines the location in memory where it will be stored.

The key itself dictates where the pair will be put in memory.

This means that if we want to search in a dict, we can do that very easily: When we say `'a' in d`, Python runs `hash('a')`, jumps to that location in memory, and finds out whether that key-value pair is there or not.

This is why mutable values cannot be used as dict keys: The `hash` function uses the elements of a Python value when it determines the location in memory. If those values can change, then the location might change, and the keys will get "lost."

In [87]:
d = {'a':10, 'b':20, 'c':30}
mylist = [100, 200, 300]

d[mylist] = 999   # store 999 in d (a dict) with the key of "mylist," a list

TypeError: cannot use 'list' as a dict key (unhashable type: 'list')

# Next up: Files!

- Reading
- a little writing

If you want to follow along, you can download a zipfile with some exercise files in it from https://files.lerner.co.il/exercise-files.zip

# Files

You use files all of the time -- Word files, Excel files, PowerPoint files, PDF files, etc.

What is a file? Basically, it's a way to store data structures from a computer's memory into a more permanent/portable location. Once we store those data structures in a file, we can (a) move them to another computer, (b) back them up, (c) read them even after the power had been cycled.

In this course, we're just going to talk about text files, meaning files that contain only text. These aren't as efficient as binary files, but they're far easier to work with and talk about, and they're quite common.

If we want to read a file from disk into Python, what do we need?

To access a file on disk, we need to go through the OS. We need to tell the OS what file we want to use, and if it's available, we'll then get a "file handle," a special object that gives us access to the file via the OS.

In Python, we don't really talk about "file handles," but rather "file objects." 

The way that we get these files handles is by *opening* the file. We invoke the builtin function `open`, and get a file handle back. The argument to `open` is a string describing where the file lives on disk. This can be a bit confusing/frustrating if you're new to specifying where files are:

- By default, Python looks for a file in the current directory. If you're using Jupyter, then it'll look in the same directory/folder as Jupyter is running. So if you `open('myfile.txt')`, then it'll look for `myfile.txt` wherever Jupyter is running.
- If you're on Unix/MacOS and have a `/` in the filename (but not at the beginning), that's known as a "relative path," and it starts in the current directory and then goes down from there. So if you `open('a/b/c/myfile.txt')`, it'll go down through the `a` subdirectory, then `b`, then `c`, and only then look for `myfile.txt`.
- If you're on Unix/MacOS and the filename *starts* with `/`, that's known as an absolute filename, and it looks from the "root" of the filesystem down to where you say.
- If you're on Windows, we use `\` rather than `/` to separate paths, and the root isn't `/` (as on Unix) but rather `c:\`.

In [93]:
# let's open a file!

f = open('/etc/passwd')   # absolute path, to a file that exists on all Unix systems

In [89]:
type(f)  # what kind of value does the variable f refer to?

_io.TextIOWrapper

In [90]:
f

<_io.TextIOWrapper name='/etc/passwd' mode='r' encoding='UTF-8'>

In [94]:
# how can I read data via this file object?
# the easiest (but not best) way is by invoking the read() method

text = f.read()  # this returns a string with the entire content of the file, starting where we last read from (or the beginning)
text[:1000]

'##\n# User Database\n# \n# Note that this file is consulted directly only when the system is running\n# in single-user mode.  At other times this information is provided by\n# Open Directory.\n#\n# See the opendirectoryd(8) man page for additional information about\n# Open Directory.\n##\nnobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false\nroot:*:0:0:System Administrator:/var/root:/bin/sh\ndaemon:*:1:1:System Services:/var/root:/usr/bin/false\n_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico\n_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false\n_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false\n_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false\n_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false\n_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false\n_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false\n_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fa

In [95]:
print(text)

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33

# What's wrong with this picture?

If we read from a file using `read`, it will work! We will get a string!

But:

1. Is a string containing the entire file really useful?
2. A bigger issue: If the file contains 10 TB of text, do we really want to read it into memory all at once?

# An alternative way to read from a file: Iterate over it!

If I iterate over a file object, I will get the next *line* of the file with each iteration. Each iteration will return a string that ends with `'\n'`. At the end the file, the iteration will stop, and we'll leave the `for `loop.

In [97]:
f = open('/etc/passwd')

for one_line in f:  # one_line will always end with a newline character
    print(one_line) # if we print one_line, every line will end with two newlines, one from the file and one from print

##

# User Database

# 

# Note that this file is consulted directly only when the system is running

# in single-user mode.  At other times this information is provided by

# Open Directory.

#

# See the opendirectoryd(8) man page for additional information about

# Open Directory.

##

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false

_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false

_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false

_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false

_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false

_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false

_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fal

In [98]:
f = open('/etc/passwd')

for one_line in f:  # one_line will always end with a newline character
    print(one_line, end='')   # pass an argument to print saying: don't end each line with something

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33

# Let's query our file!

I want a list of the usernames on my computer. 

- We know that usernames are in `'/etc/passwd'`.
- We know that in each line, the username is the first one
- We also know that there are comment lines before the data actually starts.

In [101]:
f = open('/etc/passwd')

for one_line in f:
    if not one_line.startswith('#'):    # not a comment line?
        print(one_line.split(':')[0])   # print the first element on every comment line

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

# Exercise: IP counts

1. I asked you earlier to download a small zipfile with example files in it. I want to use `mini-access-log.txt`.
2. Write a program that creates an empty dict, `counts`.
3. The file contains many lines from an Apache Web server I used to run. The first item on each line was the IP address of who accessed our server.
4. Grab each line from the file, grab the IP address that is at the front of each line, and then use it as a key in the `counts` dict to count how many times each IP address appeared in the file.