# Agenda: Day 3

1. Q&A
2. Dictionaries
    - What are they?
    - How to create them
    - How to work with them
    - How to iterate over them
    - Three paradigms for working with dicts
    - How dicts are implemented behind the scenes
3. Files
    - How to work with files, `open` and friends
    - How to take data in from a text file and iterate over it
    - How to write to files (a little)
    - The `with` construct, and how it fits into everything else

In [2]:
print('a')   # by default, print adds a \n to everything it prints
print('b')
print('c')

a
b
c


In [3]:
# print actually takes a bunch of optional arguments, each of which must be specified as a "keyword argument,"
# meaning that it looks like name=value.

# one of those is "end", which takes a string. Whatever you define "end" to be will be added at the end of
# what you print. By default, it's '\n'. You can set it to anything else, including ' ' (space) or '' (empty string).

In [4]:
print('a', end='') 
print('b', end='')
print('c')

abc


In [5]:
print('a', end='*') 
print('b', end='*')
print('c')

a*b*c


In [6]:
print('a', end='xyz') 
print('b', end='xyz')
print('c')

axyzbxyzc


# Where do we stand now?

- Python uses lots of values, and each value has a different type
- We can assign these values to variables, and then reuse them
- The data structures we have used so far:
    - `int` and `float`
    - `str` (for text strings)
    - `list` and `tuple` for collections

Many times, we don't want a single piece of data. Rather, we want a *pair* of values together, or even one value (e.g., an ID number) that then represents a larger, complex value or set of values.

This is where dictionaries really shine. But it turns out that dicts are super flexible and super efficient!

Dicts are the most important data structure in Python -- not only for us, as coders in the language, but also for the language itself, which is often implemented using dicts under the hood.

# What is a dictionary?

What we call a dict in Python is also famous and widely used in other programming languages. They tend to call this data structure by other names:

- Hash tables
- Hash maps
- Maps
- Hashes
- Key-value stores
- Name-value stores
- Associative arrays

The idea is that we aren't storing a single value. Rather, we're storing a *pair* of values. The first part of that pair is similar to the index we've used in a list or tuple *except* that it can be almost anything! They keys, as they're known, can be any immutable type, and we often use integers or strings.  You can think of a dict, in some ways, as a list where you can dictate what the indexes are, rather than just accepting 0, 1, 2, etc.

A few rules for dicts:

- Keys can be anything you want, so long as they're immutable -- basically numbers and strings
- In a given dict, each key must be unique
- Each key has one value. Each value has one key.
- Values can be absolutely anything in Python, with zero limitations.
- Everything you do in a dict works through the keys. Values are accessed via the keys, but not (typically) on their own, and certainly not efficiently.

# Creating and using dicts

To create a dictionary, use `{}`. 

- You can have an empty dict, `{}`.
- You can set as many key-value pairs as you want when the dict is created.
- Each pair looks like `key:value`, with a `:` between the key and value.
- The pairs are separated by `,` -- just like in a list or tuple

In [7]:
d = {'a':10, 'b':20, 'c':30}   # a dict with 3 key-value pairs

In [8]:
len(d)   # how many pairs?

3

In [9]:
# to retrieve from d, we use [] -- just like in a list or tuple (or string)
# in the [], we put the key, no matter what data type it is

d['a']  # in the [], I put the string 'a'!

10

In [10]:
d['b']

20

In [12]:
d['q']  # if you request the value for a key that doesn't exist, you'll get a KeyError

KeyError: 'q'

In [13]:
# you can search for a key in a dict with the "in" operator
# we've used "in" before for searching in a string, list, or tuple
# here, we *ONLY* search in the keys, not in the values

'a' in d

True

In [14]:
'b' in d

True

In [15]:
'q' in d

False

In [16]:
months = {'Jan':1, 'Feb':2, 'Mar':3, 'Apr':4, 'May':5, 'Jun':6}

months['Jan']

1

In [18]:
months = {1:'Jan', 2:'Feb', 3:'Mar', 4:'Apr', 5:'May', 6:'Jun'}
months[1]

'Jan'

# Exercise: Restaurant

We'll define a dict, `menu`, whose keys are strings (menu items) and whose values are integers (prices). We'll repeatedly ask the user to order something -- and if it's on the menu, we'll add it to `total`.

1. Define `total` to be 0.
2. Define `menu` to be a dict with the restaurant's menu.
3. Repeatedly ask the user to order something:
    - If they give us an empty string, stop asking and print `total`
    - If they order something on the menu, print the price, add to `total`, print the new `total`.
    - If they order something *NOT* on the menu, scold them and let them try again.
4. Print `total`

Example:

    Order: sandwich
    sandwich costs 10, total is 10
    Order: tea
    tea costs 7, total is 17
    Order: elephant
    We're fresh out of elephant today!
    Order: [ENTER]
    Total is 17

In [21]:
menu = {'sandwich':10, 'tea':7, 'apple':2, 'cake':5}
total = 0

while True:    # infinite loop!

    order = input('Order: ').strip()

    if order == '':     # empty string?
        break           # leave the loop
    
    if order in menu:        # meaning: if the user's input is a key in the "menu" dict
        price = menu[order]  # get the price for the item the user ordered
        total += price       # add the price to the total
        print(f'{order} is {price}; total is now {total}')
    else:
        print(f'We are fresh out of {order} today!')

print(f'Total = {total}')    

Order:  sandwich


sandwich is 10; total is now 10


Order:  tea


tea is 7; total is now 17


Order:  orange


We are fresh out of orange today!


Order:  apple


apple is 2; total is now 19


Order:  elephant


We are fresh out of elephant today!


Order:  


Total = 19


# Dictionaries are mutable

We've seen, so far:

- Integers, strings, and tuples are *immutable* -- once defined, they cannot be changed
- Lists are *mutable* -- meaning, they can be changed

For something to be mutable:

- We can modify/update an existing value
- We can add a new value
- We can remove an existing value

All of this is true for dicts, as well!

In [22]:
d = {'a':10, 'b':20, 'c':30}

# I'd like to update an existing value -- how?
# Just assign to the key for that value

d['a'] = 123
d

{'a': 123, 'b': 20, 'c': 30}

In [23]:
d['a'] += 1    # will this work? Yes! It's the same as saying d['a'] = d['a'] + 1
d

{'a': 124, 'b': 20, 'c': 30}

In [24]:
# how do we add a new key-value pair to our dict?
# it is *NOT* with a special method, similar to lists (list.append)
# rather, it's just assignment... the same as we just used to update a value!

d['x'] = 987    # 'x' didn't exist as a key... but now it does!

d

{'a': 124, 'b': 20, 'c': 30, 'x': 987}

In [25]:
# what if I do this?

d['y'] += 123   # what will happen?  This is the same as d['y'] = d['y'] + 123

KeyError: 'y'

In [26]:
# to remove a key-value pair, we use dict.pop -- similar to list.pop
# we name the key we want to remove, and we get the value returned to us -- plus, the key-value pair is removed

d.pop('x')  

987

In [27]:
d

{'a': 124, 'b': 20, 'c': 30}

# Accumulate with dicts

You've already seen (last week) that we can define an empty list in which we'll store values. As the program continues, we update those values to reflect what we've collected. Meaning, we can use the list to accumulate values over time.

But if you want multiple values, you'll need multiple lists. 

A dict allows us to centralize all of this in one place. At the start of the program, define a dict with keys (that won't change) and values (that are 0, an empty list, or the list) and will be updated over the course of the program.



In [29]:
counts = {'odds':0, 
          'evens':0}

s = input('Enter a string of digits: ').strip()

for one_character in s:
    if not one_character.isdigit():   # if it isn't a digit, ignore it
        print(f'Ignoring non-number {one_character}')
        continue

    n = int(one_character)   # get an integer from this character

    if n % 2 == 0:             # if dividing it produces no remainder, it's even!
        counts['evens'] += 1   # update the count!
    else:
        counts['odds'] += 1    # it's an odd number, then

print(counts)                  # just print the dict, and get all of the key-value pairs        

Enter a string of digits:  123abc456


Ignoring non-number a
Ignoring non-number b
Ignoring non-number c
{'odds': 3, 'evens': 3}


# Exercise: Vowels, digits, and others (dict edition)

1. Define a dict, `counts`, with three keys: `vowels`, `digits`, and `others`. Set the value to be 0 in all cases.
2. Ask the user to enter a string.
3. Go through the string, one character at a time:
    - If it's a vowel (a, e, i, o, u), add 1 to `counts['vowels']`
    - If it's a digit (0-9), add 1 to `counts['digits']`
    - Otherwise, add 1 to `counts['others']`
4. Print `counts`

In [30]:
counts = {'vowels':0, 
          'digits':0,
          'others':0}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in 'aeiou':   # is it a vowel?
        counts['vowels'] += 1
    elif one_character.isdigit():  # is it a digit?
        counts['digits'] += 1
    else:
        counts['others'] += 1

print(counts)       
        

Enter text:  hello!! 123


{'vowels': 2, 'digits': 3, 'others': 6}


In [31]:
# YP

counts = {'vowels':0, 'digits':0, 'others':0}
s = input("Enter text: ").strip()
for one_char in s:
    if one_char.lower() in 'aeiou':
        counts['vowels'] += 1
    elif one_char.isdigit():
        counts['digits'] += 1
    else:
        counts['others'] += 1
print(counts)

Enter text:  hello!! 123


{'vowels': 2, 'digits': 3, 'others': 6}


In [32]:
# U1

counts = {'vowels':0, 'digits':0, 'other':0}


s = input('Enter a string: ').strip()

for one_character in s:
    if  one_character in 'aeiou':   
        counts['vowels'] += 1 
        
    elif one_character.isdigit():
        counts['digits'] += 1 
    else: 
        counts['other'] += 1 

print(counts)

Enter a string:  hello!! 123


{'vowels': 2, 'digits': 3, 'other': 6}


# Isn't this weird?

- We define a string with `''` or `""`, but we retrieve characters with `[]`.
- We define a list with `[]`, and retrieve/set with `[]`.
- We define a tuple with `()`, and retrieve with `[]`.
- We define a dict with `{}`, and retrieve/set with `[]`.

What's going on? Why not just stick with the same parentheses for each data type?

In Python, any retrieval (or setting) will happen with `[]`. But when we define a data structure, such a dict, we use a special kind of brackets/parentheses so that Python knows what we're trying to do.

You can read this: https://lerner.co.il/2018/06/08/python-parentheses-primer/

# Next up

1. Accumulating more complex things with our data
2. Accumulating the unknown
3. Iterating over our dicts

Remember: 

- The keys of a dict must be unique (in that dict) and immutable (basically, numbers and strings)
- The values can be anything at all -- numbers, strings, lists, tuples, or even other dicts!

What if, instead of counting how many odd/even numbers we have, we were to *store* them?

To do that, we need a dict whose values are *lists*!

In [33]:
# here, counts is a dict whose values are lists!
counts = {'odds': [], 
          'evens':[]}

s = input('Enter a string of digits: ').strip()

for one_character in s:
    if not one_character.isdigit():   # if it isn't a digit, ignore it
        print(f'Ignoring non-number {one_character}')
        continue

    n = int(one_character)   # get an integer from this character

    if n % 2 == 0:             # if dividing it produces no remainder, it's even!
        counts['evens'].append(n)   # append the number to the list in the dict!
    else:
        counts['odds'].append(n)    # it's an odd number, then

print(counts)                  # just print the dict, and get all of the key-value pairs        

Enter a string of digits:  123abc456


Ignoring non-number a
Ignoring non-number b
Ignoring non-number c
{'odds': [1, 3, 5], 'evens': [2, 4, 6]}


# Exercise: Vowels, digits, and others (dict-list edition)

1. Define a dict, `counts`, with three keys: `vowels`, `digits`, and `others`. Set the value to be `[]` in all cases.
2. Ask the user to enter a string.
3. Go through the string, one character at a time:
    - If it's a vowel (a, e, i, o, u), append it to `counts['vowels']`
    - If it's a digit (0-9), append it to `counts['digits']`
    - Otherwise, append it to `counts['others']`
4. Print `counts`

In [34]:
counts = {'vowels':[], 
          'digits':[],
          'others':[]}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in 'aeiou':   # is it a vowel?
        counts['vowels'].append(one_character)
    elif one_character.isdigit():  # is it a digit?
        counts['digits'].append(one_character)
    else:
        counts['others'].append(one_character)

print(counts)       

Enter text:  hello!! 123


{'vowels': ['e', 'o'], 'digits': ['1', '2', '3'], 'others': ['h', 'l', 'l', '!', '!', ' ']}


In [35]:
# YP

counts = {'vowels':[], 'digits':[], 'others':[]}
s = input("Enter text: ").strip()
for one_char in s:
    if one_char.lower() in 'aeiou':
        counts['vowels'].append(one_char)
    elif one_char.isdigit():
        counts['digits'].append(one_char)
    else:
        counts['others'].append(one_char)
print(counts)

Enter text:  hello!! 123


{'vowels': ['e', 'o'], 'digits': ['1', '2', '3'], 'others': ['h', 'l', 'l', '!', '!', ' ']}


In [36]:
# U1

counts = {'vowels':[], 'digits':[], 'other':[]}

s = input('Enter a string: ').strip()

for one_character in s:

    n = one_character
    if  one_character in 'aeiou':   
        counts['vowels'].append(n)
        
    elif one_character.isdigit():
        counts['digits'].append(n)
    else: 
        counts['other'].append(n)
        
print(counts)

Enter a string:  hello!! 123


{'vowels': ['e', 'o'], 'digits': ['1', '2', '3'], 'other': ['h', 'l', 'l', '!', '!', ' ']}


In [41]:
# JS

counts = {'vowels':[], 'digits': [], 'others': []}

word = input ("Please enter any keyboard input: ").strip()

for character in word:
    if character in "aeiou":
        counts['vowels'].append(character)
    elif character in "0123456789":
        counts['digits'].append(character)
    else:
        counts['others'].append(character)
print(counts)

Please enter any keyboard input:  hello!! 123


{'vowels': ['e', 'o'], 'digits': ['1', '2', '3'], 'others': ['h', 'l', 'l', '!', '!', ' ']}


In [44]:
len(counts['vowels'])

2

# Paradigms so far

I like to talk about 3 paradigms for dict usage in programs:

1. Define a dict, and treat it as a read-only database in the program.
2. Define a dict, with keys that will never change (no additions, no subtractions) but with initial values that will grow over time -- could be integers, and could be lists that we'll fill.
3. Define an empty dict:
    - When we encounter a key that is already in the dict, treat like paradigm 2, adding/appending to the value
    - If it's a new key, then treat it sort of like paradigm 2's initialization, with a 0, 1, empty list, etc.

# Why do we need paradigm 3?

Example: I want to ask the user for a string, and to count how many times each character appears in the string. I'll use a dict for that! But does this mean I need to create a dict whose keys are EVERY POSSIBLE CHARACTER KNOWN TO PYTHON, and whose value is 0? 

Far better is to start with an empty dict:

- If we encounter a character we've seen before (i.e., a key in our dict), just add 1 to it
- If it's a new character, add a new key-value pair to the dict, with the character and a count of 1

In [40]:
counts = {}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in counts:      # have we seen this before? Is it already a key in our dict?
        counts[one_character] += 1   # ... just add 1 to the count
    else:
        counts[one_character] = 1    # base case: add a new key-value pair with this character and the count of 1

counts        

Enter text:  hello to everyone in the audience


{'h': 2,
 'e': 7,
 'l': 2,
 'o': 3,
 ' ': 5,
 't': 2,
 'v': 1,
 'r': 1,
 'y': 1,
 'n': 3,
 'i': 2,
 'a': 1,
 'u': 1,
 'd': 1,
 'c': 1}

# Exercise: Rainfall 

The goal of this program will be to create a dict that tracks rainfall in a variety of cities. The keys in the dict will be strings, city names. The values will be integers, mm of rain that fell in each city. You don't know in advance what cities you'll be tracking.

1. Define an empty dict, `rainfall`.
2. Ask the user, repeatedly, to enter the name of a city.
    - If they give us an empty city name, stop asking.
3. If they gave us a city, ask for how many mm rain fell there.
    - We can assume the user will give us a number.
4. Add this report to the dict:
    - If we have seen this city before, then add `mm_rain` to the existing value in `rainfall[city_name]`.
    - If this is the first time we're seeing this city, add a key-value pair to `rainfall` with `city_name` and `mm_rain`.
5. Print `rainfall`.

In [45]:
rainfall = {}

while True:
    print(rainfall)
    city_name = input('City: ').strip()

    if city_name == '':   # got an empty string? Stop asking
        break

    mm_rain = input('Rain: ').strip()
    mm_rain = int(mm_rain)   

    if city_name in rainfall:
        rainfall[city_name] += mm_rain  # add mm_rain to the existing value for this city
    else:
        rainfall[city_name] = mm_rain  # add a new key-value pair with the city and mm_rain


print(rainfall)        

{}


City:  a
Rain:  5


{'a': 5}


City:  b
Rain:  4


{'a': 5, 'b': 4}


City:  a
Rain:  3


{'a': 8, 'b': 4}


City:  


{'a': 8, 'b': 4}


In [46]:
# SK

rainfall={}
while True:
    city=input('Enter the name of a city:').strip()
    if city=='':
        break

    mm_rain=int(input('How many mm rain fell:'))
    if city in rainfall:
        rainfall[city]+=mm_rain
    else:
        rainfall[city]=mm_rain
        
print(rainfall)

Enter the name of a city: a
How many mm rain fell: 5
Enter the name of a city: b
How many mm rain fell: 4
Enter the name of a city: a
How many mm rain fell: 3
Enter the name of a city: 


{'a': 8, 'b': 4}


In [48]:
#RM

rainfall = {}

while True:
    s = input('Enter the City name "')

    if s == '':
        break
    
    rain=input('Enter the amount of rainfall : ')

    if s in rainfall:
        rainfall[s] += int(rain)

    else:
        rainfall[s] = int(rain)


print(rainfall)

Enter the City name " a
Enter the amount of rainfall :  5
Enter the City name " b
Enter the amount of rainfall :  4
Enter the City name " a
Enter the amount of rainfall :  3
Enter the City name " 


{'a': 8, 'b': 4}


In [None]:
# MS

rainfall = {}

city = input('Enter city: ').strip()
mm_rain = 0

if city == '':
    break

else: 
    rainfall[city] = mm_rain

In [50]:
# YP

rainfall = {}

while True:
    # repeatedly get city_name
    city_name = input("Enter city name: ").strip()
    if city_name == '':
        break

    # if we got a city name, ask the amount of rain
    rmm_rain = float(input("Enter rainfall amount: "))

    if city_name in rainfall:  # meaning: is this city already a key in the rainfall dict?
        rainfall[city_name] += rmm_rain
    else:
        rainfall[city_name] = rmm_rain

print(rainfall)

Enter city name:  a
Enter rainfall amount:  5
Enter city name:  b
Enter rainfall amount:  4
Enter city name:  a
Enter rainfall amount:  3
Enter city name:  


{'a': 8.0, 'b': 4.0}


In [51]:
# MG

counts = {}

text = input('Enter text: ').strip()

for one_character in text:
    if one_character in counts:       # have we seen this before? Is it already in our dict?
        counts[one_character] += 1    #... just add 1 to the count
    else:
        counts[one_character] = 1      # base case: add a new jey-value pair with this character and the count of 1

counts

Enter text:  hello out there!


{'h': 2, 'e': 3, 'l': 2, 'o': 2, ' ': 2, 'u': 1, 't': 2, 'r': 1, '!': 1}

In [56]:
# PB

rainfall = {}

while True: 
    city = input(' Enter City Name: ').strip()

    if city == '':
        print ( "city entered in None Exiting ")
        break
        
    rain = input(' Enter Rainfall for city : ').strip()
    if city in rainfall:
        rainfall[city] += int(rain)
    else:
        rainfall[city] = int(rain)

print(rainfall)
# Sent message

# TypeError: int() argument must be a string, a bytes-like object or a real number, not 'builtin_function_or_method'

 Enter City Name:  a
 Enter Rainfall for city :  5
 Enter City Name:  b
 Enter Rainfall for city :  4
 Enter City Name:  a
 Enter Rainfall for city :  3
 Enter City Name:  


city entered in None Exiting 
{'a': 8, 'b': 4}


In [57]:
# MR

rainfall = {}

while True:
    city = input(f'Enter your city name').strip()
      
    if city == "":
          break
    
   # else:
    mm_rain= input (f'How much it rained today in your City?').strip()

    if city in rainfall:     
        rainfall[city] += mm_rain 
    else:
        rainfall[city] = mm_rain

print(rainfall)


Enter your city name a
How much it rained today in your City? 5
Enter your city name b
How much it rained today in your City? 4
Enter your city name a
How much it rained today in your City? 3
Enter your city name 


{'a': '53', 'b': '4'}


# Next up

1. Alternative "rainfall" solution with lists, rather than ints
2. Iteration on dicts
3. How are dicts implemented?
4. Files!

In [58]:
# another version, with lists, rather than integers, for values
# in other words: track each entry of rainfall, not the total!

rainfall = {}

while True:
    print(rainfall)
    city_name = input('City: ').strip()

    if city_name == '':   # got an empty string? Stop asking
        break

    mm_rain = input('Rain: ').strip()
    mm_rain = int(mm_rain)   

    if city_name in rainfall:
        rainfall[city_name].append(mm_rain)  # add mm_rain to the existing value for this city
    else:
        rainfall[city_name] = [mm_rain]      # add a 1-element list as the value, containing mm_rain


print(rainfall)        

{}


City:  a
Rain:  5


{'a': [5]}


City:  b
Rain:  4


{'a': [5], 'b': [4]}


City:  a
Rain:  3


{'a': [5, 3], 'b': [4]}


City:  b
Rain:  2


{'a': [5, 3], 'b': [4, 2]}


City:  c
Rain:  1


{'a': [5, 3], 'b': [4, 2], 'c': [1]}


City:  


{'a': [5, 3], 'b': [4, 2], 'c': [1]}


In [60]:
sum(rainfall['a'])   # get the total of rainfall['a']

8

In [61]:
len(rainfall['a'])  # get the length of rainfall['a']

2

In [62]:
# if we weant the mean rainfall in 'a'...

sum(rainfall['a']) / len(rainfall['a'])

4.0

# How can we iterate over a dict?

We've seen that we can use a `for` loop on many data structures:

- On a string, we get the characters
- On a list or tuple, we get the elements
- What happens if we iterate over a dict?

In [64]:
d = {'a':10, 'b':20, 'c':30}

# iterating over a dict gives you the *keys*
for one_thing in d:
    print(one_thing)

a
b
c


In [65]:
for one_key in d:
    print(d[one_key])

10
20
30


In [66]:
for one_key in d:
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30


In [67]:
for one_city in rainfall:
    print(f'{one_city}: {rainfall[one_city]}')

a: [5, 3]
b: [4, 2]
c: [1]


In [68]:
# there is a method, dict.items, that returns 2-element tuples of (key, value)

for t in d.items():
    key = t[0]
    value = t[1]

    print(f'{key}: {value}')

a: 10
b: 20
c: 30


In [69]:
# we can use unpacking in the "for" loop, and assign both of those variables immediately:

for key, value in d.items():
    print(f'{key}: {value}')

a: 10
b: 20
c: 30


In [70]:
# there is a dict.values() method, and you can search in it,



In [71]:
for key, value in rainfall.items():
    print(f'{key}: {value}')

a: [5, 3]
b: [4, 2]
c: [1]


# What's with dicts and their rules?

- How can it be that we can use any string or int as the key?
- How can it be that keys must be immutable?
- Are dicts really that fast to search in?

When you append a value to a list, you're just sticking that value in the next available free slot in memory. If Python wants to search for something in the list, it has no choice but to go through it, one element at a time.

When you add a key-value pair to a dict, it's much more sophisticated than that. Python runs a special function, called `hash`, on the key. That returns a number -- the location in memory where the pair should be stored. 

- If you say `'a' in d`, then Python runs `hash('a')`, jumps to the place in memory that the function returns, and checks if the key-value pair is there. If so, it returns `True`. If not, it returns `False`.
- When you store a value in the key `'a'`, then Python calculates `hash('a')`, jumps to that place in memory, and stores both key and value right there.

This means that searching and retrieving are *very* fast. However, in order to avoid trouble if/when the key changes, and `hash` won't be able to find the key in memory, Python forbids us from using mutable values as keys.

In [74]:
for key, value in rainfall.items():
    print(f'{key}: {value}')
    rainfall[key + key] = value + value   # add a new key-value pair

a: [5, 3]


RuntimeError: dictionary changed size during iteration

When you assign `x = 5` in Python, Python takes `x` and turns it into a string, `'x'` (behind the scenes). It then uses that string as a key in a dictionary of variable names and values!

# What is a file?

A file is a permanent version of an in-memory data structure.

Meaning:

- If you want to store a data structure, such that it can be loaded onto another computer or just survive a reboot/powerdown, save it to a file.
- If you want to retrieve a data structure from a file into memory, then load the file.

There has to be agreement across various parts of software regarding how the data structures are translated from their in-memory form to their on-disk form.

Text files are boring and simple to work with. They're also common -- configuration files, logfiles, in instruction files for GenAI.

1. How can we read from a text file?
2. How can we turn the contents of that file into a data structure?

For practice, you can (should?) download a small zipfile from:

https://files.lerner.co.il/exercise-files.zip

Open the zipfile, and put in the directory where Jupyter is running or somewhere you know how to access. If you're using Jupyter Lite, then drag each of the files from your computer's Finder/Explorer to the panel on the left.

# Working with files

It used to be that any program could talk to the disk directly. This was terrible! 

In modern systems, a program talks to the OS, which then talks to the disk on the program's behalf. 

To ask the OS to open a connection to the file, a program can use the `open` function. `open` takes one argument (for starters), the name of the file that should be opened. It returns a "file object," from which we can read the contents of file.

In [75]:
# for example

f = open('/etc/passwd')  # this is a Unix file (on Mac/Linux) that used to contain passwords

type(f)   # what did we get f?

_io.TextIOWrapper

In [76]:
# now that the file is open, how can we read from it?
# option 1: invoke read(), which returns a string with the file's contents

s = f.read()

In [77]:
type(s)  # type tells us what kind of value we haev

str

In [78]:
len(s)

9344

It's a bad idea to use `read()` because a really big file could try to load many terabytes of memory into your Python program, which will blow it up.

In [79]:
# option 2: Invoke read(n), the same method, but with an integer argument. That is the max number of 
# characters we would want. The good news? We won't run out of memory. The bad news? We'll get
# lines chopped up.

In [80]:
# option 3: the most common, by far
# run a for loop on the file object. You'll get one line at a time from the file, 
# always returning a string, and always ending with `\n`. Most text files are designed to
# be read line by line.

In [81]:
f = open('/etc/passwd')

for one_line in f:   # iterate over the file, one line at a time, assigning each to one_line
    print(one_line)  # print the latest line that we got from the file

##

# User Database

# 

# Note that this file is consulted directly only when the system is running

# in single-user mode.  At other times this information is provided by

# Open Directory.

#

# See the opendirectoryd(8) man page for additional information about

# Open Directory.

##

nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false

root:*:0:0:System Administrator:/var/root:/bin/sh

daemon:*:1:1:System Services:/var/root:/usr/bin/false

_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false

_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false

_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false

_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false

_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false

_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/fal

Why is the file double spaced?

- `print` always puts `\n` after everything it prints.
- `read` returns up to and including the next line, which means `\n`.

This means we have one `\n` from `one_line` and another from `print`.

You could actually use `print(one_line, end='')`.

In [82]:
f = open('/etc/passwd')

for one_line in f:   # iterate over the file, one line at a time, assigning each to one_line
    print(one_line, end='')

##
# User Database
# 
# Note that this file is consulted directly only when the system is running
# in single-user mode.  At other times this information is provided by
# Open Directory.
#
# See the opendirectoryd(8) man page for additional information about
# Open Directory.
##
nobody:*:-2:-2:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1:System Services:/var/root:/usr/bin/false
_uucp:*:4:4:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico
_taskgated:*:13:13:Task Gate Daemon:/var/empty:/usr/bin/false
_networkd:*:24:24:Network Services:/var/networkd:/usr/bin/false
_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false
_lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false
_postfix:*:27:27:Postfix Mail Server:/var/spool/postfix:/usr/bin/false
_scsd:*:31:31:Service Configuration Service:/var/empty:/usr/bin/false
_ces:*:32:32:Certificate Enrollment Service:/var/empty:/usr/bin/false
_appstore:*:33:33

In [85]:
# I want to print only the usernames in /etc/passwd -- on lines containing records, the first field 
# is the username

f = open('/etc/passwd')

for one_line in f:   
    if one_line[0] == '#':
        continue  # go onto the next line...
    
    print(one_line.split(':')[0])

nobody
root
daemon
_uucp
_taskgated
_networkd
_installassistant
_lp
_postfix
_scsd
_ces
_appstore
_mcxalr
_appleevents
_geod
_devdocs
_sandbox
_mdnsresponder
_ard
_www
_eppc
_cvs
_svn
_mysql
_sshd
_qtss
_cyrus
_mailman
_appserver
_clamav
_amavisd
_jabber
_appowner
_windowserver
_spotlight
_tokend
_securityagent
_calendar
_teamsserver
_update_sharing
_installer
_atsserver
_ftp
_unknown
_softwareupdate
_coreaudiod
_screensaver
_locationd
_trustevaluationagent
_timezone
_lda
_cvmsroot
_usbmuxd
_dovecot
_dpaudio
_postgres
_krbtgt
_kadmin_admin
_kadmin_changepw
_devicemgr
_webauthserver
_netbios
_warmd
_dovenull
_netstatistics
_avbdeviced
_krb_krbtgt
_krb_kadmin
_krb_changepw
_krb_kerberos
_krb_anonymous
_assetcache
_coremediaiod
_launchservicesd
_iconservices
_distnote
_nsurlsessiond
_displaypolicyd
_astris
_krbfast
_gamecontrollerd
_mbsetupuser
_ondemand
_xserverdocs
_wwwproxy
_mobileasset
_findmydevice
_datadetectors
_captiveagent
_ctkd
_applepay
_hidd
_cmiodalassistants
_analyticsd
_fps

# Exercise: Get IP addresses

1. Read through the file `mini-access-log.txt`. On each line is one log entry, typically with the IP address at the start.
2. As you go through the file, grab the IP address from the start of each line, and print it out.

In [88]:
f = open('mini-access-log.txt')

for one_line in f:
    print(one_line.split()[0])

67.218.116.165
66.249.71.65
65.55.106.183
65.55.106.183
66.249.71.65
66.249.71.65
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.131
65.55.106.131
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.186
65.55.106.186
66.249.65.12
66.249.65.12
66.249.65.12
74.52.245.146
74.52.245.146
66.249.65.43
66.249.65.43
66.249.65.43
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.25
65.55.207.25
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.94
65.55.207.94
66.249.65.12
65.55.207.71
66.249.65.12
66.249.65.12
66.249.65.12
98.242.170.241
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38


In [90]:
# PB

# f = open('/Users/prasadbs/Downloads/mini-access-log.txt')
f = open('mini-access-log.txt')

for one_line in f:
    print(one_line.split()[0])

67.218.116.165
66.249.71.65
65.55.106.183
65.55.106.183
66.249.71.65
66.249.71.65
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.131
65.55.106.131
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.186
65.55.106.186
66.249.65.12
66.249.65.12
66.249.65.12
74.52.245.146
74.52.245.146
66.249.65.43
66.249.65.43
66.249.65.43
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.25
65.55.207.25
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.94
65.55.207.94
66.249.65.12
65.55.207.71
66.249.65.12
66.249.65.12
66.249.65.12
98.242.170.241
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38


In [91]:
# CJ

f = open('mini-access-log.txt')

for one_line in f:   
    print(one_line.split(' - ')[0])

67.218.116.165
66.249.71.65
65.55.106.183
65.55.106.183
66.249.71.65
66.249.71.65
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.131
65.55.106.131
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.186
65.55.106.186
66.249.65.12
66.249.65.12
66.249.65.12
74.52.245.146
74.52.245.146
66.249.65.43
66.249.65.43
66.249.65.43
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.25
65.55.207.25
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.94
65.55.207.94
66.249.65.12
65.55.207.71
66.249.65.12
66.249.65.12
66.249.65.12
98.242.170.241
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38


In [92]:
# TE

f = open("mini-access-log.txt")

for one_line in f:
    print(one_line.split("-")[0])

67.218.116.165 
66.249.71.65 
65.55.106.183 
65.55.106.183 
66.249.71.65 
66.249.71.65 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
65.55.106.131 
65.55.106.131 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
65.55.106.186 
65.55.106.186 
66.249.65.12 
66.249.65.12 
66.249.65.12 
74.52.245.146 
74.52.245.146 
66.249.65.43 
66.249.65.43 
66.249.65.43 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
65.55.207.25 
65.55.207.25 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
66.249.65.12 
65.55.207.94 
65.55.207.94 
66.249.65.12 
65.55.207.71 
66.249.65.12 
66.249.65.12 
66.249.65.12 
98.242.170.241 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.65.38 
66.249.6

# Next up

1. Closing files (the hard way)
2. More sophisticated reading fo files
3. Writing to files
4. Closing files (the fancy way)

You've noticed that we use `open` to open a file, and get access to it. What about closing the file?

We'll see in a bit, when we write to files, that closing is a big deal.

If you're only reading from files, and if you're only opening a handful at a time, then it doesn't really matter much; when Python exits, it'll close the file(s) for you.

However, it's considered polite to close a file when you're done with it.

You do that by invoking the `close` method.

In [93]:
f = open('mini-access-log.txt')

for one_line in f:
    print(one_line.split()[0])

f.close()   # close the file!    

67.218.116.165
66.249.71.65
65.55.106.183
65.55.106.183
66.249.71.65
66.249.71.65
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.131
65.55.106.131
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.106.186
65.55.106.186
66.249.65.12
66.249.65.12
66.249.65.12
74.52.245.146
74.52.245.146
66.249.65.43
66.249.65.43
66.249.65.43
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.25
65.55.207.25
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
66.249.65.12
65.55.207.94
65.55.207.94
66.249.65.12
65.55.207.71
66.249.65.12
66.249.65.12
66.249.65.12
98.242.170.241
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38
66.249.65.38


# Exercise: Count IP addresses

Once again, I want you to read through `mini-access-log.txt`. But this time, I don't want you to print the IP addresses! Rather, I want you to create a dict in which the IP addresses (strings) are the keys, and the number of times each address appears in the file will be the value (integers).

- Define `counts`, an empty dict.
- Iterate over each line of `mini-access-log.txt`.
- Grab the IP address from the start of each line.
- If you have seen this IP address before, add 1 to its count.
- If it is a new address, not yet a key in `counts`, add a key-value pair to the `counts` dict.
- In the end, iterate over `counts`, and print every key aznd value.

In [94]:
d = {'a':10, 'b':20, 'c':30}

d.pop('b')  # this is a method, dict.pop, and it takes an argument, 'b', the key that should be removed from the dict

20

In [95]:
d

{'a': 10, 'c': 30}

In [97]:
counts = {}

f = open('mini-access-log.txt')

for one_line in f:
    print(one_line.split())

f.close()

['67.218.116.165', '-', '-', '[30/Jan/2010:00:03:18', '+0200]', '"GET', '/robots.txt', 'HTTP/1.0"', '200', '99', '"-"', '"Mozilla/5.0', '(Twiceler-0.9', 'http://www.cuil.com/twiceler/robot.html)"']
['66.249.71.65', '-', '-', '[30/Jan/2010:00:12:06', '+0200]', '"GET', '/browse/one_node/1557', 'HTTP/1.1"', '200', '39208', '"-"', '"Mozilla/5.0', '(compatible;', 'Googlebot/2.1;', '+http://www.google.com/bot.html)"']
['65.55.106.183', '-', '-', '[30/Jan/2010:01:29:23', '+0200]', '"GET', '/robots.txt', 'HTTP/1.1"', '200', '99', '"-"', '"msnbot/2.0b', '(+http://search.msn.com/msnbot.htm)"']
['65.55.106.183', '-', '-', '[30/Jan/2010:01:30:06', '+0200]', '"GET', '/browse/one_model/2162', 'HTTP/1.1"', '200', '2181', '"-"', '"msnbot/2.0b', '(+http://search.msn.com/msnbot.htm)"']
['66.249.71.65', '-', '-', '[30/Jan/2010:02:07:14', '+0200]', '"GET', '/browse/browse_applet_tab/2593', 'HTTP/1.1"', '200', '10305', '"-"', '"Mozilla/5.0', '(compatible;', 'Googlebot/2.1;', '+http://www.google.com/bot.htm