# Week 3: Agenda

1. Q&A
2. Tuples and unpacking
3. Dictionaries
    - What are they?
    - Creating dictionaries
    - Retrieving from them
    - Different ways to use them in our programs
4. Files
    - Reading from files (plain-text files)
    - Looping over file objects to read them
    - Writing to files

In [2]:
%autosave 30

Autosaving every 30 seconds


# Tuples and unpacking

Last time, we talked about lists:

- Strings are sequences of characters. They are immutable. (We cannot change a string.)
- Lists are sequences of *anything*. They are mutable. (We *can* change a list.)

Tuples are a mix of these two ideas:

- A tuple is a sequence of *anything*
- But it is also immutable.

Many people like to think of tuples are immutable lists (or as I've sometimes heard, "locked lists.") This is *not* the way that the Python core developers want you to think about them! Rather, they want you think of lists as sequences of the same type, whereas tuples are sequences of different types.

If you have a bunch of integers, then a list is appropriate. If you have one integer, one string, and one list, then a tuple is more appropriate. (At least, officially.)

Where do we use tuples?  Many people use them instead of structs or records, data structures from other programming languages. Here are some more concrete examples:

- A record in a database can be thought of as a tuple, and when we read a database record into Python, it's useful to have it in a tuple
- When I call a function in Python, the arguments as passed as a tuple
- If I have information about a person -- first name, last name, birthdate, and shoe size -- then I'll want to use a tuple, because we have different types.

Many beginners (and not-so-beginners) in Python wonder why we need tuples at all. That's not a bad question! They're immutable, so they're more efficient than lists. But you can get away without using tuples for a while.

In [3]:
# create a tuple

t = (100, 200, 300, 400, 500)    # round parentheses and , between the elements
type(t)

tuple

In [4]:
t[0]   # retrieve one item

100

In [5]:
t[1]

200

In [6]:
t[-1]  # get the final item

500

In [7]:
len(t)  # how many items?

5

In [9]:
t[2:5]    #get a slice

(300, 400, 500)

In [10]:
for one_item in t:
    print(one_item)

100
200
300
400
500


In [11]:
# but of course, they're immutable!

t[0] = '!'

TypeError: 'tuple' object does not support item assignment

In [12]:
t = (100, 'abcd', [102, 103, 104])
len(t)

3

In [13]:
# we don't even need the parentheses!
# The commas are enough to make it a tuple

t = 100, 'abcd', [102, 103, 104]
t

(100, 'abcd', [102, 103, 104])

In [14]:
# can I change a list in a tuple? YES, absolutely!

t[-1].append(105)
t

(100, 'abcd', [102, 103, 104, 105])

In [16]:
# one aspect of tuples that's really useful!
# tuple unpacking

mylist = [10, 20, 30]

x = mylist    # what will the value of x be?

In [17]:
x

[10, 20, 30]

In [18]:
# but what if I do this:

x,y,z = mylist    # three variables in a tuple (no parentheses), and an object with three elements

In [19]:
x

10

In [20]:
y

20

In [21]:
z

30

"Unpacking" is when the object on the right has a certain number of values, and we put variables on the left to capture those values.

If the object on the right is iterable, and if I have the same number of variables on the left, then I'm totally fine, and each variable will get one element of the iterable.

What if the numbers don't match? We get an error.

In [22]:
x,y = mylist

ValueError: too many values to unpack (expected 2)

In [23]:
w,x,y,z = mylist

ValueError: not enough values to unpack (expected 4, got 3)

In [25]:
# an example of unpacking

# how can I get the indexes along with elements of a string/list/tuple? I can use "enumerate"

# 

for one_item in enumerate('abcd'):
    print(one_item)

(0, 'a')
(1, 'b')
(2, 'c')
(3, 'd')


In [26]:
# know that enumerate('abcd') returns a 2-element tuple (index, letter) with each
# iteration.  Can I capture this in a nicer way?

for index, one_letter in enumerate('abcd'):
    print(f'{index}: {one_letter}')

0: a
1: b
2: c
3: d


In [28]:
t = (100, 200, 300)

for index, one_number in enumerate(t):
    print(f'{index}: {one_number}')

0: 100
1: 200
2: 300


# Dictionaries

So far, we've seen that we can store our data in strings (only if they're characters/text), lists, or tuples. In all of these cases, we retrieve our data via a numeric index, starting at 0 for the first item and going up from there. If there are `n` elements in a sequence, then the highest index will always be `n-1`.

If I want to search through a string/list/tuple for a value, it might take a long time! I might need to search through the entire data structure to see if it's there. Which means that the longer the data structure, the more time it might take to find the value.

How do dictionaries help?

1. We can set the index (known as a "key" in the world of dicts) to be anything we want, so long as it's unique  and it's immutable. (Basically, strings, integers, and floats.)  This makes the dict easier to understand and work with.
2. Searching and retrieval via the keys is done at very very high speed.

In [29]:
# example:

d = {'a':100, 'b':200, 'c':300}

# this dict has 3 key-value pairs
len(d)

3

In [30]:
# retrieve from the dict with d[]
# in the [], put the key whose value you want

d['a']

100

In [31]:
# what if I retrieve a key that doesn't exist?
d['qqqq']

KeyError: 'qqqq'

In [32]:
# how can we check to see if a key exists?
# we can use *in*

'a' in d

True

In [33]:
'b' in d

True

In [34]:
'g' in d

False

# Summary of dicts, so far

1. We can define them with
    - `{}` around the outside
    - commas between the key-value pairs
    - colons separating the keys from the values
    - every key needs a value and vice versa
    - Keys must be immutable
2. We can retrieve with `[]`
3. We can check for membership of a key with `in`

Dicts are designed to be very fast and efficient if you retrieve the values via the keys. You can, in theory, get the keys via the values, but it's not guaranteed to be unique and it's just not done that much.

- The keys of a dict must be immutable (typically strings or integers)
- The values of a dict can be *anything at all*.

# Exercise: Restaurant

1. Define a dict whose keys are the names of items on a restaurant menu, and whose values are prices (ints). This will be the `menu` dict.
2. Define `total` to be 0
3. Ask the user, repeatedly, to enter something they want to eat.
    - If they enter the empty string, stop asking and print the final bill.
    - If they enter something that is on the menu, then add it to the total, and print out the current total and the new total.
    - If they order something that is *not* on the menu, then scold them appropriately.
    
Example:

    Order: sandwich
    sandwich costs 15, total is 15
    Order: tea
    tea costs 7, total is 22
    Order: cake
    cake costs 9, total is 31
    Order: elephant
    Sorry, we are fresh out of elephant
    Order: [ENTER]
    total is 31

# Use cases for dicts

1. Month names (keys) and month numbers (values)
2. Month numbers (keys) and month names (values)
3. User IDs (keys) and user info (values)


# Exercise: Restaurant

1. Define a dict whose keys are the names of items on a restaurant menu, and whose values are prices (ints). This will be the `menu` dict.
2. Define `total` to be 0
3. Ask the user, repeatedly, to enter something they want to eat.
    - If they enter the empty string, stop asking and print the final bill.
    - If they enter something that is on the menu, then add it to the total, and print out the current total and the new total.
    - If they order something that is *not* on the menu, then scold them appropriately.
    
Example:

    Order: sandwich
    sandwich costs 15, total is 15
    Order: tea
    tea costs 7, total is 22
    Order: cake
    cake costs 9, total is 31
    Order: elephant
    Sorry, we are fresh out of elephant
    Order: [ENTER]
    total is 31

In [35]:
# define a dict, where the keys are strings (menu items), and the values are prices (ints)

menu = {'sandwich':15, 'tea':7, 'cake':9, 'apple':5}
total = 0

while True:   # this will run forever
    order = input('Order: ').strip()
    
    if order == '':      # no order? break out of the loop; you could also say "if not order:"
        break
        
    if order in menu:         # does the user's input string exist as a key our dict?
        price = menu[order]   # get the price
        total += price
        print(f'{order} costs {price}; total is now {total}')
    else:
        print(f'{order} is not on the menu today.')
        
print(total)        

Order: sandwich
sandwich costs 15; total is now 15
Order: tea
tea costs 7; total is now 22
Order: coffee
coffee is not on the menu today.
Order: cake
cake costs 9; total is now 31
Order: cake
cake costs 9; total is now 40
Order: 
40


# Dictionaries are mutable!

Remember that "mutable" means that the value, the object, can be modified. Strings and tuples are immutable. Lists and dicts are mutable.

This is *NOT* remotely the same thing as saying they are (or aren't) constants. A constant (in other languages) means that once you set a value to a variable, you can never change that assignment. Python doesn't have constants.

Immutable means that the variable refers to an object, and the object will never change.

In the case of dictionaries, we can modify the dict -- adding or removing key-value pairs, or modifying values once they're in the dict.

In [36]:
# how do I modify a dict?

d = {'a':10, 'b':20, 'c':30}   # this is the start

len(d)   # this tells me how many key-value pairs I have

3

In [37]:
# I'm not saying d = SOMETHING, but rather I'm modifying the object that d is referring to

d['c'] = 40     # I'm replacing the value associated with the key 'c'
d

{'a': 10, 'b': 20, 'c': 40}

In [38]:
# how can I add a new key-value pair to the dict?
# I assign to the dict, EXACTLY as I modified an existing value

# if the key exists, the value is updated
# if the key is new, the key-value pair is added

# there is no "append" method for dictionaries

d['x'] = 1234   # this is the first assignment to the key 'x', so this creates that key-value pair

In [39]:
d

{'a': 10, 'b': 20, 'c': 40, 'x': 1234}

In [40]:
len(d)

4

In [41]:
# I can remove a key-value pair with the "pop" method
# I name the key, and I get back the value associated with it

d.pop('x')   # this looks for the key "x" and returns the value associated with it, when we remove the pair

1234

In [42]:
d

{'a': 10, 'b': 20, 'c': 40}

In [43]:
print(d)  # if I print my dict, this is how it looks...

{'a': 10, 'b': 20, 'c': 40}


In [44]:
d['b']

20

In [45]:
print(d['b'])

20


In [47]:
# perhaps it's obvious, but the key can be a variable

k = 'b'

d[k]  # retrieve the value from d, where k (the variable)'s value is the key

20

# Dicts aren't unique to Python!

Just about every programming language has something like a dict. High-level languages like Python, Perl, Ruby, PHP, and JavaScript really use them everywhere. But they're known under many different names:

- Hash tables
- Hashes
- Associative arrays
- Name-value pairs
- Key-value pairs
- Hash maps
- Maps

In [48]:
d = {}   # empty dict

for one_letter in 'abcd':
    d[one_letter] = 10
    
d

{'a': 10, 'b': 10, 'c': 10, 'd': 10}

# Next up

1. Accumulating (known things) in dicts
2. Accumulating (unknown things) in dicts
3. Looping over dicts
4. How dicts work behind the scenes



# Using dicts to accumulate information

In some cases, I know what I want to count. I can create a dict in which the things I want to count are the keys, and the values all start at 0. Then, over the course of the program, those numbers can rise.  At the end of the program, I look at the dict and see how many of each thing was there.

In [49]:
# count even and odd numbers in a string

counts = {'evens':0, 'odds':0}   

numbers = input('Enter numbers, separated by spaces: ').split()

for one_number in numbers:

    if not one_number.isdigit():   # can this string not be made into an int? Next number!
        continue
        
    n = int(one_number)
    
    if n % 2 == 0:   # if the number is even
        counts['evens'] += 1
    else:
        counts['odds'] += 1
        
print(counts)        

Enter numbers, separated by spaces: 10 15 abcd 17 18
{'evens': 2, 'odds': 2}


In [50]:
d = {'a':input('Enter a value: ').strip(),
     'b':input('Enter b value: ').strip()}

Enter a value: 123
Enter b value: 456


In [51]:
d

{'a': '123', 'b': '456'}

# Exercise: Vowels, digits, and others (dict edition)

1. Define a dict with three keys: `vowels`, `digits`, and `others`, and 0 for all values.
2. Ask the user, repeatedly, to enter a string.
    - If the user enters the empty string, stop asking and print our dict
3. Go through each character in the user's string:
    - If it's a vowel, add 1 to `vowels`
    - If it's a digit, add 1 to `digits`
    - If it's neither, add 1 to `others`.
4. Print our dict with all of the counts.    

Example:

    Enter text: hello!
    Enter text: ab12??
    Enter text: [ENTER]
    
    {'vowels':3, 'digits':2, 'others':7}

In [None]:
# if you're getting errors about futures and PyIodide (most of you won't), then you have to say

x = input('Enter text: ')

# don't put these in the same cell!
s = x.result()

# Exercise: Vowels, digits, and others (dict edition)

1. Define a dict with three keys: `vowels`, `digits`, and `others`, and 0 for all values.
2. Ask the user, repeatedly, to enter a string.
    - If the user enters the empty string, stop asking and print our dict
3. Go through each character in the user's string:
    - If it's a vowel, add 1 to `vowels`
    - If it's a digit, add 1 to `digits`
    - If it's neither, add 1 to `others`.
4. Print our dict with all of the counts.    

Example:

    Enter text: hello!
    Enter text: ab12??
    Enter text: [ENTER]
    
    {'vowels':3, 'digits':2, 'others':7}

In [52]:
counts = {'vowels':0, 'digits':0, 'others':0}

while True:   # I don't know how many strings the user will enter
    
    s = input('Enter text: ').strip()
    
    if s == '':
        break
        
    # go through each character -- for loop inside of my while loop!
    for one_character in s:
        if one_character in 'aeiou':   # is it a vowel?
            counts['vowels'] += 1
        elif one_character.isdigit():  # is it a digit?
            counts['digits'] += 1
        else:
            counts['others'] += 1      # others == any character that is neither a vowel nor a digit
            
print(counts)

Enter text: hello!
Enter text: ab12??
Enter text: 
{'vowels': 3, 'digits': 2, 'others': 7}


hello! -- h (others), e (vowels), l (others), l (others), o (vowel), ! (others)
ab12?? -- a (vowels), b (others), 1 (digit), 2 (digit), ? (others), ? (others)

others -- 7
vowels -- 3
digits -- 2


If your dict has integer keys, then use ints, not strings:



In [53]:
months = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr'}

months[3]

'Mar'

# Accumulating the unknown

We don't always know what we'll be counting. Examples:

- I want to know how many times each IP address has visited my server
- I want to know how much we have earned from each product
- I want to know which users have tried to log in unsuccessfully

I can start with an empty dict:

- If I want to store info about an existing key, I just add it to the current count
- If I want to store info about a new key, I have to add the new key-value pair.


In [54]:
# example: letter counts

counts = {}   # we will count each character, how many times it appears

while True:   # I don't know how many strings the user will enter
    
    s = input('Enter text: ').strip()
    
    if s == '':
        break
        
    for one_character in s:   # go through s, one character at a time

        # I want to do this:
        counts[one_character] += 1  # this won't work because we can only add to something that exists!
        
print(counts)

Enter text: hello


KeyError: 'h'

In [55]:
# example: letter counts

counts = {}   # we will count each character, how many times it appears

while True:   # I don't know how many strings the user will enter
    
    s = input('Enter text: ').strip()
    
    if s == '':
        break
        
    for one_character in s:   # go through s, one character at a time

        if one_character in counts:      # have we seen this character before?
            counts[one_character] += 1   #   add to its count!
        else:                            # new character?
            counts[one_character] = 1    #   add the key-value pair with a value of 1
        
print(counts)

Enter text: hello
Enter text: what is new?
Enter text: I'm fine, how are you?
Enter text: 
{'h': 3, 'e': 4, 'l': 2, 'o': 3, 'w': 3, 'a': 2, 't': 1, ' ': 6, 'i': 2, 's': 1, 'n': 2, '?': 2, 'I': 1, "'": 1, 'm': 1, 'f': 1, ',': 1, 'r': 1, 'y': 1, 'u': 1}


# Three paradigms for using dicts

1. Define it at the start of the program, treat it as read only.
2. Define the keys and default values (0, empty list, etc.) at the start of the program. Update the values, but don't add/remove keys.
3. Define an empty dict at the start of the program. Add keys as necessary, and update values when you see the same key a subsequent time.

# Exercise: Rainfall exercise

1. Define an empty dict, called `rainfall`. The dict's keys will be city names, and its values will be integers -- mm of rain that fell in each city.
2. We're going to ask the user to enter a city, again and again. When we get an empty city name, we'll stop asking.
3. If we got a city name, then we're going to ask how much rain (in mm) fell in that city in the last day.
    - If this is the first time we're seeing a city, then add the city + its rainfall as a key-value pair.
    - If this is not the first time we're seeing a city, then add the current rainfall to the existing value.
4. Print the `rainfall` dict.

Example:

    City: Jerusalem
    Rain: 4
    City: Tel Aviv
    Rain: 3
    City: Tel Aviv
    Rain: 2
    City: [ENTER]
    {'Jerusalem':4, 'Tel Aviv':5}

In [58]:
rainfall = {}  # empty dict

while True:
    city_name = input('City: ').strip()
    
    if city_name == '':
        break
        
    mm_rain = input('Rain: ').strip()

    if not mm_rain.isdigit():
        print('\tNot numeric; try again')
        continue

    mm_rain = int(mm_rain)     # if the user enters non-digits, this will blow up
    
    if city_name in rainfall:
        rainfall[city_name] += mm_rain    # add to the existing value if the city is already in the dict
    else:
        rainfall[city_name] = mm_rain     # add the key-value pair if the city is new
        
print(rainfall)        

City: a
Rain: 5
City: b
Rain: 4
City: a
Rain: asdfsadfa
	Not numeric; try again
City: a
Rain: 3
City: 
{'a': 8, 'b': 4}


In [59]:
# does in only work on the keys? 

d = {'a':10, 'b':20, 'c':30}

'a' in d

True

In [60]:
10 in d

False

How do we search for values in a dict?

1. Try to structure your data so you don't have to do this.
2. If you really need to, there are dict methods `.keys()` and `.values()`. If you want, you can say something like `10 in d.values()`.

In [61]:
10 in d.values()

True

In [62]:
#, Rainfall dict of lists edition

rainfall = {}  # empty dict

while True:
    city_name = input('City: ').strip()
    
    if city_name == '':
        break
        
    mm_rain = input('Rain: ').strip()

    if not mm_rain.isdigit():
        print('\tNot numeric; try again')
        continue

    mm_rain = int(mm_rain)     # if the user enters non-digits, this will blow up
    
    if city_name in rainfall:
        rainfall[city_name].append(mm_rain)
    else:
        
        rainfall[city_name] = [mm_rain]    
print(rainfall)        

City: 
{}


# Next up

1. Loops and dicts (and avoiding some common mistakes)
2. How do dicts work behind the scenes?
3. Files
    - Reading from text files
    - Writing (a little) to text files
    
    

# Loops and dicts

We've seen that we can run a `for` loop over a variety of data structures:

- Iterating over a string gives us the characters
- Iterating over a list gives us the elements
- Iterating over a tuple also gives us its elements

What happens if we iterate over a dict?

In [63]:
d = {'a':10, 'b':20, 'c':30}

for one_item in d:
    print(one_item)

a
b
c


In [64]:
# Iterating over a dict gives you the keys
# Let's use that to print our dict

for one_key in d:
    print(f'{one_key}: {d[one_key]}')

a: 10
b: 20
c: 30


In [65]:
# I mentioned before that there is a "dict.keys" method that returns the keys
# you *could* iterate over dict.keys -- but iterating directly over d is faster (and more Pythonic)



In [67]:
# it's kind of annoying that I get the keys, and then I retrieve the values via the keys
# is there a way for me to get both the keys and the values?

# yes, with the .items method!

# every iteration over .items gives me a 2-element tuple, (key, value)

for one_item in d.items():
    print(one_item)

('a', 10)
('b', 20)
('c', 30)


In [68]:
for t in d.items():
    print(f'{t[0]}: {t[1]}')   # t[0] is the key, and t[1] is the value

a: 10
b: 20
c: 30


In [69]:
# let's take advantage of unpacking
for t in d.items():
    one_key, one_value = t
    print(f'{one_key}: {one_value}')

a: 10
b: 20
c: 30


In [70]:
# we can shrink this even more by putting unpacking in our for loop

for one_key, one_value in d.items():
    print(f'{one_key}: {one_value}')

a: 10
b: 20
c: 30


In [71]:
for one_key, one_value in rainfall.items():
    print(f'{one_key}: {one_value}')

In [72]:
#, Rainfall dict of lists edition

rainfall = {}  # empty dict

while True:
    city_name = input('City: ').strip()
    
    if city_name == '':
        break
        
    mm_rain = input('Rain: ').strip()

    if not mm_rain.isdigit():
        print('\tNot numeric; try again')
        continue

    mm_rain = int(mm_rain)     # if the user enters non-digits, this will blow up
    
    if city_name in rainfall:
        rainfall[city_name].append(mm_rain)
    else:
        
        rainfall[city_name] = [mm_rain]    

        
for one_key, one_value in rainfall.items():
    print(f'{one_key}: {one_value}')        

City: a
Rain: 5
City: b
Rain: 4
City: a
Rain: 3
City: b
Rain: 2
City: a
Rain: 6
City: b
Rain: 5
City: 
a: [5, 3, 6]
b: [4, 2, 5]


In [73]:
rainfall

{'a': [5, 3, 6], 'b': [4, 2, 5]}

In [74]:
for one_key, all_values in rainfall.items():
    print(one_key)

    for one_value in all_values:
        print(f'\t{one_value}')

a
	5
	3
	6
b
	4
	2
	5


# How do dicts work?

How is it that:

- Dict keys need to be immutable?
- Searching in a dict is so fast?
- Keys must be unique?
- Values can be anything?
- We can search via keys, but not via values?

The answer: A hash function.  In other words, when Python decides where in memory to store each of the key-value pairs in our dict, it uses the key to make that calculation.

In other words, a key-value pair with a key `'a'` will be stored in a place based on calling `hash('a')`, where `hash` is a special function that gives us a number back from a string. That number is very very hard to predict, but it's deterministic.

How does this work? How is it different from lists?

Lists are similar to an office building in which you don't know where people work. Searching in a list means going through each element, one at a time, hoping that you'll find the element you're looking for. You might have to search the entire list before either finding the value or giving up. That can take a long time, if you have a lot of data.

Dicts work very differently, thanks to the hash function. You can look at the key and jump to the right place in memory, and then ask: Is my data here? If so, then great; you're done. If not, then it's not in the dictionary; you're also done. This is super fast -- we barely need to search at all.

This explains a lot:

- Why do dict keys need to be immutable? Because changing them would make their hash value irrelevant or wrong. The key-value pairs would be stored in the wrong place in the dict, and they would get lost.

- Why do keys need to be unique? Because otherwise, we would have a hash collision -- more than one key-value pair vying for the same location, and no obvious way to resolve the conflict.

- Values aren't part of the hash function, so they can be anything at all.



In [75]:
mylist = [10, 20, 30]

d[mylist] = 100

TypeError: unhashable type: 'list'

# Files

We use files all of the time:

- Word files
- Powerpoint files
- Excel files
- PDF files

All a file is, is a bunch of bytes on disk that allow us to turn off the computer, turn it back on, and still have our data.  

We're going to be using plain-text files -- as you would probably see for configuration, for logs, or for data. 

How can we read a file? We don't have access to the disk. But the OS does. So:

- We need to ask the OS for help in opening the file
- It does this by giving us a "file handle," with which we can read from the file
- This file handle goes through the OS, and makes sure that we don't do any monkey business

The way this works in Python is that we run the `open` function, and it gives us the Python equivalent of a file handle, namely a file object. This object acts as our agent -- we ask it to read from the file, it asks the OS to read from the file, the OS asks the disk to read from the file, and then we get the file's contents (if all goes well).

In [83]:
# option 1 for reading from a file: open it, and run the "read" method on the file object

f = open('linux-etc-passwd.txt')   # opened the file for reading, the default


In [84]:
f   # printed representation of my file object, showing name, mode, and encoding

<_io.TextIOWrapper name='linux-etc-passwd.txt' mode='r' encoding='UTF-8'>

In [85]:
print(f.read())  # this returns a string, the entire contents of the file

# This is a comment
# You should ignore me
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin



news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
irc:x:39:39:ircd:/var/run/ircd:/usr/sbin/nologin
gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin

nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
syslog:x:101:104::/home/syslog:/bin/false
messagebus:x:102:106::/var/run/dbu

In [86]:
# not sure where Jupyter is running?

%pwd   

'/Users/reuven/Courses/Current/oreilly-2023-05May-python'

We could run `read()` on a file object, if it's small.

But what if it's big? 

The answer is: read it, one line at a time.  If we do that, then the odds that we run out of memory are very slim.

We can do that by iterating over our file object in a `for` loop.

In [88]:
f = open('linux-etc-passwd.txt')

for one_line in f:   # iterating over a file gives you the next line (through \n) with each iteration
    print(one_line.strip())   # remove the newline from the end of the line

# This is a comment
# You should ignore me
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin



news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
irc:x:39:39:ircd:/var/run/ircd:/usr/sbin/nologin
gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin

nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
syslog:x:101:104::/home/syslog:/bin/false
messagebus:x:102:106::/var/run/dbu

In [91]:
# Let's print all of the usernames in my passwd file
# each useful line has a record with fields separated by ':'
# the username is the first field on each line, meaning before the first :

# let's read each line of the file, and break each line into a list using split on :
# then field 0 will be our username

for one_line in open('linux-etc-passwd.txt'):    # iterate over each line of linux-etc-passwd

    if one_line[0] == '#':                       # ignore lines starting with #
        continue
        
    if one_line.strip() == '':                   # if, after removing whitespace, nothing is left -- ignore it
        continue
    
    fields = one_line.split(':')                 # get the current line into a list
    print(fields[0])                             # username is first field, aka fields[0] -- print it

root
daemon
bin
sys
sync
games
man
lp
mail
news
uucp
proxy
www-data
backup
list
irc
gnats
nobody
syslog
messagebus
landscape
jci
sshd
user
reuven
postfix
colord
postgres
dovecot
dovenull
postgrey
debian-spamd
memcache
genadi
shira
atara
shikma
amotz
mysql
clamav
amavis
opendkim
gitlab-redis
gitlab-psql
git
opendmarc
dkim-milter-python
deploy
redis


# Different filenames + paths

- Unix
    - If we give a filename without any `/` characters, we assume the file is in the same directory as Python is running.
    - If the filename contains a `/` but doesn't start with it, then it's a *relative path*, and Python will try to open it relative to the current directory. So `abc/def.txt` will be the file `def.txt` in the `abc` subdirectory of the current directory.
    - If the filename *starts* with `/`, then Python will start at the top of the filesystem, as in `/etc/passwd`.
    
- Windows
    - If we give a filename without any `\` characters, we assume the file is in the same directory as Python is running.
    - If the filename contains a `\` but doesn't start with it, then it's a *relative path*, and Python will try to open it relative to the current directory. So `abc\def.txt` will be the file `def.txt` in the `abc` subdirectory of the current directory.
    - If the filename *starts* with `c:\`, then Python will start at the top of that disk in looking for the file, as in `c:\etc\passwd`.


# code -> Markdown -> code

- I can switch a cell to code mode by going into command mode (ESC) and then `y`.
- I can switch a cell to Markdown mode by going into command mode (ESC) and then `m`.

shift-Enter  executes Python code, but renders Markdown into HTML.

# Next up

1. Exercise with reading from files
    - Simple reading from a file
    - More advanced reading
2. Writing to files
    - the `with` construct
    - writing a dict to a config file
    
Download this zipfile: https://files.lerner.co.il/exercise-files.zip    
    

In [92]:
%pwd

'/Users/reuven/Courses/Current/oreilly-2023-05May-python'

# Recap: Reading from files

1. We have to use `open` to start working with a file. By default, we can only read from files when we open them, and they have to be text files.
2. We say `open(filename)`, where `filename` is any filename, relative path, or absolute path on your filesystem. (You need to have permission to read from it, of course.)
3. If you want, you could then run the `.read()` method on the file object you get back from `open`. But that's often going to be dangerous, because you don't know how big the file is, and reading one very big string could cause very big problems.
4. It's best, and most Pythonic, to iterate over a file object:
    - Each iteration uses only as much memory as one line in the file
    - With each iteration, we get back a string ending with `\n`
    - The `for` loop ends automatically when we get to the end of the file
    - You can use `if` to filter lines of the file you don't want
    - You can use `split` to break the line into a list, when that's appropriate

# Exercise: Summing numbers

1. There is a file (in my zipfile, and the GitHub repo) called `nums.txt`. In that file, each line contains one integer, surrounded by some amount of whitespace. That's true except for one line with only whitespace and no integer.
2. Go through this file, one line at a time, and sum/add the numbers.



In [93]:
!cat nums.txt

5
	10     
	20
  	3
		   	20        

 25


Some hints:

- Go through the file, one line at a time, with `for`
- How can you turn the digits in that string into an integer? (`int`)
- How can you remove the leading/trailing whitespace on each line? (`strip`)
- How can you avoid problems with the single blank line? One option: (`strip` and checking if the result is an empty string)

If you're using Windows, then remember that Windows uses `\` to separate paths. Python uses `\` to indicate that we're about to get a special character in a string.

This can and will collide!  You thus either want to double your backslashes (so Python ignores them) or use a raw string, with an `r` before the opening quote in a string.

In [94]:
%pwd

'/Users/reuven/Courses/Current/oreilly-2023-05May-python'

In [95]:
# if I want to open a file in a subdirectory of the current directory (as seen above in %pwd), then
# I can say

open('subdir/file.txt')    # subdir must exist, as must file.txt

FileNotFoundError: [Errno 2] No such file or directory: 'subdir/file.txt'

In [None]:
# this means:
# 1. I'm on a Windows machine
# 2. The Windows path starts with \ -- which is not true!
# 3. there is a directory at the root of the disk called exercise-files
# 4. there is a file called nums.txt in that directory

f=open('exercise-files\\nums.txt')

In [100]:
import os
os.getcwd()  # this returns a string, the current working directory

'/Users/reuven/Courses/Current/oreilly-2023-05May-python'

In [104]:
# iterate over the lines of the file
# because the file is in the current directory, I don't have any special path stuff to do 

total = 0
for one_line in open('nums.txt'):      # go through the file, one line at a time
    if one_line.strip() == '':         # if, after removing whitespace, nothing is left...
        print('Bad line -- ignoring')  #   ... ignore the line
    else:                              # if there's something left on the line...
        total += int(one_line)         #   ... turn it into an int, and add to total
        
print(total)        

Bad line -- ignoring
83


In [105]:
# even more robust version -- check that after stripping, a line only has digits

total = 0
for one_line in open('nums.txt'):       # go through the file, one line at a time
    if not one_line.strip().isdigit():  # remove whitespace, and check that only digits are left
        print('Bad line -- ignoring')   #   ... ignore the line
    else:                               # if there's something left on the line...
        total += int(one_line)          #   ... turn it into an int, and add to total
        
print(total)        

Bad line -- ignoring
83


In [106]:
# what if this file were in a subdirectory of the current directory?

filename = 'data/also-nums.txt'    # this is the file in a subdirectory

total = 0
for one_line in open(filename):         # go through the file, one line at a time
    if not one_line.strip().isdigit():  # remove whitespace, and check that only digits are left
        print('Bad line -- ignoring')   #   ... ignore the line
    else:                               # if there's something left on the line...
        total += int(one_line)          #   ... turn it into an int, and add to total
        
print(total)        

Bad line -- ignoring
83


# Another example of reading from a file...
# ... counting IP addresses

I have a file `mini-access-log.txt`, which contains log info from a Web server years ago.

I want a dict containing the IP addresses and how many times each address accessed our server.

In [107]:
!head mini-access-log.txt

67.218.116.165 - - [30/Jan/2010:00:03:18 +0200] "GET /robots.txt HTTP/1.0" 200 99 "-" "Mozilla/5.0 (Twiceler-0.9 http://www.cuil.com/twiceler/robot.html)"
66.249.71.65 - - [30/Jan/2010:00:12:06 +0200] "GET /browse/one_node/1557 HTTP/1.1" 200 39208 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
65.55.106.183 - - [30/Jan/2010:01:29:23 +0200] "GET /robots.txt HTTP/1.1" 200 99 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.183 - - [30/Jan/2010:01:30:06 +0200] "GET /browse/one_model/2162 HTTP/1.1" 200 2181 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
66.249.71.65 - - [30/Jan/2010:02:07:14 +0200] "GET /browse/browse_applet_tab/2593 HTTP/1.1" 200 10305 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.71.65 - - [30/Jan/2010:02:10:39 +0200] "GET /browse/browse_files_tab/2499?tab=true HTTP/1.1" 200 446 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.12 - -

In [111]:
filename = 'mini-access-log.txt'
counts = {}

for one_line in open(filename):
    ip_address = one_line.split()[0]   # split the line on whitespace into a list... grab the first field
    
    if ip_address in counts:
        counts[ip_address] += 1        # if we've seen this IP before, add 1 to its count
    else:
        counts[ip_address] = 1         # new IP? add the key-value pair to our dict
        
for key, value in counts.items():
    print(f'{key}: {value}')

67.218.116.165: 2
66.249.71.65: 3
65.55.106.183: 2
66.249.65.12: 32
65.55.106.131: 2
65.55.106.186: 2
74.52.245.146: 2
66.249.65.43: 3
65.55.207.25: 2
65.55.207.94: 2
65.55.207.71: 1
98.242.170.241: 1
66.249.65.38: 100
65.55.207.126: 2
82.34.9.20: 2
65.55.106.155: 2
65.55.207.77: 2
208.80.193.28: 1
89.248.172.58: 22
67.195.112.35: 16
65.55.207.50: 3
65.55.215.75: 2


# Writing to files

If we want to write to a file:

- Open it, but for writing (not for reading -- normally, you have to choose)
- We have to ensure that the file is closed when we're done writing to it, otherwise some of it might not be written for a while, or ever

So:

1. I'll open the file passing `w` as the second argument to `open`, indicating that I want to open for writing
2. I'll use the `write` method to write to the file. I can write any string I want. It's sort of like `print`, but doesn't automatically add a newline.
3. How will we close our file? One way is to explicitly use the `close` method on our file object. But there's a nicer way, which looks weird, namely the `with` statement.

In [112]:
# when you open a file for writing, IT WILL DESTROY THAT FILE if it already exists, replacing
# it with a new file of 0 length

# (1) I'm opening the file for writing with the filename + 'w'
# (2) I'm putting it in this weird "with" statement to ensure that the file is closed when we're done
# (3) The 'as' is a variable assignment in disguise... inside of the block, outfile is our file object for writing
# (4) we can write whatever string we want - -don't forget the newline

with open('myfile.txt', 'w') as outfile:
    outfile.write('abcd\n')
    outfile.write('efgh\n')
    
    # thanks to

In [113]:
!cat myfile.txt

abcd
efgh
