# Week 02: More Python

This week's learning goals are as follows:

1. Manipulate strings.
1. Read and write regular files and Comma-Separated Value files (CSVs).
1. Understand tuples and sets.
1. Be able to create and use dictionaries.
1. Use NumPy for very simple statistics.

For reference: [Reading different OS newlines](https://stackoverflow.com/questions/2536545/how-to-write-unix-end-of-line-characters-in-windows-using-python/23434608#23434608)

This notebook uses data from [Folger Digital Texts](http://www.folgerdigitaltexts.org/?chapter=5&play=Oth). The verison of Othello by William Shakespeare from Folger is saved for student convenience at ```txts/othello.txt```.

In [89]:
# the following code guarantees you'll properly reload any modules that you custom-defined in your environment.
# you don't need to understand it.
# just run this once at the beginning.
# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
import os
import sys

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. Strings

Most of the material is based off of the [Google for Education Python course](https://developers.google.com/edu/python/strings).

We saw strings last week, but this week we're going to get into more of how we can manipulate strings. Strings are the means with which we can interact with file data on our computers.

In [6]:
s = 'hi'
print(s[1])          ## i
print(len(s))        ## 2
print(s + ' there')  ## hi there
print(s*4) ## multiplication works, but it's not super useful in general

i
2
hi there
hihihihi


You can concatenate strings, but you can't concatenate strings to integers or floats, or to lists.

In [7]:
pi = 3.14
#text = 'The value of pi is ' + pi      ## NO, does not work
text = 'The value of pi is '  + str(pi)  ## yes
print(text)

The value of pi is 3.14


**Special characters**

There are two that we are concerned with: tab (```\t```) and newline (```\n```).

In [10]:
tab_text = 'This line has\ta tab'
print(tab_text)
print()
newline_text = 'This line has\na newline'
print(newline_text)

This line has	a tab

This line has
a newline


**Slicing strings**

We can slice strings using the same type of list indexing that we did earlier.

In [12]:
text = 'hellooooooooooo world'
print(text[:5])
print(text[5:])
print(text[5:10])

hello
oooooooooo world
ooooo


### String functions

There are plenty of functions that are exclusive to strings that will really help us work with files. The bolded ones are the most important.

- ```s.lower()```, ```s.upper()``` -- returns the lowercase or uppercase version of the string
- ```s.strip()``` -- returns a string with whitespace removed from the start and end
- ```s.isalpha()/s.isdigit()/s.isspace()...``` -- tests if all the string chars are in the various character classes
- **```s.startswith('other'), s.endswith('other')```** -- tests if the string starts or ends with the given other string
- ```s.find('other')``` -- searches for the given other string (not a regular expression) within s, and returns the first index where it begins or -1 if not found
- ```s.replace('old', 'new')``` -- returns a string where all occurrences of 'old' have been replaced by 'new'
- **```s.split('delim')```** -- returns a list of substrings separated by the given delimiter. The delimiter is not a regular expression, it's just text. ```'aaa,bbb,ccc'.split(',')``` -> ```['aaa', 'bbb', 'ccc']```. As a convenient special case ```s.split()``` (with no arguments) splits on all whitespace chars.
- **```s.join(list)```** -- opposite of ```split()```, joins the elements in the given list together using the string as the delimiter. e.g. '---'.join(['aaa', 'bbb', 'ccc']) -> ```aaa---bbb---ccc```

Let's first go through the not-so useful ones.

In [20]:
sentence_ex = "The Quick Brown Fox Jumps over the Lazy Dog."
print(sentence_ex)

sentence_lower = sentence_ex.lower()
print(sentence_lower)

sentence_upper = sentence_ex.upper()
print(sentence_upper)

The Quick Brown Fox Jumps over the Lazy Dog.
the quick brown fox jumps over the lazy dog.
THE QUICK BROWN FOX JUMPS OVER THE LAZY DOG.


In [23]:
sentence_whitespace = "  \t\t\t " + sentence_ex + "    \n \t\t\n"
print(sentence_whitespace)
print('stripped version:***' + sentence_whitespace.strip()  + "***end stripped version")

  			 The Quick Brown Fox Jumps over the Lazy Dog.    
 		

stripped version:***The Quick Brown Fox Jumps over the Lazy Dog.***end stripped version


In [19]:
sentence_punctuation = "/.,/.,asdf/.,//.,&"
print(sentence_punctuation.strip('/.,')) # the right version doesn't work
print(sentence_punctuation.strip('/.,&'))

asdf/.,//.,&
asdf


In [27]:
s = sentence_ex.lower()
print('First instance of the:', s.find('the'))
print(s.replace('the', 'a'))

First instance of the: 0
a quick brown fox jumps over a lazy dog.


Now here are the most important ones:

**split** converts a string into a list of strings by separating on a given element.

In [28]:
s.split(' ')

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.']

**join** converts a list of strings into a string by joining on a given element.

In [29]:
arr = ['sdf']*5
print('&'.join(arr))

sdf&sdf&sdf&sdf&sdf


In [34]:
print('original:\t' + s)
print('with commas:\t' + ','.join(s.split(' ')))
print('with tabs:\t' + '\t'.join(s.split(' ')))

original:	the quick brown fox jumps over the lazy dog.
with commas:	the,quick,brown,fox,jumps,over,the,lazy,dog.
with tabs:	the	quick	brown	fox	jumps	over	the	lazy	dog.


### Formatting strings

There are a few ways to format strings so that you can include numerical output with strings. 

The easiest way is to use the plus sign with type-casting (meaning that you explicitly convert non-strings to strings):

In [66]:
print("this is the simplest " + str(2347.0) + " thing but it's annoying")

this is the simplest 2347.0 thing but it's annoying


The second way is to use explicitly set the formatting of each string:

In [73]:
print("This is a little more fancy %03d, %d, %s, %f, %.2f, %.3f" % \
      (2347.0, 2347.0, 2347.0, 2347.0, 2347.0, 2347.0))
print("...but it requires you to know formatting tricks")

This is a little more fancy 2347, 2347, 2347.0, 2347.000000, 2347.00, 2347.000
...but it requires you to know formatting tricks


Finally, the more common ways nowadays is to let Python strings do the formatting automatically for you as follows:

In [77]:
print("This is the convention now: {}".format(2347.0))
print("We let Python decide the best display: {} {} {}".format(37.0304, 27, 00.345))

This is the convention now: 2347.0
We let Python decide the best display: 37.0304 27 0.345


#### Programming exercises

Run the following block to get a string.

In [114]:
ubw_string = """I am the bone of my sword
Steel is my body and fire is my blood
I have created over a thousand blades
Unknown to Death, Nor known to Life
Have withstood pain to create many weapons
Yet, those hands will never hold anything
So as I pray, Unlimited Blade Works."""

Print a lower-case version of this string.

In [115]:
# lowercase version

Print an upper-case version of this string.

In [116]:
# uppercase version

Print a single-line version of this string. In other words, replace all the newline characters with spaces.

In [None]:
# single-line version

(10 minute exercise) Print out a version of this string where every word has its last character capitalized (and all other characters are lowercase).

Hint: use a for loop to iterate over every word. Create a string variable that gets a new word concatenated to it for each iteration of the loop.

In [117]:
# your code here

## 2. Files and CSVs

Finally we are ready to read files!!! Python file-reading (and writing is a three-step process):
1. Create the file object and open the file
1. Read or write stuff from the file object
1. Close the file object

The convention is to let Python handling the file opening/closing for you. We do this as follows:

```
with open(filename.txt, read_or_write_mode) as f: # opens file f
    # do file reading or writing by calling functions on f...
# now after the indented block, the file f is now closed
```

Python files can only ever be opened in either read-mode (```'r'```) or write-mode (```'w'```); they can never be both.

In general, we try to put everything related to reading the file in the ```with``` block. However once we have finished calling functions on the file object, we exit the ```with``` block as soon as possible and just work with the variables defined in memory. I'll try to show you how I do this through the rest of this section.

For the first part of this section, we will be dealing with the folder ```02_python_advanced/txts```.

Let's first deal with **write mode**.

In [80]:
# Using write mode
with open(os.path.join('txts', 'hello_world.txt'), 'w') as f:
    f.write('hello world')
    # the write() function does not create newlines for you
    f.write(' and real life\n') # continues on same line then adds a newline
    f.write('a different line')

Now if you open the txt file in your text editor, you will see the file.

If we open the file for writing again, it automatically starts from the beginning. So no information will be saved. But we could open in append (```'a'```) mode if we wanted to, which can add to the end of a file.

In [81]:
# Overwriting a file
with open(os.path.join('txts', 'hello_world.txt'), 'w') as f:
    f.write('hello world again')
    
with open(os.path.join('txts', 'hello_world.txt'), 'a') as f:
    f.write('\nthis is an appended line')

In general, you'll find that it's easier to construct what you want to write line by line, and then at the end open and write a file with code that makes use of ***```'\n'.join([list of lines])```***. This is easier to read and code up than opening a file and writing it to it bit by bit. An example is below.

In [105]:
# this code is harder to read
with open(os.path.join('txts', 'example_write.txt'), 'w') as f:
    for i in range(3):
        f.write('line {}\n'.format(i))
            

In [106]:
# this code is easier to read
# ...and also more pythonic
lines = ['line {}'.format(i) for i in range(3)]
with open(os.path.join('txts', 'example_write.txt'), 'w') as f:
    f.write('\n'.join(lines))

For **read mode**, there are a few different ways we can read in the information:

In [82]:
# Using read mode
othello_fpath = os.path.join('txts', 'othello.txt')

We can read in the entire file at once with **```f.read()```**.

In [83]:
# read in everything at once with read()
with open(othello_fpath, 'r') as f:
    contents = f.read()
    print('number of characters in text file: {}'.format(len(contents)))

number of characters in text file: 153671


Alternatively, we could read in the entire file line by line with **```f.readline()```**, where the file object automatically detects newlines. **Note that when a file object reads a line, it keeps the newline character at the end.** When the file object no longer has any lines to read, it will return an empty string ```''```, so the best way to use this function is inside a ```while``` loop, where you always read in the next line at the end of a loop iteration.

In [99]:
# read in line by line
with open(othello_fpath, 'r') as f:
    num_lines = 0
    line = f.readline()
    last_line = None
    while len(line) != 0: # alternatively, while len(line):
        num_lines += 1
        last_line = line # save the last line read  
        line = f.readline() # read in a new line
print('num lines:', num_lines)
# here I've encapsulated it in a list just to show you there's a newline at the end
print(['last line:', last_line]) 

num lines: 5730
['last line:', '[They exit.]\n']


A third way to do is by **f.readlines()**, which reads in the entire file, then splits on newlines. Again, it will keep the newline characters at the end.

In [102]:
# read in everything with line breaks
with open(othello_fpath, 'r') as f:
    lines = f.readlines()
    
print('num lines:', len(lines))
# here I've encapsulated it in a list just to show you there's a newline at the end
print(['last line:', lines[-1]]) 

num lines: 5730
['last line:', '[They exit.]\n']


Finally, the last way to read in files is to simply use a for loop on the file object itself. Python detects that you want to read in line by line and will exit the loop when there are no longer any lines to read.

In [104]:
# read in line by line with auto-exit
with open(othello_fpath, 'r') as f:
    num_lines = 0
    for line in f:
        num_lines +=1
        last_line = line
print('num lines:', num_lines)
# here I've encapsulated it in a list just to show you there's a newline at the end
print(['last line:', last_line]) 

num lines: 5730
['last line:', '[They exit.]\n']


Keep in mind that **when a file object is open, it can only be passed through once**. So if you want to reread a file, you need to close it, then open it again (```with ...``` syntax). 

In [266]:
with open(othello_fpath, 'r') as f:
    f.readline() # the first line is read but isn't printed out
    print(">>>line 2", f.readline().strip()) # now this is the second line
    print(">>>20 lines starting from third line:")
    for i, line in enumerate(f):
        if i == 20: break
        print(line.strip()) # starts from the third line onward

>>>line 2 by William Shakespeare
>>>20 lines starting from third line:
Edited by Barbara A. Mowat and Paul Werstine
with Michael Poston and Rebecca Niles
Folger Shakespeare Library
http://www.folgerdigitaltexts.org/?chapter=5&play=Oth
Created on Jul 31, 2015, from FDT version 0.9.2

Characters in the Play
OTHELLO, a Moorish general in the Venetian army
DESDEMONA, a Venetian lady
BRABANTIO, a Venetian senator, father to Desdemona
IAGO, Othello's standard-bearer, or "ancient"
EMILIA, Iago's wife and Desdemona's attendant
CASSIO, Othello's second-in-command, or lieutenant
RODERIGO, a Venetian gentleman
Duke of Venice
Venetian gentlemen, kinsmen to Brabantio:
LODOVICO
GRATIANO
Venetian senators


The way you choose to read in a file depends on your use case. I tend to like using ```f.readlines()``` because I like using list comprehension functions on each string. However, sometimes it benefits you to use the last method if it's easier for you to think about a single line at a time; it's usually also a better use of memory. You can imagine that ```f.read()``` and ```f.readlines()``` requires you to load the entire file into memory first, whereas ```f.readline()``` and ```for line in f``` allow you to read the file piece by piece.

Finally, note that we can remove all the whitespace and newline characters before/after a line by calling **line.strip()**.

In [125]:
# read in line by line with auto-exit
with open(othello_fpath, 'r') as f:
    for i, line in enumerate(f):
        if i == 10: break
        print(i+1, line.strip()) # will not print the newline

1 Othello
2 by William Shakespeare
3 Edited by Barbara A. Mowat and Paul Werstine
4 with Michael Poston and Rebecca Niles
5 Folger Shakespeare Library
6 http://www.folgerdigitaltexts.org/?chapter=5&play=Oth
7 Created on Jul 31, 2015, from FDT version 0.9.2
8 
9 Characters in the Play


#### Programming Exercises

In the first cell, write a multi-line file of your choice and save it into the ```txts``` folder.

In the second cell, read your file and print it out with line numbers as shown:
```
line 1: This is my first line
line 2: This is my second line
```

Hint: you might want to use string formatting, as follows:
```
print('line {}: {}'.format(some_arg1, some_arg2))
```

In [None]:
# write your custom file here

In [108]:
# read your custom file here

In the following cell, print out the first fifty lines of ```txts/othello.txt``` with the following formatting:

```
line 1/5730: Othello
line 2/5730: by William Shakespeare
...
line 50/5730: IAGO  Despise me
```

Hint: you might want to use ```f.readlines()``` for this, as follows:
```
lines = f.readlines()
# some code to get the total number of lines here
for i, line in f.readlines():
    ...
```

### CSVs

CSVs are a large class of files that use commas as the main separator. Such files are easy to load into Excel or another spreadsheet program, because the csv format also assumes that you can visualize the data as columns or as rows.

The following section uses the folder ```02_python_advanced/csvs```.

You can read csvs using conventional Python file reading functions:

In [127]:
cheese_bedsheets_fpath = os.path.join('csvs', 'cheese_bedsheets.csv')
with open(cheese_bedsheets_fpath, 'r') as f:
    for line in f:
        print(line.strip()) # removes newline

Years,Per capita consumption of cheese (USA) in Pounds (USDA),Number of people who died by becoming tangled in their bedsheets in Deaths (US) (CDC)
2000,29.8,327
2001,30.1,456
2002,30.5,509
2003,30.6,497
2004,31.3,596
2005,31.7,573
2006,32.6,661
2007,33.1,741
2008,32.7,809
2009,32.8,717


For CSVs, we want to (1) remove all whitespace and newlines, and (2) separate by commas. The following code does these two things in a single line:

In [131]:
with open(cheese_bedsheets_fpath, 'r') as f:
    lines = [line.strip().split(',') for line in f]
for line in lines:
    print(line)

['Years', 'Per capita consumption of cheese (USA) in Pounds (USDA)', 'Number of people who died by becoming tangled in their bedsheets in Deaths (US) (CDC)']
['2000', '29.8', '327']
['2001', '30.1', '456']
['2002', '30.5', '509']
['2003', '30.6', '497']
['2004', '31.3', '596']
['2005', '31.7', '573']
['2006', '32.6', '661']
['2007', '33.1', '741']
['2008', '32.7', '809']
['2009', '32.8', '717']


However, it's more useful for us to use the ```csv``` module since it automatically strips and splits all lines for us. As always, we make sure that we import the package before using it:

**IMPORTANT NOTE:** The CSV reader (as we learn it) and even the normal file reader does NOT convert the strings into other types. So even if we have floats in our csv, ```csv.reader()``` returns strings. It is up to us to convert to the proper type when we want to do more with the data.

In [132]:
# remember, this only has to be done once per file
import csv

In [133]:
with open(cheese_bedsheets_fpath, 'r') as f:
    for line in csv.reader(f):
        print(line)

['Years', 'Per capita consumption of cheese (USA) in Pounds (USDA)', 'Number of people who died by becoming tangled in their bedsheets in Deaths (US) (CDC)']
['2000', '29.8', '327']
['2001', '30.1', '456']
['2002', '30.5', '509']
['2003', '30.6', '497']
['2004', '31.3', '596']
['2005', '31.7', '573']
['2006', '32.6', '661']
['2007', '33.1', '741']
['2008', '32.7', '809']
['2009', '32.8', '717']


The ```csv``` module is especially useful because sometimes your csv files has commas inside items. This happens when you have something saved in Excel, where you modify a cell to have commas in it, and the item appears as ```"item, stuff"``` with quotes around it. For example:

In [148]:
cage_harvard_fpath = os.path.join('csvs', 'cage_harvard.csv')

# using simple string splitting will split across commas
# inside a single item, too, which is not what we want
with open(cage_harvard_fpath, 'r') as f:
    lines = [line.strip().split(',') for line in f]
for line in lines:
    print(line)

['Year', '"Films (IMDB)', ' Cage', ' Nicholas"', '"Women (Harvard Crimson)', ' Female Editors', ' Harvard Law Review"']
['2005', '2', '9']
['2006', '3', '14']
['2007', '4', '19']
['2008', '1', '12']
['2009', '4', '19']


In [146]:
# we add quotechars to specify the character that represents a whole item.
# the other options aren't needed but they're good to keep in.
with open(cage_harvard_fpath, 'r') as f:
    for line in csv.reader(f, quotechar='"', delimiter=',',
                     quoting=csv.QUOTE_MINIMAL):
        print(line)

['Year', '"Films (IMDB)', ' Cage', ' Nicholas"', '"Women (Harvard Crimson)', ' Female Editors', ' Harvard Law Review"']
['2005', '2', '9']
['2006', '3', '14']
['2007', '4', '19']
['2008', '1', '12']
['2009', '4', '19']


#### Programming exercises

In ```util.py```, implement ```write_csv(fpath, tups)``` which takes in a filepath and a list of lists and saves a csv.

You can test it using the following cells.
1. Reimport your function by running the first cell whenever you change your util.py.
1. Run a particular test case cell.
1. Check the output of your csv by opening it in Notepad++ or by reading it through the notebook.

In [194]:
# run this every time you change your function
from util import write_csv

def read_csv(fpath):
    # do you have a newline at the end of every line?
    with open(fpath, 'r') as f:
        print('list of lines:', f.readlines())
    
    with open(fpath, 'r') as f:
        print('using the regular csv reader:')
        for line in csv.reader(f, quotechar='"', delimiter=',',
                     quoting=csv.QUOTE_MINIMAL):
            print(line)

In [168]:
def test_case1():
    fpath = os.path.join('csvs', 'test1.csv')
    tups = [ ('test', 'title', 'row'),
              ['only','strings','and'],
              ('equal', 'length', 'rows')]
    write_csv(fpath, tups)
test_case1()
#read_csv(os.path.join('csvs', 'test1.csv')) # uncomment this if you want to use it

list of lines: ['test,title,row\n', 'only,strings,and\n', 'equal,length,rows\n']
using the regular csv reader
['test', 'title', 'row']
['only', 'strings', 'and']
['equal', 'length', 'rows']


In [177]:
def test_case2():
    fpath = os.path.join('csvs', 'test2.csv')
    tups = [ ('now', 'test', 130.0, 'non-string', 'elements', 'and single line')]
    write_csv(fpath, tups)
test_case2()
# read_csv(os.path.join('csvs', 'test2.csv')) # uncomment this if you want to use it

Wrote csv to csvs/test2.csv


In [179]:
def test_case3():
    fpath = os.path.join('csvs', 'test3.csv')
    tups = [] # empty
    write_csv(fpath, tups)
test_case3()
# read_csv(os.path.join('csvs', 'test3.csv')) # uncomment this if you want to use it

Wrote csv to csvs/test3.csv


In [193]:
def test_case4():
    fpath = os.path.join('csvs', 'test4.csv')
    tups = [['finally, test', 'things with', 'commas, or'],
            ['different length', 'sublists'],
            ['but, single element should work too1!!'], 
            (234, 2348.0234, 'asfa', True, False, 0x234)]
    write_csv(fpath, tups)
test_case4()
# read_csv(os.path.join('csvs', 'test4.csv')) # uncomment this if you want to use it

Wrote csv to csvs/test4.csv
list of lines: ['"finally, test",things with,"commas, or"\n', 'different length,sublists\n', '"but, single element should work too1!!"\n', '234,2348.0234,asfa,True,False,564\n']
using the regular csv reader:
['finally, test', 'things with', 'commas, or']
['different length', 'sublists']
['but, single element should work too1!!']
['234', '2348.0234', 'asfa', 'True', 'False', '564']


We'll practice using CSVs more in the next few sections.

## 3. Tuples and sets
Other than lists, there are some other data structures that you might come across in Python that have different purposes.

A **tuple** is like a list, except that once you make a tuple, you can't modify the elements. This is useful when you want some read-only information, or when you want to make sure that the format of your data is absolutely fixed.

In [40]:
tup = (3,4,5)
arr = [6,7,8]
print(tup, "vs", arr) # note parentheses vs braces

(3, 4, 5) vs [6, 7, 8]


In [39]:
tup[0] = 3 # this won't work

TypeError: 'tuple' object does not support item assignment

In [42]:
# you can cast lists to tuples and vice versa
print(tup, 'converts to', list(tup))
print(arr, 'converts to', tuple(arr))

(3, 4, 5) converts to [3, 4, 5]
[6, 7, 8] converts to (6, 7, 8)


A **set** is also like a list, except that it only contains one of each object. Furthermore, it is **not ordered**. This means that you can't index into it. However, its real benefit is that it's fast to check whether or not something exists in it, and in general it works faster than lists for checking membership.

In [206]:
x_set = set(['a', 'b', 'd', 'e'])
print(x_set)# note not sorted

{'d', 'e', 'a', 'b'}


In [44]:
x_set[0] # won't work

TypeError: 'set' object does not support indexing

In [207]:
for c in ['a', 'b', 'c', 'd', 'e', 'f']:
    print(c, c in x_set)

a True
b True
c False
d True
e True
f False


You can add and remove elements from sets.

In [209]:
x_set.add('c')
print('c in set', x_set)
x_set.add('c') # add it again # doesn't do anything
print('c sill in set', x_set)
x_set.remove('c') 
print('c no longer in set', x_set)

c in set {'c', 'b', 'd', 'a', 'e'}
c sill in set {'c', 'b', 'd', 'a', 'e'}
c no longer in set {'b', 'd', 'a', 'e'}


Finally, if you want to convert back into a list, you can simply do so as follows:

In [210]:
print(x_set)
x_list = list(x_set)
print(x_list)

{'b', 'd', 'a', 'e'}
['b', 'd', 'a', 'e']


It is useful to convert and sort the list in one line:

In [211]:
x_list_sorted = sorted(list(x_set))
print(x_list_sorted)

['a', 'b', 'd', 'e']


#### Programming exercise

In the file ```csvs/cheese_everything.csv```:
1. Print out the number of years of data.
1. Print out a sorted list of the years.
1. Print out the two different categories that are compared to cheese.

Create two sets: one for years, and one for categories, and update them as you go through each line of the csv.

In [None]:
cheese_fpath = os.path.join('csvs', 'cheese_everything.csv')
# your code here

## 4. Dictionaries
Finally the most important data structure for us is a dictionary. A **dictionary** is a lookup table, that maps a unique key to a value.

There are two ways to make a dictionary if you know what the values are ahead of time:
1. ```d = {key1: value1, key2: value2}```
1. ```d = dict([(key1, value1), (key2, value2)]) # tuple or list will work```

In [63]:
key1, value1 = 'key1', 'value1'
key2, value2 = 'key2', 'value2'
d = {key1: value1, key2: value2}
print(d)

{'key1': 'value1', 'key2': 'value2'}


In [65]:
d = dict([(key1, value2), (key2, value2)])
print(d)

{'key1': 'value2', 'key2': 'value2'}


You can also manually set each value of a dictionary as follows:

In [220]:
d = {} # make an empty dictionary
d[key1] = value1
d[key2] = value2
print(d)

{'key1': 'value1', 'key2': 'value2'}


If you know a key, you can get the value by simply asking:

In [222]:
print('key1 maps to', d['key1'])
print('key2 maps to', d['key2'])
print('key3 maps to', d['key3']) # this won't work

key1 maps to value1
key2 maps to value2


KeyError: 'key3'

It's important to remember that **keys are unique**. In other words, only one value can be mapped to a key. However, values don't have to be unique, because they're just the result of a lookup.

In [217]:
d = dict([('a', 1), ('b', 1), ('a', 'A'), ('c', 1)])
print(d) # 'a' will get mapped to the latest value

{'a': 'A', 'b': 1, 'c': 1}


We can grab all keys of a dictionary with ```d.keys()```. This returns an **unsorted** set of all of the keys of our dictionary.

In [223]:
print(d.keys())

dict_keys(['key1', 'key2'])


Also, we can iterate in a few ways over our dictionary.

In [232]:
fancy_dict = dict([(chr(ord('a') + i), ord('a') + i) for i in range(26)])
for k in fancy_dict: # automatically asks for the keys
    print(k, '-> ascii ->', fancy_dict[k])

a -> ascii -> 97
b -> ascii -> 98
c -> ascii -> 99
d -> ascii -> 100
e -> ascii -> 101
f -> ascii -> 102
g -> ascii -> 103
h -> ascii -> 104
i -> ascii -> 105
j -> ascii -> 106
k -> ascii -> 107
l -> ascii -> 108
m -> ascii -> 109
n -> ascii -> 110
o -> ascii -> 111
p -> ascii -> 112
q -> ascii -> 113
r -> ascii -> 114
s -> ascii -> 115
t -> ascii -> 116
u -> ascii -> 117
v -> ascii -> 118
w -> ascii -> 119
x -> ascii -> 120
y -> ascii -> 121
z -> ascii -> 122


In [234]:
for k, v in fancy_dict.items():
    print(k, '-> ascii ->', v) # keys and values

a -> ascii -> 97
b -> ascii -> 98
c -> ascii -> 99
d -> ascii -> 100
e -> ascii -> 101
f -> ascii -> 102
g -> ascii -> 103
h -> ascii -> 104
i -> ascii -> 105
j -> ascii -> 106
k -> ascii -> 107
l -> ascii -> 108
m -> ascii -> 109
n -> ascii -> 110
o -> ascii -> 111
p -> ascii -> 112
q -> ascii -> 113
r -> ascii -> 114
s -> ascii -> 115
t -> ascii -> 116
u -> ascii -> 117
v -> ascii -> 118
w -> ascii -> 119
x -> ascii -> 120
y -> ascii -> 121
z -> ascii -> 122


In [236]:
# makes sure that you have sorted keys in your loop
for k in sorted(fancy_dict):
    print(k, '-> ascii ->', fancy_dict[k])

a -> ascii -> 97
b -> ascii -> 98
c -> ascii -> 99
d -> ascii -> 100
e -> ascii -> 101
f -> ascii -> 102
g -> ascii -> 103
h -> ascii -> 104
i -> ascii -> 105
j -> ascii -> 106
k -> ascii -> 107
l -> ascii -> 108
m -> ascii -> 109
n -> ascii -> 110
o -> ascii -> 111
p -> ascii -> 112
q -> ascii -> 113
r -> ascii -> 114
s -> ascii -> 115
t -> ascii -> 116
u -> ascii -> 117
v -> ascii -> 118
w -> ascii -> 119
x -> ascii -> 120
y -> ascii -> 121
z -> ascii -> 122


**Important tip**: When we add items to dictionaries, we can check if a key exists in a dictionary with ```key in fancy_dict``` or ```key not in fancy_dict```, and then do things as needed. For example:

In [270]:
char_counts = ['a']*5 + ['b']*4 + ['a']*2 + ['d']*2 + ['e']*3
char_dict = {}
for char in char_counts:
    if char not in char_dict:
        # first add it in
        char_dict[char] = 0 # no counts seen yet
    char_dict[char] += 1 # we have definitely seen it once, so increment the count
print(char_dict)

{'a': 7, 'b': 4, 'd': 2, 'e': 3}


Dictionaries are the most useful data structure for parsing CSV files because they allow us to create lookup tables and meaningful lists, instead of always just looking at a single column or a single row.

**IMPORTANT NOTE:** Remember that whenever we read in CSV files, ```csv.reader()``` doesn't cast any strings to our correct type. When we are creating our dictionaries from CSV files, we should look at our csv to make sure that we cast to the correct type. That way we can use our dictionaries to do math, as we'll see in the next section.

In [245]:
cheese_bed_dict = {}
with open(cheese_bedsheets_fpath, 'r') as f:
    header = f.readline() # skip the first row/title header
    print(header)
    for line in csv.reader(f):
        year_str, cheese_str, bed_str = line
        cheese_bed_dict[int(year_str)] = [float(cheese_str), int(bed_str)]
for k, v in cheese_bed_dict.items():
    print(k, v)

Years,Per capita consumption of cheese (USA) in Pounds (USDA),Number of people who died by becoming tangled in their bedsheets in Deaths (US) (CDC)

2000 [29.8, 327]
2001 [30.1, 456]
2002 [30.5, 509]
2003 [30.6, 497]
2004 [31.3, 596]
2005 [31.7, 573]
2006 [32.6, 661]
2007 [33.1, 741]
2008 [32.7, 809]
2009 [32.8, 717]


**Tip** After you create your dictionary, it's good to define some integer constants so that referencing into your dictionary makes sense. So instead of having to remember that the pounds of cheese is the first element and the bedsheet deaths is the second element, you can simply reference meaningful variable names. For example:

In [247]:
# define some useful constants
CHEESE_LBS, BEDSHEET_DEATHS = 0, 1

# this also does the same thing -- can you see why?
CHEESE_LBS, BEDSHEET_DEATHS = range(2)

"""
The following list is useful in case we need to print out headers.
If we need the cheese header, then we can just reference
CHEESE_BED_STRS[CHEESE_LBS] which will return 'cheese in pounds'.
This is useful when we plot axes and stuff later on.
"""
CHEESE_BED_STRS = ['cheese in pounds', 'bedsheet deaths']

#### Programming exercises

In the file ```csvs/cheese_everything.csv```, create a dictionary called ```cheese_dict``` that maps:
    year -> [cheese_in_pounds, golf_revenue, civil_eng_doctorates]
    
In other words, your keys should be integer years, and your values should be lists of three floats as specified above.

In [243]:
# your code here

In [244]:
# print out your dictionary to make sure it looks alright

## 5. Basic NumPy

For the last section of this week, we are going to go over basics of NumPy, which stands for Numerical Python. This is a math library that has a whole bunch of really useful mathematical functions, and is mainly built on arrays and matrices. However, for this week, we are just going to focus on the easy statistics that NumPy gives us.

First, let's import ```numpy```. The following line means that instead of typing ```numpy.<sthg>``` every time you want to call something from the ```numpy``` library, instead you can just type ```np.<sthg>```.

In [241]:
import numpy as np

Recall what our ```cheese_bed_dict``` looked like. Let's use the constants to help us out.

In [250]:
print('year -> [{}, {}]'.format(
        CHEESE_BED_STRS[CHEESE_LBS],
        CHEESE_BED_STRS[BEDSHEET_DEATHS]))
for k, v in cheese_bed_dict.items():
    # converted all to floats and ints
    print('{} -> [{}, {}]'.format(
         k,
         v[CHEESE_LBS],
         v[BEDSHEET_DEATHS]))

year -> [cheese in pounds, bedsheet deaths]
2000 -> [29.8, 327]
2001 -> [30.1, 456]
2002 -> [30.5, 509]
2003 -> [30.6, 497]
2004 -> [31.3, 596]
2005 -> [31.7, 573]
2006 -> [32.6, 661]
2007 -> [33.1, 741]
2008 -> [32.7, 809]
2009 -> [32.8, 717]


Then the following code returns us a list containing all of our values of cheese.

In [254]:
cheese_values = [v[CHEESE_LBS] for k, v in cheese_bed_dict.items()]
print(cheese_values) # might not be sorted

[29.8, 30.1, 30.5, 30.6, 31.3, 31.7, 32.6, 33.1, 32.7, 32.8]


We can also do this via a normal for loop. Let's do this for the bedsheet_deaths:

In [256]:
bedsheet_values = []
for k, v in cheese_bed_dict.items():
    bedsheet_values.append(v[BEDSHEET_DEATHS])
print(bedsheet_values) # might not be sorted

[327, 456, 509, 497, 596, 573, 661, 741, 809, 717]


Okay, now we're done with our list setup, now we can use NumPy to get some basic statistics on these two lists.
* ```np.mean(arr)``` # returns the mean of the list arr
* ```np.median(arr)``` # returns the median of the list arr
* ```np.std(arr)``` # returns standard deviation of the list arr
* ```np.amin(arr)``` # returns min of list arr. Same as min(arr)
* ```np.amax(arr)``` # returns max of list arr. Same as max(arr)
* ```np.argmax(arr)``` # returns the index of the maximum element of list arr. If there are multiple maximums, returns the lowest index.
* ```np.argmin(arr)``` # returns the index of the maximum element of list arr. If there are multiple maximums, returns the lowest index.

NumPy has an extensive documentation online that you will most likely use several times during this course. For example, here is the ```np.std()``` documentation [page](https://docs.scipy.org/doc/numpy/reference/generated/numpy.std.html).

In [257]:
print(CHEESE_BED_STRS[CHEESE_LBS],
       'max', np.amax(cheese_values),
      'min', np.amin(cheese_values))

max cheese 33.1 min cheese 29.8


In [260]:
print('{} statistics: mean {} (stdev {}), median {}'.format(
    CHEESE_BED_STRS[BEDSHEET_DEATHS],
    np.mean(bedsheet_values),
    np.std(bedsheet_values),
    np.median(bedsheet_values)))

bedsheet deaths statistics: mean 588.6 (stdev 139.48921105232478), median 584.5


In [269]:
# argmax and argmin
year_keys = [k for k,v in cheese_bed_dict.items()]
max_ind = np.argmax(cheese_values)
maximal_year = year_keys[max_ind]
print('maximal year was {} with cheese amount {}'.format(
    maximal_year, cheese_bed_dict[maximal_year][CHEESE_LBS]))
min_ind = np.argmin(cheese_values)
print('minimal year was {} with cheese amount {}'.format(
    year_keys[min_ind] ,cheese_bed_dict[year_keys[min_ind]][CHEESE_LBS]))
print('verifying with original dictionary', cheese_bed_dict)

maximal year was 2007 with cheese amount 33.1
minimal year was 2000 with cheese amount 29.8
verifying with original dictionary {2000: [29.8, 327], 2001: [30.1, 456], 2002: [30.5, 509], 2003: [30.6, 497], 2004: [31.3, 596], 2005: [31.7, 573], 2006: [32.6, 661], 2007: [33.1, 741], 2008: [32.7, 809], 2009: [32.8, 717]}


## 6. Homework

* Problem 1:
    * ```problem_01.ipynb``` - contains problem descriptions and test cases
    * ```problem_01_fn.py``` - where you write your code
* Problem 2: 
    * ```problem_02.ipynb``` - contains problem descriptions and test cases
    * ```problem_02_fn.py``` - where you write your code
    