# Tutorial 4: Introduction to dictionaries and reading files

## Sources

All tutorials in this folder are adapted from Software Carpentry - Programming with Python (v5), licensed under [CC Zero public domain waiver](https://creativecommons.org/publicdomain/zero/1.0/)  Some instructions have been removed and other added in order to make it fit with the course Data Analytics given at Uppsala university, campus Gotland.


Dictionaries are data structures consisting of an unordered collection of *key-value pairs*. Some languages refer to dictionaries as *associative arrays*. The basic idea is that you can reference items by their key value as opposed to an index location (as you'd do in a numpy ndarray or a list). They are like "lookup tables" in some sense. 

In [None]:
# Create a dictionary with three items in it. Note
# the curly braces used when **constructing** a dictionary.

params = {"parameter1" : 1.0,
          "parameter2" : 2.0,
          "parameter3" : 3.0}

print(type(params))
print(params)

To reference a specific value, just use its key. Notice the square brackets when **referencing** dictionary elements.

In [None]:
params["parameter2"]

Modifying existing dictionary elements is easy.

In [None]:
# Look, we can even change data types on the fly and mix different data
# types together in the same dictionary. That's what dictionaries are great for.

params["parameter1"] = "A"
params["parameter2"] = "B"

# add a new entry
params["parameter4"] = "D"

print(params)

Dictionaries have numerous properties and methods. Let's explore a few.

In [None]:
# Does a certain key exist

print("parameter1" in params.keys())

"parameter5" in params.keys()

In [None]:
# iterate over the keys
for k in params.keys():
    print (k)

**IMPORTANT:** Keys are **not** guaranteed to be sorted in any way.

What if we want to list out the key, value pairs for a dictionary. The `dict.items()` method does this. Technically it returns something known as a *view*. However, it's iterable and behaves like a list of tuples. Let's see this.

In [None]:
print(params.items())

for item in params.items():
    print(item)

We'll see later that one use of dictionaries is that they can be passed (in a special way) to a function in which the key-value pairs are named function arguments and their respective values. We'll use this in web scraping too when we need to pass parameters along with a URL.

More on dictionaries
---------------------

In one of the examples we'll explore, we are going to use dictionaries to hold the values of counts of website hits per month based on an apache log file. The records look like:

    local - - [24/Oct/1994:13:41:41 -0600] "GET index.html HTTP/1.0" 200 150
    local - - [24/Oct/1994:13:41:41 -0600] "GET 1.gif HTTP/1.0" 200 1210
    local - - [24/Oct/1994:13:43:13 -0600] "GET index.html HTTP/1.0" 200 3185
    local - - [24/Oct/1994:13:43:14 -0600] "GET 2.gif HTTP/1.0" 200 2555
    local - - [24/Oct/1994:13:43:15 -0600] "GET 3.gif HTTP/1.0" 200 36403
    local - - [24/Oct/1994:13:43:17 -0600] "GET 4.gif HTTP/1.0" 200 441
    
Each month is represented by a three character abbreviation. Let's say that our basic strategy is to:

* create an empty dictionary called `monthly_counts`
* read a line, get the month into a variable. For example, month = 'Oct'
* Increment the counts for that month via `monthly_counts[month] = monthly_counts[month] + 1

In [None]:
# Create an empty dictionary
monthly_counts = {}

Now, let's assume that the variable `month` has the value 'Oct'. What happens if we try to increment the dictionary value for that key?

In [None]:
month = 'Oct'
monthly_counts[month] = monthly_counts[month] + 1

Ah, so if we haven't added a key yet, we can't assume it starts out with a value of 0 (or anything else, for that matter). Of course, we could simply add a bunch of initialization lines such as `monthly_counts['Jan'] = 0`, `monthly_counts['Feb'] = 0`, and so on. However, there's another way of accessing a dictionary value using its `get` method. The beauty of the `get` method is that it has an optional second parameter in which you can specify the return value if the key doesn't exist.



In [None]:
print (monthly_counts.get('Oct',0))

In [None]:
month = 'Oct'
monthly_counts[month] = monthly_counts.get(month,0) + 1
print (monthly_counts['Oct'])

Reading Files
-------------

Soon we'll use the `csv` package for doing a similar thing. When we learn Pandas we'll see functions like `read_csv`. However, often you need to read a text file line by line and do some data scraping, parsing, manipulating, transforming, ..., whatever. Here are a few of the basic ideas.

### Example 1: open file, read a line, strip, print, repeat until no more lines, close file

In [None]:
# Store input filename in variable. Include necessary path info.
in_filename= "data/apache-mini.log"

# Open the input file for reading. InFile is a "file object" or "file handle".
in_file = open(in_filename, 'r')

# Init counter to keep track of line numbers (not necessary but sometimes useful)
line_number = 0

# Loop through each line in the file. Check out the nice looping syntax for traversing a file.
for line in in_file:
    # The variable line contains the current line as one big string and includes things like
    # end of line characters. Also, to be clear, 'line' is a variable name we made up. We could have
    # called it 'peanutbutter' had we chose to.
    
    # Let's strip off any end of line characters
    # After running this cell, let's comment out this line to see what happens.
    line = line.rstrip()
    
    # Increment the line counter
    line_number += 1
    
    # Print the line and line number. What do you think the ':6' is for? Hint: There are < 1 million rows.
    print( '{:6}: {}'.format(line_number, line) )
    
# After the loop is done, close the file
in_file.close()

In [None]:
# What is in_file?

type(in_file)

### Example 2: an alternate way of opening and closing

Now let's see a more "Pythonic" way of working with files

In [None]:
# Store input filename in variable. Include necessary path info.
in_filename = "data/apache-mini.log"

# Init counter to keep track of line numbers (not necessary but sometimes useful)
line_number = 0

# Open the input file for reading using a `with` block
with open(in_filename, 'r') as in_file:
    # Loop through each line in the file. Check out the nice looping syntax for traversing a file.
    for line in in_file:
        # The variable Line contains the current line as one big string and includes things like
        # end of line characters. 
        
        # Let's strip off any end of line characters
        line = line.rstrip()
        
        # Increment the line counter
        line_number += 1
        
        # Print the line and line number
        print( '{:6}: {}'.format(line_number, line) )
                
# After the loop is done,  there is no need to close the file. It's already been
# closed for you. :) To see that:

if in_file.closed:
    print("\nFile already closed.")
else:
    print("\nFile NOT closed yet")


### Example 3: splitting lines into a list

One common thing you might want to do when reading a formatted text file, is to split each line on some sort of special character such as a comma, tab, or space. Let's split the apache log on space - each line will become a list. We'll store each of these lists in a master list. Sometimes this can do exactly what you need in terms of getting lines ready for import into something like a Pandas DataFrame. 

We can always use more powerful tools like [regex](http://regexr.com/) to do this job. And of course, Python supports regex. We'll see this a little later.

In [None]:
# Store input filename in variable. Include necessary path info.
in_filename = "data/apache-mini.log"

# Init counter to keep track of line numbers (not necessary but sometimes useful)
line_number = 0

# Create empty list
loglines = []

# Open the input file for reading
with open(in_filename, 'r') as in_file:
    # Loop through each line in the file. Check out the nice looping syntax for traversing a file.
    for line in in_file:
        
        # Let's strip off any end of line characters
        line = line.rstrip()
        
        # Before we split on the spaces, let's get rid of the brackets around the date
        line = line.replace('[', '')
        line = line.replace(']', '')
        
        # Now split the line using space as our delimiter
        logline_list = line.split(' ')
        
        # Append the logline list to the master list
        loglines.append(logline_list)
        
        # Increment the line counter
        line_number += 1
        
# All done, print the list
print(loglines)


Well, not so pretty. Sometimes we need to "pretty print" - https://docs.python.org/3/library/pprint.html.

In [None]:
from pprint import pprint

In [None]:
pprint(loglines)

... or of course, we could iterate over the list and print a line at a time for finer control. 

In [None]:
for logline in loglines:
    print(logline)