# A gentle introduction to Python, File IO, Parallel Programming

## What is Python?

![Image of Python](https://www.andreabacciu.com/upload/2015/02/Python-Logo-PNG-Image.png)

Python is a high level programming language created by Guido van Rossum in 1991. 

It is a popular language because it is easy to read, write, and understand. Python abstracts a lot of underlying computational details allowing the programmer to focus on the overall idea of what the program should do.

Python is used extensively in data analytics and machine learning due to the the massive number of libraries supporting these endeavors.

## Lesson 1 - Basic Programming & Python Syntax

I did tell you that Python is easy to read right? Lets put that to the test - what do you think this line of code does? 

(Select the code snippet by clicking on it with your mouse and then hit "Run" on the menu above.)

In [1]:
print("Hello World!")

Hello World!


There are 2 things happening here:

    1) A function called "print" is called.
    2) The function is supplied with input data, that is, the text in quotations: "Hello World!"
    
As a result, Hello World! is printed on to the screen. Go figure.

### Variables

In the previous coding example, you saw that a "function" could be supplied with data to produce a result. This "data" is more commonly refered to as "variables" in programming.

The code snippet above could easily be re-written as:

In [2]:
greeting = "Hello World!"

print(greeting)

Hello World!


So what is the difference? In this scenario, we store the text to be printed as a variable called "message". This is then put in to the function, print.

We can manipulate that piece of data now that we have stored it:

In [14]:
greeting = "Hello World."

message = greeting + " I am alive!"

print(message)

Hello World. I am alive!


### Functions

*Skippable!*

print() is a built in function provided by default in Python. 

In programming, we often find ourselves doing repetitive tasks. Functions can store the steps of these tasks, so that they can be executed via a single line of code. 

As a result, we have the ability write our own functions:

In [15]:
def greeting(message):
    greeting = "Hello World."
    message = greeting + " " + message
    print(message)

message = "This is a function call."
greeting(message)

Hello World. This is a function call.


To declare a function, simply follow this format:

    def function_name(variable1, variable2):

Lines of code belonging to a function should be indented 1 tab character. For example:

    def function_name(variable1, variable2):
        this_is_apart_of_the_function = True

    this_is_not_apart_of_the_function = False

A function can even return a piece of data:

In [1]:
def make_greeting(message):
    greeting = "Hello World."
    message = greeting + " " + message
    return message

message = "This is a function call."
greeting = make_greeting(message)

print(greeting)

Hello World. This is a function call.


### Data Types

Variables can be of different data types. 

In the previous coding examples, I have shown you the String datatype:
    
    String = "This is a string"
    
And the Boolean data type:
    
    is_it_raining = False
    are_we_hacking = True

There are many others to store different types of data:

    Integer = 1234
    3+3
    
    Float = 12.34
    2 * 12.34

### Data structures

Variables alone can only do so much. By placing variables within data structures, we can begin to preform more interesting tasks.

For this Hackathon, we will only be looking at 1 data structure: Dictionaries.

#### Dictionaries

Dictionaries are known as "key-value" pairs. This means for each "key" there is a value associated with it.

Here I have pre-initialized a dictionary with some key value pairs:

    phone_dict = { "Police" : 911,
                  "City of Toronto Archives": 4163970778,
                  "Pizza-Pizza" : 4169671111 }

This dictionary contains 3 key-value pairs. Each key is a String containing the name of a business. Each value is an Integer representing the phone number of the business. 

Note that, keys and values can be of any data type. We could have easily assigned the phone number to be the key and the name of the business to be the value.

#### Common operations on dictionaries

To add/edit a key-value pair inside a dictionary:

    phone_dict = {}

    # to add    
    phone_dict["Dominos"] = 4166407777
    
    # now to edit an existing value
    phone_dict["Dominos"] = 1234567890
    
To access the value of a key:

    print(phone_dict["Police"])
    
And finally to check if a key is in a dictionary:

    if "Pizza-Pizza" in phone_dict:
        print('This is Pizza Pizza's number: ' + phone_dict["Pizza-Pizza"])
    elif "Pizza-Pizza" not in phone_dict:
        print('We don't have the number')

Note that if you were to try to get the value for a key which doesn't exist in the dictionary, you would get an error message. 

## Lesson 2 - File I/O (Input & Output)

As a member of the archive team, you are aware that there are thousands of individual files containing important information. Using Python you can access the text within these files and manipulate, sort and transform this data.

For this section, we will only be looking at how to access data within a plain-text file (.txt). Matthew will show you how to access other file types (like PDF's) in a future section.

### Opening and reading from a file

You can open a file in Python using a single line:

In [1]:
with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
    print('We opened it')

We opened it


In line 1, we open a file named "filename.txt" and stored a reference to this file in a variable called 'f'.

To begin reading this file, we need to operate on this 'f'. So, we write:


In [5]:
with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
    data = f.read()
    print(data)

Since I assumed this position five years ago, much has changed at the Foundation:  the Foundation moved offices from College Street to Jarvis Street; Ms. Sandy Hengeveld and Ms. Emmanuelle Fontaine joined the staff; the Foundation moved from a paper-based system to a web-based system for applications and reports; Drs. Keith Jarvie and Vivian Rakoff began and completed their terms as, respectively, Chair and Vice-Chair;  six new members have joined the Board of the Foundation – Drs. Mary Seeman, Gregory Brown, George Tolomizencko, Mr. Herman Gill, Ms. Clare Sullivan, and Ms. Jeanette Lewis; Drs. William Avison and Harriet MacMillan were appointed Chair and Vice-Chair of the Foundation, and the membership of both the Grants and Fellowships Committees has changed considerably  since 2005. The Foundation has, however, continued – as it has for nearly 45 years – to support excellent research to promote the mental health of Ontarians, to prevent mental illness, and to improve diagnosis, trea

Here we store all text within the file in a String variable called data. When we print data, we print the contents of an entire file out the screen. 

And it only took 3 lines of code. Nice.

### Breaking the data

What if you wanted to access the words in the data? We can do this by using split(' ').

    with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
        for line in f:
            words = line.split(' ')
            for word in words:
                print('One word within the file: ' + word)
                
(*Hint*)
Something interesting that you can do now, is that you can now count how many words there are in a file:

    word_count = 0
    
    with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
        for line in f:
            words = line.split(' ')
            for word in words:
                word_count = word_count + 1
    
    print(word_count)

In [None]:
with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
    for line in f:
        words = line.split(' ')
        print(words)

In most written languages around the world, words are seperated by whitespace. split(' ') stores each block of text in between spaces in to a List.

We can also go through each element within a List by doing this:


In [7]:
with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
    for line in f:
        words = line.split(' ')
        for word in words:
            print('One word within the file: ' + word)


One word within the file: Since
One word within the file: I
One word within the file: assumed
One word within the file: this
One word within the file: position
One word within the file: five
One word within the file: years
One word within the file: ago,
One word within the file: much
One word within the file: has
One word within the file: changed
One word within the file: at
One word within the file: the
One word within the file: Foundation:
One word within the file: 
One word within the file: the
One word within the file: Foundation
One word within the file: moved
One word within the file: offices
One word within the file: from
One word within the file: College
One word within the file: Street
One word within the file: to
One word within the file: Jarvis
One word within the file: Street;
One word within the file: Ms.
One word within the file: Sandy
One word within the file: Hengeveld
One word within the file: and
One word within the file: Ms.
One word within the file: Emmanuelle
One w

(*Hint*)
Something interesting that you can do now, is that you can now count how many words there are in a file:

In [6]:
word_count = 0
    
with open("~$HF 2010-11 annual report final from MOHLTC.txt") as f:
    for line in f:
        words = line.split(' ')
        for word in words:
            word_count = word_count + 1
    
print(word_count)

643


### Writing to files

*Skippable!*

It is possible to write lines of text to a file. We use similar code as reading:

    with open("filename.txt", "w") as f:
       f.write("I love programming in Python.")

Note a few differences:
    1. We open "filename.txt" in 'w' mode - write mode.
    2. We still store a reference to this file in a variable called 'f'
    3. But we can now use the 'write' operation. If you were to open a file in read mode and run this line of code, your script would crash

Writing to files is important, but is not as relevant to this Hackathon. 

### Programming challenge #1

Section requirements:
    1. Data structures
    2. Opening and reading a file
    3. Breaking down each line

Now that you can read the contents of a file and store data inside a data structure, can you do both by storing the count of each word in to a Dictionary?

(*Hint*) Your program will first need to have a completely empty dictionary. This can be initialized by using:

    word_count_dict = {}

## Lesson 3 - Parallel Programming

In the world of big data analytics, it not unusual that a computational job takes a single computer several days to complete. A common way to lower the time a program's runs for is to distribute certain portions of the job across many computers running simultaneously.

The field of using simultaneous computional processes is known as parallel computing.

### Data Parallelism

![Image of Data Parallelism](https://computing.llnl.gov/tutorials/parallel_comp/images/domain_decomp.gif)

Data parallelism is the most well known method to achieve parallel computing. In this method, the input data set is divided equally and sent to multiple processes. 

Here is an analogy. Lets say that I want to make an entire bag of flour in to pizza dough. If I were to make a set amount alone, it would take me a while, lets say 4 hours. If 3 of my buddies joined in and we were making this dough all at the same time, it would take me 4 / 4 = 1 hour. Each buddy would take 1/4th of the flour to make the dough.

Data paralleism works in a similar way. Each "worker" process preforms the exactly the same task, but is given a different set of inputs to work on.

In [1]:
#### Data parallelism coding example

import multiprocessing as mp
import string

output = mp.Queue()

def load_data(file):
    with open(file) as f:
        data = f.read()
        for f in string.punctuation:
            data = data.strip(f)
        return data.split(' ')

def count(output, data, begin, end):
    unique_words = {}
    for word in data[int(begin):int(end)]:
        if word in unique_words:
            unique_words[word] = unique_words[word] + 1
        else:
            unique_words[word] = 1
    output.put(unique_words)

if __name__ == "__main__":
    data = load_data("~$HF 2010-11 annual report final from MOHLTC.txt")
    number_of_words_in_data = len(data)
    set_size = number_of_words_in_data / 4
        
    processes = []
    
    # Specify which input data each "worker" process gets
    for x in range(4):
        begin = set_size * x
        end = min(set_size * (x+1), number_of_words_in_data)
        p = mp.Process(target=count, args=(output, data, begin, end))
        
        processes.append(p)
        
        print('Adding process %i, which runs on the word # %i - %i.' % (x, begin, end))
        
    # Run processes
    for p in processes:
        p.start()

    # Exit the completed processes
    for p in processes:
        p.join()

    # Get process results
    results = [output.get() for p in processes]
    
    # Print the results to the screen
    for process_id, result in enumerate(results):
        print("Process %i found the following word counts in its section:" % process_id)
        print(str(result))
    

Adding process 0, which runs on the word # 0 - 155.
Adding process 1, which runs on the word # 155 - 311.
Adding process 2, which runs on the word # 311 - 467.
Adding process 3, which runs on the word # 467 - 623.
4
Process 0 found the following word counts in its section:
{'Since': 1, 'I': 1, 'assumed': 1, 'this': 1, 'position': 1, 'five': 1, 'years': 2, 'ago,': 1, 'much': 1, 'has': 3, 'changed': 2, 'at': 1, 'the': 10, 'Foundation:': 1, '': 3, 'Foundation': 4, 'moved': 2, 'offices': 1, 'from': 2, 'College': 1, 'Street': 1, 'to': 5, 'Jarvis': 1, 'Street;': 1, 'Ms.': 4, 'Sandy': 1, 'Hengeveld': 1, 'and': 11, 'Emmanuelle': 1, 'Fontaine': 1, 'joined': 2, 'staff;': 1, 'a': 2, 'paper-based': 1, 'system': 2, 'web-based': 1, 'for': 2, 'applications': 1, 'reports;': 1, 'Drs.': 3, 'Keith': 1, 'Jarvie': 1, 'Vivian': 1, 'Rakoff': 1, 'began': 1, 'completed': 1, 'their': 1, 'terms': 1, 'as,': 1, 'respectively,': 1, 'Chair': 2, 'Vice-Chair;': 1, 'six': 1, 'new': 1, 'members': 1, 'have': 1, 'Board': 

### A note

In reality, you will most likely be using tools which already implement parallel processing. In Shaikh's upcoming section, he will touch a little on Hadoop and how it can run on multiple computers. oooo.

### Programming challenge #2

This is one for experienced programmers.

In the above coding example, I showed you how to print a list of unique words corresponding to sections of the a body of text. Can you modify the program to show word counts across the whole document? You will need to combine the results of each process and store them in a central dictionary.

*Hint*: To iterate across all key-value pairs within a dictionary:
    
    for key, value in dictionary.values():
        print(key, value)

## What's next?

Congratulations! You have finished the first section of this Hackathon. You did it. You are the programmer.

![Image of Hacker](https://ak1.picdn.net/shutterstock/videos/3365171/thumb/1.jpg)

By the end of this section you will have touched upon how to read and store data in addition to having done a basic word count analysis. You will have touched on the topic of parallel computing as well.

In future sections, you will learn how to read data in from other file extensions (.pdf, .doc, .xls etc.) and advanced analysis using machine learning models.