Quantitative Methods 2:  Data Science and Visualisation
------------------------------

Week 1: Recapping Python
------------------------------

Using Python
------------

In this course, we'll make extensive use of *Python*, a programming language used widely in scientific computing and on the web. We will be using Python as a way to manipulate, plot and analyse data. This isn't a course about learning Python, it's about working with data - but we'll learning a little bit of programming along the way.

By now, you should have done the prerequisites for the module, and understand a bit about how Python is structured, what different commands do, and so on - this is a bit of a refresher to remind you of what we need at the beginning of term.

The particular flavour of Python we're using is *iPython*, which, as we've seen, allows us to combine text, code, images, equations and figures in a *Notebook*. This is a *cell*, written in *markdown* - a way of writing nice text. Contrast this with *code* cell, which executes a bit of Python:

In [None]:
print 1+1

The Notebook format allows you to engage in what Don Knuth describes as [Literate Programming](http://en.wikipedia.org/wiki/Literate_programming):

> […] Instead of writing code containing documentation, the literate programmer writes documentation containing code. No longer does the English commentary injected into a program have to be hidden in comment delimiters at the top of the file, or under procedure headings, or at the end of lines. Instead, it is wrenched into the daylight and made the main focus. The "program" then becomes primarily a document directed at humans, with the code being herded between "code delimiters" from where it can be extracted and shuffled out sideways to the language system by literate programming tools.
[Ross Williams][1]

[1]: http://www.literateprogramming.com/lpquotes.html

Libraries
---------

We will work with a number of *libraries*, which provide additional functions and techniques to help us to carry out our tasks.

These include:

*Pandas:* we'll use this a lot to slice and dice data

*matplotlib*: this is our basic graphing software, and we'll also use it for mapping

*nltk*: The Natural Language Tool Kit will help us work with text

We aren't doing all this to learn to program. We could spend a whole term learning how to use Python and never look at any data, maps, graphs, or visualisations. But we do need to understand a few basics to use Python for working with data. So let's revisit a few concepts that you should have covered in your prerequisites.

Variables
---------

Python can broadly be divided in verbs and nouns: things which *do* things, and things which *are* things. In Python, the verbs can be *commands*, *functions*, or *methods*. We won't worry too much about the distinction here - suffice it to say, they are the parts of code which manipulate data, calculate values, or show things on the screen.

The simplest proper noun object in Python is the *variable*. Variables are given names and store information. This can be, for example, numeric, text, or boolean (true/false). These are all statements setting up variables:

n = 1

t = "hi"

b = True

Now let's try this in code:

In [None]:
n = 1

t = "hi"

b = True

Note that each command is on a new line; other than that, the *syntax* of Python should be fairly clear. We're setting these variables equal to the letters and numbers and phrases and booleans. **What's a boolean?**

The value of this is we now have values tied to these variables - so every time we want to use it, we can refer to the variable:

In [None]:
n

In [None]:
t

In [None]:
b

Because we've defined these variables in the early part of the notebook, we can use them later on.

***Advanced**: where do **classes** fit into this noun/verb picture of variables and commands?*

Where is my data?
-----------------

When we work in excel and text editors, we're used to seeing the data onscreen - and if we manipulate the data in some way (averaging or summing up), we see both the inputs and outputs on screen. The big difference in working with Python is that we don't see our variables all of the time, or the effect we're having on them. They're there in the background, but it's usually worth checking in on them from time to time, to see whether our processes are doing what we think they're doing.

This is pretty easy to do - we can just type the variable name, or "print *variable name*":

In [None]:
n = n+1
print n
print t
print b

Flow
----

Python, in common with all programming languages, executes commands in a sequence - we might refer to this as the "ineluctable march of the machines", but it's more common referred to as the *flow* of the code (we'll use the word "code" a lot - it just means commands written in the programming language). In most cases, code just executes in the order it's written. This is true within each *cell* (each block of text in the notebook), and it's true when we execute the cells in order; that's why we can refer back to the variables we defined earlier:

In [None]:
print n

If we make a change to one of these variables, say n:

In [None]:
n = 3

and execute the above "print n" command, you'll see that it has changed n to 3. So if we go out of order, the obvious flow of the code is confused. For this reason, try to write your code so it executes in order, one cell at a time. At least for the moment, this will make it easier to follow the logic of what you're doing to data.

*Advanced*: what happens to this flow when you write *functions* to automate common tasks? 

***Exercise - Setting up variables***:


1. Create a new cell. 

2. Create the variables "name", and assign your name to it. 

3. Create a variable "Python" and assign a score out of 10 to how much you like Python. 

4. Create a variable "prior" and if you've used Python before, assign True; otherwise assign False to the variable

5. Print these out to the screen

Downloading Data
--------------------------

Lets fetch the data we will be using for this session. You can either upload the data to the Azure notebook by using the Data Menu above or you can use the following cell to fetch the data directly from the QM2 server.

Let's create a folder that we can store all our data for this session

In [None]:
!mkdir data

In [None]:
!mkdir ./data/wk1
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/data.csv -o ./data/wk1/data.csv
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/sample_group.csv -o ./data/wk1/sample_group.csv

Storing and importing data
--------------------------

Typically, data we look at won't be just one number, or one bit of text. Python has a lot of different ways of dealing with a bunch of numbers: for example, a list of values is called a **list**:

In [None]:
listy = [1,2,3,6,9]
print listy

A set of values *linked* to an index (or key) is called a **dictionary**; for example:

In [None]:
dicty = {'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}
print dicty

Notice that the list uses square brackets with values separated by commas, and the dict uses curly brackets with pairs separated by commas, and colons (:) to link a *key* (index or address) with a value.

(You might notice that they haven't printed out in the order you entered them)

***Advanced**: Print out 1) The third element of **listy**, and 2) The element of **dicty** relating to Giant

We'll discuss different ways of organising data again soon, but for now we'll look at *dataframes* - the way our data-friendly *library* **Pandas** works with data. We'll be using Pandas a lot this term, so it's good to get started with it early.

Let's start by importing pandas. We'll also import another library, but we're not going to worry about that too much at the moment.  

If you see a warning about 'Building Font Cache' don't worry - this is normal.

In [None]:
import pandas

import matplotlib
%matplotlib inline

Let's import a simple dataset and show it in pandas. We'll use a pre-prepared ".csv" file, which needs to be in the same folder as our code.

In [None]:
data = pandas.DataFrame.from_csv('./data/wk1/data.csv')
data.head()

What we've done here is read in a .csv file into a **dataframe**, the object pandas uses to work with data, and one that has lots of methods for slicing and dicing data, as we will see over the coming weeks. The **head()** command tells iPython to show the first few columns/rows of the data, so we can start to get a sense of what the data looks like and what sort of type of objects is represents.

Extension: Dotting the Is
--------------
You have no doubt noticed the chain of dots above - like "pandas.DataFrame.from_csv('data1.csv')" - which is the Python way of accessing subcomponents of libraries. So, when we use "data.head()" we are calling a method in the *data* object called *head()* - which draws the first five rows. When we do "pandas.DataFrame.from_csv(*blah*)", we are going into the library *pandas* (which we've mentioned before), getting the object *DataFrame* and then delving within that to get a command called *from_csv* - which is what we need to import our data. So each dot is like saying "look inside the object and get this method/variable/whatever". If you're interested in this, you can look into Object Oriented Programming a bit more - but generally, it's worth just thinking about these things as packages that have all kinds of useful functions and commands. As we'll see soon, we can also do stuff like plot the data in the dataframe...



In [None]:
data.plot()

Here, the dataset represents characters from 90s TV series *Twin Peaks* - which season/episode they first appeared in, their approximate height, gender and whether they are law enforcement officers. Let's start to think about a more useful dataset.

Exercise: Assigning groups
-------

In the next few days, we will be assigning groups for your project work, so you can start thinking about your final projects. To do this, we want to assign groups based on a mix of skills and interests.

1. Form a group of 4 with your neighbours
2. Create a .csv file with the headings "Name", "Context", "Data Analysis", and "Design". 
3. Fill in your names
4. Each individual has ten points - they need to assign those ten points to the three categories ("Context", "Data Analysis", and "Design") based on what they are most interested in - "Context" (using data and quantitative methods to explore issues in the arts, science, or politics, for example), "Data Analysis" (the business of using these methods to get a deeper understanding of what the data tells us), or "Design" (creating maps, graphs and other visual tools to communicate these stories visually and complement textual narratives).
5. Save as a csv, then load this into Python and display it, as we did with our *Twin Peaks* data above. We can even use some boilerplate code to make your life very easy:

In [None]:
myFilename = './data/wk1/your_sample_group_file.csv'
data = pandas.DataFrame.from_csv(myFilename)
data.head()

Homework
--------

Co-ordinate across the cohort to create these scores for everyone taking this class - then we'll do some work behind the scenes to create balanced groups which incorporate different interests and skills. *Remember*, we expect you all to do a little coding, a little mapping/graphing/visualisation, and a little writing - so this isn't a way to to avoid Python altogether for your final project!

Deadline
--------
Before Week 2 Lecture 

Supplementary: More about Markdown
------------------------

Markdown lets us format text:

# As Heading Text
## A little bit smaller
### There are six header levels

**In bold**

*In italics*

In unordered lists:

    A list item
    Another List Item
    Yet Another list item
    
Or in numbered lists:

1. The first entry
2. The Second entry
3. The third entry
4. The fourth entry
    5. You can even have sub-lists
    6. It's very flexible

Or in bulleted lists:

   - You're beginning to get the idea now
   - And so on

Text can be formatted as code inline: `[entry for entry in ls if ls % 2 == 0]`

And it even has blockquotes:

   > The sky above the port was the color of a television, tuned to a dead channel

And you can embed images if you want:

In [None]:
from IPython.display import Image, display

Image("https://s3.eu-west-2.amazonaws.com/qm2/wk1/python.png")

You can also use $\LaTeX$, which allows you to type mathematical formulæ and display them correctly:

$x = y+z$

$z = \sum_{\alpha}Y_{\alpha}$

$Y_{ij} = \frac{M_iM_j}{r^\beta}$