# Plotting and Programming in Python 👑 💻 🐍

## January 16-17, 2020
https://columbiaswc.github.io/2020-01-16-columbia-section-1/

contact: alan crosswell / alan@columbia.edu and teddy thomas / tthoma24@columbia.edu

python helpers: rob lane / rob@cs.columbia.edu, hima bindu /hb2635@columbia.edu, and axinia radeva / ar2667@columbia.edu

Special thanks to https://twitter.com/mariinyrop for this instructor notebook, as well as the slides that I reuse shamelessly - Teddy
<hr>

### Table of Contents

#### Part I

1. [Running and Quitting](#1.-Running-and-Quitting)
2. [Variables and Assignment](#2.-Variables-and-Assignment)
3. [Data Types and Type Conversion](#3.-Data-Types-and-Type-Conversion)
4. [Built-in Functions and Help](#4.-Built-in-Functions-and-Help)
5. [Libraries](#5.-Libraries)
6. [Reading Tabular Data into DataFrames](#6.-Reading-Tabular-Data-into-DataFrames)
7. [Pandas DataFrames](#7.-Pandas-DataFrames)

#### Part II

8. [Lists](#8.-Lists)
9. [For Loops](#9.-For-Loops)
10. [Looping Over Data Sets](#10.-Looping-Over-Data-Sets)
11. [Writing Functions](#11.-Writing-Functions)
12. [Variable Scope](#12.-Variable-Scope)
13. [Conditionals](#13.-Conditionals)
14. [Programming Style](#14.-Programming-Style)

## Intro

- We'll be learning the basics of Python with an emphasis on research (as opposed to, say, App building), though most of this will be widely applicable no matter what you do.

- We'll be using Python3 within the Jupyter interactive Notebook environment, which is preferred by researchers because (1) you can write prose alongside your code to contextualize it, (2) it's a portable format that shows the code as well as the resulting output, and (3) it encourages reproducibility.

- We've set your Jupyter notebooks up to work with Anaconda, which you installed along with Python. Anaconda is what manages all the special Python *libraries* that you can chose to use in your Python projects.

- The libraries we'll use today are called *pandas* and *matplotlib*, which are two of the most used libraries for manipulating and vizualizing data.

## Check-In

- Do you have `python-novice-gapminder-data.zip` downloaded and unpacked in your current directory?

In [None]:
!ls python-novice-gapminder-data.zip

- Can your notebook import the `pandas` library using anaconda?

In [None]:
import pandas

<hr>

## 1. Running and Quitting

### Key Points:

- Python programs are plain text files.
- Use the Jupyter Notebook for editing and running Python.
- The Notebook has Command and Edit modes.
- Use the keyboard and mouse to select and edit cells.
- The Notebook will turn Markdown into pretty-printed documentation.
- Markdown does most of what HTML does.

<hr>

## 2. Variables and Assignment

[Slide #1: Variables](https://slides.com/marii/cul-swc-python#/1)


Use variables to store values.

In [None]:
age = 42
first_name = 'Ahmed'

Use `print` to display values.

In [None]:
print(first_name, 'is', age, 'years old')

Variables must be created before they are used.

In [None]:
print(last_name)

Variables can be used in calculations.

In [None]:
age = age + 3
print('Age in three years:', age)

Use an index to get a single character from a string.

In [None]:
atom_name = 'helium'
print(atom_name[0])

Use a slice to get a substring.

In [None]:
atom_name = 'sodium'
print(atom_name[0:3])

Use the built-in function `len` to find the length of a string.

In [None]:
print(len('helium'))

<hr>

## 3. Data Types and Type Conversion

[Slide #2: Data Types](https://slides.com/marii/cul-swc-python#/2)

Use the built-in function `type` to find the type of a value.

In [None]:
print(type(52))

In [None]:
fitness = 'average'
print(type(fitness))

Types control what operations (or methods) can  be performed on a given value.

In [None]:
print(5 - 3)

In [None]:
print('hello' - 'h')

You can use the “+” and “\*” operators on strings.

In [None]:
full_name = 'Ahmed' + ' ' + 'Walsh'
print(full_name)

In [None]:
separator = '=' * 10
print(separator)

Strings have a length (but numbers don’t).

In [None]:
print(len(full_name))

In [None]:
print(len(52))

You must convert numbers to strings or vice versa when operating on them.

In [None]:
print(1 + '2')

In [None]:
print(1 + int('2'))
print(str(1) + '2')

You can mix integers and floats freely in operations. (This is only in Python 3, so watch out!)

In [None]:
print('half is', 1 / 2.0)
print('three squared is', 3.0 ** 2)

Variables only change value when something is assigned to them.

In [None]:
first = 1
second = 5 * first
first = 2
print('first is', first, 'and second is', second)

<hr>

## 4. Built-in Functions and Help 

Use comments to add documentation to programs.

[Slide #3: Functions + Syntax](https://slides.com/marii/cul-swc-python#/3)

In [None]:
# This sentence isn't executed by Python.
adjustment = 0.5   # Neither is this - anything after '#' is ignored.

A function may take zero or more arguments.

In [None]:
print('before')
print()
print('after')

Commonly-used built-in functions include `max`, `min`, and `round`.

In [None]:
print(max(1, 2, 3))
print(min('a', 'A', '0'))

Functions may only work for certain (combinations of) arguments.

In [None]:
print(max(1, 'a'))

Functions may have default values for some arguments.

In [None]:
round(3.712)

In [None]:
round(3.712, 1)

Use the built-in function `help` to get help for a function.

In [None]:
help(round)

Python reports a syntax error when it can’t understand the source of a program.

In [None]:
# Forgot to close the quote marks around the string.
name = 'Feng

In [None]:
# An extra '=' in the assignment.
age = = 52

In [None]:
print("hello world"

Python reports a runtime error when something goes wrong while a program is executing.

In [None]:
age = 53
remaining = 100 - aege # mis-spelled 'age'

The Jupyter Notebook has two ways to get help.


- Place the cursor inside the parenthesis of the function, hold down `shift`, and press `tab`.
- Or type a function name with a question mark after it.

In [None]:
round()

Every function returns something.

In [None]:
result = print('example')
print('result of print is', result)

<hr>

## 5. Libraries

[Slide #4: What are Libraries?](https://slides.com/marii/cul-swc-python#/4)

A library is a collection of modules, but the terms are often used interchangeably, especially since many libraries only consist of a single module, so don’t worry if you mix them.

A program must import a library module before using it.

In [None]:
import math

print('pi is', math.pi)
print('cos(pi) is', math.cos(math.pi))

Use `help` to learn about the contents of a library module.

In [None]:
help(math)

Import specific items from a library module to shorten programs.

In [None]:
from math import cos, pi

print('cos(pi) is', cos(pi))

Create an alias for a library module when importing it to shorten programs.

In [None]:
import math as m

print('cos(pi) is', m.cos(m.pi))

<hr> 

## 6. Reading Tabular Data into DataFrames


[Slide #5: What is Tabular Data?](https://slides.com/marii/cul-swc-python#/5)

Use the Pandas library to do statistics on tabular data.

In [None]:
import pandas

data = pandas.read_csv('data/gapminder_gdp_oceania.csv')
data

Use `index_col` to specify that a column’s values should be used as row headings.

In [None]:
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
data

Use `DataFrame.info` to find out more about a dataframe.

In [None]:
data.info()

The `DataFrame.columns` variable stores information about the dataframe’s columns.

In [None]:
data.columns

Use `DataFrame.T` to transpose a dataframe. (Switch columns and rows)

In [None]:
data.T

Use `DataFrame.describe` to get summary statistics about data.

In [None]:
data.describe()

<hr> 

## 7. Pandas DataFrames

[Slide #6: Selecting values](https://slides.com/marii/cul-swc-python#/6)

Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

In [None]:
import pandas

data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
data.iloc[0, 0]

Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

In [None]:
data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')

data.loc["Albania", "gdpPercap_1952"]

In [None]:
data.loc["Albania", :]

Use `:` on its own to mean all columns or all rows.

In [None]:
data.loc[:, "gdpPercap_1952"]

Select multiple columns or rows using `DataFrame.loc` and a named slice.

In [None]:
data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']

Result of slicing can be used in further operations.

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

In [None]:
print(data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

Use comparisons to select data based on value.

In [None]:
# Use a subset of data to keep output readable.
subset = data.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:') 
subset

In [None]:
# Which values were greater than 10000 ?
print('\nWhere are values large?')
subset > 10000

Select values or NaN using a Boolean mask.

In [None]:
mask = subset > 10000
subset[mask]

In [None]:
subset[subset > 10000].describe()

Select-Apply-Combine operations

In [None]:
mask_higher = data.apply(lambda x:x > x.mean())
wealth_score = mask_higher.aggregate('sum', axis=1)/len(data.columns)
wealth_score

In [None]:
data.groupby(wealth_score).sum()

<hr> 

## 7. Plotting

`matplotlib` is the most widely used scientific plotting library in Python.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
time = [0, 1, 2, 3]
position = [0, 100, 200, 300]

plt.plot(time, position)
plt.xlabel('Time (hr)')
plt.ylabel('Position (km)')

Plot data directly from a `Pandas dataframe`.

In [None]:
import pandas

data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')

# Extract year from last 4 characters of each column name
years = data.columns.str.strip('gdpPercap_')
# Convert year values to integers, saving results back to dataframe
data.columns = years.astype(int)

data.loc['Australia'].plot()

Select and transform data, then plot it.

In [None]:
data.T.plot()
plt.ylabel('GDP per capita')

Many styles of plot are available.

In [None]:
plt.style.use('ggplot')
data.T.plot(kind='bar')
plt.ylabel('GDP per capita')

Data can also be plotted by calling the `matplotlib` `plot` function directly.


- The command is plt.plot(x, y)
- The color / format of markers can also be specified as an optical argument: e.g. ‘b-‘ is a blue line, ‘g–’ is a green dashed line.

Get Australia data from dataframe

In [None]:
years = data.columns
gdp_australia = data.loc['Australia']

plt.plot(years, gdp_australia, 'g--')

You can plot many sets of data together.

In [None]:
# Select two countries' worth of data.
gdp_australia = data.loc['Australia']
gdp_nz = data.loc['New Zealand']

# Plot with differently-colored markers.
plt.plot(years, gdp_australia, 'b-', label='Australia')
plt.plot(years, gdp_nz, 'g-', label='New Zealand')

# Create legend.
plt.legend(loc='upper left')
plt.xlabel('Year')
plt.ylabel('GDP per capita ($)')

In [None]:
plt.scatter(gdp_australia, gdp_nz)

In [None]:
data.T.plot.scatter(x = 'Australia', y = 'New Zealand')

In [None]:
# plt.savefig('my_figure.png')

<hr> 

## 8. Lists

A list stores many values in a single structure.

In [None]:
pressures = [0.273, 0.275, 0.277, 0.275, 0.276]
print('pressures:', pressures)
print('length:', len(pressures))

Use an item’s index to fetch it from a list.

In [None]:
print('zeroth item of pressures:', pressures[0])
print('fourth item of pressures:', pressures[4])

Lists’ values can be replaced by assigning to them.

In [None]:
pressures[0] = 0.265
print('pressures is now:', pressures)

Appending items to a list lengthens it.

In [None]:
primes = [2, 3, 5]
print('primes is initially:', primes)
primes.append(7)
primes.append(9)
print('primes has become:', primes)

In [None]:
teen_primes = [11, 13, 17, 19]
middle_aged_primes = [37, 41, 43, 47]
print('primes is currently:', primes)
primes.extend(teen_primes)
print('primes has now become:', primes)
primes.append(middle_aged_primes)
print('primes has finally become:', primes)

In [None]:
help(list)

Use `del` to remove items from a list entirely.

In [None]:
print('primes before removing an item:', primes)
del primes[4]
print('primes after removing an item:', primes)

The empty list contains no values.

In [None]:
my_list = []
print(my_list)

Lists may contain values of different types.

In [None]:
goals = [1, 'Create lists.', 2, 'Extract items from lists.', 3, 'Modify lists.']

Character strings can be indexed like lists.

In [None]:
element = 'carbon'
print('zeroth character:', element[0])
print('third character:', element[3])

Character strings are immutable.

In [None]:
element[0] = 'C'

Indexing beyond the end of the list is an error.

In [None]:
print('99th element of element is:', element[99])

<hr>

## 9. For Loops

A for loop executes commands once for each value in a collection.

In [None]:
for number in [2, 3, 5]:
    print(number)

In [None]:
print(2)
print(3)
print(5)

The first line of the `for` loop must end with a colon, and the body must be indented.

In [None]:
for number in [2, 3, 5]:
print(number)

Loop variables can be called anything.

In [None]:
for kitten in [2, 3, 5]:
    print(kitten)

The body of a loop can contain many statements.

In [None]:
primes = [2, 3, 5]
for p in primes:
    squared = p ** 2
    cubed = p ** 3
    print(p, squared, cubed)

Use `range` to iterate over a sequence of numbers.

In [None]:
for number in range(0,3):
    print(number)

The Accumulator pattern (very common) turns many values into one.

In [None]:
# Sum the first 10 integers.
total = 0
for number in range(10):
   total = total + (number + 1)
print(total)

<hr>

## 10. Looping Over Data Sets

Use a `for` loop to process files given a list of their names.

In [None]:
import pandas

for filename in ['data/gapminder_gdp_africa.csv', 'data/gapminder_gdp_asia.csv']:
    data = pandas.read_csv(filename, index_col='country')
    print(filename, data.min())

Use `glob.glob` to find sets of files whose names match a pattern.

In [None]:
import glob

print('all csv files in data directory:', glob.glob('data/*.csv'))

In [None]:
print('all PDB files:', glob.glob('*.pdb'))

Use `glob` and `for` to process batches of files.

In [None]:
for filename in glob.glob('data/gapminder_*.csv'):
    data = pandas.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

<hr>

## 11. Writing Functions

*`()` contains the ingredients for the function while the body contains the recipe.*

Break programs down into functions to make them easier to understand.

In [None]:
def print_greeting():
    print('Hello!')

Defining a function does not run it.

In [None]:
print_greeting()

Arguments in call are matched to parameters in definition.

In [None]:
def print_date(year, month, day):
    joined = str(year) + '/' + str(month) + '/' + str(day)
    print(joined)

print_date(1871, 3, 19)

In [None]:
print_date(month=3, day=19, year=1871)

Functions may return a result to their caller using return.

In [None]:
def average(values):
    if len(values) == 0:
        return None
    return sum(values) / len(values)

In [None]:
a = average([1, 3, 4])
print('average of actual values:', a)

In [None]:
print('average of empty list:', average([]))

Every function returns something.
A function that doesn’t explicitly return a value automatically returns `None`.

In [None]:
result = print_date(1871, 3, 19)
print('result of call is:', result)

<hr>

## 12. Variable Scope

In [None]:
pressure = 103.9

def adjust(t):
    temperature = t * 1.43 / pressure
    return temperature

`pressure` is a global variable.
- Defined outside any particular function.
- Visible everywhere.

`t` and `temperature` are local variables in `adjust`. 
- Defined in the function.
- Not visible in the main program.
- Remember: a function parameter is a variable that is automatically assigned a value when the function is called.


In [None]:
print('adjusted:', adjust(0.9))
print('temperature after call:', temperature)

In [None]:
for item in ['book', 'rock', 'chair', 'ghost']:
    print(item)

In [None]:
print(item)

<hr> 

## 13. Conditionals

Use `if` statements to control whether or not a block of code is executed.

In [None]:
mass = 3.54
if mass > 3.0:
    print(mass, 'is large')

In [None]:
mass = 2.07
if mass > 3.0:
    print (mass, 'is large')

Conditionals are often used inside loops.

In [None]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')

Use `else` to execute a block of code when an `if` condition is not true.

In [None]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

Use `elif` to specify additional tests.

In [None]:
masses = [3.54, 2.07, 9.22, 1.86, 1.71]
for m in masses:
    if m > 9.0:
        print(m, 'is HUGE')
    elif m > 3.0:
        print(m, 'is large')
    else:
        print(m, 'is small')

Conditions are tested once, in order.

In [None]:
grade = 85
if grade >= 70:
    print('grade is C')
elif grade >= 80:
    print('grade is B')
elif grade >= 90:
    print('grade is A')

Does not automatically go back and re-evaluate if values change.

In [None]:
velocity = 10.0
if velocity > 20.0:
    print('moving too fast')
else:
    print('adjusting velocity')
    velocity = 50.0

In [None]:
velocity = 10.0
for i in range(5):
    print(i, ':', velocity)
    if velocity > 20.0:
        print('moving too fast')
        velocity = velocity - 5.0
    else:
        print('moving too slow')
        velocity = velocity + 10.0
        
print('final velocity:', velocity)

<hr>

## 14. Programming Style