# Intermediate Python programming

A notebook like this one consists of cells that contain either text (such as this cell) or code (such as the next cell). To progress to the next cell, running any code along the way, press shift + enter:

In [None]:
print('hello world!')

## Numbers and arithmetic

We can create numerical variables and apply arithmetic operations to them:

In [None]:
x = 1
# arithmetic operations
print(x + 1) # addition
print(x - 1) # subtraction
print(x * 2) # multiplication
print(x / 2) # division
print(x ** 2) # exponentiation (note: not x^2)
print(1 / (2 - 0.5)) # BEDMAS rules

You can add `=` onto the end of any of these arithmetic operators to overwrite a variable's value with their result.

I.e., `x ☐= y` is the same as `x = x ☐ y`. E.g.:

In [None]:
x = 2
x *= 3
print(x)
x -= 1
print(x)
x **= 2
print(x)

## Strings (text)

Python uses single or double quotes to store text information in **strings**:

In [None]:
# Single vs double quotes:
greeting = "hello"
greeting = 'hello'

# Quotes to define a string vs quotes in a string:
sentence = 'He says "hello"'
sentence = "I'm busy now"
# Backslash to use quote in a string:
sentence = 'He says "hello, I\'m busy now"'
print(sentence)

# Combining text:
new_text = 'a' + 'b'
print(new_text)

## Booleans (true/false values)

True and false values called **Booleans** result from comparisons:

In [None]:
# Booleans result from comparisons
x = 2
y = 3
print(x < y)
print(x <= y)
print(x > y)
print(x >= y)
print(x == y) # Note: not x = y, which is for variable assignment
print(x != y)
print(x < y < 4) # chained comparison---this tests x < y and y < 4
print('abc' == 'abc') # Not just numbers!

Like the arithmetic operators we can apply to numbers, there are Boolean operators we can apply to Booleans:

In [None]:
# Boolean operators
bool1 = False
bool2 = True
print(bool1 and bool2)
print(bool1 or bool2)
print(not bool2)

Also as with arithmetic operators, Boolean operators have an order of operations. To illustrate this, let's imagine a situation where we can publish our paper if the journal is free to publish in or we have funds for publication fees, and if we either have significant results or have preregistered our study:

In [None]:
# Publication example
free = True # The journal has no fee
funds = False # We have have no funds
signif = False # Our results are not significant
prereg = False # Our study was not preregistered

can_publish = free or funds and signif or prereg
print(can_publish) # !!

That can't be right---we don't have significant results and didn't preregister our study. The reason we're getting the wrong answer is that `and` takes precedence over `or`. I.e., what's really being evaluated is `funds or (waive and signif) or prereg`.

In [None]:
can_publish = (free or funds) and (signif or prereg)
print(can_publish)

## Collections (lists and dictionaries)

We can collect multiple pieces of data into collections, organizing them either so that they have a serial order (in a `list`) or so that they are each associated with a different name (in a `dict`, or **dictionary**):

In [None]:
# Create a list to store values in an ordered sequence:
my_list = [1, 2, 3, 4, 5, 6, 7]
# Create a dict to associate each value with a name:
my_dict = {'first': 1, 'second': 2, 'third': 3}

# We can then access the stored values either by their serial position:
print(my_list[0]) # !!
# Or by their name:
print(my_dict['second'])

We can access multiple items in a list at a time using **list slicing**. The general notation is:

`listname[start:stop:step]`

- Start with the `start`th value (if omitted, start at the very beginning)
- Stop short of the `stop`th value (if omitted, don't stop anywhere)
- Move forward in the list by `step` each time (if omitted, move forward by 1)

In [None]:
print(my_list[0:5:2])
print(my_list[:5:2])
print(my_list[::2])

## Conditionals

**Conditional statements** allow us to only run certain pieces of code if certain conditions hold---that is, if certain tests return the value `True`:

In [None]:
# Imaginary p value example
p = 0.009

if p < 0.001:
	print('our results are very significant')
elif p < 0.01:
	print('our results are significant')
elif p < 0.05:
	print('our results are technically significant')
else:
	print('our results are not significant')

`elif` and `else` clauses are linked to the previous `if` and `elif`s. Compare to:

In [None]:
if p < 0.01:
	print('our results are significant')
if p < 0.05:
	print('our results are technically significant')

## `while` loops

**Loops** allow you to run the same piece of code many times, often with the value of some variable being different on each loop iteration. `while` loops repeat the same piece of code for a long as some condition remains `True`:

In [None]:
x = 0
while x < 10:
	x += 1
	print(x) # Note that x does reach 10

We can use a `break` statement to force a `while` loop to stop before its condition becomes false

In [None]:
x = 0
while x < 10:
	x += 1
	print(x)
	if x == 5:
		print('ending the loop now')
		break

## `for` loops

Loops are linked to a sequence of values:

In [None]:
numbers = [1, 2, 3]
# print each item in the above list:
idx = 0
while idx < len(numbers):
	curr_number = numbers[idx]
	print(curr_number)
	idx += 1

A more concrete example:

```python
files = ['file1.ext', 'file2.ext', 'file3.ext']
file_idx = 0
while file_idx < len(files):
	curr_file = files[file_idx]
	load_and_process(curr_file)
	file_idx += 1
```

`for` loops give us a succinct notation for achieving the same effect:

In [None]:
numbers = [1, 2, 3]
for curr_number in numbers:
	print(curr_number)

Similarly:

```python
files = ['file1.ext', 'file2.ext', 'file3.ext']
for curr_file in files:
	load_and_process(curr_file)
```

Now we don't have to worry about creating and iterating an index variable!

You may also want to create a for loop that still uses an index. For this, the `range()` function is great:

In [None]:
xs = [1, 2, 3]
for idx in range(len(xs)):
  x = xs[idx]
  print(idx, x**2)

E.g.:

```python
files = ['file1.ext', 'file2.ext', 'file3.ext']
while file_idx in range(len(files)):
	curr_file = files[file_idx]
	print('Processing file ' + str(file_idx))
	load_and_process(curr_file)
```

Tip: if you find yourself creating numbered variable names like `x1`, `x2`, `x3`, it's a sure sign that you could save yourself a lot of time by putting their values in a list and creating a `for` loop:

In [None]:
x1 = 1
x2 = 2
x3 = 3
print(x1**2)
print(x2**2)
print(x3**2)

# vs

xs = [1, 2, 3]
for x in xs:
	print(x**2)

## Functions
A **function** is a little set of instructions that you can refer to by name. They usually take inputs (or **arguments**) and perform some operations on them to produce outputs.

The general notation for **defining** a function is as follows:

```python
def <function name>(<inputs>):
  <code>
  return <outputs>
```

For example:

In [None]:
def double_it(x):
  y = x * 2
  return y

In [None]:
print(double_it(4))

Functions can take more than one input and return more than one output:

In [None]:
def double_both(in1, in2):
  out1 = double_it(in1)
  out2 = double_it(in2)
  return out1, out2

In [None]:
x, y = double_both(2, 3)
print(x)
print(y)

Inputs can also be provided by name, in case you forget the default order:

In [None]:
def kumaraswamy_density(x, a, b):
  return a*b*x**(a-1) * (1-x**a)**(b-1)

print(kumaraswamy_density(0.5, 3, 2))

print(kumaraswamy_density(a=3, b=2, x=0.5))

Because we're generally only interested in the final outputs of functions, any variables that aren't returned as output get deleted once the function stops executing. That is, they are **local variables** that only exist within the function (as opposed to **global variables**, which exist everywhere):

In [None]:
def some_function():
  loc_var = 'local variable'
  print(loc_var)

some_function()
print(loc_var)

Bacause of this, variables inside and outside a function can have the same name without interfering with each other:

In [None]:
def create_a():
  a = 'two'
  print('Inside the function, a is ' + a)

a = 'one'
print(f'Prior to the function, a is ' + a) 
create_a()
print(f'After the function, a is ' + a) 

## Object-oriented programming

In Python, everything is an object. Objects have **attributes** and **methods**, which are, respectively, data and functions that are attached to them. They are accessed using dot notation:

`object.attribute`

or

`object.method(input)`

In [None]:
x = [1, 2, 3]
x.append(4) # append() is a method of the "list" class
print(x)

You can create your own **classes** of objects with custom methods and attributes:

In [None]:
class Participant():
  def __init__(self, ID, age):
    # Create "ID" and "age" attributes:
    self.ID = ID
    self.age = age
  def print_info(self): # Defines a method
    print('ID', self.ID)
    print('age', self.age)

In [None]:
p1 = Participant(ID = '181720', age = 23)
print(p1.ID) # Access the ID attribute
p1.print_info() # Call the print_info method

It's unlikely that you'll need to create your own classes in the course of ordinary research. However, because you probably *will* end up using libraries written by someone else, it's worth understanding a little bit about how the classes defined by these libraries work.

**Note to self: give the certificate code**

## Important science libraries in Python

### Importing libraries

In [None]:
import random

for i in range(5):
  print(random.gauss(0, 1))

### The `math` library

Python's `math` library defines a number of useful mathematical functions. Let's see a few:

In [None]:
import math as m

print(m.exp(1)) # e^x, where e is Euler's number
print(m.sqrt(2)) # square root
print(m.log(8, 2)) # logarithm (second argument is base; default base is e)
print(m.sin(m.pi/2)) # sin and pi
print(m.e)
print(m.ceil(0.2)) # round up ("ceiling")

### The `random` library

In [None]:
import random

print(random.random()) # Uniform distribution between 0 and 1
print(random.gauss(mu = 0, sigma = 1)) # Standard normal distribution

random.seed(1) # "Seed" the random number generator so that it will produce the same sequence of results
print(random.random())
print(random.gauss(mu = 0, sigma = 1))

# Re-seed the random number generator and draw two more samples. They'll be the same if the seed is the same
random.seed(1)
print(random.random())
print(random.gauss(mu = 0, sigma = 1))

### Matplotlib

To generate plots in notebooks like this one, you have to run the `%matplotlib inline` command:

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

The `plot` and `scatter` functions both take an `x` and `y` sequence as input and produce line and scatter plots, respectively:

In [None]:
# Generate some sinusoidal data
ndata = 50
x = [d*4*m.pi/(ndata - 1) for d in range(ndata)]
y = [m.sin(d) for d in x]
print(x)
print(y)

# Line plot
plt.plot(x, y)
plt.show()

# Scatterplot
plt.scatter(x, y)
plt.show()

The `hist` and `boxplot` functions both take a single sequence of numbers and display its distribution:

In [None]:
# Generate some normally distributed data
x = [random.gauss(0, 1) for i in range(1000)]

plt.hist(x, 100)
plt.show()

plt.boxplot(x)
plt.show()

You can add axis labels and a title to your plots using `xlabel`, `ylabel`, and `title`:

In [None]:
ndata = 100
x = [random.random() for i in range(ndata)]
y = [random.random() for i in range(ndata)]
plt.scatter(x, y)
plt.xlabel('My x data (units)')
plt.ylabel('My y data (units)')
plt.title('My data')
plt.show()

### NumPy

NumPy is Python's linear algebra library. Its basic data type is the **array**, which is a lot like a list except that it's stored more efficiently in the computer's memory:

In [None]:
import numpy as np

# Create a list first, then convert it to an array
list_data = [1, 2, 3, 4, 5]
array_data = np.array(list_data)
print(array_data)

Arrays can be one-dimensional or higher-dimensional (i.e., can be vectors or matrices/tensors):

In [None]:
list_data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
array_data = np.array(list_data)
print(array_data)

Accessing elements in an array is the same as accessing elements in a list, except that when the array is multidimensional, you have to specify which elements to access across each dimension:

`arrayname[<which row(s)>, <which column(s)>, <which page(s)>, ...]`

In [None]:
print(array_data[1, 1])
print(array_data[1:, 1:])
print(array_data[:, 1])

Math with arrays is faster than math with lists, and you can apply mathematical operations to arrays directly rather than having to apply them to each element:

In [None]:
print(array_data + 5)
print(array_data * 3)
print(array_data + array_data)
print(np.sqrt(array_data))

### pandas

pandas is a library for working with tabular data. The core object in pandas is the `DataFrame`:

In [None]:
import pandas as pd

gcbs = pd.read_csv('https://pnb.mcmaster.ca/becker/expts/2a03/gcbs.csv')
print(type(gcbs))
print()
print(gcbs)

Access the data in a column using either dictionary notation or dot notation:

In [None]:
print(gcbs['Q1']) # Dictionary notation
print()
print(gcbs.Q1) # Dot notation

 To select multiple columns, we use dictionary notation, but select using a list of strings rather than a single string:

In [None]:
print(gcbs[['age', 'Q1', 'Q2']])

We can compute the mean, standard deviation, median, and various other statistical quantities on the data in columns using dot notation:

In [None]:
print(gcbs[['Q1', 'Q2']].mean())

We can also group the data according to the values of one of the columns, and then compute these quantities for each group:

In [None]:
print(gcbs.groupby('age')[['Q1', 'Q2']].mean())
print(type(gcbs.groupby('age')[['Q1', 'Q2']].mean()))

To select a subset of rows, we use the following notation:

`subset = dataframe[<condition>]`

In [None]:
gcbs = gcbs[gcbs['age'] < 115]
print(gcbs)

by_age = gcbs.groupby('age').mean()
print(by_age)

pandas also provides a convenient interface to Matplotlib:

In [None]:
by_age[['Q1', 'Q2', 'Q3']].plot()

It looks like the data is more variable among older participants. Maybe this is because there are fewer of them? Let's find out.

In [None]:
gcbs['age'].hist(bins=30)

### Statistics with SciPy's `stats` module

SciPy contains a number of modules for various applications in science and engineering, such as signal processing and clustering algorithms. For now, we'll just look at the `stats` module, which contains, among other things, functions to perform various statistical tests.

In [None]:
from scipy import stats

The above plot suggests that the average response to question 1 is higher than the average response to question 3. How would we test this? One way would be to do a t-test for related samples, the function for which is `ttest_rel`:

In [None]:
print(stats.ttest_rel(gcbs.Q1, gcbs.Q3))

However, because these scores are discrete rather than continuous, it's technically incorrect to do a t-test. Instead, we should do a Wilcoxon signed-rank test, which does not make any assumptions about the underlying distribution of the data:

In [None]:
print(stats.wilcoxon(gcbs.Q1, gcbs.Q3))

We can also test for correlations:

In [None]:
print(stats.pearsonr(gcbs.Q1, gcbs.Q3))

However, again, the Pearson correlation assumes the data is normally distributed, which is not possible for these discrete questionnaire responses. To be technically correct, we can compute a Spearman correlation, which does not make this assumption:

In [None]:
print(stats.spearmanr(gcbs.Q1, gcbs.Q3))

### scikit-learn

scikit-learn is an important library widely used for machine learning. If you want to learn about it, come to my next workshop in 3 weeks! https://libcal.mcmaster.ca/event/3691279