In [None]:
# In colab run this cell first to setup the file structure!
%cd /content
!rm -rf MOL518-Intro-to-Data-Analysis

!git clone https://github.com/shaevitz/MOL518-Intro-to-Data-Analysis.git
%cd MOL518-Intro-to-Data-Analysis/Lecture_2

# Lecture 2
## NumPy arrays, loading tabular data, and first scientific plots

In Lecture 1 we used Python mostly as a calculator.
That let us practice variables, arithmetic, and execution order.

In this lecture we move to something closer to real scientific work.
Most biological experiments do not produce a single number.
They produce *collections* of measurements.

Today we will learn how to represent those collections, store them in files, load them back into Python, and make our first plots.


## What is an array?

An array is a collection of values that all represent the same kind of measurement.
You can think of it as a column in a spreadsheet.

Arrays are ordered.
They have a fixed length.
Each element has a position, called an index.

Arrays show up everywhere in biology.
Time points, fluorescence intensities, cell lengths, gene counts.


### A 1D array as a table

Here is a one dimensional array written as a table.

| index | value |
|-------|-------|
| 0     | 1     |
| 1     | 2     |
| 2     | 3     |
| 3     | 4     |
| 4     | 5     |

This is what Python sees when you work with a simple array.

### A 2D array as a table

A two dimensional array looks like a small spreadsheet.

| row \ col | 0 | 1 | 2 |
|-----------|---|---|---|
| 0         | 1 | 2 | 3 |
| 1         | 4 | 5 | 6 |

This kind of structure appears naturally when data come from files.
Each row is one observation.
Each column is one variable.


## External packages and `import`

Python itself is a small language.
Most scientific functionality lives in external packages.

A package is a collection of code written by other people.
NumPy and matplotlib are examples.

The `import` command tells Python to load that code so you can use it.
We often give packages short nicknames to make code easier to read.

Finding packages is part of scientific programming.
You will often rely on documentation, examples, and other scientists' code.


## NumPy

NumPy is the core numerical package we will use.
It provides the array object and fast numerical operations.


In [None]:
import numpy as np

### Creating a 1D array

In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr

### Length and shape

In [None]:
len(arr)
arr.shape

### Indexing and slicing

In [None]:
arr[0]
arr[-1]
arr[0:3]
arr[::2]

### Replacing values

In [None]:
arr[2] = 10
arr

### Creating a 2D array

In [None]:
arr2 = np.array([[1, 2, 3],
                 [4, 5, 6]])
arr2

### Shape and indexing in 2D

In [None]:
arr2.shape
arr2[0, 1]
arr2[1]
arr2[:, 1]

### In class exercise: arrays

Exercise 1a: Predict the output of slicing operations on a 1D array.
Exercise 1b: Predict the output of indexing and slicing on a 2D array.


In [None]:
# Exercise 1a
arr[0:5]
arr[-1]
arr[::2]

# Exercise 1b
arr2[0]
arr2[:, 0]
arr2[1, 2]

## Loading data from a file

Experimental data are usually stored in files.
In this course, each lecture folder contains a `data/` directory.

For now, we assume CSV files with no header rows.
Each row is one time point.
Column 0 is time.
Column 1 is optical density.


### Loading with `np.loadtxt`

`np.loadtxt` is strict.
If the file is not purely numeric, it will fail.
This helps catch problems early.


In [None]:
data = np.loadtxt('data/OD1.csv')
data

### Inspect immediately

In [None]:
data.shape
data[:5]
data[-1]

### Extract columns

In [None]:
time = data[:, 0]
od = data[:, 1]
time[:5], od[:5]

### Duration of experiment

In [None]:
time[0], time[-1]
time[-1] - time[0]

## Saving modified data

When you modify data, save the result to a new file.
Do not overwrite raw data.


In [None]:
od_modified = od * 1.0
od_modified[0] = 0.05
new_data = np.column_stack((time, od_modified))
np.savetxt('data/example2_modified.csv', new_data)

## Plotting

Numbers alone are hard to interpret.
Plots turn arrays into pictures that humans can understand.


### What is matplotlib?

Matplotlib is a plotting library.
It is not part of core Python.

It takes numerical arrays and turns them into figures.
We will start with a very small subset of its features.


In [None]:
import matplotlib.pyplot as plt

### Plot OD versus time

In [None]:
plt.plot(time, od)
plt.xlabel('Time')
plt.ylabel('Optical density (OD)')
plt.title('Growth curve')
plt.show()

### Normalizing the growth curve

In [None]:
od_norm = od / od[0]
plt.plot(time, od_norm)
plt.xlabel('Time')
plt.ylabel('Normalized OD')
plt.title('Normalized growth curve')
plt.show()

## In class exercises with files

Exercise 2: Load `example2.csv`, modify values, and save a new file.
Exercise 3: Load `OD1.csv` and `OD2.csv`, plot both, and discuss growth modes.


## Common failure modes

Forgetting imports.
Mixing up rows and columns.
Overwriting variables.
Not checking shapes or plots.
