# Lab 1: Getting Started with Machine Learning

## Labs for this short course
Each session will be split between lectures and a hands-on lab session like this one. We aim for these lab sessions serve multiple purposes:
- Provide a space to explore the concepts discussed in the lectures.
- See and use tools that rely on the machine learning methods we discuss.
- Give a friendly first glimpse into the programming needed to develop machine learning models.

Some of our lab sessions (like this one) will use an environment called a [Jupyter notebook](https://jupyter.org/). Other sessions will be a guided walk-through the operation of specific mass spectrometry tools that use specific machine learning methods under the hood. Jupyter notebooks provide a way to marry text and code—a concept call *literate programming*—providing a rich interactive document for us to explore with. While you could install Jupyter on your computer, we are using a service called [Binder](https://mybinder.org/) to run these Jupyter notebooks on a borrowed computer in the cloud; thus, there are no installation requirements for the labs aside from a web browser.

This lab will focus on familiarizing us with the Jupyter notebook interface, a short introduction to programming in Python, and exploring the "bias-variance" tradeoff.

## Exercise 1: Meet your group!
Everyone should have been assigned a random group for our lab sessions. Find the other 3–4 folks in your group and introduce yourself! This is an excellent time to help each other and learn together. Over the next 5 minutes, go around and introduce yourself with the following info and anything else you want to share:
1. Your name (don't hesitate to gently correct mispronunciations!)
2. Your preferred pronouns (he/him, she/her, them/they, ...)
3. Where you are from, as specific as you want to be.
4. Why are you interested in machine learning?

Your group should be the first folks you look at to answer questions and discuss the excersises during the labs. However, if you do get stuck or have a remaining question, don't hesistate to ask for help! Use your post-it notes (we should have already talked about what the colors mean) to signal that your group needs help and one of the instructors will be over as soon as possible to assist.

## Exercise 2: Introducing Jupyter notebooks
We chose to use Jupyter notebook environments with Binder, because they provide a way for us to explore the machine learning method we discuss with little-to-no setup required. However, the notebook interface can be intimidating, particularly if you've never programmed before. If this describes how you're feeling, please don't worry; We've tried to design these labs to be approachable by anyone—with or without coding experience—once they understand the Jupyter notebook interface at a high-level.

### The Jupyter Lab interface
Jupyter notebooks allow us to switch between writing text (in a syntax called "markdown") and code using the concept of cells. The markdown cells are where we've written the prose for excersises. In fact, the text you're reading right now is contained in a markdown cell. Here are the different parts of a Jupyter notebook:

![notebook](images/blank-notebook-ui.png)

### Running code in Jupyter notebooks
Code cells contain code that we can run. When we run or execute a code cell, the computer reads the code that is written in it and uses it as a set of instructions to do *something*—whatever, we've programmed it to do. Many of the exercises involve running code that we've already written and modifying it in specific ways to see what happens. Throughout the labs, you'll see two types of code: Python and shell commands (commands you would run from the command line or terminal).

A code cell is indicated with a `[x]:` beside it, where `x` is blank if the code cell has never been run. After a code cell has been run `x` indicates the order in which that code cell was run, among all of the other code cells in the notebook. Some code cells will also output text or images when they are run. These outputs appear below the code cell that generated them and are updated each time the code cell is run. 

To run the example code cell below, click on the cell and press the "Run" button in the toolbar, or pressing ⌘Cmd + Enter (MacOS) or Ctrl + Enter (Windows). 

**Go ahead a run the cell below. What happened?**

In [None]:
print("I just ran this cell!")

Throughout these labs, we will modify parts of existing code cells to see what happens. Below is a code cell that will create a plot which we can modify. Note that parts following `#`s are comments and have no affect on the code itself. Most of our Jupyter notebooks will begin with an series of `import` statements, which allow us to load the functionality required for the session.

For this session, we only need the following import.

**Run the cell below with ⌘Cmd + Enter (MacOS) or Ctrl + Enter (Windows):**

In [None]:
import src

With the functionality for our current session loaded, **try running the code below**. If your see a message that says `NameError`, please verify that you've run the code cell above.

**Try experimenting with values for the parameter**. To change the values, type any number you want in place of the `1`. You can also try non-numbers, but you'll receive an error.

In [None]:
# Change the value of parameter and see what happens:
parameter = 1

# After you change the above parameter, run the code.
# The plot generated by the function below should change!
src.session_1.make_first_plot(parameter)

Which function do you think $f(X)$ describes? *The answer is at the bottom of this notebook.*

### Running command line programs in Jupyter notebooks
Finally, we can also execute command line programs from within a Jupyter notebook code cell. In later labs we'll explore command line programs like Percolator in detail. There are multiple ways to run command line programs in a Jupyter notebook. The easiest is to prefix your code with an exclamation point `!`.

**Run the following code cell to print the help for Percolator.** Please note that you don't really need to pay attention to what the help message says, just that it was printed from the Percolator command line program.

In [None]:
!percolator --help

## Exercise 3: A brief introduction to Python

Python is a general purpose programming language that is very popular for machine learning and data science. We have no expectation that you've programmed in any language before, but as you saw in the previous exerciese, we will often be working together on small modifications to Python code in these labs. We chose this format, so that everyone would be exposed to Python code and that these labs could serve as a jumping-off point to pursue your interests after the course. However, we'll need to introduce a little bit of how Python works to maximize your experience.

### Variables are assigned with the `=` symbol.
In Python we assign *values* to *variables* as a way to make the computer remember something that we choose. In Python and many programming languages, *values* are assigned to *variables* using the `=` symbol. The values we can assign take many forms, which we call *data types*. We will mostly be working with numeric data types like `1`,`10`, `100382`, `2.35`, `3e4`.

This code below assigns the value `3.14159` to the variable `pi`, then uses the `print()` function to show us the value we've assigned:

In [None]:
pi = 3.14159
print(pi)

There are also non-numeric data types that we'll use less frequently in these labs. One example of these strings, which are collections of characters enclosed in quotation marks (single quotes `'` or double quotes `"` will work). 

For example, we can assign the string `"awesome"` to the variable `mass_spec`:

In [None]:
mass_spec = "awesome"
print(mass_spec)

Additionally we can change the value of a variable after it's first assigned. In this example we assign the variable `pi` the value of `3.14159`, then we reassign the variable `pi` the value of `"whoops"`:

In [None]:
pi = 3.14159
print("First pi was:", pi)

pi = "whoops"
print("Now pi is:   ", pi)

**Experiment with assigning your own variables in the empty code cell below.** 

**Before you run the next code cell, what do you predict will happen?**

In [None]:
pi = 3.14159
cherry = pi
cherry = cherry + 1
print(cherry)

### Functions use inputs to produce outputs

Just like the functions we're trying to estimate using machine learning, functions in Python can take inputs and turn them into outputs. We typically call the inputs *arguments*. In some exercises you'll need to change the values of arguments to explore how a function behaves. One example of a function we've used already is `print()` which prints the values of the variables you give it.

We can also define our own custom function using the `def` keyword:

In [None]:
#  The function
#      name        arguments
#       |              |
#       |       +------+--------+
#       V       V      V        V
def my_function(X, multiplier, power):
    """I am a docstring.

    I contains information about the function.
    """
    # We use arguments like as variables in the function.
    Y = multiplier * X**power  # ** indicates an exponent.

    # The return keyword specifies the output
    return Y

Once its defined, we can then use the function:

In [None]:
output = my_function(10, 2, 0.5)  # 2(10^0.5)
print(output)

**Write a function that calculates the mean of three numbers.**

## Exploring the bias-variance tradeoff

We're now going to revisit our peptide retention time data and use it to get a better understanding of the bias-variance trade-off. The code below fits a model with one hyperparameter `k` that controls its flexibility to our peptide retention time dataset from [Zolg et al](https://doi.org/10.1002/pmic.201700263). As a group, explore what happens when you change `k` and discuss your observations. 

**Explore values of k between 1 and 10**
- What changes do you see in the model fit (the line in the first two panels) when you change k?
- What value of k would you choose for a final model based on these results?
- Do low values of k result in high bias or high variance?
- Do high values of k result in high bias or high variance?

In [None]:
# Change the value of k by replace None with an between 0 and 29:
k = 13

# This function fits a model, then plots the results.
src.session_1.fit_model_to_ret_times(k)

## The Prime Directive

Recall Will's prime directive of ML: 
> Only evaluate your models on data that has never been used for training.

Now we're going to explore why it is so important and what happens when we ignore it!

First, we'll load some data that represents a biomarker panel for a disease. Each **feature** in our dataset is a measurement of one out of 100 analytes and each **example** is one patient. The **output** variable indicates whether or not each patient was diagnosed with our disease of interest by some other means. In the code cell below, we load our data:

In [None]:
## load data.

We now split our data into a training and a test set, before doing anything. We will use the training set for all of our model training tasks, then use our test set only for our final evaluation. The code below splits our data once:

In [None]:
## split data

Here we split the training data again, so that we have a held-out validation set on which to evaluate the different **hyperparameters** we can choose for our model:

In [None]:
## split data again

Finally before modeling, let's visulize our training set using principal component analysis (PCA). PCA allows us to reduce the dimensionality of our dataset from the 100 features, down to two, which we can plot and explore. It is always a good idea to perform some exploratory analysis to learn about your dataset before modeling.

In [None]:
## PCA

Now we're going to build our model two ways: once ignoring Will's Prime Directive, and once following it to see how they perform. 

## Ignoring the Prime Directive

We're now going to fit a classifier to our full dataset, prior to splitting data for modeling. In practice, you should never do this for a supervised learning task (or one that will be come a supervised learning task on the same data). 

Let's begin building our model! Once again there is a single hyperparameter what we need to pick. This time it is represented by the greek letter lambda, $\lambda$, which we write as variable `lambda_val` below. *Note that `lambda` is a reserved keyword in Python, so 

## Answers
*What the function do you think $f(X)$ describes?* If we call our parameter $p$, then $f(X) = X\sin(pX)$