Capstone Assignment
===================

**Author:** Ulrich G Wortmann



## Goal



Today's assignment has two parts:

1.  Write a program (notebook) that can do regression analysis on arbitrary CSV-files. You can re-use many parts of last week's assignment, but you likely have to modify it. The user of your program will only specify the file name, the column index of the independent variable, the column index of the dependent variable, and whether any of these require a log transform. Your code will produce a histogram plot, a fully annotated regression plot including the residuals. The plot must use the CSV-file column headers as axis-labels
2.  Use the new code to solve the calibration assignment below, i.e., you submit a notebook with your code and the results obtained with the calibration data below.



## Preparing the code



In the following, we want to create a notebook that can be used with any given dataset to produce a linear regression analysis. The point of the notebook is to make this analysis as comfortable as possible, so we will add a few bells and whistles. The notebook should have the following sequence of cells:

1.  A cell with all import statements, variable definitions for the file name, the index values for the columns you want to analyze,  the graph labels, the name of the figure file (e.g., xyz-lin-reg.pdf pr png etc.), and whether this analysis will use a log transformation or not. Use the following variable names:



In [1]:
# data file
fn: str = "foo.csv"  # replace foo with your file name
fig_fn: str = "foo.pdf"  # figure file name

# select data from csv file
x_col: int = 1  # independent_var
y_col: int = 0  # dependent_var
log_transform: str = ""  # use as "X", or "Y" or "XY"
sig: float = 0.05  # = 1 - sig > 0.95 = 95% significance

Notes: If you do a log transform, your code should change the respective figure captions to (log(&#x2026;)). 

2.  Starting with this cell, all code actions should only depend on the variables defined in the first cell. Continue with reading the data from the CSV file, and mapping the data frame columns to generic variables (e.g., `X` and `Y`, see below). Add a statement to sort the data frame in ascending order by "X".
3.  Plot "X" and "Y" as histograms to get a visual test if the data is normal distributed.
4.  Print the Pearson correlation coefficient using a suitable print statement (i.e., don't just print a number!)
5.  Perform the regression analysis for `Y~X` and print the resulting parameters
6.  Extract all data needed to make a regression plot that shows the regression parameters, and the confidence intervals for the model and the predictions. In addition, also extract the residuals from the results object like this:



In [1]:
residuals: pd.Series = results.resid

7.  Create a graph with 1 column and two rows. The first plot will be the usual regression plot, whereas the second graph should plot the residuals versus the x-axis data. The graph should be 120 dpi, 6 by 8 inches. See this example:

![img](./sed.png)

8.  Note that you should derive the text position for the regression parameters automatically. You can retrieve the axis dimensions as `ax.get_xlim()`  and `ax.get_ylim()`.  Consider whether your correlation coefficient is positive or negative when you compute the text position.



### Mapping the column names



To keep our code universal, and to avoid potential problems with the statsmodel library, it is best if we rename the columns used in the statistical model. So first extract the column headers for the independent and dependent variables, and assign their names to some variables (also to be used in the plotting step). Using these variables 
we can replace the existing column headers in the data frame with our generic names (i.e., "X" and "Y"). We achieve this by passing a dictionary that defines the mapping to the `df.rename()` method. Note that `independent_var` and  `dependent_va` are string variables that contain the names of the respective column headers.



In [1]:
# replace df column names
map_headers: dict[str:str] = {independent_var: "X",
                              dependent_var: "Y"}

df.rename(map_headers, axis='columns', inplace=True)

### Rules



-   At this point, you have seen many examples of well-written code. Follow these examples in building your own code
    -   Provide a header with an authorship and purpose statement
    -   When reading the CSV file, use the pathlib library to ensure that the file exists
    -   Use comments
    -   Use type hints
    -   If you want to, use functions, but for this exercise, it is not needed
    -   All figures use the ggplot style



## Testing your code



Run the notebook with the storks and sulfate reduction data, to make
sure it functions as intended. Note: I do not need to see these
tests. This is purely for your benefit.

Pay attention to the residuals plot. For well-behaved data, the residuals should be evenly distributed, if they are not (e.g., in the stork data), or they show a clear pattern, it is a strong indication that your analysis is not valid.  You can test this with the sedimentation versus sulfate reduction data set.



## Let's put your code to use



As part of my research in the International Ocean Drilling Program
(IODP), I occasionally go on expeditions with the JR Joides Resolution

![img](Joides.jpg)

The ship takes about 28 scientists, 28 technicians, 28 drillers, and
about 28 seamen. Expeditions are typically two months in length. While
the scientists typically have cruise-related research objectives,
their duties during the cruise are taken up by routine measurements
describing the core materials. This includes geomagnetic,
sedimentological, chemical, biological, mineralogical, and structural
data.

I participated twice and was part of the team documenting the
inorganic chemistry of the water that is trapped between sediment
grains - the so-called interstitial water (IW).

The chemistry of the interstitial water is monitored for many ionic
species, among them Ammonium (NH<sub>4</sub><sup>+</sup>). Ammonium is measured on a
photospectrometer, which measures how much light passes through a
sample container at a specified wavelength. This allows for exact concentration measurements but requires manual calibration. In
the first step, you have to mix standards. In the second step, you
measure those standards to create a calibration function. The
calibration function allows you to relate the light absorbance to the
NH<sub>4</sub><sup>+</sup> concentration. In other words, we build a linear model to predict
the NH<sub>4</sub><sup>+</sup> concentrations based on the absorbance values we get
from the photospectrometer.  The whole process is a bit of an art, but
after practicing it for three days, I got this:

![img](ammonium_311.png)

Below, you will find calibration data for my first expedition in 1998. I
was a total noob, I did not practice, and likely I was not careful
enough when I prepared my calibration standards. Alas, my lab journal lacks enough detail to understand why it was so bad.

The data for the above table is in the file `ammonia-1998.csv`. 
Next use your shiny new code to perform a regression analysis where you treat the absorbance data as
the independent variable and the NH<sub>4</sub><sup>+</sup> concentration as the dependent
variable. The regression analysis should be performed at the 99% confidence level, and there is no need for a log transform.

Describe in words the measurement (i.e., prediction) uncertainty for NH<sub>4</sub><sup>+</sup> in units umol/l  for an absorbance reading of 600. Write a few words about how this compares to the uncertainty you get for `ammonia-2005.csv` .

Note the structure in the residuals for the 2005 data. This strongly
suggests that a liner model is not suitable for this data and that the
measurement error could be further reduced by a polynomial fit.



### Marking Scheme



There are no partial points. Total Points: 24 pts

-   Correct use of type hints: 1 pts
-   Correct use of comments and doc strings 1pt
-   Correct results for r<sup>2</sup>, p, and the regression equation: 3 pts
-   Correct graph labels: 1 pt
-   Display of the confidence interval for the regression model: 1 pt
-   Display of the confidence interval for the predictions of the
    model: 1 pt
-   Correct value for the measurement uncertainty 6 pts
-   Code calculates useful values for the regression text coordinates 2 pts
-   Code correctly replaces column headers 2 pts
-   Code handles log-transforms specified in `log_transform` variable 2pts
-   Code adjusts axis-labels depending on `log_transform` variable 2 pts
-   Code plots the file name as plot title 1 pt
-   Correct graph dimensions 1 pt
-   **Included your CSV data with your submission!** Otherwise, the TA will use his own data file, and you will lose the 6 points for the measurement uncertainty.



### Submission Instructions



Create a new (or copy and existing) notebook in your `submissions`
folder before editing it. Otherwise, your edits may be overwritten the
next time you log into syzygy. Please name your copy
"Assignment-Name-FirstName-LastName": 

-   Replace the `Assignment-Name` with the name of the assignment
    (i.e., the filename of the respective Jupyter Notebook)
-   `FirstName-Last-Name` with your own name.

Note: If the notebook contains images, you must also copy the image files!

Your notebook/pdf must start with the following lines 

**Assignment Title**

**Date:**

**First Name:**

**Last Name:**

**Student: Id**

Before submitting your assignment:

-   Check the marking scheme and ensure you have covered all requirements.
-   re-read the learning outcomes and verify that you are comfortable
    with each concept. If not, please speak up on the discussion board
    and ask for further clarification. I can guarantee that if you feel
    uncertain about a concept, at least half the class will be in the
    same boat. So don't be shy!

To submit your assignment, you need to download it as `ipynb` notebook
format **and** `pdf` format. The best way to export your notebook as pdf
is to use your browser's print function (`Ctrl-P`) and then select
`Save as pfd`.  Please submit both files on Quercus. Note that the pdf
export can fail if your file contains invalid markup/python code. So
you need to check that the pdf export is complete and does not miss
any sections. If you have export problems, don't hesitate to contact
the course instructor directly.

Notebooks typically have empty code cells in which you must enter
python code. Please use the respective cell below each question, or
create a python cell where necessary. Add text cells to enter your
answers where appropriate. Your responses will only count if the code
executes without error. It is thus recommended to run your solutions
before submitting the assignment.

**Note: Unless specifically requested, do not type your answers by**
**hand. Instead, write code that produces the answer. Your pdf file**
**should show the code and the results of the code execution.**

