Capstone Assignment
===================

**Author:** Ulrich G. Wortmann



## Goal



Today's assignment has two parts:

1.  Write a program (notebook) that can do regression analysis on arbitrary CSV files. You can re-use many parts of last week's assignment, but you likely have to modify it. The user of your program will only specify:

2.  Use the new code to solve the calibration assignment below, i.e., you submit a notebook with your code and the results obtained with the calibration data below.



## Part 1 - Preparing the code



In the following, we want to create a notebook that can be used with any given dataset to produce a linear regression analysis. The point of the notebook is to make this analysis as comfortable as possible, so we will add a few bells and whistles. The notebook should have the following sequence 

1.  A cell where you specify the CSV file name, and the figure file name, which columns  to use, and the confidence level

2.  Starting with the next cell, all code actions should only depend on the variables defined in the first cell. Continue with reading the data from the CSV file, and mapping the data frame columns to generic variables (e.g., `X` and `Y`, see below). Add a statement to sort the data frame in ascending order by "X". The sorting is needed because otherwise, some statistical tests may fail.

3.  A cell where you use the Shapiro-Wilk test to test if the X and the Y data are normal distributed or not.
    -   If the test fails, your code will apply a log transform to the variable that needs it,
    -   Then it will apply the test again. 
        -   If the test fails again, produce a histogram plot and end your program with an error message telling the user that neither the data nor the log-transformed data follows a normal distribution.
        -   If the test succeeds without using a log transform, proceed to the next cell
        -   If the test succeeds after applying a log transform,  add the log-transformed data to the dataframe, save the dataframe as CSV file, print a warning that states that  X and/or Y have been log-transformed, and then proceed to the next cell

4.  Print the Pearson correlation coefficient using a suitable print statement (i.e., don't just print a number!)
5.  Perform the regression analysis for `Y~X` and print the resulting parameters
6.  Extract all data needed to make a regression plot that shows the regression parameters, and the confidence intervals for the model and the predictions. In addition, also extract the residuals from the results object like this:
    
        residuals: pd.Series = results.resid

7.  Create a graph with 1 column and two rows. The first plot will be the usual regression plot, whereas the second graph should plot the residuals versus the x-axis data. The graph should be 120 dpi, 6 by 8 inches.  The plot must use the CSV file column headers as axis labels (modify accordingly if you used a log transform!). Place your annotation in the upper left corner if you have a positive correlation, and in the lower left corner if your data shows a negative correlation. See this example:

![img](./sed.png)

8.  Note that you should derive the text position for the regression parameters automatically. You can retrieve the axis dimensions as `ax.get_xlim()` and `ax.get_ylim()`.  Consider whether your correlation coefficient is positive or negative when you compute the text position.



### Column names



To keep our code universal, and to avoid potential problems with the statsmodel library, it is best if we use standardized column names, rather than "Babies 10<sup>3</sup>/yr" which would fail with the statsmodel formula interface. It is thus best if you copy your data in generic variables (i.e., `X` and `Y`) this way there is no need to change column names on the dataframe.

\*\* Rules

-   At this point, you have seen many examples of well-written code. Follow these examples in building your own code
-   Provide a header with an authorship and purpose statement
-   When reading the CSV file, use the pathlib library to ensure that the file exists
-   Use comments
-   Use type hints
-   If you want to, use functions, but for this exercise, it is not needed
-   All figures use the ggplot style
-   Develop your code step by step!



## Testing your code



Run the notebook with the storks and sulfate reduction data, to make sure it functions as intended. Note: I do not need to see these tests. This is purely for your benefit.

Pay attention to the residual plot. For well-behaved data, the residuals should be evenly distributed, if they are not (e.g., in the stork data), or they show a clear pattern, it is a strong indication that your analysis is not valid.  Test your code with the following datasets:

-   `nd_positive.csv`  positive correlation
-   `nd_negative.csv` negative correlation
-   `x_log.csv` this data requires a log transform in x
-   `y_log.csv` this data requires a log transform in y
-   `x_log_y_log.csv` this data requires a log-transform in x and y
-   `x_skewed.csv` this data is highly correlated but x is not normal distributed
-   `y_skewed.csv` this data is highly correlated but y is not normal distributed
-   `x_and_y_skewed.csv` this data is highly correlated but x and y are not normal distributed



## Part 2: Let's put your code to use



To quote chatgpt: 

> Measurement uncertainty refers to the doubt or range of possible values associated with a measurement result. It accounts for factors such as variation in measurement conditions, instruments, and human factors, and is typically expressed as an interval within which the true value is believed to lie with a certain level of confidence.Moreover, with measured data, there are two things to consider: 

-   The precision of the measurement, i.e., if you re-measure the same sample again and again, how much variation do you get? This number can be very small but does not imply that your measurement is correct (i.e., you can measure a wrong result with high precision).
-   The accuracy: This value reflects how close your value matches the true value. Accuracy typically has a much larger uncertainty than precision.

In the following, we will use your code to evaluate the accuracy (or error range) of some real-life data. As part of my research in the International Ocean Drilling Program (IODP), I occasionally go on expeditions with the JR Joides Resolution

![img](Joides.jpg)
The ship takes about 24 scientists, 24 technicians, 23 drillers, and about 22 seamen. Expeditions are typically two months in length. While the scientists have cruise-related research objectives, their duties during the cruise are taken up by routine measurements describing the core materials. This includes geomagnetic, sedimentological, chemical, biological, mineralogical, and structural data.

I participated twice and was part of the team documenting the inorganic chemistry of the water that is trapped between sediment grains - the so-called interstitial water (IW).

The chemistry of the interstitial water is monitored for many ionic species, among them Ammonium (NH<sub>4</sub><sup>+</sup>). Ammonium is measured on a photo spectrometer, which measures how much light passes through a sample container at a specified wavelength. 

The photo spectrometer readings are very precise, but to get true concentration values, it requires manual calibration. In the first step, you have to mix standards. In the second step, you measure those standards to create a calibration function. The calibration function allows you to relate the light absorbance to the NH<sub>4</sub><sup>+</sup> concentration in your standards. In other words, we build a linear model to predict the NH<sub>4</sub><sup>+</sup> concentrations based on the absorbance values we get from the photo spectrometer.  The whole process is a bit of an art, but after practicing it for three days, I got this:

![img](ammonium_311.png)

Below, you will find calibration data for my first expedition in 1998. I was a total noob, I did not practice, and likely I was not careful enough when I prepared my calibration standards. Alas, my lab journal lacks enough detail to understand why it was so bad.

The data for the above table is in the file `ammonia-1998.csv`.  Next, use your shiny new code to perform a regression analysis where you treat the absorbance data as the independent variable and the NH<sub>4</sub><sup>+</sup> concentration as the dependent variable. The regression analysis should be performed at a 99% confidence level. Since there are only a few data points, the Shapiro-Wilk test may fail. Maybe you need to add an override option to your code.

Describe in your own words the measurement uncertainty  (i.e., the prediction interval)  for the NH<sub>4</sub><sup>+</sup> concentration in units &mu; mol/l for an absorbance reading of 600. Write a few words about how this compares to the uncertainty you get for `ammonia-2005.csv` .

Note the structure in the residuals for the 2005 data. This strongly suggests that a linear model is not suitable for this data and that the measurement error could be further reduced by a polynomial fit.



### Marking Scheme



There are no partial points. Total Points: 24 pts

-   Correct use of type hints: 1 pts
-   Correct use of comments and doc strings 1pt
-   Correct results for r<sup>2</sup>, p, and the regression equation: 3 pts
-   Correct graph labels: 1 pt
-   Display of the confidence interval for the regression model: 1 pt
-   Display of the confidence interval for the predictions of the
    model: 1 pt
-   Correct value for the measurement uncertainty 6 pts
-   Code calculates useful values for the regression text coordinates 2 pts
-   Code correctly replaces column headers 2 pts
-   Code handles log-transforms correctly 2pts
-   Code adjusts axis label(s) depending in case a log-transform was necessary 2 pts
-   Code plots the file name as plot title 1 pt
-   Correct graph dimensions 1 pt



### Submission Instructions



Create a new (or copy and existing) notebook in your `submissions`
folder before editing it. Otherwise, your edits may be overwritten the
next time you log into syzygy. Please name your copy `assignment-name-firstname-lastname` 

-   Replace the `assignment-name` with the name of the assignment
    (i.e., the filename of the respective Jupyter Notebook)
-   `firstname-lastname` with your own name.

Note: If the notebook contains images, you must also copy the image files!

Your notebook/pdf must start with the following lines :

**ESS245: Assignment Title**

**Date:**

**First Name:**

**Last Name:**

**Student: Id**

Before submitting your assignment:

-   Check the marking scheme and ensure you have covered all requirements.
-   Re-read the learning outcomes and verify that you are comfortable
    with each concept. If not, please speak up on the discussion board
    and ask for further clarification. I can guarantee that if you feel
    uncertain about a concept, at least half the class will be in the
    same boat. So don't be shy!

To submit your assignment, you need to download it as `ipynb` notebook
format **and** `pdf` format. **To export your notebook as pdf
use your browser's print function (`Ctrl-P`) and then select**
`Save as pdf`.  In the past, this worked best with Chrome or Firefox.

 Please submit **both files** on Quercus. Note that the pdf
export can fail if your file contains invalid markup/python code. So
you need to check that the pdf export is complete and does not miss
any sections. If you have export problems, don't hesitate to contact
the course instructor directly.

Notebooks typically have empty code cells in which you must enter
python code. Please use the respective cell below each question, or
create a python cell where necessary. Add text cells to enter your
answers where appropriate. Your responses will only count if the code
executes without error. It is thus recommended to run your solutions
before submitting the assignment.

**Note: Unless specifically requested, do not type your answers by**
**hand. Instead, write code that produces the answer. Your pdf file**
**should show the code and the results of the code execution.**

