## Assignment



Learning outcomes:

-   practice testing for normality
-   practice ordinary least square model definitions
-   learn how to transform data when necessary
-   continue practicing plot-generation

Instructions:

1.  The file `sed_rate_vs_sufate_reduction_rate.csv` contains
    measurements of the activity of sulfate reducing bacteria in
    marine sediments, and of the sedimentation rates in the
    respective sample locations (Claypool, 2004). The sulfate
    reduction rate is given as a rate constant 1/Mysr, and the
    sedimentation rate is given in m/Myrs.  Create a scatter plot
    with the sedimentation rate as X-axis, and sulfate reduction rate
    constant as the Y-axis.  Plot axis must show labels with
    units. All units must be given in square brackets.
2.  Calculate Pearson Product Moment Correlation Coefficient, and
    state whether the data correlates weakly, correlates, or
    correlates strongly
3.  Use a histogram to test whether both variables show a normal distribution.
4.  Data from natural processes often show a log-normal distribution
    (Here we use the logarithm to the base of 10). In other words,
    once we transform the data into logarithmic space, it will show a
    normal distribution. Assuming you have a data frame with two
    series objects called `A` & `B`, you can create the log-transform
    as a new column like this



In [1]:
import numpy as np

df['A_log10'] = np.log10(df['A'])

Note, this is just a template. You need to adapt it to your own variable names!

5.  Create a new plot using the log10 transformed data for
    sedimentation rate and sulfate reduction rate.
6.  Perform a regression analysis of this data-set where the
    sedimentation rate is the independent variable and the sulfate
    reduction rate constant is the dependent variable.
7.  Extract the regression parameters, and use the results to create
    a single multi line f-string which shows the regression
    parameters as seen in the previous chapter. Use the `text()`
    method of the plot handle to place this string in the upper left
    corner of the plot.
8.  Using the predict method of your model result object to generate a new set
    of predicted y-values which you then plot as a regression line on
    top of your scatterplot.
9.  Discuss the statistical relevance of your analysis. E.g., is the
    data normal distributed? How much of the variance is explained by
    your model, what is the likelihood that your hypothesis is
    correct? Where there any warnings in the fitting step?
10. Using the code below, create a dataset with a known linear
    relationship, then add some noise to it, and then analyze the
    noisy data and compare it to the actual data. Plot the results of
    this experiment similar to #6. Plot the original data (i.e.,
    without the noise) as a blue line, the data with the noise as a
    scatter plot, and the results of your regression analysis as an
    orange line.



In [1]:
# here we use the nump library. We will dicuss it's use in greater
# detail in the next module
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

nodp: int = 10
# create 30 values between 0 and 30 with a stepsize of 1
x0: np.ndarray = np.arange(0, nodp, 1)  # this is the independent variable
# calcualte y-values based on x0
y0: np.ndarray = 12.1 + 0.23 * x0
# create some random noise which varies between -3 and 3
noise: np.ndarray = np.random.uniform(-3, 3, size=nodp)
# and add it to the dataset
yn: np.ndarray = y0 + noise

# create a dataframe from the above data
df: pd.DataFrame = pd.DataFrame({"X": x0, "Y": yn})

# create a regression model
model: smf.ols = smf.ols(formula="yn ~ x0", data=df)
results: model.fit = model.fit()
print(results.summary())

Last but not least:

11. Create a new code cell, cut/copy/paste the code from #10,
    and rerun the code using 30 datapoints i.e., modify the
    value of `nodp`. Show your results as figure including the
    regression parameters.
12. Same as before, but this time use 100 data points
13. How does the number of datapoints affect the statistical
    parameters r<sup>2</sup>, p? How is the fit between original data (the
    line) with the predicted regression line? Summarize your
    findings in a text cell



### Marking Scheme



Marking Scheme: 1 pt per question. It is either right or wrong. No
half points. (13 Points total)

All exercises:

-   use self contained code for #10 to #12. Ok to inherit results from
    previous code cells for #1 to #9
-   test whether the data file exists before reading it
-   use 5x4 inch plot size
-   use the ggplot plot-style
-   use the object oriented interface to matplotlib



### Submission Instructions



Create a new (or copy and existing) notebook in your `submissions`
folder before editing it. Otherwise, your edits may be overwritten the
next time you log into syzygy. Please name your copy
"Assignment-Name-FirstName-LastName": 

-   Replace the `Assignment-Name` with the name of the assignment
    (i.e., the filename of the respective Jupyter Notebook)
-   `FirstName-Last-Name` with your own name.

Note: If the notebook contains images, you need to copy the image files as well!

Your notebook/pdf must start with the following lines 

**Assignment Title**

**Date:**

**First Name:**

**Last Name:**

**Student: Id**

Before submitting your assignment:

-   Check the marking scheme, and make sure you have covered all requirements.
-   re-read the learning outcomes and verify that you are comfortable
    with each concept. If not, please speak up on the discussion board
    and ask for further clarification. I can guarantee that if you feel
    uncertain about a concept, at least half the class will be in the
    same boat. So don't be shy!

To submit your assignment, you need to download it `ipyn` notebook
format **and** `pdf` format. The best way to export your notebook as
pdf, is to select `print`, and then `print to pdf`.  Please submit
both files on Quercus. Note that the pdf export can fail if your file
contains invalid markup/python code. So you need to check that the pdf
export is complete and does not miss any sections. If you have export
problems, please contact the course instructor directly.

Notebooks typically have empty code cells in which you have to enter
python code. Please use the respective cell below each question, or
create a python cell where necessary. Add text cells to enter your
answers where appropriate. Your answers will only count if the code
executes without error. It is thus recommended to run your solutions
before submitting the assignment.

**Note: Unless specifically requested, do not type your answers by**
**hand. Instead, write code that produces the answer. Your pdf file**
**should show the code as well as the results of the code execution.**



### References



-   E. G. Claypool, Ventilation of marine sediments indicated by depth
    profiles of pore water sulfate and d<sup>34</sup>S, Geochemical Investigations
    in Earth and Space Science: A Tribute to Isaac R. Kaplan, p59-65,
    004, [https://doi.org/10.1016/S1873-9881(04)80007-5](https://doi.org/10.1016/S1873-9881(04)80007-5)

