### BIOS470/570 Lecture 23 Data fitting

In [None]:
import numpy as np
from scipy import optimize
import matplotlib.pyplot as plt 
import pandas as pd
import seaborn as sns

import warnings
warnings.simplefilter(action="ignore",category=FutureWarning)

## In data fitting, we define a model and try to make it fit a set of data as well as possible. This involves optimizing the parameters of the model to make it fit the data as closely as possible. Fitting is typically done using least squares methods, and numpy and scipy have built in functions for doing this. 

### Let's begin by generating some random points between 0 and 1. 

In [None]:
x = np.random.random(100)

### Let's define y values for each x value according to $y=mx+b$+noise with m = 5 and b = -2. To define the noise, we want to use values evenly centered around 0 or else the noise would introduce a bias in one direction. If r is a random number between 0 and 1, then $2r-1$ is a random number between -1 and 1:

In [None]:
y = 5*x-2+0.5*(2*np.random.random(100)-1)
plt.plot(x,y,'.');
plt.xlabel('x')
plt.ylabel('y');

### The numpy function polyfit is used to fit a polynomial to data. Remember that in numpy a polynomial $ax^n+bx^{n-1}+...+c$ is represented as an array [a,b,...,c]. So a polynomial of degree n is represented as an array of length n+1. The polyfit function takes the x and y values of the data and the degree of the polynomial to use and returns the best fitting polynomial. 

In [None]:
pfit = np.polyfit(x,y,1)
pfit

### The polyval function evaluates a polynomial at specific points. That is for an array ar, np.polyval(poly,ar) returns the values of the polynomial represented by poly evaluated at the points in the array. We can use this to compare the fit to the data:

In [None]:
fitvals = np.polyval(pfit,x)

In [None]:
plt.plot(x,y,'.')
plt.plot(x,fitvals,'-',color = 'k')
plt.legend(['Data','Fit'])
plt.xlabel('x')
plt.ylabel('y');

### What happens if we fit to a higher order polynomial?

In [None]:
pfit2 = np.polyfit(x,y,2)
pfit2

### Now let's look at some more complicated data, defined according to $y=7x^3+2x^2-5x+1$. 

In [None]:
y2 = 7*x**3+2*x**2-5*x+1+(2*np.random.random(100)-1)*0.2
plt.plot(x,y2,'.');
plt.xlabel('x')
plt.ylabel('y');

### Let's see what fits look like for polynomials of degree 1 to 4:

In [None]:
## We will store the fit polynomials as well as the evaluation of these polynomials on the interval 0,1 to compare with data:
fits = []
vals = []

## Loop over degree, do the fit and store the results:
evalat = np.arange(0,1,0.01)
for ii in range(4):
    pfit = np.polyfit(x,y2,ii+1)
    fits.append(pfit)
    vals.append(np.polyval(pfit,evalat))

In [None]:
fits

In [None]:
plt.plot(x,y2,'.',label = 'data')
for ii in range(4):
    plt.plot(evalat,vals[ii],'-', label = 'fit, degree ' + str(ii+1))
plt.legend();
plt.xlabel('x')
plt.ylabel('y');

### The dangers of overfitting, an example:

### Define some completely random data:

In [None]:
xx = 10*np.random.random(20)
yy = 10*np.random.random(20)
plt.plot(xx,yy,'.');
plt.xlabel('x')
plt.ylabel('y');

### Fit the data to very high order polynomal and let's compare the result to the data:

In [None]:
pfit = np.polyfit(xx,yy,20)
evalat = np.arange(xx.min(),xx.max(),0.001)
yfit = np.polyval(pfit,evalat)
plt.plot(xx,yy,'.',label = 'data')
plt.plot(evalat,yfit,'-',label = 'fit degree 20')
plt.legend();
plt.ylim([-20, 20]);

### What if we want to fit other functions to data? scipy provides a general function for curve fitting which takes a function and data as input and returns the optimized parameters. First let's revisit our cubic example using this function. To fit data to a cubic, we need to define a function for the fit:

In [None]:
def cubic(x,a,b,c,d):
    return a*x**3+b*x**2+c*x+d

### Use the curve_fit function to actually do the fitting. returns the best fit parameters as well as the convariance matrix, which can be used to estimate uncertainty in the parameters:

In [None]:
pfit, pvar = optimize.curve_fit(cubic,x,y2)

In [None]:
pfit

In [None]:
pvar.diagonal()

### Compare the fit with data:

In [None]:
evalat = np.arange(0,1,0.01)
plt.plot(x,y2,'.',label = 'data')
plt.plot(evalat,np.polyval(pfit,evalat),'-',label = 'fit curve')
plt.legend();
plt.xlabel('x')
plt.ylabel('y');

### Finally, let's look at some realistic data. Here the concentration of a ligand was varied, and the express of a target gene was measured:

In [None]:
dat = pd.read_csv('data/data.txt',header=None,names = ['concentration','expression'])

In [None]:
dat

In [None]:
sns.scatterplot(dat,x='concentration',y='expression');

### Let's try to fit this to a Michaelis function like the one we use to model gene expression last week. First we need to define the function:

In [None]:
def fitFunc(x,ku,kb,K):
    return (ku+kb*x)/(K+x)

### Run the fitting:

In [None]:
pfit, pvar = optimize.curve_fit(fitFunc,dat["concentration"],dat["expression"])

### Plot the result - notice the use of the unpack operator (*) which makes it easy to evaluate the function with the best fit parameters:

In [None]:
evalat= np.arange(0,100,0.1)
yeval = fitFunc(evalat,*pfit)
sns.scatterplot(dat,x='concentration',y='expression');
plt.plot(evalat,yeval,'-');

### What happened?

### One solution: Put bounds on the allowed values of the parameters:

In [None]:
pfit, pvar = optimize.curve_fit(fitFunc,dat["concentration"],dat["expression"],bounds=(0,np.inf))

In [None]:
evalat= np.arange(0,100,0.1)
yeval = fitFunc(evalat,*pfit)
sns.scatterplot(dat,x='concentration',y='expression');
plt.plot(evalat,yeval,'-');

### Another solution: Make an initial guess for the best fit parameters:

In [None]:
pfit, pvar = optimize.curve_fit(fitFunc,dat["concentration"],dat["expression"],p0 = (100,10,5))

In [None]:
pfit

In [None]:
p