**Tutorial 7 - Photometric redshifts.**

In this tutorial we will try to builds models to predict galaxy redshifts from their photometry in 5 bands.

We will get some practice with the scikit-learn package which contains powerful tools for simple machine learning problems. (https://scikit-learn.org/stable/user_guide.html)

1) Import the data from the file reduced_galaxy_data.fits.  This file contains a small subset of data from the Slone Digital Sky Survey (SDSS).  The columns in the fits table are: id number, the measured spectroscopic redshift, the fluxes in five bands (ugriz), and the magnitudes in the same five bands.  Note that the entry in the flux and magnitude columns is an array of 5 entries.

Put the redshifts into an array called `redshift` and the magnitudes into a two dimensional array `color_block` where `color_block[0]` are the 5 magnitudes for the first entry.

Make a scatter plot of redshift versus u band magnitude.

In [None]:
import numpy as np
from astropy.io import fits
import matplotlib.pyplot as plt
#import pandas as pa

filename = "reduced_galaxy_data.fits"
hdul = fits.open(filename)
print(hdul[1].columns)
data = hdul[1].data

redshift = np.array(data['z'])

....

plt.xlabel('redshift')
plt.ylabel('U magnitude')
plt.show()


2) Subtract the U band magnitude from all the other bands so that we have 4 colors and one apparent magnitude.

In [None]:

for i in np.arange(1,5) :
    color_block[:,i] = color_block[:,i] - color_block[:,0]

3) From sklearn import `linear_model`.  Make a `linear_model.LinearRegression(copy_X=True)` object 
and then fit a model that predicts redshifts from the colors and the U band magnitude. (see https://scikit-learn.org/stable/modules/linear_model.html for more information.)  Print out the coefficients.  Use the model.score() function to give the score for the model which in this case is the R^2 statistic (coefficient of determination).

In [None]:
from sklearn import linear_model

linear_mod = ...
linear_mod.fit(...)

print('coefficients : ',...)
print("score = ",linear_mod.score(...)


4) Use the model to predict the redshifts for all of the galaxies.  Make a scatter plot of the predicted vs observed redshifts. Decreasing the alpha parameter can make this clearer.

In [None]:

# predict redshifts
y = linear_mod ...

# plot prediction vs redshifts

plt.scatter(y,redshift,alpha=0.05)
plt.plot([0,5],[0,5],linestyle=':')
plt.ylabel("observed")
plt.xlabel("prediction")
plt.xlim(0,0.5)
plt.ylim(0,0.5)
plt.show()

5) Scatter plots can be deceiving.  The density of points can be estimated with a Gaussian kernel.  This should do that.  Put some labels on the plot and overlay a contour plot.  (If you can make a nicer countour plot than this extra point.)

In [None]:
from scipy.stats import kde

nbins = 40
#k = kde.gaussian_kde(np.array([predictions,observations]))
k = kde.gaussian_kde(np.array([ ... , ... ]))

xi, yi = np.mgrid[0:0.5:nbins*1j, 0:0.5:nbins*1j]
zi = k(np.vstack([xi.flatten(), yi.flatten()]))
 
plt.pcolormesh(xi,yi, zi.reshape(xi.shape), shading='gouraud')
plt.plot([0,0.5],[0,0.5],linestyle=':')
plt.colorbar()

plt.contour(xi, yi, zi.reshape(xi.shape) )

plt.show()

6) We are going to need a quick way of assessing how well a model is working.

Make a function that takes the residuals (predictions - true values) and prints six things : 1) their median, 2) mean, 3) the range the contains 90% of the cases with 5% larger and smaller, 4) the same for 80%, 5) the standard deviation and 6) the mean absolute deviation.  

Call this function `report()`.

In [None]:
def report(residuals) :
    print('median',...)
    print('mean',...)
    print('90% quantile range : ',... , ...)
    print('80% quantile range : ', ... , ...)
    print('standard deviation : ', ... )
    print('mean absolute deviation : ', ... )

7) We need a quick way of calculating the residuals, but we cannot calculate the residuals on the same data as we fit the model.  Let's make a function the splits the data in two, fits on one subset and calculates the residuals on the remaining subset.

The function should take the independent variables `X`, the dependent variable `Y`, the model `model` and the fraction of the data set that will be used for fitting the the model.  The `model` has the functions `fit()` and `predict()` as for the sklearn models.  The function should split the data set into two random subsets: fit and test.  It should then fit the model and then return the residuals for the test set and the true Y values of the test set.

In [None]:
def test_residuals(X,Y,model,fit_fraction) :
   
    # make and index
    ...
    # shuffle the index
    np.random.shuffle(index)
    
    #split the index into two parts 
    
    split = int(fit_fraction*N)
    index_fit = index[:split]
    index_test = ...

    # define X_fit,Y_fit
    X_fit = ...
    Y_fit = ...

    # define X_test, Y_test
 
    X_test = ...
    Y_test = ...

    ## fit model
   
    ## predict redshifts
   
    return ... , ...

8) Use your function `test_residuals()` to get a set of residuals with `linear_mod` and our data.  Use 80% of the data for fitting and 20% for testing. Run `report()` on the resulting residuals to see how well this model predicts the redshifts.

9) Make a nice histogram of the residuals.

In [None]:
residuals,z_test = test_residuals(color_block, ... ,... ,...)

print("residuals")
report(residuals)

10) What is important is that the fractional error in the redshift is small, i.e. `residuals / y`.  Find the fractional residuals, run them through `report()` and make a histogram of them.

In [None]:
frac_residuals = ...

print("fractional residuals")
report(frac_residuals)

plt.hist(...)
plt.xlabel(...)
plt.show()

11) Would you consider this a successful model in terms of the fractional residuals?

12) So far, the model has been linear in the parameters AND the colors/magnitude.  It might improve the model if we include terms in the model that are second order in the colors while keeping it linear in the parameters.

`sklearn` provides a convenient function that will transform our matrix of colors into a larger matrix that includes higher order terms.  In particular, `PolynomialFeatures(2).fit_transform(color_block)` will include all the colors squared and all the products of the colors.

Use this function to transform our `color_block` into another one with second order terms.

How many more parameters will be in a linear model fit with this new data matrix?

In [None]:
from sklearn.preprocessing import ...

color_block2 = 

print(color_block2.shape)

13) Create and fit a new linear model using this new data matrix.

What is the score for this model?

In [None]:
inear_mod2 = ...
linear_mod2 ...

print("score = ",...)

14) Use your functions `test_residuals()` and `report()` with this new model as before.  Make a histogram of the residuals as before.

In [None]:
residuals,z_test = ...

print("residuals")
report(residuals)

plt.hist(residuals,bins=50)
plt.show()

15) As before, do the same for the fractional residuals.

16) Make a new contour plot for predictions based on `color_block2`.

17) Uses `sklearn.model_selection.cross_val_score` to find an estimate of the mean absolute error using k-fold validation. (scoring="neg_mean_absolute_error", cv=5)

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(..., ..., ..., cv=...,scoring="neg_mean_absolute_error")
print(-scores)


18) Is this model an improvement on the first one?