# Assignment 2

## Machine Learning Techniques

This assignment is split into 3 sections, roughly corresponding to the contents of each of the 3 weeks in the Machine Learning module. 

All assignments are presented as Jupyter notebooks. You will fork the repository to have your own access to all files. You can edit this notebook directly with your answers and push your changes to GitHub. 

### **The goal of this assignment is to use different ML techniques to explore your data, find patterns in it, and eventually build a model that will allow us to predict stellar mass & redshift of galaxies *without doing SED fitting*.**

# Section 0: Data Preprocessing

Before we delve into machine learning, it's a good idea to look at our data, pick which sample we want to work with, etc.

The code below loads in the input data catalog:

In [4]:
from astropy.table import Table
from astropy.io import fits
with fits.open('../data/sw_input.fits') as f:
    df = Table(f[1].data).to_pandas()
    f.close()
    
# Display the top 3 rows of the dataframe
df.head(3)

Unnamed: 0,id,ra,dec,redshift,PLATE,MJD,FIBERID,designation,flux0_u,flux0_u_e,...,flux_w2_e,flux_w3,flux_w3_e,flux_w4,flux_w4_e,extin_u,extin_g,extin_r,extin_i,extin_z
0,3,337.45031,1.266134,0.088372,376,52143,404,J222948.07+011558.1,3.1e-05,3e-06,...,4.9e-05,4.172e-07,0.000209,2e-06,0.001187,0.341327,0.26596,0.18399,0.136724,0.101698
1,5,338.115522,1.270146,0.1638,376,52143,567,J223227.69+011612.6,1.1e-05,4e-06,...,0.000111,9.851e-07,0.000493,4e-06,0.001883,0.368063,0.286793,0.198402,0.147434,0.109664
2,8,341.101481,1.266255,0.143369,378,52146,404,J224424.38+011558.3,1.7e-05,3e-06,...,3.9e-05,1.0137e-06,0.000507,8e-06,0.003856,0.33763,0.263079,0.181997,0.135243,0.100596


#### Question 1

Look at all the column names. Choose which columns have *meaningful* data, i.e. have data we want to use in our machine learning to predict stellar mass and/or redshift. Why did you choose these ones?

In [None]:
# Placeholder

#### Question 2

Choose a reasonably-sized subset of your data ($10^3 \sim 10^4$ or so galaxies)<br>
Make sure to save your subset, or at least the IDs you chose, for later - you will need them!

In [None]:
# Placeholder

#### Question 3

It is often useful (and sometimes required) to *normalize* your data, i.e. for each parameter, subtract the mean of that parameter from each point in the sample, and divide by the standard deviation. For example, for mass, for each galaxy $i$, you can calculate

$$ M_{norm, i} = \frac{M_i - \langle M \rangle}{\sigma_M}$$

There are other ways to pre-process data (e.g., normalize by quantiles, or redefine variables such that they match some distribution).

* For each column you chose, normalize the data in your sample in some way.
* For one parameter, make a histogram of the original data, and the data after normalization. Do the histograms look as you expected them to?

In [None]:
# Placeholder

# Section 1: Data Compression

#### Question 1

What is a dimensionality reduction technique? Why would you use one?

#### Question 2

There are many different data compression techniques: PCA, UMAP, tSNE, VAE... Pick two of these methods, and explain briefly: 
* How do each one of them work?
* What are advantages or disadvantages of each method?
* When would you use one over the other?

#### Question 3

Pick one of the dimensionality reduction method

> Most already have easy-to-use implementations so you don't have to code them from scratch: [PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), [UMAP](https://umap-learn.readthedocs.io/en/latest/index.html), [tSNR](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html). Training something like a Variational Autoencoder is a more involved task and requires access to a GPU.

Do the following:    
1. Reduce your data to 2 dimensions using your chosen algorithm
2. Save your output
    > Remember to keep the IDs with the principal components, so that you can easily see which galaxy those values are for later
3. Plot the two principal variables against each other and describe what you see
    * Are there any obvious patterns in your data?
    * Are there any clusters?!

In [None]:
# placeholder

#### Question 4

Now, load in the `sw_output.fits` table and cross-match the two tables to get stellar masses, redshifts, dust opacities, etc.

7. Color the points on your plot above by a physical property and discuss if you see any patterns.

In [None]:
# placeholder

#### Question 5 [bonus]

If you are feeling brave, you can try several encoding tools - how do your results change? What if you keep more than 2 principal variables? Does changing *hyperparameters* of your algorithm change your results quantitatively / qualitatively?


# Section 2: Unsupervised ML

In section, you will implement a clustering algorithm to see if there are any *natural* clusters in your data. You can choose any algorithm from the ones shown [on the Scikit-Learn website](https://scikit-learn.org/stable/modules/clustering.html). The best algorithm depends on your data: so refer back to the plots you made in Section 1 to see which algorithm you think will work best. 

Load in the subset you chose in the previous section. 

In [3]:
# Space for code

#### Question 1

Choose a clustering algorithm. Why did you go for this particular one?

#### Question 2

Run clustering on your **compressed data**. Think of these questions, if they are relevant to your algorithm - often the *hyperparameters* of your algorithm will need you to answer these.

* How many clusters should you fit to your data?
* Where should the initial guesses for the cluster centers be?
* What should be the typical size for each cluster?


In [None]:
# Placeholder

#### Question 3

Plot your compressed data, coloring the points by which cluster they belong it.
Optionally, you can also overplot the boundaries of your cluster (see the example [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#visualize-the-results-on-pca-reduced-data)).

* Do the clusters make sense?
* Do you think you chose the right number of clusters?

You can use this visualization to tune the hyper-parameters of your clustering algorithm to perhaps get a cleaner result.

In [None]:
# Placeholder

#### Question 4

Look at the distribution of *physical properties* (mass, redshift, dust...) from the output catalog for each one of your clusters. Are there any statistically significant differences between the clusters in any of these properties?

In [None]:
# Placeholder

#### Question 5

Repeat **questions 2 and 4** but using your full dataset instead of the compressed one. 
* Do you see any differences in your results?
* Did you need to choose different hyperparameters?


# Section 3: Supervised ML

Finally, we can use *supervised ML* to train an algorithm that predicts our relevant physical parameters (mass, redshift, etc.) from the input data directly. This is a regression task, since we want to predict a continuous variable.

You should aim to **predict mass and redshift** of galaxies, but you can also try to predict more properties available in the output catalog.

#### Question 1

Choose a regression tool (e.g., a linear model, SVM, neural network, gaussian process, decision tree... see [more examples here](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)).

Why did you choose this particular one? 

* You can either train N different models for mass, redshift, etc.; or you can train a single model that predicts N parameters simultaneously.
* Think: how many variables do you want to predict, what does your data look like, what relationship do you see by eye between your input and output data...



#### Question 2

Split your data into two sub-samples: one for training and one for validation

Use the tool you chose to predict the physical parameters from the input *training* data.

* Optimize the parameters of the model
* What is the correlation between predicted mass/redshift/other parameters, or the errors in the predictions?


In [None]:
# Placeholder

#### Question 3

Use your model to predict the physical parameters of your validation sample

* What is the accuracy? Is the model performing as well as you expected?

In [None]:
# Placeholder

#### Question 4

Load in `test_sample.csv` and use your model to predict physical parameters.

* What is the accuracy? Is the model performing as well as you expected?
* Why do you think the performance is what it is?

In [None]:
# Placeholder

#### Question 5 [bonus]

Repeat steps 1-4 for the *compressed* dataset. Do you see any differences in performance?

In [None]:
# Placeholder

#### Question 6 [bonus]

Think back to the results you got from clustering. Does the model work equally well for galaxies in the different clusters?

In [5]:
# Placeholder