## **Name**:
## **Status**: (e.g., UG3, G1, etc.)
## **Department**:

# ***CBE 512. Machine Learning in Chemical Science and Engineering.***
## **Assignment 02:** *Preliminaries, Regression+Classification*
### &#169; Princeton University
### **DUE**: 11:59pm, October 02, 2024
### **Notes**: Supply responses/solutions in a packaged format, including notebook (.ipynb) and any supporting files required to run the notebook. Your assignment should be submitted to Canvas.



---
## **Problem 1, Concept check (25 points):**

**(a)** *In about a paragraph*, describe the fundamental distinction between supervised vs. unsupervised learning and classification vs. regression. For each case, try to identify/explain a setting (in physical science/engineering) for which these modes of machine learning may be appropriate.

---


***Student Response:***

**(b)** In class, we have often noted when a problem might be solved exactly using techniques from linear algebra (using matrix inversion and minimizing projection error). Remark on the scope/applicability of this approach (i.e., for what class of models and errors is it useful). Use this explanation to expand on the utility of machine learning. *Hint*: your reponse should touch on model complexity and the nature of loss functions.

---

***Student Response:***

**(c)** Explain the utility of feature scaling (in general). Pick one feature-scaling method available in scikit-learn. Describe its approach/implementation and discuss any advantages/limitations.

---

***Student Response:***

**(d)** In class, we discussed some of the dangers of simple gradient descent as a method for parameter optimization. Explain how methods like "gradient descent with momentum" or "stochastic" gradient descent might be superior by contrast.

---

***Student Response:***

**(e)** In the vernacular of class, what is the distinction between "parameters" and "hyperparameters." For the case of support vector machines and random forest models, identify and explain some of the more evident hyperparameters.

---

***Student Response:***



---


## **Problem 2, Rest and Regression (40 points):**

Examine the contents of `solubility_regression.csv`.
You can download and view it on your own or easily examine the contents on [github](https://github.com/webbtheosim/CBE512-MLinChmSciEng/blob/main/data/solubility-regression-cbe512.csv).

This is an expanded version of the solubility dataset used in the first problem set. Notably, whereas previously the `.csv` had columns for just `SMILES`, `Solubility`, and `MolWt`, now there are many more columns that can be used as inputs in our feature vectors.

Overall, we are going to explore aspects of model-building with the example of multivariate linear regression. For this problem part, your *label* that we want to predict will be `Solubility`

---



**(a)** Begin by partitioning the data into a simple 80/20 train/test split.
You may either use your own function to do this train-test split, or you may use a function from scikit-learn (e.g., `sklearn.model_selection.train_test_split`).
Pick two features in the dataset and compare the distribution of those features with each other and how those features compare between the proposed training and test set.
What considerations might one have with respect to these distributions in developing subsequent machine learning models?

---

In [None]:
# import modules for this problem


**(b)** Now create and train a multivariate [linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) model that takes available *numerical* features as input and predicts solubility. You may choose to implement feature-scaling in anyway you like (if at all).
How well does the model fit the data it was trained on? How well does it generalize to the test set?

---





**(c)** Some of the features might not be useful, adding unnecessary noise and complexity to the model.
In this part, you will select a "minimal" set of features that still results in a quality linear model by using cross-validation and information criteria.

While we *could* evaluate the space of combinations exhaustively (code will take appx. 1.5 hrs to run) we will instead use a greedy, stepwise algorithm to build up a feature list.
Starting with the null model (no features, optimal model would just predict the mean for every input), compare models that augment the existing model with one of the features from the remaining pool of features.
Use the mean $r^2$ obtained from cross validation as the metric for comparing models with the same complexity (i.e., same number of features)
Identify the feature that results in the best model and add it permanently to the feature list; make note of the model's performance metrics.
Then, repeat the procedure until all seventeen features have been included.
The result should be a plot of model performance as a function of model complexity.
Now, compare the best models identified at every step of the procedure using an *information criterion* of your choice as the metric for comparison. Which model is ``optimal''? Compare the predictive performance of this model to that of part **b**.


***Note:*** Minimizing the MSE of the linear model is equivalent to maximizing the MLE of the probabilistic form of the linear model. In this probabilistic framework, regularization is seen as putting a prior belief on the your parameters and instead maximizing the *a posteriori* estimate (MAP) with respect to parameters of the model.
The specific form of the $\ell_1$ norm regularization term comes from asserting a Gaussian prior on the parameters where the mean is 0, and the variance of this distribution is the hyperparameter of the model. Accordingly, hyperparameters in many models formulated within a probabilistic framework are parameters of the prior distributions that are placed on parameters. To read more about this, visit [this description](https://bjlkeng.github.io/posts/probabilistic-interpretation-of-regularization).

---






**(d)** Using scikit-learn, explore the use of a non-linear regressor other than a neural network (like random-forest-- we will spend more time on neural networks later) for predicting solubility. Explain how this model works and what the parameters are. What are the hyperparameters associated with this model and what do they control? Explore the effects of changing these hyperparameters on the predictive performance of your model.
Use grid-search and cross-validation to identify an optimal set of *hyperparameters*.
Examine how your results change using *(i)* the full set of features and *(ii)}* the set of features identified in part **c**.

***Note:*** Grid search with cross-validation exhaustively evaluates all hyperparameter combinations, which may take several minutes to complete depending on the grid size (i.e., number of hyperparameters and their options).

---



---

## **Problem 3. High classification. (35 points):**

Examine the contents of `solubility_classification.csv`.
You can download and view it on your own or easily examine the contents on [github](https://github.com/webbtheosim/CBE512-MLinChmSciEng/blob/main/data/solubility-classification-cbe512.csv). This is the same dataset as in **Problem 2** except we have removed `Solubility` as a label and included `Group`. For this problem part, your *label* that we want to predict is the class featured in `Group`


---



**(a)** Use multivariate [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to predict the solubility `Group` of a molecule from the numerical features in the dataset. Follow the same principles as in **Problem 2a**.

---



**(b)** Again some of these features might not be useful or may be redundant for the model. In this part, use cross-validation to perform feature selection based on a crierion that you describe. Rather than performing a greedy method, this time, you will adopt a stochastic strategy where you randomly choose a set of features (from all possible combinations) and assess the performance of the resulting model. You should conduct as many random trials as model evaluations run for the greedy-selection algorithm from **Problem 2**. Identify the best-performing model and compare its performance to the model from part **a**.

---

**(c)** Here we will compare the effects of $\ell_1$ vs $\ell_2$ vs. no regularization on logistic regression. Construct a graphic that summarizes the (relative) performances of these three classifiers for different strengths of the regularization. Specifically, show this for weighting coefficients of [1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]. Put error bars on these performances

*Hint:* use cross-validation to get separate performance evaluations.

---

**(c)** Implement a *non-linear* support vector machine classifier from `scikit-learn` to predict the solubility group. Explain roughly how this model works. What are the hyperparameters associated with this model and what do they control? Explore the effects of changing these hyperparameters on the predictive performance of your model and how well your training data fits your data. Try to identify an optimal set of hyperparameters using cross validation. How does this predictive performance of this model compare to those you constructed in previous parts?

---