# Project Template: Phase 2

Below are some concrete steps that you can take while doing your analysis for phase3. This guide isn't "one size fit all" so you will probably not do everything listed. But it still serves as a good "pipeline" for how to do data analysis.

If you do engage in a step, you should clearly mention it in the notebook.

---


## 2.1) Decide on what models you will use and compare

Select at least 3 models to compare on your prediction task. At least 2 of your models should be ones we've covered in class. 

Some resources try to help you select a well-performing model for your data:
* [sklearn's Flowchart](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
* [geeks4geeks Flowchart](https://www.geeksforgeeks.org/flowchart-for-basic-machine-learning-models/)
* [SAS Cheatsheet](https://blogs.sas.com/content/subconsciousmusings/files/2017/04/machine-learning-cheet-sheet.png)

**Note**: These are general guides, and not guarantees of success. Some of the models are also outside of what we have covered, but you can explore them if you want to.

In addition to selecting a model you think will perform well, there are other reasons to select a model:
* To serve as a baseline (naive) approach you expect to outperform with more complex/appropriate models.
* You need a model that is human interpretable (e.g. Decision Tree).
* The model has historically performed well on similar tasks.
* Some properties of the model are effective for the type of data you have. Remember, at the end of most Seminars, you learned the strengths and weaknesses of each model.

1. Model XXX: I am selecting XXX because...
2. Model YYY: I am selecting YYY because...
3. Model ZZZ: I am selecting ZZZ because...

## 2.2) Split into train and test
Make sure to split your data *before* you apply any transformations.

**Note**: If you have multiple records from the same object (e.g., multiple attempts from the same student), these should all go in either training or test, but not split between them. See the examples for how to accomplish this.

### 2.2.1) Sampling (If needed)

If one of your classes is very underrepresented (e.g. 1000 of Class 0; 200 of Class 1), you might consider oversampling the minority class (e.g. sample 1000 times with replacement from 200 instances), or undersampling the majority class (e.g. sample 200 times from 1000 instances).

Check out [np.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html) for how to sample a vector.

**Note 1**: You should only ever sample the *training dataset*, never the test. After all, you can't chose the class distribution of your test data!

**Note 2**: Sampling can help a classifier perform better on the minority class, often at the cost of *overall* performance. But this is no guarantee. If you chose to sample, you should compare your classifiers' performance with and without sampling to see if it actually helped.

**Note 3**: Make sure you sample the *same* indices from your training and test data -- otherwise they won't match anymore!


Play around with sampling below (or skip this step if you don't need sampling).

 When you're done, write the `sample_data` method to perform sampling on any training dataset.

In [None]:
def sample_data(X_train, Y_train):
    """
    Input: The original X_train and Y_train training dataset
    Output: A new training dataset with sampling applied (same columns, different rows)
    """
    # For example, undersample the majority class, or oversample the minority class.
    
    return (X_train, Y_train)

## 2.3) Feature Transformation

Use your training data to fit any transformers or encoder your need, then apply the fit transformer to your test data. This applies to:
* Normalizing/standardizing your features
* Using Bag of Words or TF-IDF to encode strings
* PCA or dimensionality reduction

**Rationale**: In practice, we won't be able to see the test data we'll be making predicting for, so we shouldn't use that data as the basis for any transformation or feature extractio.

Try your feature transformation below:

 When you're done, write the `apply_feature_transformation` method to perform transformation on any training/test split.

In [1]:
def apply_feature_transformation(X_train, X_test):
    """
    Input: The original X_train and X_test feature sets.
    Output: The transformed X_train and X_test feature sets.
    """
    return (X_train, X_test)

## 2.4) Train and Explore your Models
Using the models you decided upon in the beginning, now train these models. Conduct preliminary evaluations to see if using said models are even feasible, before potentially wasting time tuning a model thats no-good.

## 2.5) Hyperparameter Tuning
For promising models, tune them even further to squeeze out the best possible performance. Some questions to consider.

1. What hyperparamaters should I tune? Why?
2. What values ranges should I choose for each param? Why?
3. Should I use try the values manually, or use the [built-in tuning functions](https://scikit-learn.org/stable/modules/grid_search.html)?

**Make sure to only tune on the training dataset!**

In [2]:
from sklearn.model_selection import GridSearchCV

def find_best_hyperparameters_m1(X_train, Y_train):
    """
    Input: The training X features and Y labels/values
    Output: The classifier with the best hyperparams and the predictions
    """
    clf = None # Create your base classifier
    param_grid = {"param_1": [0, 1, 2],
                  "param_2": ['value1', 'value2']}
    
    search = GridSearchCV(clf, param_grid)
    search.fit(X_train,y_train)
    return search, search.predict(X_test)

## Put it All Together

Now, combine the "scratch work" that you did above into a tidy function that someone could use to replicate your work and process in a single step.

In [None]:
def evaluate_model1(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    hyperparameters = find_best_hyperparameters_m1(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions

In [None]:
def evaluate_model2(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    # You need to create a new hyperparameter selector for your second model, or remove this step
    hyperparameters = find_best_hyperparameters_m2(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions

In [None]:
def evaluate_model3(X_train, X_test, Y_train, Y_test):
    (X_train, X_test) = apply_feature_transformation(X_train, X_test)
    (X_train, Y_train) = sample_data(X_train, Y_train)
    # You need to create a new hyperparameter selector for your second model, or remove this step
    hyperparameters = find_best_hyperparameters_m3(X_train, Y_train)
    # Fit your model here
    
    # Return your model's predictions