## Improving Algorithms

Last class we saw that we can employee different techniques for improving the model accuracy (e.g. blending , boosting etc). We keep track of the model metrics (accuracy, tn, fn , etc etc) for each model and pick the one that's performing the best. For competitions, however this will not be good enough. We have to see approaches like blending to make a better algorithm.

### Choose best hyperparameters with GridSearchCV

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

rf_clf = RandomForestClassifier(random_state = 42)

params_grid = { 'max_depth' : [3,None],
		'min_samples_split' : [2,3,10],
		'min_samples_leaf':[1,3,10],
		'bootstrap':[True,False]
		'criterion':['gini','entropy']}

grid_search = GridSearchCV(rf_clf,params_grid,n_jobs=-1,cv=5,verbose=1,scoring='accuracy')

grid_search.fit(X_train,y_train)

grid_search.best_score_

grid_search.best_estimator_.get_params()

print_score(grid_search, X_train, y_train, X_test, y_test, train=True)

print_score(grid_search, X_train, y_train, X_test, y_test, train=False)

### Cross Fold Validation

Run cross fold validation and see how model is performing for each data splits. If the model is performing with more or less same level of accuracies across multiple splits, then that's a good model

See the below example, cv = 10 for a datasize of 500 records means each split has 50 records and we run the train on 9 samples and test on remaining 1 sample and so on for each split

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(rfc,x,y,cv=10)

### Blending

There are different types of blending. One of them is explained in the previous class. The other type is explained below,

- We train the model on 'n' algorithms
- We take the average predictions of each algorithm and give that as the prediction

### Model Building steps

- Model Building
    - Machine Learning Models
    - Accuracy Score and hyper-parameter tuning
    - Single model which give you highest accuracy is chosen
    - deploy and we use this for prediction

- Advanced users/kaggle competition
    - Stacking models
    - boosting
    - use this models for predicting the feature observations
    - trade-off between interpretability - accuracy (complex techniques such as stacking gives better accuracy but interpretability is lost)
    - gridsearch
    - Find optimized parameters for each algorithms
    - stacking or boosting algorithms
    - check for accuracy
    - k-fold validation
    - submit prediction once you have k fold validation values in similar ranges

### Feature Engineering

While we develop a model we have to select the features(predictors) that helps the most in predicting/classifying the y-variable. The technique to achieve this is called 'feature engineering'. We try to select the most important features and as well try to see if we can make the insignificant features as significant by applying some techniques


- Feature Engineering
- Forward Selection
- Backward Elimination
- Stepwise Selection
- Lowvariance Method
- PCA (Principal Component Analysis)
- LDA (Linear Discriminant Analysis)
- TSNE (T Student Neighbourhood Embedding)

Objective - 
Identify the imporant columns using 
- feature selection
- backward elimination
- stepwise selection
- lowvariance method

Convert insignificant to significant
- PCA (Principal Component Analysis)
- LDA (Linear Discriminant Analysis)
- TSNE (T Student Neighbour Embedding)
- Transformations
