- Classification
- Regression
- Clustering
- Dimensionality Reduction
- Ensembling
- Model Analysis
- NLP

- EDA: non- categorical cols: univariate, bivariate, multi-variate, correlation, numerica, categorica, missing vals, outliers(box plot), 
- Statistical test: no select from null or alternate hypothesis – 
- Concept drift, data drift: label, feature drift(may/ maynot affect)

- All learning methods: https://scikit-learn.org/stable/supervised_learning.html
- All API references: https://scikit-learn.org/stable/modules/classes.html
- tSNE: https://distill.pub/2016/misread-tsne/
- sk example graph comparisons: https://scikit-learn.org/stable/auto_examples/
- GLossary of terms: https://scikit-learn.org/stable/glossary.

# Classification

## 1. Logistic Regressor
despite its name, it is a linear model for classification rather than regression.  In this model the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

https://scikit-learn.org/stable/modules/linear_model.html

**sklearn.linear_model.LogisticRegression**(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

- **penalty** - regularization for cost func - 'none' 'l2', 'l1','elasticnet': both L1 and L2 added. refer table summarizing the penalties supported by each solver
- **l1_ratio**: The Elastic-Net mixing parameter, with 0 <= l1_ratio <= 1. Only used if penalty='elasticnet'. LogisticRegressionCV has built-in cross-validation support, to find the optimal C and l1_ratio.
- **C** - a low value says "This data not fully representative of the real world, so if it's making a parameter large, don't listen to it"
- **fit_interceptbool** =True - wheather bias be added to decision function, 
- **class_weight** - Weights for classes in the form {class_label: weight}, The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
- **solver** - Algo for optim. ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’. ‘liblinear’ - For small class  - OvR,  ‘sag’ and ‘saga’ are faster for large data with feature scaling; optim algo depends on the penalty, refer table, lbfgs good for small data
- **random_stateint**, default=None - Used when solver == ‘sag’, ‘saga’ or ‘liblinear’ to shuffle the data. 
- **warm_startbool** - 'False', 'True' = reuse solution of previous call to fit as initialization, otherwise, erase the previous solution. seless for liblinear solver.
- **multi_class**{‘auto’, ‘ovr’, ‘multinomial’} -‘ovr’ - A binary problem is fit for each label, ‘multinomial’ - loss minimised is the multinomial loss fit across the entire probability distribution, even when the data is binary - ‘multinomial’ is unavailable when solver=’liblinear’, ‘auto’ selects ‘ovr’ if the data is binary, or if solver=’liblinear’, and otherwise selects ‘multinomial’.
https://chrisyeh96.github.io/2018/06/11/logistic-regression.html
- **tol**: Tolerance for the optimization. When the loss or score is not improving by at least tol for two consecutive iterations, convergence is considered to be reached and training stops.


- **Avdantages:**
•	Classifies linearly separable data well
•	Fast training speed and prediction speeds
•	Does well on small datasets
•	Decently interpretable results, easy to explain
•	Easy to update the model when new data comes in
•	Can avoid overfitting when regularized
•	Can do both binary and multiclass classification
•	No parameter tuning required (except when regularized, we need to tune the regularization parameter)
•	Doesn’t need feature scaling (except when regularized)

- **Disadvantages:**
•	Doesn’t work well for non-linearly separable data
•	Low(er) prediction accuracy
•	Doesn’t learn feature interactions in the dataset
•	Doesn’t separate signal from noise well — cull irrelevant features before use
•	If dataset has redundant features, linear regression can be unstable




In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0).fit(X, y)
clf.predict(X[:2, :])
clf.predict_proba(X[:2, :]) #result categorized by class
clf.score(X, y)

## 2. Support Vector Machines
used for classification, regression and outliers detection. The core of an SVM is a quadratic programming problem (QP), separating support vectors from the rest of the training data. The algorithm used in LinearSVC by the liblinear implementation is much more efficient than its libsvm-based SVC counterpart and can scale almost linearly to millions of samples and/or features.

it is highly recommended to scale your data. For example, scale and standardize it to have mean 0 and variance 1. 
In SVC, if the data is unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.

The kernel function can be :polynomial: parameter degree, coef0. rbf: , parameter gamma, must be greater than 0. sigmoid: coef0.
Radial Basis Function (RBF) kernel, two parameters must be considered: C and gamma. C and gamma is critical to the SVM’s performance.use GridSearchCV with C and gamma spaced exponentially far apart to choose good values.


**sklearn.svm.SVC**(*, C=1.0, kernel='rbf', degree=3, gamma='scale', coef0=0.0, shrinking=True, probability=False, tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape='ovr', break_ties=False, random_state=None)

**sklearn.svm.LinearSVC**(penalty='l2', loss='squared_hinge', *, dual=True, tol=0.0001, C=1.0, multi_class='ovr', fit_intercept=True, intercept_scaling=1, class_weight=None, verbose=0, random_state=None, max_iter=1000)


- **C**: Regularization parameter - common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. The strength of the regularization is inversely proportional to C. The penalty is a squared l2 penalty. decrease C for noisy data and do more regularization. LinearSVC and LinearSVR are less sensitive to C when it becomes large, and prediction results stop improving after a certain threshold. Meanwhile, larger C values will take more time to train, sometimes up to 10 times longer
- **kernel**{‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’
- **Kernel cache size**: For SVC, SVR, NuSVC and NuSVR, recommended to set cache_size to a higher value than default of 200(MB), such as 500 or 1000
- **degreeint**, default=3 Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
- **gamma**{‘scale’, ‘auto’} or float, default=’scale’.Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. 'scale' - gamma = 1 / (n_features * X.var()), ‘auto'- 1 / n_features. gamma defines the influence a single training example has. The larger gamma, the closer other examples must be to be affected.
- **probabilitybool**: default=False :to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict.
- **shrinking**: True/ False - if the number of iterations is large, then shrinking can shorten the training time. But, if we loosely solve the optimization problem (e.g., by using a large stopping tolerance), the code without using shrinking may be much faster
- **class_weight**: dict or ‘balanced’, default=None.  The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).can also implement weights for individual samples in the fit method through the sample_weight parameter. This sets the parameter C for the i-th example to C * sample_weight[i]
- **max_iterint**: default=-1. Hard limit on iterations within solver, or -1 for no limit.
- **decision_function_shape**{‘ovo’, ‘ovr’}, default=’ovr’
- **break_tiesbool**: default=False. If true, decision_function_shape='ovr', and number of classes > 2, predict will break ties according to the confidence values of decision_function; otherwise the first class among the tied classes is returned.
- **nu** in NuSVC/OneClassSVM/NuSVR approximates the fraction of training errors and support vectors.

The fit time scales at least quadratically with the number of samples and may be impractical beyond tens of thousands of samples. For large datasets consider using LinearSVC or SGDClassifier instead, possibly after a Nystroem transformer. The multiclass support is handled according to a one-vs-one scheme.

- **Advantages:**
•	High prediction accuracy.
•	Doesn’t overfit, even on high-dimensional datasets, so it's great for when you have lots of features
•	Works well for small datasets (<100k training set)
•	Work well for text classification problems
•   Versatile: different Kernel functions can be specified for the decision function. 
•   The support vector machines in scikit-learn support both dense  and sparse (any scipy.sparse) sample vectors as input. For predictions of sparse data, svm must have been fit on such data.
- **Disadvantages:**
•	Not easy to update the model when new data comes in
•	Is very memory intensive
•	Doesn’t work well on large datasets
•	Requires you choose the right kernel in order to work
•	The linear kernel models linear data and works fast
•	The non-linear kernels can model non-linear boundaries and can be slow, Use Boosting instead!
•   SVMs don't directly provide probability estimates, these are calculated using an expensive five-fold cross-validation 

- NuSVC, SVC: OvO approach for multi-class, (n_classes * (n_classes - 1) / 2) classifiers are constructed and each one trains data from two classes.
- LinearSVC: implemented in terms of liblinear rather than libsvm,more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples. faster, no kernel, lacks some of the attributes of SVC and NuSVC, like support_. OvR, thus training n_classes models.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X, y)
clf.predict([[-0.8, -1]])


# SVMs decision function depends on some subset of the training data, called the support vectors. 
# Some properties of these support vectors can be found in attributes support_vectors_, support_ and n_support_
# get support vectors
clf.support_vectors_
    #result: array([[0., 0.],
    #       [1., 1.]])
# get indices of support vectors
clf.support_
    #result: array([0, 1]...)
# get number of support vectors for each class
clf.n_support_
    #result: array([1, 1]...)

#LINEAR SVC
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
clf = make_pipeline(StandardScaler(), LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
clf.named_steps['linearsvc'].coef_
clf.named_steps['linearsvc'].intercept_
clf.predict([[0, 0, 0, 0]])

## 3. K Nearest Neighbors (Distance Based)
known as non-generalizing machine learning methods, as they simply “remember” all the training data (transformed into a fast indexing structure like Ball tree or KD Tree). As a non-parametric method, it's good for classifications where the decision boundary is very irregular. can handle either NumPy arrays or scipy.sparse matrices as input. if two neighbors have same distances but different labels, the result depends on the order of training data. Classification is computed from a simple majority vote of the nearest neighbors of each point.choice of "k" is highly data-dependent. in general a larger "k" suppresses the effects of noise, but makes the classification boundaries less distinct.

**sklearn.neighbors.KNeighborsClassifier**(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)

- **n_neighbors**, default=5; Number of neighbors to use for kneighbors queries.
- **weights**{‘uniform’: uniform weights. All points in each neighborhood are weighted equally, ‘distance’:  weight points by the inverse of their distance, so closer neighbors of a query point will have a greater influence than neighbors further away} or callable, default=’uniform’
- **algorithm**{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’, Algorithm used to compute the nearest neighbors, ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit method. Note: fitting on sparse input will override this parameter, using brute force.

**brute**:computation of distances between all pairs of points in dataset: for "N" samples in "D dimensions, this approach scales as O[DN<sup>2</sup>].  very good for small data samples ( less than 30); more efficient than a tree-based approach. However, as N grows, this quickly becomes infeasible. query time is unchanged by data structure, k value.

**K-D Tree**: To address computational inefficiencies of the brute. attempts to reduce the required number of distance calculations by efficiently encoding aggregate distance information for the sample. The basic idea is that if point A  and B are very distant, and point C is very close to B, implies that A and C are very distant, without explicitly calculating their distance. O[Nlog(N)]. The construction of a KD tree is very fast. very fast for low-dimensional (D<20) neighbors searches, it becomes inefficient as  grows very large.  query time become slower as k increases. As k becomes large compared to N, the ability to prune branches in a tree-based query is reduced, here Brute force queries can be efficient.

**Ball Tree**: To address the inefficiencies of KD Trees in higher dimensions, KD trees partition data along Cartesian axes, ball trees partition data in a series of nesting hyper-spheres, making tree construction more costly.Because of the spherical geometry of the ball tree nodes, it can out-perform a KD-tree in high dimensions. O[Dlog(N)]

**'auto'**: selects 'brute' in following conditions: input data is sparse, metric = 'precomputed', D>15, k>=N/2, effective_metric_ isn’t in the VALID_METRICS list for either 'kd_tree' or 'ball_tree'. 
selects the first out of 'kd_tree' and 'ball_tree' that has effective_metric_ in its VALID_METRICS list. following assumptions:
the number of query points is at least the same order as the number of training points, leaf_size is close to its default value of 30
when D>15, the intrinsic dimensionality of the data is generally too high for tree-based methods

number of query points: Both the ball tree and the KD Tree require a construction phase. The cost of this is negligible when there are many queries. if less queries to be performed, construction make up a significant fraction of total cost, here brute force is better than a tree-based method.


- **leaf_size**, default=30, Leaf size passed to BallTree or KDTree. this can affect the speed of the construction and query and memory required to store the tree. optimal value depends on the nature of the problem. this controls the number of samples at which a query switches to brute-force. 
- **p**: default=2. Power parameter for the Minkowski metric. When p = 1 => using manhattan_distance (l1), p = 2 => euclidean_distance (l2).
- **metrics**: default=’minkowski’, Metric to use for distance computation. Default is “minkowski”
- **Attributes**: classes_, n_features_in_(Number of features seen during fit), feature_names_in_

- **Advantages:**
•	Fast training speed
•	Doesn’t need much parameter tuning
•	Interpretable results, easy to explain
•	Works well for small datasets (<100k training set)
- **Disadvantages:**
•	Low(er) prediction accuracy
•	Doesn’t do well on small datasets
•	Need to pick a suitable distance function
•	Needs feature scaling to work well
•	Prediction speed grows with size of dataset
•	Doesn’t separate signal from noise well — cull irrelevant features before use
•	Is memory intensive because it saves every observation
•	Also means they don’t work well with high-dimensional data


In [None]:
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
neigh.predict([[1.1]])
neigh.predict_proba([[0.9]])
get_params(deep=True)
score(X, y, sample_weight=None)

## 4. Decision Tree

non-parametric supervised learning method for classification and regression. The cost of predicting data is logarithmic in the number of data points used to train tree. Able to handle multi-output problemss, numerical and categorical data. But, the sklearn do not support categorical variables for now. Possible to validate model using statistical tests. Use np.float32 arrays internally, if training data not this format, a copy of the dataset will be made. If the input matrix is very sparse, it is recommended to convert to sparse csc_matrix before calling fit and sparse csr_matrix before calling predict. Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples.

can be overfit on data with a large number of features - PCA, pruning, setting a minimum number of samples required at leaf node or setting a maximum depth of the tree can solve this problem. small variations in the data can result in a very different tree being generated, this can be solved  by ensemble. Predictions of decision trees are not smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation. Create biased trees if some classes dominate,so balance the dataset. concepts like XOR, parity or multiplexer problems are hard to learn because DTs do not express them easily.

**sklearn.tree.DecisionTreeClassifier**(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)

- **criterion**{“gini”, “entropy”, “log_loss”}, default=”gini”,  function for splitting the nodes of a decision tree.gini is much faster, results from entropy are slightly better.

GiniIndex=1–$∑_{j}p^{2}_{j}$ - Where pj is the probability of class j. The min value of gini is 0 - when the node is pure - Therefore, this node will not be split again. gini gets the maximum value (0.5) when the probability of the two classes are the same.
Entropy=–$∑_jp_j⋅log_2.p_j$ - Entropy is a measure of information that indicates the disorder of the features with the target. the optimum split is similar to the Gini Index - vals in [0, 1]
- **max_depth**, int, default=None =nodes are expanded until all leaves are pure or all leaves contain less than min_samples_split samples.
- **min_samples_split**, int, default=2, The minimum number of samples required to split an internal node. If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples.
- **min_samples_leaf**: int or float, default=1, The minimum number of samples required to be at a leaf node.  A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. 
- **max_features**: int, float or {“auto”, “sqrt”, “log2”}, default=None, The number of features to consider when looking for the best split. If “auto”, "sqrt" - max_features=sqrt(n_features), “log2” - max_features=log2(n_features), None - max_features=n_features.
- **random_state**: int, default=None, The features are always randomly permuted at each split, When max_features < n_features, the algorithm will select max_features at random at each split. To obtain a deterministic behaviour during fitting, random_state has to be fixed to an integer.
- **max_leaf_nodes**: int, default=None, Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
- **min_weight_fraction_leaf**: float, default=0.0, The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
- **class_weightdict**: list of dict or “balanced”, default=None, “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
- **ccp_alpha**: non-negative float, default=0.0, Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. 

- Visualize your tree as you are training using the export function. Use max_depth=3 to get a feel for how the tree is fitting , and then increase the depth. number of samples required to populate the tree doubles for each additional level the tree grows to. 
- Use min_samples_split or min_samples_leaf to ensure that multiple samples inform every decision in the tree, by controlling which splits will be considered. A very small number will usually mean the tree will overfit, whereas a large number will prevent the tree from learning the data. Try min_samples_leaf=5 as an initial value. If the sample size varies greatly, a float number can be used as percentage in these two parameters. While min_samples_split can create arbitrarily small leaves, min_samples_leaf guarantees that each leaf has a minimum size, avoiding low-variance, over-fit leaf nodes in regression problems. For classification with few classes, min_samples_leaf=1 is often the best choice.


- **Advantages:**
•	Fast training speed and prediction speeds
•	Captures non-linear relationships in the dataset well
•	Learns feature interactions in the dataset
•	Great when your dataset has outliers
•	Great for finding the most important features in the dataset
•	Can do both 2 class and multiclass classification
•	Doesn’t need feature scaling    
•	Decently interpretable results, easy to explain

- **Disadvantages:**
•	Low(er) prediction accuracy
•	Requires some parameter tuning
•	Doesn’t do well on small datasets
•	Doesn’t separate signal from noise well
•	Used very rarely in practice, use ensembled trees instead
•	Not easy to update the model when new data comes in
•	Can overfit (see ensembled models below)




In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
clf = DecisionTreeClassifier(random_state=0)
cross_val_score(clf, iris.data, iris.target, cv=10)
tree.plot_tree(clf)
cost_complexity_pruning_path(X, y, sample_weight=None) 
#RETURNS Effective alphas of subtree during pruning, Sum of the impurities of subtree leaves for the corresponding alpha value in ccp_alphas.
tree.get_depth() #RETURN The maximum depth of the tree.
tree.get_params(deep=True) 

## 5 Naive Bayes

- **sklearn.naive_bayes.GaussianNB**(*, priors=None, var_smoothing=1e-09)

- **Advantages:**
•	Performs really well on text classification problems
•	Fast training speed and prediction speeds
•	Does well on small datasets
•	Separates signal from noise well
•	Performs well in practice
•	Simple, easy to implement
•	Works well for small datasets (<100k training set)
•	The naive assumption about the independence of features and their potential distribution lets it avoid overfitting
•	Also if this condition of independence holds, Naive Bayes can work on smaller datasets and can have faster training speed
•	Doesn’t need feature scaling
•	Not memory intensive
•	Decently interpretable results, easy to explain
•	Scales well with the size of the dataset

- **Disadvantages:**
•	Low(er) prediction accuracy


In [None]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)
clf.predict([[-0.8, -1]])
clf_pf = GaussianNB()
clf_pf.partial_fit(X, Y, np.unique(Y)) #Incremental fit on a batch of samples. This method is expected to be called several times consecutively on different chunks of a dataset so as to implement out-of-core or online learning. This is especially useful when the whole dataset is too big to fit in memory at once. This method has some performance and numerical stability overhead, hence it is better to call partial_fit on chunks of data that are as large as possible (as long as fitting in the memory budget) to hide the overhead.
clf_pf.predict([[-0.8, -1]]) 

## 6. NN

- **Advantages:**
•	High prediction accuracy — does really well in practice
•	Captures very complex underlying patterns in the data
•	Does really well with both big datasets and those with high-dimensional data
•	Easy to update the model when new data comes in
•	The network’s hidden layers reduce the need for feature engineering remarkably
•	Is state of the art for computer vision, machine translation, sentiment analysis and speech recognition tasks

- **Disadvantages:**
•	Very long training speed
•	Need a huge amount of computing power
•	Need feature scaling
•	Not easy to explain or interpret results
•	Need lots of training data because it learns a vast number of parameters
•	Outperformed by Boosting algorithms for non-image, non-text, non-speech tasks
•	Very flexible, come with lots of different architecture building blocks, thus require expertise to design the architecture


# Regression

## 1. Lasso, Ridge, Elastic-Net Regression

### a. Lasso regressor
Linear Model trained with L1 prior as regularizer (aka the Lasso). Technically the Lasso model is optimizing the same objective function as the Elastic Net with l1_ratio=1.0 (no L2 penalty).The Lasso is a linear model that estimates sparse coefficients. It is useful in some contexts due to its tendency to prefer solutions with fewer non-zero coefficients, effectively reducing the number of features upon which the given solution is dependent. For this reason, Lasso and its variants are fundamental to the field of compressed sensing. Under certain conditions, it can recover the exact set of non-zero coefficients (see Compressive sensing: tomography reconstruction with L1 prior (Lasso)). The implementation in the class Lasso uses coordinate descent as the algorithm to fit the coefficients. 

**sklearn.linear_model.Lasso**(alpha=1.0, *, fit_intercept=True, normalize='deprecated', precompute=False, copy_X=True, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic')

### b. Ridge regressor
Linear least squares with l2 regularization. This model solves a regression model where the loss function is the linear least squares function and regularization is given by the l2-norm. Also known as Ridge Regression or Tikhonov regularization

**sklearn.linear_model.Ridge**(alpha=1.0, *, fit_intercept=True, normalize='deprecated', copy_X=True, max_iter=None, tol=0.001, solver='auto', positive=False, random_state=None)

- **Advantages:**
•	These models are linear regression with regularization
•	Help counteract overfitting
•	These models are much better at generalizing because they are simpler
•	They work well when we only care about a few features

- **Disadvantages:**
•	Need feature scaling
•	Need to tune the regularization parameter


Ridge-regression
1.	Fit a line to data of min sum of sq residuals. But this line could be an overfit line thus has high variance when there are less samples. Goal here is to find a line that don’t fit the training data as well by introducing a small amount of bias into the new line fitting
2.	Add ridge regression penalty to cost: Cost = sum of sq residuals + lambda * slope^2
3.	When slope is less, prediction is affected less; to decide lambda, use CV with 10-fold
4.	On categorical values, use sum of sq residuals + lambda * (avg y value of 1st category – avg y val of 2nd category)^2
5.	For the data with both categorical and numerical,  do lambda * (slope^2 + categ_difference^2);
6.	Can be used in logistic regression too
7.	Ridge is best when sample size is less than feature size

Lasso-regression
1.	Cost = sum of sq. residuals + lambda * |slope|
2.	The lasso can shrink slope to zero with high lambda, but ridge can only get slope near to 0
3.	When most features are not useful, use lasso, coz lasso get rid of non-imp features. But when most features are useful, use ridge regression, coz it takes all features into consideration

Elastic-Net regression
1.	When there is like millions of params, we cant assess all of them
2.	Sum of sq. residuals + lambda_1 * (slope) + lambda_2 * |slope|
3.	When there is corr between features, use both lambdas > 0; 


In [None]:
#LASSO
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit([[0, 0], [1, 1]], [0, 1])
reg.predict([[1, 1]])

#RIDGE 
from sklearn.linear_model import Ridge
rng = np.random.RandomState(0)
clf = Ridge(alpha=1.0)
clf.fit(X, y)



## 3. Stochastic Gradient Descent
**sklearn.linear_model.SGDRegressor**(loss='squared_error', *, penalty='l2', alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=1000, tol=0.001, shuffle=True, verbose=0, epsilon=0.1, random_state=None, learning_rate='invscaling', eta0=0.01, power_t=0.25, early_stopping=False, validation_fraction=0.1, n_iter_no_change=5, warm_start=False, average=False)

Linear model fitted by minimizing a regularized empirical loss with SGD.

SGD stands for Stochastic Gradient Descent: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength schedule (aka learning rate).

The regularizer is a penalty added to the loss function that shrinks model parameters towards the zero vector using either the squared euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net). If the parameter update crosses the 0.0 value because of the regularizer, the update is truncated to 0.0 to allow for learning sparse models and achieve online feature selection.

This implementation works with data represented as dense numpy arrays of floating point values for the features.



In [None]:
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Always scale the input. The most convenient way is to use a pipeline.
reg = make_pipeline(StandardScaler(), SGDRegressor(max_iter=1000, tol=1e-3))
reg.fit(X, y)

# Clustering

## 1. DBSCAN - Density-Based Spatial Clustering of Applications with Noise
Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.Perform DBSCAN clustering from vector array or distance matrix.

- **sklearn.cluster.DBSCAN**(eps=0.5, *, min_samples=5, metric='euclidean', metric_params=None, algorithm='auto', leaf_size=30, p=None, n_jobs=None)

- **Advantages**
•	Scalable to large datasets
•	Detects noise well
•	Don’t need to know the number of clusters in advance
•	Doesn’t make an assumption that the shape of the cluster is globular

- **Disadvantages:**
•	Doesn’t always work if your entire dataset is densely packed
•	Need to tune the density parameters — epsilon and min_samples to the right values to get good results


DBSCAN
Nested clusters = when one cluster is surrounded by another cluster. DBSCAN can identify nested clusters in high dimensions.
1.	Count number of points around a each points on a radius (HP)
2.	Pick core points which a has “n”(HP) number of close points.
3.	Randomly pick a core point and assign it to a first cluster – add all core points next to core_point to cluster – then add all other core_points close to cluster_core_points to the cluster 
4.	Now Add all non_core_points close to the cluster_points to the first cluster – but do not use the non_core points to extend the cluster
5.	Pick the a point from remaining core_points, put it in second cluster and repeat above
6.	After all cluster point added to a cluster, remaining non_cluster points are outliers


In [None]:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
clustering.labels_

## 2. KMeans  and Kmeans - minibatch

https://www.cross-validated.com/Starbucks-Rewards-Program/

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

If the algorithm stops before fully converging (because of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e. the cluster_centers_ will not be the means of the points in each cluster. Also, the estimator will reassign labels_ after the last iteration to make labels_ consistent with predict on the training set.

- **sklearn.cluster.KMeans**(n_clusters=8, *, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd')
- **sklearn.cluster.MiniBatchKMeans**(n_clusters=8, *, init='k-means++', max_iter=100, batch_size=1024, verbose=0, compute_labels=True, random_state=None, tol=0.0, max_no_improvement=10, init_size=None, n_init=3, reassignment_ratio=0.01)

- **Advantages:**
•	Great for revealing the structure of the underlying dataset
•	Simple, easy to interpret
•	Works well if you know the number of clusters in advance
- **Disadvantages:**
•	Doesn’t always work if your clusters aren’t globular and similar in size
•	Needs to know the number of clusters in advance — Need to tune the choice of k clusters to get good results
•	Memory intensive
•	Doesn’t scale to large datasets


K-means
1.	select a k from variation vs k elbow graph – pick the k where graph start declining(elbow), randomly select “k” data points as initial clusters
2.	measure distance between all data points and all “k” clusters - Assign all data points to the nearest cluster
3.	calculate the mean of each cluster – take mean values as new cluster points, measure all the distances and cluster until there is no need of clustering. We can assess quality of clustering by adding up all variations within each cluster. 
4.	Store the cluster and its variation and pick another random k points, repeat the above steps and store the cluster and variance
Repeat this n times (an HP) and choose cluster with least variation

Hierarchical
Orders and/ or the columns based on similarity. Less Height of the branches in dentogram = most similar
1.	Calculate the Euclidean distance between every 2 samples combinations, cluster the least distance samples. Can also use Manhattan distance.
2.	Treat new cluster as a single sample and do the step 1 again and on and on. To measure distance to a cluster: 
1.	to Centroid = average the cluster sample values
2.	to single linkage (closest cluster_sampe to the comparing sample)
3.	to complete-linkage(further point)

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
kmeans.predict([[0, 0], [12, 3]])
kmeans.cluster_centers_

#MINIBATCH KMEANS
from sklearn.cluster import MiniBatchKMeans
>>> # manually fit on batches
kmeans = MiniBatchKMeans(n_clusters=2, random_state=0, batch_size=6)
kmeans = kmeans.partial_fit(X[0:6,:])
kmeans = kmeans.partial_fit(X[6:12,:])
kmeans.cluster_centers_
kmeans.predict([[0, 0], [4, 4]])

# fit on the whole data
kmeans = MiniBatchKMeans(n_clusters=2, random_state=0, batch_size=6, max_iter=10).fit(X)
kmeans.cluster_centers_
kmeans.predict([[0, 0], [4, 4]])

## 3. Hirarchical - AgglomerativeClustering
Recursively merges pair of clusters of sample data; uses linkage distance.

- **sklearn.cluster.AgglomerativeClustering**(n_clusters=2, *, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', distance_threshold=None, compute_distances=False)


Hierarchical
Orders and/ or the columns based on similarity. Less Height of the branches in dentogram = most similar
1.	Calculate the Euclidean distance between every 2 samples combinations, cluster the least distance samples. Can also use Manhattan distance.
2.	Treat new cluster as a single sample and do the step 1 again and on and on. To measure distance to a cluster: 
1.	to Centroid = average the cluster sample values
2.	to single linkage (closest cluster_sampe to the comparing sample)
3.	to complete-linkage(further point)


In [None]:
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering().fit(X)
clustering
AgglomerativeClustering()
clustering.labels_

# Dimensionality reduction

## 1. Principal Component Analysis - Randomized SVD
Linear dimensionality reduction using approximated Singular Value Decomposition of the data and keeping only the most significant singular vectors to project the data to a lower dimensional space.

- **sklearn.decomposition.RandomizedPCA**(n_components=None, copy=True, iterated_power=3, whiten=False, random_state=None)


PCA using SVD(singular value decomposition)
Find which variables are important for clustering the data
1.	standardize data (mean = 0) – fit a line PC1 going through the origin that fits the data with lowest std_deviation. But PCA solve this by fining max_destance from origin to the projected point of data to the line
2.	the slope of the line gives the ratio of data representation on both axes. This line a linear combination of variables. If ratio is like 1:4, then 2nd variable is important rep of data
3.	when using SVD, ratio is scaled so that hypotenuse of ratio = 1 (devide each side by root(1^2+4^2)). The scaled ratio unit vector is called singular / eigen vector for the PC1 line, new ratio is called as loading scores
sum(sq(distances from 0 to projection points of PC1)) = eigen_value for PC1
sq_root(eigen_vector) = singular_value for PC1
4.	PC2 = draw a line perpendicular to PC1 – so ratio in this line becomes like -1:4
5.	If data is multi-D: draw lines perpendicular to PC1, find best line like how PC1 was found, and so on and on till all dimensions
6.	Eigen_values for PC1 /(n-1) = variation for PC1; add all PCs and divide by every PCs to get the percentage of total variation every variable is accountable for. Highest percentage means that variable is important. Scree plot is the graphical rep of percentages. 
7.	To convert data to reduced dimension, project data to selected PC lines, rotate the PC1 to be horizontal, project the points from PCs to the graph space
1.	PCA is best when 2 pca’s account for most of variational data. Not good for complicated DS


In [None]:
from sklearn.decomposition import RandomizedPCA
pca = RandomizedPCA(n_components=2)
pca.fit(X)
pca.explained_variance_ratio_

## UMAP

t-SNE
project data to a low dim space so that the clustering in HD space is preserved
1.	Determine the similarity b/w all 2 points combi: for a point, draw a normal curve and put the point on centre, put all other points distances to the_points on the curve, measure the length of the line from point to the curve – scale this length so that they add up to 1 =>close points has high length and vice versa


UMAP(uniform manifold approx. and projection)
	Calc similarity scores  that help identify clustered points so it can try to preserve that clustering in the low dim graph
1.	Calc distance between each pair of points- make a graph for each point with the point on the zero and all other points laid on graph w.r.t the distance to the point. 


# Ensembling
- **Bagging**: Train many base models with different randomly chosen subsets of data, with replacement. Let the base models vote on final predictions. Eg: Bagging, RandomForests.
- **Boosting**: Iteratively train models and update the importance of getting each training example right after each iteration. Eg: AdaBoost, GradientBoosting.
- **NOTIMP Blending**: Train many different types of base models and make predictions on a holdout set. Train a new model out of their predictions, to make predictions on the test set. (Stacking with a holdout set).
- **NOTIMP Stacking**: Train many different types of base models and make predictions on k-folds of the dataset. Train a new model out of their predictions, to make predictions on the test set.


## 1. BaggingClassifier
build several instances of a black-box estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction.reduce the variance of a base estimator (e.g., a decision tree), reduces overfitting, bagging methods work best with complex models without making it necessary to adapt the underlying base algorithm.(e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models (e.g., shallow decision trees).
#### defferent Randomization techniques
1. When random subsets of the dataset are drawn as random subsets of the samples, this algorithm is called Pasting 
2. When samples are drawn with replacement, the method is called Bagging 
3. When random subsets of the dataset are drawn as random subsets of the features, the method is called Random Subspaces
4. when base estimators are built on subsets of both samples and features, then the method is called Random Patches

**class sklearn.ensemble.BaggingClassifier**(base_estimator=None, n_estimators=10, *, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)

- **base_estimator**: object, default=None =DecisionTreeClassifier; 
- **n_estimators**: int, default=10; The number of base estimators in the ensemble.
- **max_samples** :int or float, default=1.0; The number of samples to draw from X; If float - draw max_samples * X.shape[0] samples.
- **bootstrap**: bool, default=True = samples are drawn with replacement
- **bootstrap_features**:bool, default=False = features are drawn without replacement
- **oob_score**: bool, default=False = dont use out-of-bag samples to estimate the generalization error. Only available if bootstrap=True.
When using a subset of the available samples the generalization accuracy can be estimated with the out-of-bag samples by setting oob_score=True
- **warm_start**: bool, default=False = Dont reuse the solution of the previous fit call and add more estimators to the ensemble, just fit a whole new ensemble. 
- **n_jobs**: int, default=None; The number of jobs to run in parallel for both fit and predict. 
- **random_state**: int, default=None; Controls the random resampling of original dataset. Pass an int for reproducible output across multiple function calls.



In [None]:
# more example codes: https://www.programcreek.com/python/example/86713/sklearn.ensemble.BaggingClassifier
#1
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
clf = BaggingClassifier(base_estimator=SVC(), n_estimators=10, random_state=0).fit(X, y)
clf.predict([[0, 0, 0, 0]])
clf.score(X, y, sample_weight=None)

#2
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=0.5)

## 2. RandomForest
a diverse set of classifiers is created by introducing randomness in the classifier construction. The prediction of the ensemble is given as the averaged prediction of the individual classifiers.

Trees are inaccurate, good on data used to create them, but not good to classify new samples

- Bootstrapped DS: randomly select samples from original DS. Can pick same sample more than once
- Create a dec tree using BSDS, but only use random subset of vars(cols) at each step of making a division. The number of subset of cols here can be an HP to get best model. Typically we use around the squire of num of variables
- Make new BSDS’s and make new trees 100’s of times. 
- At inference: run on all trees for count of yes or no. use the most count as final answer. This whole thing is called bagging
- Typically 1/3rd of data does not end up in BDST. Use these out of bag data as validation
- missing vals in original DS: gen idea is make initial guess and gradually refine guess until hopefully good guess. Vals are fiiled in by most vals or median val (in case of numeric) of the same category. To refine this guess, we determine samples that are similar. To determine similarity: build a random forest, run all data in RM, same leaf node samples of missing val examples are noted using proximity matrix(all samples on each side like corr mat with count of similar leafed) - devide each prox mat value by total num of trees – multiply missing row’s each col val with prob of yes on the col to the (proximity val on the matrix/ sum of prox vals of row) to get the weighted frequency of yes for the col val of the row – calc weighted no similarly and choose highest val – for numerical, sum(num * weighted avg weight of each sample ) – we do this 6,7 times to get good vals. 
- missing vals in inference example: create 2 copies of data with yes and no as inference.then use the above methode to make good guess of missing vals on each copies. Run 2 samples in random forests and see which got correctly classified most. Use the most one

- **sklearn.ensemble.RandomForestClassifier**(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2,min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='sqrt', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)

- **sklearn.ensemble.RandomForestRegressor**(n_estimators=100, *, criterion='squared_error', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=1.0, max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, ccp_alpha=0.0, max_samples=None)

- **n_estimators**: int, default=100; The number of trees in the forest.
- **criterion**: {“gini”, “entropy”, “log_loss”}, default=”gini”
- **max_depth**: int, default=None= nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
- **min_samples_split**: int or float, default=2; The minimum number of samples required to split an internal node; If float, then its a fraction and int value = ceil(min_samples_split * n_samples)
- **min_samples_leaf**: int or float, default=1; The minimum number of samples to be at a leaf node. A split will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This smoothen the model, especially in regression.
If float - its a fraction and int val = ceil(min_samples_leaf * n_samples)
- **min_weight_fraction_leaf**: float, default=0.0; The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. 
- **max_features**: {“sqrt”, “log2”, None}, int or float, default=”sqrt”; The number of features to consider when looking for the best split:
If float, its a fraction int val = max(1, int(max_features * n_features_in_)), If "sqrt" or “auto” - sqrt(n_features), If “log2” - log2(n_features), If None - n_features. 

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
- **max_leaf_nodes**: int, default=None; Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
- **min_impurity_decrease**: float, default=0.0; A node is split if this split induces a decrease of the impurity greater than or equal to this value.
- **bootstrap**: bool, default=True = bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
- **max_samples**: int or float, default=None; If bootstrap is True, the number of samples to draw from X to train each base estimator. If float, then draw max_samples * X.shape[0] samples.
- **oob_scorebool**, default=False; Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
- **n_jobs**: int, default=None; The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over trees.
- **random_state**: int, RandomState instance or None, default=None; Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features).
- **warm_start**: bool, default=False; When True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest. 
- **class_weight**: {“balanced”, “balanced_subsample”}, dict or list of dicts, default=None; The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
- **ccp_alpha**: non-negative float, default=0.0; Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed.

https://medium.datadriveninvestor.com/decision-tree-adventures-2-explanation-of-decision-tree-classifier-parameters-84776f39a28




In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(X, y)
clf.predict([[0, 0, 0, 0]])
clf.score(X, y, sample_weight=None)


## 3. AdaBoost
This algorithm builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage n_classes_ regression trees are fit on the negative gradient of the loss function, e.g. binary or multiclass log loss. Binary classification is a special case where only a single regression tree is induced.

sklearn.ensemble.HistGradientBoostingClassifier is a much faster variant of this algorithm for intermediate datasets (n_samples >= 10_000).

**sklearn.ensemble.GradientBoostingClassifier**(*, loss='log_loss', learning_rate=0.1, n_estimators=100, subsample=1.0, criterion='friedman_mse', min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, validation_fraction=0.1, n_iter_no_change=None, tol=0.0001, ccp_alpha=0.0)
- **base_estimator**: object default=None; Support for sample weighting is required, as well as proper classes_ and n_classes_ attributes. If None - base estimator is DecisionTreeClassifier with max_depth=1.
- **n_estimators**: int, default=50; The maximum number of estimators at which boosting is terminated. In case of perfect fit, the learning procedure is stopped early.
- **learning_rate**: float, default=1.0; Weight applied to each classifier at each boosting iteration. A higher learning rate increases the contribution of each classifier. There is a trade-off between the learning_rate and n_estimators parameters. Values must be in the range (0.0, inf)
- **algorithm**: {‘SAMME’, ‘SAMME.R’}, default=’SAMME.R’; If ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. If ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations
- **random_state**: int, RandomState instance or None, default=None; Controls the random seed given at each base_estimator at each boosting iteration. Thus, it is only used when base_estimator exposes a random_state. Pass an int for reproducible output across multiple function calls.


ADABoost
•	In random forest, u always make full size trees, no max_depth
•	Adaboost trees are a node and 2 leaves(stump) = forest of stumps = stumps use only one variable to make a decision = weak learners
•	Contrast to randForest, stumps votes on final classi is weighted by size, order of tree creation is important – errors of 1st stump is taken into account by next stump and so on
1.	Give all sample same 1/num_samples weight – choose split with gini index to make 1st stump – 
2.	assign the wight for this stump based on its error = sum of wights of incorrectly classified samples. Error = [0,1]
1.	Weight of stump = ½log((1-total_error)/total_error)
2.	weight_for_sample = sample_wt * e^(+/-)weight_of_stump (- for correct sample vs +)
3.	normalize all sample weights to add up to 1
4.	make duplicate copy of dataset of same size with random drawing of samples based on the new_sample_weights (random number in (0,1) decide to draw sample that has weight_sum_upto_it)
5.	do this for all variables
6.	prediction: add up all weights that classified same. Select the classification of highest sum


In [None]:
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)

## 4. XGBoost

DMatrix is an internal data structure that is used by XGBoost, which is optimized for both memory efficiency and training speed. You can construct DMatrix from multiple different sources of data.

xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, eval_metric='mlogloss',
gdamma=0, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.300000012, max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=16,num_parallel_tree=1, objective='multi:softprob', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=None)

- base_score - prediction for initial models. default score of 0.5.
- booster - type of algorithm used to improve the model performance.
- colsample_bylevel - how different branches levels are separated in the tree.
- colsample_bynode -  how different nodes are split.
- colsample_bytree - how different trees in XGBoost are separated.
- gamma - to reduce the loss when correcting model errors.
- learning_rate - The rate at which the XGBoost model learns
- max_delta_step - to update the model class during training.
- max_depth - maximum depth of the XGBoost classifier.
- min_child_weight - minimum size allowed to partition the tree’s leaf node.
- n_estimators - total number of estimators added during model training.
- n_jobs - total number of jobs handled by the model.
- objective - type of algorithm used to build the model, in this case, it uses logistic regression.
- reg_alpha - to reduce the weights of the model.
- reg_lambda - to increase the weights of the model.
- subsample - The ratios to sample the training phases of a model.
- random_state; seed;

### two ways to control overfitting:

1. directly control model complexity: max_depth, min_child_weight and gamma.

2. add randomness to make training robust to noise: subsample and colsample_bytree.
You can also reduce stepsize eta, increase num_round when you do so.
Set tree_method to hist or gpu_hist for faster computation. Additional parameters for hist, gpu_hist: single_precision_histogram - [default= false], Use single precision to build histograms instead of double precision.

### Two ways to improve imalance dataset
For common cases such as ads clickthrough log, the dataset is extremely imbalanced.
1. If you care only about the overall performance metric (AUC) of your prediction, Balance the positive and negative weights via scale_pos_weight, use AUC for evaluation.

2. If you care about predicting the right probability, you cannot re-balance the dataset. Set parameter max_delta_step to a finite number (say 1) to help convergence.



In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_california_housing
import xgboost as xgb
import multiprocessing

if __name__ == "__main__":
    print("Parallel Parameter optimization")
    X, y = fetch_california_housing(return_X_y=True)
    xgb_model = xgb.XGBRegressor(n_jobs=multiprocessing.cpu_count() // 2)
    clf = GridSearchCV(xgb_model, {'max_depth': [2, 4, 6], 'n_estimators': [50, 100, 200]}, verbose=1, n_jobs=2)
    clf.fit(X, y)
    print(clf.best_score_)
    print(clf.best_params_)

hyperparameters = {
        "max_depth":"5",
        "eta":"0.2",
        "gamma":"4",
        "min_child_weight":"6",
        "subsample":"0.7",
        "verbosity":"1",
        "objective":"reg:squarederror",
        "num_round":"50"}

## 5. voting
Soft Voting/Majority Rule classifier for unfitted estimators.
- **sklearn.ensemble.VotingClassifier**(estimators, *, voting='hard', weights=None, n_jobs=None, flatten_transform=True, verbose=False)

- **estimators**: list of (str, estimator) tuples; Invoking the fit method on the VotingClassifier will fit clones of those original estimators that will be stored in the class attribute self.estimators_. An estimator can be set to 'drop' using set_params.
- **voting**: {‘hard’, ‘soft’}, default=’hard’; If ‘hard’, uses predicted class labels for majority rule voting. Else if ‘soft’, predicts the class label based on the argmax of the sums of the predicted probabilities, which is recommended for an ensemble of well-calibrated classifiers.
- **weights**: array-like of shape (n_classifiers,), default=None; Sequence of weights (float or int) to weight the occurrences of predicted class labels (hard voting) or class probabilities before averaging (soft voting). Uses uniform weights if None.
- **n_jobs**: int, default=None; The number of jobs to run in parallel for fit. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.
- **flatten_transform**: bool, default=True; Affects shape of transform output only when voting=’soft’ If voting=’soft’ and flatten_transform=True, transform method returns matrix with shape (n_samples, n_classifiers * n_classes). If flatten_transform=False, it returns (n_classifiers, n_samples, n_classes).
- 


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
clf1 = LogisticRegression(multi_class='multinomial', random_state=1)
clf2 = RandomForestClassifier(n_estimators=50, random_state=1)
clf3 = GaussianNB()
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='hard')
eclf1 = eclf1.fit(X, y)
eclf1.predict(X)
np.array_equal(eclf1.named_estimators_.lr.predict(X), eclf1.named_estimators_['lr'].predict(X))
eclf2 = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3)], voting='soft')
eclf2 = eclf2.fit(X, y)
eclf2.predict(X)

# Model analysis
- Model Metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics
- Validation curves: plotting scores to evaluate models: https://scikit-learn.org/stable/modules/learning_curve.html

# NLP- Transformers

## 1. BERT

### a. RoBERTa

### b. alBERT

### c. DistiliBERT

## 2. GPT