## 3. Classification

### Q 3.1 How do Decision Trees work?
There are two components of Decision Trees:

    Entropy — It is regarded as the randomness of the system. It is a measure of node purity or impurity.
    
<img src="https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/entropy.png?raw=1">

**Entropy is maximum when p = 0.5, i.e., both outcome has the same favor.**

    Information gain — It is a reduction in entropy. It is the difference between the starting node’s uncertainty and the weighted impurity of the two child nodes.
    
<img src="https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/information_gain.png?raw=1" width = 500>

    Goal of Decision Tree: Maximize Information Gain and Minimize Entropy.
    
    Decision tree first split the feature with the highest information gain. This is a recursive process until all child nodes are pure or until the information gain is zero.

    ## Pros

    Easy to interpret
    Handles both categorical and continuous data well.
    Works well on a large dataset.
    Not sensitive to outliers.
    Non-parametric in nature.
    
    ## Cons

    These are prone to overfitting.
    It can be quite large, thus making pruning necessary.
    It can’t guarantee optimal trees.
    It gives low prediction accuracy for a dataset as compared to other machine learning algorithms.
    Calculations can become complex when there are many class variables.
    High Variance(Model is going to change quickly with a change in training data)

    Which is better Gini or entropy?
    The range of Entropy lies in between 0 to 1 and the range of Gini Impurity lies in between 0 to 0.5. 
    Hence we can conclude that Gini Impurity is better as compared to entropy for selecting the best features.

### Q3. 2 Why Random Forest is better than CART (classification and Regression tree)
https://towardsdatascience.com/why-random-forest-is-my-favorite-machine-learning-model-b97651fa3706
https://towardsdatascience.com/a-pragmatic-dive-into-random-forests-and-decision-trees-with-python-a850f6ed4ed
https://towardsai.net/p/machine-learning/why-choose-random-forest-and-not-decision-trees

**How does Random Forest Works?**

Lets assume we have “m” features in our dataset:

1. Create n bootstrapped samples from the training data, which are used to train n decision trees. A bootstrapped sample is a randomised sample of the original training data with replacement, meaning we have some duplicates (and also samples missing entirely — known as out-of-bag).
2. Randomly chose “k” features satisfying condition k < m.
3. Among the k features, calculate the root node by choosing a node with the highest Information gain.
4. Split the node into child nodes. The choice on making this split is based on the objective function used (e.g. Entropy, Gini index, classification error, Mean Squared Error).
5. Repeat the previous steps n times.
6. You end up with a forest constituting n trees.
7. Aggregate our results together to form an overall prediction. For regression, this is usually the mean. For classification, this will often be through majority voting.

<img src= "decision_tree_graph.png" width = 400>

    If we just combine many decision trees together, we have a bagging model. 
    Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. For a random forest, there is a key difference; decorrelation. This prevents one or two dominating features from making all individual trees similarly correlated. This reduces the model variance compared to a bagging model.
    
    
**Pros:**

1. Robust to outliers.
2. Works well with non-linear data.
3. Lower risk of overfitting.
4. Runs efficiently on a large dataset. Handles higher dimensionality data very well.
5. Better accuracy than other classification algorithms.
6. Handles missing values and maintains accuracy for missing data.

**Cons:**

1. Random forests are found to be biased while dealing with categorical variables.
2. Slow Training.
3. Not suitable for linear methods with a lot of sparse features
4. Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the regression model.

### Q 3.3 Can Random Forest be used for feature selection? 
Random forests consist of 4 –12 hundred decision trees, each of them built over a random extraction of the observations from the dataset and a random extraction of the features. Not every tree sees all the features or all the observations, and this guarantees that the trees are de-correlated and therefore less prone to over-fitting. <br>

Each tree is also a sequence of yes-no questions based on a single or combination of features. At each node (this is at each question), the three divides the dataset into 2 buckets, each of them hosting observations that are more similar among themselves and different from the ones in the other bucket. **Therefore, the importance of each feature is derived from how “pure” each of the buckets is.**

    For classification, the measure of impurity is either the Gini impurity or the information gain/entropy.
    For regression the measure of impurity is variance.

Therefore, when training a tree, it is possible to compute how much each feature decreases the impurity. **The more a feature decreases the impurity, the more important the feature is**. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.<br>

For example, **features that are selected at the top of the trees are in general more important than features that are selected at the end nodes of the trees, as generally the top splits lead to bigger information gains. ** <br>


    from sklearn.ensemble import RandomForestClassfier
    from sklearn.feature_selection import SelectFromModel
    sel = SelectFromModel(RandomForestClassifier(n_estimators = 100))
    sel.fit(X_train, y_train)

SelectFromModel will select those features which importance is greater than the mean importance of all the features by default, but we can alter this threshold if we want. <br>

To see which features are important we can use get_support method on the fitted model. <br>

    sel.get_support()
    
We can now make a list and count the selected features. And print the feature <br>

    selected_feat= X_train.columns[(sel.get_support())]
    len(selected_feat)
    print(selected_feat) #


### Q 3.4 Is there any possibility of duplicate tree because of random sampling of data
If you undertstand the difference between Bagging and Random Forest you will say **yes, there is probability of having duplicate trees**

**how bagging works:**

1. It creates randomized samples of the data set (just like random forest) and grows trees on a different sample of the original data. The remaining 1/3 of the sample is used to estimate unbiased OOB error.
2. It considers all the features at a node (for splitting).
3. Once the trees are fully grown, it uses averaging or voting to combine the resultant predictions.

The need for random forest surfaced after discovering that the bagging algorithm results in correlated trees when faced with a data set having strong predictors. Unfortunately, averaging several highly correlated trees doesn't lead to a large reduction in variance.

**But how do correlated trees created? Let's say a data set has a very strong predictor, along with other moderately strong predictors. In bagging, a tree grown every time would consider the very strong predictor at its root node, thereby resulting in trees similar to each other.**

The main difference between random forest and bagging is that random forest considers only a subset of predictors at a split. This results in trees with different predictors at top split, thereby resulting in **decorrelated trees** and more reliable average output. That's why we say random forest is robust to correlated predictors.

### Q 3.5  How to prune a decision tree
https://www.analyticsvidhya.com/blog/2020/10/cost-complexity-pruning-decision-trees/


Understanding the problem of Overfitting in Decision Trees and solving it by Minimal Cost-Complexity Pruning



### Q 3.6 Bagging vs boosting
https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9#:~:text=Bagging%20(Bootstrap%20Aggregation)%20is%20used,to%20train%20their%20decision%20trees

The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.
Let’s talk about few techniques to perform ensemble decision trees:

    Bagging
    Boosting

1. Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision tree. Here idea is to create several subsets of data from training sample chosen randomly with replacement. Now, each collection of subset data is used to train their decision trees. As a result, we end up with an ensemble of different models. Average of all the predictions from different trees are used which is more robust than a single decision tree.

Random Forest is an extension over bagging. It takes one extra step where in addition to taking the random subset of data, it also takes the random selection of features rather than using all features to grow trees. When you have many random trees. It’s called Random Forest.


2. Boosting is another ensemble technique to create a collection of predictors. In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree. When an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into better performing model.

    Gradient Boosting= Gradient Descent + Boosting.
    
Gradient Boosting is an extension over boosting method. It uses gradient descent algorithm which can optimize any differentiable loss function. An ensemble of trees are built one by one and individual trees are summed sequentially. Next tree tries to recover the loss (difference between actual and predicted values).

**Pros :**
1. Supports different loss function.
2. Works well with interactions.

**Cons :**
1. Prone to over-fitting.
2. Requires careful tuning of different hyper-parameters


### Q 3.7 Why is XGBoost better than Random Forest?

Boosting happens to be iterative learning which means the model will predict something initially and self analyses its mistakes as a predictive toiler and give more weightage to the data points in which it made a wrong prediction in the next iteration. After the second iteration, it again self analyses its wrong predictions and gives more weightage to the data points which are predicted as wrong in the next iteration. This process continues as a cycle. Hence technically, if a prediction has been done, there is an at most surety that it did not happen as a random chance but with a thorough understanding and patterns in the data. Such a model that prevents the occurrences of predictions with a random chance is trustable most of the time.

Random forest is just a collection of trees in which each of them gives a prediction and finally, we collect the outputs from all the trees and considers the mean, median, or mode of this collection as the prediction of this forest depending upon the nature of data (either continues or categorical). At a high level, this seems to be fine but there are high chances that most of the trees could have made predictions with some random chances since each of the trees had their own circumstances like class imbalance, sample duplication, overfitting, inappropriate node splitting, etc.

1. **XG Boost straight away prunes the tree with a score called “Similarity score” before entering into the actual modeling purposes.** It considers the “Gain” of a node as the difference between the similarity score of the node and the similarity score of the children. If the gain from a node is found to be minimal then it just stops constructing the tree to a greater depth which can overcome the challenge of overfitting to a great extend. Meanwhile, the Random forest might probably overfit the data if the majority of the trees in the forest are provided with similar samples. If the trees are completely grown ones then the model will collapse once the test data is introduced. Therefore major consideration should be given to distributing all the elementary units of the sample with approximately equal participation to all trees.

2. **XG Boost is a good option for unbalanced datasets but we cannot trust random forest in these types of cases.** In applications like forgery or fraud detection, the classes will be almost certainly imbalanced where the number of authentic transactions will be huge when compared with unauthentic transactions. In XG Boost, when the model fails to predict the anomaly for the first time, it gives more preferences and weightage to it in the upcoming iterations thereby increasing its ability to predict the class with low participation but we cannot assure that random forest will treat the class imbalance with a proper process.

3. **One of the most important differences between XG Boost and Random forest is that the XG boost always gives more importance to functional space when reducing the cost of a model while Random Forest tries to give more preferences to hyperparameters to optimize the model.** A small change in the hyperparameter will affect almost all trees in the forest which can alter the prediction. Also, this is not a good approach when we expect test data with so many variations in real-time with a pre-defined mindset of hyperparameters for the whole forest but XG boost hyperparameters are applied to only one tree at the beginning which is expected to adjust itself in an efficient manner when iterations progress. Also, the XG boost needs only a very low number of initial hyperparameters (shrinkage parameter, depth of the tree, number of trees) when compared with the Random forest.

### Q 3.8 Random Forest Vs Adaboost Vs XGBoost
https://towardsdatascience.com/the-ultimate-guide-to-adaboost-random-forests-and-xgboost-7f9327061c4f


### Q3.9 How you avoid overfitting of GBM?


### Q4 How do you going to fix imbalance dataset ? Explain SMOTE in detail!!

### Q5 what are different types of missing data
Missing completely at random (MCAR). When data are MCAR, the fact that the data are missing is independent of the observed and unobserved data. ...

Missing at random (MAR). ...

Missing not at random (MNAR).

### Q6 Generative Vs Discriminitive model

In General, A Discriminative model models the **decision boundary between the classes**. A Generative Model explicitly models the **actual distribution of each class**. In final both of them is predicting the conditional probability P(Animal | Features). But Both models learn different probabilities.

A Generative Model learns the **joint probability distribution p(x,y)**. It predicts the conditional probability with the help of **Bayes Theorem**. A Discriminative model learns the **conditional probability distribution p(y|x)**. Both of these models were generally used in supervised learning problems.

https://medium.com/@mlengineer/joint-probability-vs-conditional-probability-fa2d47d95c4a

Examples:

###### Generative classifiers
    Naïve Bayes
    Bayesian networks
    Markov random fields
    Hidden Markov Models (HMM)
###### Discriminative Classifiers
    Logistic regression
    Scalar Vector Machine
    Traditional neural networks
    Nearest neighbour
    Conditional Random Fields (CRF)s
    
    Questions:

    What are the problems these models can solve?
    Which model learns joint probability?
    Which model learns conditional probability?
    What happens when we give correlated features in discriminative models?
    What happens when we give correlated features in generative models?
    Which models works very well even on less training data?
    Is it possible to generate data from with the help of these models?
    Which model will take less time to get trained?
    Which model will take less time to predict output?
    Which model fails to work well if we give a lot of features?
    Which model prone to overfitting very easily?
    Which model prone to underfitting easily?
    What happens when training data is biased over one class in Generative Model?
    What happens when training data is biased over one class in Discriminative Models?
    Which model is more sensitive to outliers?
    Can you able to fill out the missing values in a dataset with the help of these models?




[Calling C/C++ from Python?](https://stackoverflow.com/questions/145270/calling-c-c-from-python)

[boost](https://www.boost.org/doc/libs/1_49_0/libs/python/doc/tutorial/doc/html/index.html)



### Q7 non-parametric vs parametric model
<br>
Machine learning can be summarized as learning a function (f) that maps input variables (X) to output variables (Y).

Y = f(x)

An algorithm learns this target mapping function from training data.

The form of the function is unknown, so our job as machine learning practitioners is to evaluate different machine learning algorithms and see which is better at approximating the underlying function.

Different algorithms make different assumptions or biases about the form of the function and how it can be learned.

#### parametric:
<br>
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) is called a parametric model. No matter how much data you throw at a parametric model, it won’t change its mind about how many parameters it needs.

Some examples of parametric machine learning algorithms include:

<font color = 'green'> Logistic Regression </font><br>
Linear Discriminant Analysis(LDA)<br>
Perceptron
<font color = 'green'> Naive Bayes </font> <br>
Simple Neural Networks<br>

Benifits of Parametric Machine Learning Algorithms:

Simpler: These methods are easier to understand and interpret results.<br>
Speed: Parametric models are very fast to learn from data.<br>
Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.<br>

Limitations of Parametric Machine Learning Algorithms:

Constrained: By choosing a functional form these methods are highly constrained to the specified form. <br>
Limited Complexity: The methods are more suited to simpler problems. <br>
Poor Fit: In practice the methods are unlikely to match the underlying mapping function. <br>

#### non-parametric:
<br>
Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. By not making assumptions, they are free to learn any functional form from the training data.

Nonparametric methods are good when you have a lot of data and no prior knowledge, and when you don’t want to worry too much about choosing just the right features.

Some more examples of popular nonparametric machine learning algorithms are:

<font color = 'green'> k-Nearest Neighbors </font>
<font color = 'green'> Support Vector Machines </font> <br>
<font color = 'green'> Decision Trees </font>like CART and C4.5
<br>
Benefits of Nonparametric Machine Learning Algorithms:

Flexibility: Capable of fitting a large number of functional forms. <br>
Power: No assumptions (or weak assumptions) about the underlying function. <br>
Performance: Can result in higher performance models for prediction. <br>

Limitations of Nonparametric Machine Learning Algorithms:

More data: Require a lot more training data to estimate the mapping function. <br>
Slower: A lot slower to train as they often have far more parameters to train. <br>
Overfitting: More of a risk to overfit the training data and it is harder to explain why specific predictions are made. <br>

### Q8. How do you detect outliers?
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561

Outliers can be of two kinds: univariate and multivariate. Some of the most popular methods for outlier detection are:

##### Z-Score or Extreme Value Analysis (parametric)
###### Please have a look a MyPythonCode/SigTuple_Challenge
The z-score or standard score of an observation is a metric that indicates how many standard deviations a data point is from the sample’s mean, assuming a gaussian distribution. This makes z-score a parametric method. Very frequently data points are not to described by a gaussian distribution, this problem can be solved by applying transformations to data ie: scaling it. <br>
When computing the z-score for each sample on the data set a threshold must be specified. Some good ‘thumb-rule’ thresholds can be: 2.5, 3, 3.5 or more standard deviations. <br>
By ‘tagging’ or removing the data points that lay beyond a given threshold we are classifying data into outliers and not outliers

#### Probabilistic and Statistical Modeling (parametric)
#### Linear Regression Models (PCA, LMS)
#### Proximity Based Models (non-parametric) (like SVM)
#### Information Theory Models
#### High Dimensional Outlier Detection Methods (high dimensional sparse data)

### Q9. Difeerence between stochastic vs Detrerministic model
In deterministic models, the output of the model is fully determined by the parameter values and the initial conditions initial. Stochastic models possess some inherent randomness. The same set of parameter values and initial conditions will lead to an ensemble of different outputs.

### Q10. Difference between SGD and GD
https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent

In GD optimization, we compute the cost gradient based on the complete training set; hence, we sometimes also call it batch GD. In case of very large datasets, using GD can be quite costly since we are only taking a single step for one pass over the training set -- thus, the larger the training set, the slower our algorithm updates the weights and the longer it may take until it converges to the global cost minimum (note that the SSE cost function is convex).
In Stochastic Gradient Descent (SGD; sometimes also referred to as iterative or on-line GD), we don't accumulate the weight updates as we've seen above for GD. <br>

![title](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/GD.png?raw=1)


Instead, we update the weights after each training sample:

![title](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/SGD.png?raw=1)

Here, the term <font color='red'> "stochastic" </font> comes from the fact that the gradient based on a single training sample is a "stochastic approximation" of the "true" cost gradient. Due to its stochastic nature, the path towards the global cost minimum is not "direct" as in GD, but may go "zig-zag" if we are visualizing the cost surface in a 2D space. However, it has been shown that SGD almost surely converges to the global cost minimum if the cost function is convex


### Q11 What exactly is Bias-variance tradeoff?

1.      Bias is the error for the wrongful assumptions we make building the learning algorithm. It is the primary reason for the under fitting the model. When the bias becomes higher then there is huge gap in the relationship between the regressors and the response variable hence under fit model. 

2.      Variance is the error due to fluctuations in the training set. When we have high variance then our model is going to over fit.

### Q12. How would you design a recommendation system?
Please see the campaign recomender.

### Q13 What is data normalization and why do we need it? 
Data normalization is very important preprocessing step, used to rescale values to fit in a specific range to assure better convergence during backpropagation. In general, it boils down to **subtracting the mean of each data point and dividing by its standard deviation**. If we don’t do this then some of the features (those with high magnitude) will be weighted more in the cost function (if a higher-magnitude feature changes by 1%, then that change is pretty big, but for smaller features it’s quite insignificant). The data normalization makes all features weighted equally.

### Q14 Difference between  data normalization and data scalling.

One of the reasons that it's easy to get confused between scaling and normalization is because the terms are sometimes used interchangeably and, to make it even more confusing, they are very similar! In both cases, you're transforming the values of numeric variables so that the transformed data points have specific helpful properties. **The difference is that, in scaling, you're changing the range of your data while in normalization you're changing the shape of the distribution of your data**. 

#### Scaling
This means that you're transforming your data so that it fits within a specific scale, like 0-100 or 0-1. You want to scale data when you're using methods based on measures of how far apart data points, like <font color='red'>support vector machines, or SVM or k-nearest neighbors, or KNN </font>. With these algorithms, a change of "1" in any numeric feature is given the same importance.


For example, you might be looking at the prices of some products in both Yen and US Dollars. One US Dollar is worth about 100 Yen, but if you don't scale your prices methods like SVM or KNN will consider a difference in price of 1 Yen as important as a difference of 1 US Dollar! This clearly doesn't fit with our intuitions of the world. With currency, you can convert between currencies. But what about if you're looking at something like height and weight? It's not entirely clear how many pounds should equal one inch (or how many kilograms should equal one meter).

By scaling your variables, you can help compare different variables on equal footing.

#### Normalization
Scaling just changes the range of your data. Normalization is a more radical transformation. The point of normalization is to change your observations so that they can be described as a normal distribution.

##### Normal distribution: 
Also known as the *"bell curve"*, this is a specific statistical distribution where a 
1. roughly equal observations fall above and below the mean, 
2. the mean and the median are the same, 
3. and there are more observations closer to the mean. 
The normal distribution is also known as the Gaussian distribution.


In general, you'll only want to normalize your data if you're going to be using a machine learning or statistics technique that assumes your data is normally distributed. Some examples of these include t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes. (Pro tip: any method with "Gaussian" in the name probably assumes normality.)


In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing()
X_full, y_full = dataset.data, dataset.target

# Take only 2 features to make visualization easier
# Feature of 0 has a long tail distribution.
# Feature 5 has a few but very large outliers.

X = X_full[:, [0, 5]]

scalling =  MinMaxScaler().fit_transform(X)
normalizing = Normalizer().fit_transform(X)

#Plotting you check from 
#https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py




### Q15 Explain dimensionality reduction, where it’s used, and it’s benefits? or Why we need feature selection?

Dimensionality reduction is the **process of reducing the number of feature variables** under consideration by obtaining a set of principal variables which are basically the important features. Importance of a feature depends on how much the feature variable contributes to the information representation of the data and depends on which technique you decide to use. <br>

Benefits of dimensionality reduction for a data set may be: <br>
1. Reduce the storage space needed
2. Speed up computation, less dimensions mean less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions.
3. Remove redundant features, for example no point in storing a terrain’s size in both sq meters and sq miles (maybe data gathering was flawed).
4. Reducing a data’s dimension to 2D or 3D may allow us to plot and visualize it, maybe observe patterns, give us insights.
5. Too many features or too complex a model can lead to overfitting.

### Q16 Bias - variance dillema & No of features.  or What’s the trade-off between bias and variance?

#### High Bias
1. Pays little attention to the data
2. over simplified
3. High error on training set (low r^2, high SSE)

#### High Variance
1. Pays too much attention to data(does not generalize well)
2. overfits
3. Much high error on test set than on training set

When you use **few features** high bias can happen. When you use **too many feature without "Regularization"** high variance can happen. Balance is achieved when "many features, carefully optimized performance on training data". That is optimum number of features laid to best fit model.

From above Lasso regression we knew that we have to decrese the value of our cost function.

minimize(cost function) = SSE + lambda|beta|

1. When no of feature increases our SSE decreases because we have better fit to the model
2. But when we increase the no of feature then we have to pay a penalty (lambda|beta|)

![best_fit_model](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/best_fit_model.png?raw=1)

In [None]:
from sklearn import linear_model
regressor = linear_model.Lasso(alpha=0.1)
regressor.fit([[0,0], [1, 1], [2, 2]], [0, 1, 2])
Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)
print(regressor.coef_)
[ 0.85  0.  ]

### Q16.1 How many feature really matters in above code?
Since regressor.coef_ returns [ 0.85  0.  ], only 1 feature is usefull because the value of 2nd feature is 0.

### Q16.2 If a decision tree is overfit then would you expect the accuracy on a test set to be very high or pretty low?
Ans: Overfit, accuracy on training data is high but accuracy on test data is low.

### Q16.3 How to overfit a algorithm?
A classic way to overfit an algorithm is by using lots of features but not enough training data.

### Q17 What is PCA and why we should use it?
PCA stands for principle component analysis. It used for dimensionality reduction(specially in imgaes).
Somewhat unsurprisingly, reducing the dimension of the feature space is called “dimensionality reduction.” There are many ways to achieve dimensionality reduction, but most of these techniques fall into one of two classes:

1. Feature Elimination
2. Feature Extraction

Feature elimination is what it sounds like: we reduce the feature space by eliminating features. As a disadvantage, though, you gain no information from those variables you’ve dropped. 

Feature extraction, however, doesn’t run into this problem. PCA based on the feature extraction strategy. PCA is basically moving our centre of axis to the centre of the data. That is moving towards the maximum variance of the data. Because maximum variance has the highest information.

For example, let say we have 2 measureable feature "safety problem" and "school ranking". We can combine this together to a latent feature "neighborhood quality". Basically, we are converting 2D to 1D by placing line in between data and making projection of the data points.
So, if we have 10 data points scattered on 2D space, after PCA we will have 10 points on single line.

Formula: Projection onto direction of maximal variance (minimize distance between points and new line) from old data points to its new transformed value.

Max no of PCA: min(no_of_feature, no_of_data_points). Let say we have 4 feature and 100 data points then we will have 4 PCA.

### What is SVD (Singular Value Decomposition)
[SVD](https://medium.com/data-science-group-iitr/singular-value-decomposition-elucidated-e97005fb82fa) <br>
A linear algebra method that decomposes a matrix into 3 resultant matrices to reduce information redundancy and noise. Most commonly used for PCA.

A = UxVxS

A ( m x n )= Original Matrix <br>
U (m x f)  = left orthogonal matrix, holds important, nonredundant information about observations <br>
V (f x f)  = right orthogonal matrix, holds important, nonredundant information on features <br>
S (f x n)  = diagonal matrix, contains all of the information about the decomposition process performed during the compression. 

### PCA for Facial Recognition

What makes facial recognition in pictures good for PCA? <br>
1. Pictures of faces generally have high **input dimension**. A picture of (600x600x3) has 600 SVD (Singular value Decomposition). This picture can be pretty clear with only 100 singular values out of 600.

2. Faces have general patterns that could be captured in smaller number of dimensions.

### What is a good way to figure out how many PC to use?
Train on different no of PC, see how accuracy stands out. Cut off when you see adding moe PC doesn't buy more discrimination.

Order of selection:
1. PCA 
2. Feature selection

### Q18. What is cross validation and why we should use it?

To avoid overfitting, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test.
See the following code example of train test split:
<font color = 'green'>
from sklearn.model_selection **import train_test_split** <br>
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)  <br>
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) <br>
clf.score(X_test, y_test)   <br>                        
0.96 </font> <br>

When evaluating different settings (“hyperparameters”) for estimators, such as the **C setting** that must be manually set for an **SVM**, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, **knowledge about the test set can “leak” into the model** and evaluation metrics no longer report on generalization performance. <br>

To solve this problem, yet another part of the dataset can be held out as a so-called **“validation set”:** training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. <br>

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets. <br>

A solution to this problem is a procedure called **cross-validation (CV for short)**. A test set should still be held out for final evaluation, but the **validation set is no longer needed** when doing CV. In the basic approach, called **k-fold CV**, the training set is split into k smaller sets.

<font color = 'green'>
from sklearn.model_selection import cross_val_score <br>
clf = svm.SVC(kernel='linear', C=1) <br>
scores = cross_val_score(clf, iris.data, iris.target, cv=5)<br>
scores                                              
array([ 0.96...,  1.  ...,  0.96...,  0.96...,  1.        ]) <br>

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2)) <br>
Accuracy: 0.98 (+/- 0.03) </font> <br>


### Q19. What is GridSearchCV
GridSearchCV is a way of systematically working through multiple combinations of parameter tunes, cross-validating as it goes to determine which tune gives the best performance. 

<font color = 'green'>
from sklearn import svm, datasets <br> 
from sklearn.model_selection import GridSearchCV <br> <br>
iris = datasets.load_iris() <br>
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} <br>
svc = svm.SVC() <br>
clf = GridSearchCV(svc, parameters) <br>
clf.fit(iris.data, iris.target) <br>
clf.best_params_.  </font> <br>

     

### Q20. What is evaluation metrics? What different types of evaluation metrics are available?

Evaluating your machine learning algorithm is an essential part of any project. Your model may give you satisfying results when evaluated using a metric say accuracy_score but may give poor results when evaluated against other metrics such as logarithmic_loss or any other such metric. Most of the times we use classification accuracy to measure the performance of our model, however it is not enough to truly judge our model. <br>

Followings are the different types of evaluation metrics: <br>

**1. Classification Accuracy:** <br>

accuracy = no of correct prediction / total no of prediction made

##### sklearn acc
prediction = algo.predict(X_test) <br>
acc = metrics.accuracy_score(y_test, prediction) <br>

Short comings of accuracy:
> not ideal for shewed classes (when you have few data points for a particular class)<br>
> may produce false positive (guessing icnnocent as guilty)<br>
> may produce false negative (guessing guilty as innocent)<br><br>
> The real problem arises, when the cost of misclassification of the minor class samples are very high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick person is much higher than the cost of sending a healthy person to more tests. <br>

**2. Logarithmic loss or log loss ** <br>
Log Loss has no upper bound and it exists on the range [0, ∞)
In general, minimising Log Loss gives greater accuracy for the classifier.

Logarithmic Loss or Log Loss, works by penalising the false classifications. It works well for multi-class classification. When working with Log Loss, the classifier must assign probability to each class for all the samples. Suppose, there are N samples belonging to M classes, then the Log Loss is calculated as below : <br>

![log_loss](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/log_loss.gif?raw=1)

where,

y_ij, indicates whether sample i belongs to class j or not

p_ij, indicates the probability of sample i belonging to class 

** 3. Confusion Matrix**



![accuracy_confusion_matrix](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/confusion_matrix_2.png?raw=1)

![accuracy_confusion_matrix](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/accuracy_confusion_metrics1.gif?raw=1)

**sensitivity, recall, hit rate, or true positive rate (TPR)**
${\displaystyle \mathrm {TPR} ={\frac {\mathrm {TP} }{P}}={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FN} }}=1-\mathrm {FNR} } $ <br>

**specificity, selectivity or true negative rate (TNR)**
${\displaystyle \mathrm {TNR} ={\frac {\mathrm {TN} }{N}}={\frac {\mathrm {TN} }{\mathrm {TN} +\mathrm {FP} }}=1-\mathrm {FPR} }$

**precision or positive predictive value (PPV)**
${\displaystyle \mathrm {PPV} ={\frac {\mathrm {TP} }{\mathrm {TP} +\mathrm {FP} }}}$

**accuracy (ACC)**
${\displaystyle \mathrm {ACC} ={\frac {\mathrm {TP} +\mathrm {TN} }{P+N}}={\frac {\mathrm {TP} +\mathrm {TN} }{\mathrm {TP} +\mathrm {TN} +\mathrm {FP} +\mathrm {FN} }}}$

**miss rate or false negative rate (FNR)**
${\displaystyle \mathrm {FNR} ={\frac {\mathrm {FN} }{P}}={\frac {\mathrm {FN} }{\mathrm {FN} +\mathrm {TP} }}=1-\mathrm {TPR} }$

**fall-out or false positive rate (FPR)**
${\displaystyle \mathrm {FPR} ={\frac {\mathrm {FP} }{N}}={\frac {\mathrm {FP} }{\mathrm {FP} +\mathrm {TN} }}=1-\mathrm {TNR} } $

**F1 score**
${\displaystyle F_{1}=2\cdot {\frac {\mathrm {PPV} \cdot \mathrm {TPR} }{\mathrm {PPV} +\mathrm {TPR} }}={\frac {2\mathrm {TP} }{2\mathrm {TP} +\mathrm {FP} +\mathrm {FN} }}}$

<br>
Example:

![confusion_matrix](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/confusion_matrix.png?raw=1)

accuracy = TP+TN/T = 100+50/165 = 0.91 <br>


**4. Area Under Curve**

Area Under Curve(AUC) is one of the most widely used metrics for evaluation. It is used for binary classification problem. AUC of a classifier is equal to the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. <br>

False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR bot hare computed at threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn. AUC is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in [0, 1]. <br>

![ROC](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/ROC.png?raw=1)

**5. F1 score **

F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It tells you how precise your classifier is (how many instances it classifies correctly), as well as how robust it is (it does not miss a significant number of instances).

High precision but lower recall, gives you an extremely accurate, but it then misses a large number of instances that are difficult to classify. The greater the F1 Score, the better is the performance of our model. 

F1 Score tries to find the balance between precision and recall.

**6. Mean Absolute Error **
Mean Absolute Error is the average of the difference between the Original Values and the Predicted Values. It gives us the measure of how far the predictions were from the actual output.

![MAE](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/MAE.gif?raw=1)

**7. Mean Squared Error **
Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference being that MSE takes the average of the square of the difference between the original values and the predicted values. 

![MSE](https://github.com/rnomadic/ML_Learning/blob/main/MyPythonCode/MyPythonCode/resources/MSE.gif?raw=1)

### Q21. How is the k-nearest neighbour(KNN) algorithm different from k-means clustering?

KNN is a supervised learning algorithm used for classification. K-means is an unsupervised learning algorithm used for clustering. KNN is a supervised learning algorithm which means training data is labeled. The goal of KNN is to classify an unlabeled point into. K-means clustering requires only a set of unlabeled points and a threshold. Then the algorithm will take unlabeled points and gradually learn how to cluster them into groups by computing the mean of the distance between different points. The primary difference is that when training data is labeled it becomes supervised learning while when the data is unlabeled it is unsupervised learning.

### Q22. What’s the difference between Type I and Type II error?

Type I error is equivalent to a False positive. Type II error is equivalent to a False negative. Type I error refers to non-acceptance of hypothesis which ought to be accepted. Type II error is the acceptance of hypothesis which ought to be rejected. Lets take an example of Biometrics. When someone scans their fingers for a biometric scan, a Type I error is the possibility of rejection even with an authorized match. A Type II error is the possibility of acceptance even with a wrong/unauthorized match.

### Q23.How do you handle missing or corrupted data in a dataset? 

You could find missing/corrupted data in a dataset and either drop those rows or columns, or decide to replace them with another value. In Pandas, there are two very useful methods: isnull() and dropna() that will help you find columns of data with missing or corrupted data and drop those values. If you want to fill the invalid values with a placeholder value (for example, 0), you could use the fillna() method.

see more detail on **Data-processing**

[panda](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

In [None]:
import pandas as pd
import numpy as np

df =pd.DataFrame({'name': ['alfred', 'disuja', 'usha'],
             'toy': [np.nan, 'Bottomwhile', 'wagon'],
             'born': [pd.NaT, pd.Timestamp("1979-06-07"), pd.NaT]})

df.head()

Unnamed: 0,born,name,toy
0,NaT,alfred,
1,1979-06-07,disuja,Bottomwhile
2,NaT,usha,wagon


In [None]:
#Drop the rows where at least one element is missing.
df.dropna()

Unnamed: 0,born,name,toy
1,1979-06-07,disuja,Bottomwhile


In [None]:
#Drop the col where at least one element is missing.
df.dropna(axis=1)

Unnamed: 0,name
0,alfred
1,disuja
2,usha


In [None]:
# fillna method
df = pd.read_csv('train.csv')
df.head()

#Shows how many null values
df.isnull().sum()

"""
Missing values in 'Item_weight' and 'Outlet_size' needs to be imputed
Imputing item_weight with mean
"""

mean = df['Item_Weight'].mean() 

df['Item_Weight'].fillna(mean, inplace =True)

""" 
Imputing outlet size with mode
Mode Function in python pandas is used to calculate 
the mode or most repeated value of a given set of numbers. 
"""
mode = df['Outlet_Size'].mode() 
df['Outlet_Size'].fillna(mode[0], inplace =True)

### Q24. How would you go about doing an Exploratory Data Analysis (EDA)?

The goal of an EDA is to gather some insights from the data before applying your predictive model i.e gain some information. Basically, you want to do your EDA in a coarse to fine manner. We start by gaining some high-level global insights.
1. Check out some imbalanced classes. Look at **mean and variance** of each class.

2. Check out the first few rows to see what it’s all about. Run a **pandas df.info()** to see which features are continuous, categorical, their type (int, float, string). 

3. Next, **drop unnecessary columns** that won’t be useful in analysis and prediction. These can simply be columns that look useless, one’s where many rows have the same value (i.e it doesn’t give us much information), or it’s missing a lot of values. We can also **fill in missing values with the most common value in that column, or the median**. 

4. Now we can start making some **basic visualizations**. Start with high-level stuff. Do some bar plots for features that are categorical and have a small number of groups. Bar plots of the final classes. Look at the most “general features”. Create some visualizations about these individual features to try and gain some basic insights.

5. Now we can start to get more specific. Create **visualizations between features**, two or three at a time. **How are features related to each other**? 

6. You can also do a **PCA to see which features contain the most information**. **Group some features together** as well to see their relationships. For example, what happens to the classes when A = 0 and B = 0? How about A = 1 and B = 0? Compare different features. For example, if feature A can be either “Female” or “Male” then we can plot feature A against which cabin they stayed in to see if Males and Females stay in different cabins. 

7. Beyond bar, scatter, and other basic plots, we can do a **PDF/CDF, overlayed plots**, etc. Look at some statistics like **distribution, p-value**, etc. 

8. Finally it’s time to build the ML model. Start with easier stuff like Naive Bayes and Linear Regression. 

9. If you see that those suck or the data is highly non-linear, go with polynomial regression, decision trees, or SVMs. The features can be selected based on their importance from the EDA. If you have lots of data you can use a Neural Network. Check ROC curve. Precision, Recall

### Q25 Which optimizer should you use?
Over the years, many gradient descent optimization algorithms have been developed and each have their pros and cons. A few of the most popular ones include:

1. Stochastic Gradient Descent (SGD) with momentum
2. Adam
3. RMSprop
4. Adadelta

RMSprop, Adadelta, and Adam are considered to be adaptive optimization algorithms, since they automatically update the learning rate. With SGD you have to manually select the learning rate and momentum parameter, usually decaying the learning rate over time.



### Q26 Gradient Descent vs Adagrad vs Momentum in TensorFlow

1. AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically eliminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning.

2. AdaDelta resolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly increase. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. **RMSprop is very similar to AdaDelta.**

3. Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately.

#### What is Sparse Data
A common problem in machine learning is sparse data, which alters the performance of
machine learning algorithms and their ability to calculate accurate predictions. Data is
considered sparse when certain expected values in a dataset are missing, which is a
common phenomenon in general large scaled data analysis.