# General Linear Model:

1. What is the purpose of the General Linear Model (GLM)?

**ANS:** Generalized Linear Models (GLMs) are a class of regression models that can be used to model a wide range of relationships between a response variable and one or more predictor variables. Unlike traditional linear regression models, which assume a linear relationship between the response and predictor variables, GLMs allow for more flexible, non-linear relationships by using a different underlying statistical distribution.

2. What are the key assumptions of the General Linear Model?

**ANS:** Key assumptions of the General Linear Model are :
First, the general linear model assumes that the relationships between the outcome and any continuous predictors are linear; in Figure 1, the straight regression line adequately captures the trend in the data. (You can add non-linear transformations of the predictors, too, but then the model will assume that the transformed predictors are linearly related to the outcome. I’m using the term ‘linearity assumption’ to mean ‘the model captures the trend in the data’.)

Second, the general linear model assumes that the residuals were all sampled from the same distribution. The three red curves in Figure 1 show the assumed distribution of the residuals at three predictor values; the ‘constant variance’ assumption entails that these distributions all have the same width as they do in Figure 1.

Third, the general linear model assumes that the residuals were all sampled from normal distributions. Strictly speaking, t- and F-statistics – and the p-values derived from them – depend on this assumption.


3. How do you interpret the coefficients in a GLM?

**ANS:** We can interpret your coefficients in a similar way as the Poisson regression. When you increase x by 1, the mean of your underlying count (which you have turned into presence/absence) is multiplied by exp(β1) e x p ( β 1 ) 


4. What is the difference between a univariate and multivariate GLM?

**ANS:**
The difference between a univariate and multivariate Generalized Linear Model (GLM) lies in the number of dependent variables (responses) being modeled.

1. **Univariate GLM:** In a univariate GLM, there is a single dependent variable or response variable being modeled. The model estimates the relationship between this single response variable and one or more independent variables (predictors). The focus is on understanding and modeling the relationship between the response variable and the predictors.

2. **Multivariate GLM:** In a multivariate GLM, there are multiple dependent variables or response variables being modeled simultaneously. The model estimates the relationships between the multiple response variables and the predictors. The focus is on understanding and modeling the relationships between the response variables themselves, as well as their relationships with the predictors.

Key differences between univariate and multivariate GLMs include:

- **Number of Response Variables:** Univariate GLM deals with one response variable, whereas multivariate GLM deals with two or more response variables.

- **Modeling Approach:** Univariate GLM focuses on modeling the relationship between a single response variable and predictors. Multivariate GLM considers the relationships between multiple response variables simultaneously, accounting for their potential interdependencies.

- **Parameter Estimation:** In univariate GLM, separate parameter estimates are obtained for each predictor in relation to the single response variable. In multivariate GLM, parameter estimates are obtained for predictors in relation to each response variable, as well as for the relationships among the response variables themselves.

- **Model Complexity:** Multivariate GLM models tend to be more complex than univariate GLM models because they involve analyzing and capturing relationships between multiple response variables.

Univariate GLM is often used when analyzing the relationship between a single outcome variable and multiple predictors. On the other hand, multivariate GLM is employed when investigating the relationships between multiple outcome variables, such as in multivariate regression or multivariate analysis of variance (MANOVA) scenarios.


5. Explain the concept of interaction effects in a GLM.

**ANS:**  In general, the existence of an interaction means that the effect of one variable depends on the value of the other variable with which it interacts. If there isn't an interaction, then the value of the other variable doesn't matter.
Interaction effects in GLMs describing probabilities and counts are not equal to product terms between predictor variables. Instead, interactions may be functions of the predictors of a model, requiring nontraditional approaches for interpreting these effects accurately.


6. How do you handle categorical predictors in a GLM?

**ANS:** When your dependent variable is binary (1 vs. 0, "dead" vs. "alive"), the you might use logistic regression which is a glm with a binomial error distribution and a logit link function. When your dependent variable is ordinal (e.g. "bad"> "good" > "best"), you can use ordinal logistic regression. For a nominal (e.g. transportation: "walk", "car", "bicycle") dependent variable, you can use multinomial logistic regression.


7. What is the purpose of the design matrix in a GLM?

**ANS:** The design matrix is used in certain statistical models, e.g., the general linear model. It can contain indicator variables (ones and zeros) that indicate group membership in an ANOVA, or it can contain values of continuous variables.


8. How do you test the significance of predictors in a GLM?

**ANS :**  likelihood ratio test, F-test and ANNOVA are used to tests are used to test the significance of predictors in a GLM.


9. What is the difference between Type I, Type II, and Type III sums of squares in a GLM?

**ANS :** Three different methodologies for splitting variation exist: Type I, Type II and Type III Sums of Squares. They do not give the same result in case of unbalanced data.

Type I Sums of Squares, or also called Sequential Sums of Squares, assign variation to the different variables in a sequential order.

The Type II Sums of Squares take a different approach in two ways.

First of all, the variation assigned to independent variable A is accounting for B and the other way around the variation assigned to B is accounting for A.
Secondly, the Type II Sums of Squares do not take an interaction effect.


The Type III Sums of Squares are also called partial sums of squares again another way of computing Sums of Squares:

Like Type II, the Type III Sums of Squares are not sequential, so the order of specification does not matter.
Unlike Type II, the Type III Sums of Squares do specify an interaction effect.


10. Explain the concept of deviance in a GLM.

**ANS :** Deviance is a goodness-of-fit metric for statistical models, particularly used for GLMs. It is defined as the difference between the Saturated and Proposed Models and can be thought as how much variation in the data does our Proposed Model account for.


# Regression:

11. What is regression analysis and what is its purpose?

**ANS :** Regression analysis is a statistical method that shows the relationship between two or more variables. Usually expressed in a graph, the method tests the relationship between a dependent variable against independent variables.


12. What is the difference between simple linear regression and multiple linear regression?

**ANS : ** Simple linear regression has only one x and one y variable. Multiple linear regression has one y and two or more x variables.

13. How do you interpret the R-squared value in regression?

**ANS:** R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

14. What is the difference between correlation and regression?

**ANS:** The key difference between correlation and regression is that correlation measures the degree of a relationship between two independent variables (x and y). In contrast, regression is how one variable affects another.

15. What is the difference between the coefficients and the intercept in regression?

**ANS:** The simple linear regression model is essentially a linear equation of the form y = c + b*x; where y is the dependent variable (outcome), x is the independent variable (predictor), b is the slope of the line; also known as regression coefficient and c is the intercept


16. How do you handle outliers in regression analysis?

**ANS:** There are many possible approaches to dealing with outliers: 
1. removing them from the observations, 
2. treating them (for example, capping the extreme observations at a reasonable value), 
3. using algorithms that are well-suited for dealing with such values on their own.

17. What is the difference between ridge regression and ordinary least squares regression?

**ANS:** In summary, when there is a difference in variance between predictor variables, OLS tends to give higher variance for coefficients corresponding to predictors with higher variance, while Ridge Regression reduces the variance differences between coefficients by shrinking them towards zero.

18. What is heteroscedasticity in regression and how does it affect the model?

**ANS:** Heteroskedastic refers to a condition in which the variance of the residual term, or error term, in a regression model varies widely. Homoskedastic refers to a condition in which the variance of the error term in a regression model is constant.

19. How do you handle multicollinearity in regression analysis?

**ANS:** The idea is to reduce the multicollinearity by regularization by reducing the coefficients of the feature that are multicollinear. By increasing the alpha value for the L1 regularizer, we introduce some small bias in the estimator that breaks the correlation and reduces the variance.

20. What is polynomial regression and when is it used?

**ANS:** A polynomial regression model is a machine learning model that can capture non-linear relationships between variables by fitting a non-linear regression line, which may not be possible with simple linear regression. It is used when linear regression models may not adequately capture the complexity of the relationship.

# Loss function:

21. What is a loss function and what is its purpose in machine learning?

**ANS:** At its core, a loss function is a measure of how good your prediction model does in terms of being able to predict the expected outcome(or value). We convert the learning problem into an optimization problem, define a loss function and then optimize the algorithm to minimize the loss function.

22. What is the difference between a convex and non-convex loss function?

**ANS:** A convex function is one in which a line drawn between any two points on the graph lies on the graph or above it. There is only one requirement. A non-convex function is one in which a line drawn between any two points on the graph may cross additional points. It was described as “wavy.”


23. What is mean squared error (MSE) and how is it calculated?

**ANS:** The Mean Squared Error measures how close a regression line is to a set of data points. It is a risk function corresponding to the expected value of the squared error loss. Mean square error is calculated by taking the average, specifically the mean, of errors squared from data as it relates to a function.


24. What is mean absolute error (MAE) and how is it calculated?


**ANS:** MAE is calculated as the sum of absolute errors divided by the sample size: It is thus an arithmetic average of the absolute errors , where is the prediction and. the true value. Alternative formulations may include relative frequencies as weight factors.


25. What is log loss (cross-entropy loss) and how is it calculated?

**ANS:** Log Loss (Binary Cross-Entropy Loss): A loss function that represents how much the predicted probabilities deviate from the true ones. It is used in binary cases. Cross-Entropy Loss: A generalized form of the log loss, which is used for multi-class classification problems.

26. How do you choose the appropriate loss function for a given problem?

**ANS:** There's no one-size-fits-all loss function to algorithms in machine learning. There are various factors involved in choosing a loss function for specific problem such as type of machine learning algorithm chosen, ease of calculating the derivatives and to some degree the percentage of outliers in the data set.

27. Explain the concept of regularization in the context of loss functions.

**ANS:**Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting.

28. What is Huber loss and how does it handle outliers?

**ANS:** In statistics, the Huber loss is a loss function used in robust regression, that is less sensitive to outliers in data than the squared error loss. A variant for classification is also sometimes used.

29. What is quantile loss and when is it used?

**ANS:**As the name suggests, the quantile regression loss function is applied to predict quantiles. A quantile is the value below which a fraction of observations in a group falls. For example, a prediction for quantile 0.9 should over-predict 90% of the times.

30. What is the difference between squared loss and absolute loss?

**ANS:** For square loss, you will choose the estimated mean of y0, as the true mean minimizes square loss on average (where the average is taken across random samples of y0 subject to x=x0). For absolute loss, you will choose the estimated median.

# Optimizer (GD):

31. What is an optimizer and what is its purpose in machine learning?

**ANS:** An optimizer is an algorithm or function that adapts the neural network's attributes, like learning rate and weights. Hence, it assists in improving the accuracy and reduces the total loss. But it is a daunting task to choose the appropriate weights for the model.

32. What is Gradient Descent (GD) and how does it work?

**ANS:** Gradient descent is an optimization algorithm that's used when training a machine learning model. It's based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum.

33. What are the different variations of Gradient Descent?

**ANS:** There are three types of gradient descent learning algorithms: batch gradient descent, stochastic gradient descent and mini-batch gradient descent.

34. What is the learning rate in GD and how do you choose an appropriate value?

**ANS :** In order for Gradient Descent to work, we must set the learning rate to an appropriate value. This parameter determines how fast or slow we will move towards the optimal weights. If the learning rate is very large we will skip the optimal solution.

35. How does GD handle local optima in optimization problems?

**ANS:** The cost function may consist of many minimum points. The gradient may settle on any one of the minima, which depends on the initial point (i.e initial parameters(theta)) and the learning rate. Therefore, the optimization may converge to different points with different starting points and learning rate.


36. What is Stochastic Gradient Descent (SGD) and how does it differ from GD?

**ANS:** Stochastic Gradient Descent is a drastic simplification of GD which overcomes some of its difficulties. Each iteration of SGD computes the gradient on the basis of one randomly chosen partition of the dataset which was shuffled, instead of using the whole part of the observations.

37. Explain the concept of batch size in GD and its impact on training.

**ANS:** Batch size defines the number of samples we use in one epoch to train a neural network. There are three types of gradient descent in respect to the batch size: Batch gradient descent – uses all samples from the training set in one epoch.

38. What is the role of momentum in optimization algorithms?

**ANS:** Momentum is a strategy for accelerating the convergence of the optimization process by including a momentum element in the update rule. This momentum factor assists the optimizer in continuing to go in the same direction even if the gradient changes direction or becomes zero.

39. What is the difference between batch GD, mini-batch GD, and SGD?

**ANS:** Mini Batch Gradient Descent, Batch Gradient Descent can be used for smoother curves. SGD can be used when the dataset is large. Batch Gradient Descent converges directly to minima. SGD converges faster for larger datasets.

40. How does the learning rate affect the convergence of GD?

**ANS:** In order for Gradient Descent to work, we must set the learning rate to an appropriate value. This parameter determines how fast or slow we will move towards the optimal weights. If the learning rate is very large we will skip the optimal solution.

# Regularization:

41. What is regularization and why is it used in machine learning?


**ANS:** Regularization refers to techniques that are used to calibrate machine learning models in order to minimize the adjusted loss function and prevent overfitting or underfitting. Using Regularization, we can fit our machine learning model appropriately on a given test set and hence reduce the errors in it.

42. What is the difference between L1 and L2 regularization?

**ANS:**L1 regularization penalizes the sum of absolute values of the weights, whereas L2 regularization penalizes the sum of squares of the weights.

43. Explain the concept of ridge regression and its role in regularization.

**ANS:** Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so that we can get better long-term predictions. Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is also called as L2 regularization.


44. What is the elastic net regularization and how does it combine L1 and L2 penalties?

**ANS:** In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a regularized regression method that linearly combines the L1 and L2 penalties of the lasso and ridge methods.

45. How does regularization help prevent overfitting in machine learning models?

**ANS:** In short, Regularization in machine learning is the process of regularizing the parameters that constrain, regularizes, or shrinks the coefficient estimates towards zero. In other words, this technique discourages learning a more complex or flexible model, avoiding the risk of Overfitting.


46. What is early stopping and how does it relate to regularization?

**ANS:** The model at the time that training is stopped is then used and is known to have good generalization performance. This procedure is called “early stopping” and is perhaps one of the oldest and most widely used forms of neural network regularization. This strategy is known as early stopping.

47. Explain the concept of dropout regularization in neural networks.

**ANS:** Dropout is a regularization technique for reducing overfitting in neural networks by preventing complex co-adaptations on training data. It is a very efficient way of performing model averaging with neural networks. The term "dropout" refers to dropping out units (both hidden and visible) in a neural network.

48. How do you choose the regularization parameter in a model?

**ANS:** How do we choose the regularization parameter? as follows: on the training set, we estimate several different Ridge regressions, with different values of the regularization parameter; on the validation set, we choose the best model (the regularization parameter which gives the lowest MSE on the validation set).


49. What is the difference between feature selection and regularization?

**ANS:** Feature selection, also known as feature subset selection, variable selection, or attribute selection. This approach removes the dimensions (e.g. columns) from the input data and results in a reduced data set for model inference. Regularization, where we are constraining the solution space while doing optimization.

50. What is the trade-off between bias and variance in regularized models?

**ANS:** Bias Variance Tradeoff
If the algorithm is too simple (hypothesis with linear equation) then it may be on high bias and low variance condition and thus is error-prone. If algorithms fit too complex (hypothesis with high degree equation) then it may be on high variance and low bias.

# SVM:

51. What is Support Vector Machines (SVM) and how does it work?

**ANS:** An SVM builds a learning model that assigns new examples to one group or another. By these functions, SVMs are called a non-probabilistic, binary linear classifier. In probabilistic classification settings, SVMs can use methods such as Platt Scaling.

52. How does the kernel trick work in SVM?

**ANS:** The “trick” is that kernel methods represent the data only through a set of pairwise similarity comparisons between the original data observations x (with the original coordinates in the lower dimensional space), instead of explicitly applying the transformations ϕ(x) and representing the data by these transformed coordinates in the higher dimensional feature space.

53. What are support vectors in SVM and why are they important?

**ANS:** Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier.

54. Explain the concept of the margin in SVM and its impact on model performance.

**ANS:** Margin: it is the distance between the hyperplane and the observations closest to the hyperplane (support vectors). In SVM large margin is considered a good margin. There are two types of margins hard margin and soft margin.

55. How do you handle unbalanced datasets in SVM?

**ANS:** We can use the make_classification() function to define a synthetic imbalanced two-class classification dataset. We will generate 10,000 examples with an approximate 1:100 minority to majority class ratio. Once generated, we can summarize the class distribution to confirm that the dataset was created as we expected.

56. What is the difference between linear SVM and non-linear SVM?

**ANS:** Linear SVM: When the data points are linearly separable into two classes, the data is called linearly-separable data. We use the linear SVM classifier to classify such data. Non-linear SVM: When the data is not linearly separable, we use the non-linear SVM classifier to separate the data points.

57. What is the role of C-parameter in SVM and how does it affect the decision boundary?

**ANS:** C parameter adds a penalty for each misclassified data point. If c is small, the penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassifications .

58. Explain the concept of slack variables in SVM.

Slack variables are introduced to allow certain constraints to be violated. That is, certain train- ing points will be allowed to be within the margin. We want the number of points within the margin to be as small as possible, and of course we want their penetration of the margin to be as small as possible.

59. What is the difference between hard margin and soft margin in SVM?

**ANS:** The difference between a hard margin and a soft margin in SVMs lies in the separability of the data. If our data is linearly separable, we go for a hard margin. However, if this is not the case, it won't be feasible to do that.

60. How do you interpret the coefficients in an SVM model?

**ANS:** Let's say the svm would find only one feature useful for separating the data, then the hyperplane would be orthogonal to that axis. So, you could say that the absolute size of the coefficient relative to the other ones gives an indication of how important the feature was for the separation.

# Decision Trees:

61. What is a decision tree and how does it work?

**ANS:** A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.


62. How do you make splits in a decision tree?

**ANS:** Steps to split a decision tree using Information Gain:
For each split, individually calculate the variance of each child node
Calculate the variance of each split as the weighted average variance of child nodes
Select the split with the lowest variance
Perform steps 1-3 until completely homogeneous nodes are achieved

63. What are impurity measures (e.g., Gini index, entropy) and how are they used in decision trees?

**ANS:** The Gini impurity measure is one of the methods used in decision tree algorithms to decide the optimal split from a root node, and subsequent splits. To put it into context, a decision tree is trying to create sequential questions such that it partitions the data into smaller groups.

The Entropy and Information Gain method focuses on purity and impurity in a node. The Gini Index or Impurity measures the probability for a random instance being misclassified when chosen randomly. The lower the Gini Index, the better the lower the likelihood of misclassification.

64. Explain the concept of information gain in decision trees.

**ANS:** Information gain is the basic criterion to decide whether a feature should be used to split a node or not. The feature with the optimal split i.e., the highest value of information gain at a node of a decision tree is used as the feature for splitting the node.

65. How do you handle missing values in decision trees?

**ANS:** Surrogate splitting rules enable you to use the values of other input variables to perform a split for observations with missing values. Important Note : Tree Surrogate splitting rule method can impute missing values for both numeric and categorical variables.

66. What is pruning in decision trees and why is it important?

**ANS:** Pruning reduces the size of decision trees by removing parts of the tree that do not provide power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood.

67. What is the difference between a classification tree and a regression tree?

**ANS:** Classification trees are used when the dataset needs to be split into classes that belong to the response variable. Regression trees, on the other hand, are used when the response variable is continuous.

68. How do you interpret the decision boundaries in a decision tree?

**ANS:** A Decision tree splits the data based on a feature value and this value would remain constant throughout for one decision boundary e.g., x=2 or y=3 where x and y are two different features. Whereas in a linear classifier, a decision boundary could be for instance: y=mx+c.

69. What is the role of feature importance in decision trees?

**ANS:** A decision tree is explainable machine learning algorithm all by itself. Beyond its transparency, feature importance is a common way to explain built models as well. Coefficients of linear regression equation give a opinion about feature importance but that would fail for non-linear models.

70. What are ensemble techniques and how are they related to decision trees?

**ANS:** Ensemble methods, which combines several decision trees to produce better predictive performance than utilizing a single decision tree. The main principle behind the ensemble model is that a group of weak learners come together to form a strong learner.

# Ensemble Techniques:

71. What are ensemble techniques in machine learning?

**ANS:** In this ensemble technique, machine learning professionals use a number of models for making predictions about each data point. The predictions made by different models are taken as separate votes. Subsequently, the prediction made by most models is treated as the ultimate prediction.

72. What is bagging and how is it used in ensemble learning?

**ANS:** Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once.

73. Explain the concept of bootstrapping in bagging.

**ANS:** Bootstrapping is a sampling method, where a sample is chosen out of a set, using the replacement method. The learning algorithm is then run on the samples selected.

74. What is boosting and how does it work?

**ANS:** Boosting creates an ensemble model by combining several weak decision trees sequentially. It assigns weights to the output of individual trees. Then it gives incorrect classifications from the first decision tree a higher weight and input to the next tree.

75. What is the difference between AdaBoost and Gradient Boosting?

**ANS:** AdaBoost is the first designed boosting algorithm with a particular loss function. On the other hand, Gradient Boosting is a generic algorithm that assists in searching the approximate solutions to the additive modelling problem. This makes Gradient Boosting more flexible than AdaBoost.

76. What is the purpose of random forests in ensemble learning?

**ANS:** Random forest algorithm is an ensemble learning technique combining numerous classifiers to enhance a model's performance. Random Forest is a supervised machine-learning algorithm made up of decision trees. Random Forest is used for both classification and regression problems.

77. How do random forests handle feature importance?

**ANS:** The more a feature decreases the impurity, the more important the feature is. In random forests, the impurity decrease from each feature can be averaged across trees to determine the final importance of the variable.

78. What is stacking in ensemble learning and how does it work?

**ANS:** Stacking is one of the most popular ensemble machine learning techniques used to predict multiple nodes to build a new model and improve model performance. Stacking enables us to train multiple models to solve similar problems, and based on their combined output, it builds a new model with improved performance.

79. What are the advantages and disadvantages of ensemble techniques?

**ANS:** Ensemble methods offer several advantages over single models, such as improved accuracy and performance, especially for complex and noisy problems. They can also reduce the risk of overfitting and underfitting by balancing the trade-off between bias and variance, and by using different subsets and features of the data.

80. How do you choose the optimal number of models in an ensemble?

**ANS:**
Step 1 : Find the KS of individual models.
Step 2: Index all the models for easy access.
Step 3: Choose the first two models as the initial selection and set a correlation limit.
Step 4: Iteratively choose all the models which are not highly correlated with any of the any chosen model.