In [None]:
"""
Multiclass classification problems:

A single example is assigned exactly one label from a group of many possible classes.
For example, if our model is classifying images as cats, dogs, or rabbits, the softmax output might look like this for a given image: [.89, .02, .09]. This means our model is predicting an 89% chance the image is a cat, 2% chance it’s a dog, and 9% chance it’s a rabbit. Because each image can have only one possible label in this scenario, we can take the argmax (index of the highest probability) to determine our model’s predicted class.

Multilabel models:

Refers to problems where we can assign more than one label to a given training example.
For example, in text models, we can imagine a few scenarios where text can be labeled with multiple tags. Suppose that we have a dataset of Stack Overflow questions, we could build a model to predict the tags associated with a particular question. As an example, the question “How do I plot a pandas DataFrame?” could be tagged as “Python”, “pandas”, and “visualization”.

"""

In [None]:
"""
There are two main methods for tackling a multi-label classification problem: problem transformation methods and algorithm adaptation methods.

Problem transformation methods transform the multi-label problem into a set of binary classification problems, which can then be handled using single-class classifiers.

Whereas algorithm adaptation methods adapt the algorithms to directly perform multi-label classification. In other words, rather than trying to convert the problem to a simpler problem, they try to address the problem in its full form.

In an extensive comparison with other approaches, label-powerset method scores best, followed by the one-against-all method.
Both ML-KNN and label-powerset take considerable amount of time when run on this dataset, so experimentation was done on a random sample of the train data.
"""

In [None]:
"""
Adapted algorithms
Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these including for multi-label data are

k-nearest neighbors: the ML-kNN algorithm extends the k-NN classifier to multi-label data.[8]
decision trees: "Clare" is an adapted C4.5 algorithm for multi-label classification; the modification involves the entropy calculations.[9] MMC, MMDT, and SSC refined MMDT, can classify multi-labeled data based on multi-valued attributes without transforming the attributes into single-values. They are also named multi-valued and multi-labeled decision tree classification methods.[10][11][12]
kernel methods for vector output
neural networks: BP-MLL is an adaptation of the popular back-propagation algorithm for multi-label learning.[13]

"""

In [None]:
# Decision boundary
# we are predicting log odds...log odds ..the decision boundary the value is 0..log(p/1-P) = 0 => (p/1-p) = 1 => 2p = 1 => p = 0.5

# points below decision boudnary log odds is -ve
# points above decision boundary log odds is +ve



![image.png](attachment:image.png)

In [None]:
"""
Logistic Regression (LR):

LR, on the other hand, is based on statistical approaches and is commonly used for binary classification tasks. It models the probability of a binary outcome (e.g., class 0 or class 1) based on one or more predictor variables.
LR uses the logistic function (sigmoid function) to transform the output of a linear combination of predictor variables into probabilities. The logistic function maps the linear output to a value between 0 and 1, representing the probability of belonging to a particular class.
LR estimates the coefficients (weights) for each predictor variable using techniques like maximum likelihood estimation, and these coefficients indicate the impact of each predictor on the probability of the outcome.
LR provides interpretable results, as the coefficients can be directly interpreted as the change in the log odds of the outcome for a one-unit change in the predictor variable.
"""

"""

In [None]:
"""
 How would you interpret coefficients of logistic regression for categorical and boolean variables? ***


 In logistic regression, coefficients represent the log-odds change in the dependent variable per one-unit change in the predictor. A positive coefficient implies an increase in the log-odds of the event, while a negative coefficient suggests a decrease. Exponentiating the coefficient yields the odds ratio, representing the multiplicative change in odds. For example, an odds ratio of 2 implies the odds of the event are two times higher for each unit increase in the predictor. Confidence intervals for coefficients aid in assessing statistical significance. Interpretation should consider the specific context and variables involved to derive meaningful insights from logistic regression results.
"""

In [None]:
"""
 What are the odds?
Odds are defined as the ratio of the probability of an event occurring to the probability of the event not occurring. 
"""

In [None]:
"""
Let’s say we are trying to improve our search feature. How would you improve recall without changing the underlying algorithm?

Recall is the ratio between the number of correct predictions and the number of predictions that were denoted as right. To improve recall in Amazon’s search feature, it would be necessary to enhance prediction by either changing the acceptance threshold or increasing the number of parameters to be evaluated. I would enrich the metadata and product descriptions to ensure broader coverage of relevant keywords. Additionally, adjusting the threshold for matching query terms can increase the number of returned results.”

"""

In [None]:
"""
Which algorithm is better in the case of outliers present in the dataset i.e., Logistic Regression or SVM?
SVM (Support Vector Machines) handles the outliers in a better manner than the Logistic Regression.
Logistic Regression: Logistic Regression will identify a linear boundary if it exists to accommodate the outliers. To accommodate the outliers, it will shift the linear boundary.

SVM: SVM is insensitive to individual samples. So, to accommodate an outlier there will not be a major shift in the linear boundary. SVM comes with inbuilt complexity controls, which take care of overfitting, which is not true in the case of Logistic Regression.
"""

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [None]:
"""
Underlying Concept:

SVM: SVMs find the optimal hyperplane that maximizes the margin between the closest data points of different classes (support vectors). This margin represents the decision boundary for future classifications.

Logistic Regression: Logistic Regression uses a sigmoid function to model the probability of a data point belonging to a specific class. It estimates a linear relationship between the features and the log odds of belonging to a particular class.

Applications:

Data: SVMs work well with unstructured and semi-structured data, like text and images. LR works with independent variables that have already been identified.

SVM: Commonly used for tasks where understanding the decision boundary is crucial, such as image classification, spam detection, or novelty detection (identifying outliers). SVMs can also be effective for high-dimensional data with a small number of samples.

Logistic Regression: Well-suited for problems where predicting the probability of an event is important, such as sentiment analysis, credit risk assessment, or predicting customer churn. It's also often a good choice for interpretable models, as the coefficients of the logistic regression model can provide insights into feature importance
"""

In [None]:
"""
SVM works well with unstructured and semi-structured data like text and images while logistic regression works with already identified independent variables.
SVM is based on geometrical properties of the data while logistic regression is based on statistical approaches.
The risk of overfitting is less in SVM, while Logistic regression is vulnerable to overfitting.


When To Use Logistic Regression vs Support Vector Machine
Depending on the number of training sets (data)/features that you have, you can choose to use either logistic regression or support vector machine.

Lets take these as an example where :
n = number of features,
m = number of training examples

1. If n is large (1–10,000) and m is small (10–1000) : use logistic regression or SVM with a linear kernel.

2. If n is small (1–10 00) and m is intermediate (10–10,000) : use SVM with (Gaussian, polynomial etc) kernel

3. If n is small (1–10 00), m is large (50,000–1,000,000+): first, manually add more features and then use logistic regression or SVM with a linear kernel

Generally, it is usually advisable to first try to use logistic regression to see how the model does, if it fails then you can try using SVM without a kernel (is otherwise known as SVM with a linear kernel). Logistic regression and SVM with a linear kernel have similar performance but depending on your features, one may be more efficient than the other.
"""

In [None]:
"""
Can you use logistic regression for classification between more than two classes?

Yes, it is possible to use logistic regression for classification between more than two classes, and it is called multinomial logistic regression. However, this is not possible to implement without modifications to the vanilla logistic regression model.




"""

In [None]:
"""
Now, as I told you earlier that we have to generate the same number of classifiers as the class labels are present in the dataset, So we have to create three classifiers here for three respective classes.

Classifier 1:- [Green] vs [Red, Blue]
Classifier 2:- [Blue] vs [Green, Red]
Classifier 3:- [Red] vs [Blue, Green]


After the training model, when we pass input test data to the model, then that data is considered as input for all generated classifiers. If there is any possibility that our input test data belongs to a particular class, then the classifier created for that class gives a positive response in the form of +1, and all other classifier models provide an adverse reaction in the way of -1. Similarly, binary classifier models predict the probability of correspondence with concerning classes.

By analyzing the probability scores, we predict the result as the class index having a maximum probability score


"""

![image.png](attachment:image.png)

In [None]:
# one vs one

![image.png](attachment:image.png)

In [None]:
"""
In One-vs-One classification, for the N-class instances dataset, we have to generate the N* (N-1)/2 binary classifier models. Using this classification approach, we split the primary dataset into one dataset for each class opposite to every other class.

Taking the above example, we have a classification problem having three types: Green, Blue, and Red (N=3).

We divide this problem into N* (N-1)/2 = 3 binary classifier problems:

Classifier 1: Green vs. Blue
Classifier 2: Green vs. Red
Classifier 3: Blue vs. Red
Each binary classifier predicts one class label. When we input the test data to the classifier, then the model with the majority counts is concluded as a result.
"""

In [None]:
"""
Why can't we use the mean square error cost function used in linear regression for logistic regression?

If we use mean square error in logistic regression, the resultant cost function will be non-convex, i.e., a function with many local minima, owing to the presence of the sigmoid function in h(x). As a result, an attempt to find the parameters using gradient descent may fail to optimize cost function properly. It may end up choosing a local minima instead of the actual global minima.
"""

In [None]:
"""
If you observe that the cost function decreases rapidly before increasing or stagnating at a specific high value, what could you infer?

A trend pattern of the cost curve exhibiting a rapid decrease before then increasing or stagnating at a specific high value indicates that the learning rate is too high. The gradient descent is bouncing around the global minimum but missing it owing to the larger than necessary step size.
"""

In [None]:
"""
Will the decision boundary be linear or non-linear in logistic regression models? Explain with an example.

The decision boundary is essentially a line or a plane that demarcates the boundary between the classes to which linear regression classifies the dependent variables. The shape of the decision boundary will depend entirely on the logistic regression model.

For logistic regression model given by hypothesis function h(x)=g(Tx)where g is the sigmoid function, if the hypothesis function is h(x)=g(1+2x2+3x3)then the decision boundary is linear. Alternatively, if h(x)=g(1+2x22+3x32)then the decision boundary is non-linear.
"""

In [None]:
"""
Wald Test Applicability:

Wald tests are generally applicable to any model estimated using maximum likelihood estimation (MLE). This includes logistic regression (a type of generalized linear model) and some, but not all, linear regressions.


β^ : maximum-likelihood estimation (MLE) of co-efficient

β0 : Parameter of interest, usually 0 as we want to test whether the coefficient is different than zero or not.

SE: Standard error of MLE

VAR: Variance of MLE

χ21 : Chi-Square distribution with 1 degree of freedom

The Wald test results interpretation:

If β^ is significantly different from β0 (null hypothesis: β0 = 0), it suggests that estimate of β significantly improves model fit and the variable is significant.

The Wald test can be used to simultaneously test many parameters. For example: The null hypothesis can be 2 coefficients of interest are at the same time equal to zero. If the test rejects the null hypothesis, this suggests that the 2 variables are significant to that model fit. If the test results could not reject the null hypothesis, this means that removing the variables from the model will not considerably damage the fit of that model. Wald test is used to compare models on best fit criteria in case of logistic regression.
"""

In [None]:
"""
In classification problems like logistic regression, classification accuracy alone is not considered a good measure. Why?

Classification accuracy considers both true positives and false positives with equal significance. If this were just another machine learning problem of not too much consequence, this would be acceptable. However, when the problems involve deciding whether to consider a candidate for life-saving treatment, false positives might not be as bad as false negatives. The opposite can also be true in some cases
"""

In [None]:
"""
Which is the most preferred algorithm for variable selection?

Lasso is the most preferred for variable selection because it performs regression analysis using a shrinkage parameter where the data is shrunk to a point, and variable selection is made by forcing the coefficients of not so significant variables to be set to zero through a penalty.
"""

In [None]:
"""
differences between decision trees and logistic regression:
Decision boundaries: Decision trees bisect the space into smaller and smaller regions. Logistic regression fits a single line to divide the space exactly into two

Interpretability: Decision trees are usually easier to interpret than logistic regression.
Missing values: Decision trees work with missing values, but logistic regression doesn't.

"""

![image.png](attachment:image.png)

In [None]:
"""

"""

![image.png](attachment:image.png)

In [None]:
"""
Naive Bayes
Naive Bayes is a probabilistic algorithm that applies Bayes' theorem to calculate the probability of each class given the features. Bayes' theorem is a formula that relates the conditional probability of an event to its prior probability and the evidence. Naive Bayes makes a simplifying assumption that the features are independent of each other given the class, which is often unrealistic but makes the computation easier. Naive Bayes can handle both categorical and numerical features, and it can deal with missing values by ignoring them or using a default value. Naive Bayes is fast, easy to implement, and works well with high-dimensional data and small training sets.


"""

![image.png](attachment:image.png)

![image.png](attachment:image.png)

In [None]:
# BCE loss function  is convex ***

![image.png](attachment:image.png)

In [None]:
"""
can logistic regression be used for imbalanced data /

Class weights will be balanced using a dictionary where the dictionary keys are the classes of the dataset and the keys of the dictionary would be the percentage of weights that would be assigned to each of the classes of the data. So let us look into how to use a dictionary as a parameter for class weights and evaluate certain parameters of the model.

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""

In [None]:
"""

"""