## Assignment

**Q1. Explain the difference between linear regression and logistic regression models. Provide an example of
a scenario where logistic regression would be more appropriate.**

Ans:  Linear Regression:
- Predicts a continuous outcome(e.g., house price, temperatue).
- Models a linear relationship between variables.
- Best fit line is used to make predictions.

Logistic Regression:
- Predicts a categorical outcome(e.g., yes/no, spam/not spam)
- Models the probability of a certain outcome.
- S-shaped curve (sigmoid function) is used to make predictions

Scenario:
A healthcare organization wants to predict whether a patient is likely to develop diabetes (Yes/No) based on features such as age, BMI, glucose levels, and family history.

Why Logistic Regression is Appropriate:
- Nature of the Problem: The outcome(diabetes or no diabetes) is binary (categorical).
- Output: Logistic regression predicts the probability of developing diabetes.
- Interpretability: The model provides odds ratios, which are useful for understanding the effect of each feature on the likelihood of diabetes.


**Q2. What is the cost function used in logistic regression, and how is it optimized?**

Ans: The cost function in logistic regression is called log loss (or cross-entropy loss). It measures the difference between predicted probabilities and actual outcomes.

It's optimized using gradient descent, an iterative algorithm that finds the parameters minimizing the cost function by repeatedly adjusting them in the direction of the steepest decrease.

**Q3. Explain the concept of regularization in logistic regression and how it helps prevent overfitting.**

Ans: Regularization in logistic regression is a technique used to prevent overfitting, which occurs when a model learns the training data too well, including its noise and outliers, leading to poor performance on new, unseen data. Regularization achieves this by adding a penalty term to the cost function that the model tries to minimize during training. This penalty discourages the model from assigning excessively large coefficients to the features. Regularization in logistic regression acts as a constraint on the model's complexity, preventing it from becoming too specialized to the training data and improving its ability to generalize to new data.

**Q4. What is the ROC curve, and how is it used to evaluate the performance of the logistic regression
model?**

Ans:
The ROC(Receiver Operating Characteristic) curve is a graphical representation of the performance of a binary classification model, like logistic regression, at all possible classification thresholds. It plots two key metrics:
- True Positive Rate(TPR) or Sensitivity: The proportion of actual positives that are correctly identified.
- False Positive Rate(FPR): The proportion of actual negatives that are incorrectly identified as positives.

How it's used to evaluate logistic regression:

Varying the Threshold: Logistic regression outputs probabilities between 0 and 1. To make a final classification (e.g., yes/no), you need to set a threshold (e.g., 0.5). The ROC curve is created by varying this threshold and calculating the TPR and FPR at each threshold.   

Plotting the Curve: The TPR is plotted on the y-axis, and the FPR is plotted on the x-axis. This creates a curve that shows the trade-off between sensitivity and specificity (1 - FPR) at different thresholds.   

Ideal Curve: An ideal classifier would have a curve that goes straight up to the top left corner (TPR = 1, FPR = 0), meaning it correctly classifies all instances.

 Area Under the Curve (AUC): The AUC is a single number that summarizes the overall performance of the model. An AUC of 1 represents a perfect classifier, while an AUC of 0.5 represents a classifier that performs no better than random guessing. 

**Q5. What are some common techniques for feature selection in logistic regression? How do these
techniques help improve the model's performance?**

Feature selection in logistic regression aims to identify the most relevant features for predicting the outcome, discarding irrelevant or redundant ones. This simplifies the model, improves performance, and reduce overfitting.
Here are some common techniques:
- Filter Methods: Use statistical measure to rank features.Examples include:
    - Chi-square test:Measures the independence between categorical features and outcome.
    - ANOVA: Analyzes variance between groups to assess feature relevance

- Wrapper Methods: Evaluate feature subsets by training the model with different combinations. Examples include:
    - Forward selection: Starts with an empty set and adds the best feature iteratively.   
    - Backward elimination: Starts with all features and removes the least important one iteratively.
    - Recursive Feature Elimination (RFE): Recursively removes features based on their importance ranking

- Embedded Methods: Perform feature selection as part of the model training process. Examples include:
    - L1 regularization (Lasso): Shrinks some feature coefficients to zero, effectively selecting features.  

**Q6. How can you handle imbalanced datasets in logistic regression? What are some strategies for dealing
with class imbalance?**

Ans:
Imbalanced datasets, where one class has significantly more instances than the other, can bias logistic regression towards the majority class. Here's how to handle them:   

1. Data-level techniques:

    - Oversampling: Increase the number of minority class instances by:
        - Random oversampling: Duplicating existing instances.   
        - SMOTE (Synthetic Minority Over-sampling Technique): Creating synthetic instances based on existing ones.   
    - Undersampling: Decrease the number of majority class instances by:
        - Random undersampling: Randomly removing instances.   
        - Tomek links: Removing majority class instances that are "close" to minority class instances.   
2. Algorithm-level techniques:

- Class weights: Assign higher weights to the minority class in the cost function, making misclassifications more costly.   
3. Evaluation metrics:

Use metrics beyond accuracy, such as:
- Precision: Proportion of true positives among predicted positives.   
- Recall (Sensitivity): Proportion of true positives among actual positives.   
- F1-score: Harmonic mean of precision and recall.
- AUC: Area under the ROC curve.   



**Q7. Can you discuss some common issues and challenges that may arise when implementing logistic
regression, and how they can be addressed? For example, what can be done if there is multicollinearity
among the independent variables?**

Ans:
Common issues in logistic regression and their solutions:
- Multicollinearity(high correlation between predictors):
    - Problem: Makes it hard to isolate the effect of individual predictors, inflates standard errors.
    - Solution: Remove one of the correlated variables, combine them into a composite variable, or use regularization(L1 or L2).
- Overfitting:
    - Problem: Model performs well on training data but poorly on unseen data.
    - Solution: Use regularization, cross-validation , or simplify the model(feature selection).
- Outliers:
    - Problem: Can disproportionately influence the model.
    - Solution: Identify and remove or transform outliers.
- Imbalanced datasets:
    - Problem: Model biased towards the majority class.
    - Solution: Use oversampling, undersampling, class weights, or appropriate evaluation metrics(precision,recall,F1-score,AUC).

- Linearity assumption violation(for continuous predictors):
    - Problem:Logistic regression assumes a linear relationship between continuous predictors and the log-odds of the outcome.
    - Solution: Transform the predictor variables or use a non-linear model.
 