## what’s the difference between Regression Problem and Classification Problem?

------------
The answer is label type, 
the `Regression` is continuous number while the `Classification` is discrete number.

- In Regression Problem, a label is a continuous number (or real number). Taking Housing Price as example, according to the different combination of features, we can get the label (i.e. price) such as $460,000, $232,000, $315,000, etc. All of them are the actual value of the data.

![image.png](attachment:image.png)

**Whereas the label in Classification Problem represents `category`**.
- Here we take breast cancer detection as an example. In breast cancer detection, what we concern most is the tumor type, that is, malignant or benign. For convenience’s sake, we simply labeled malignant and benign as 0 and 1 respectively. Note that the discrete number (0 and 1) means the label (tumor type) of the data.

![image.png](attachment:image.png)

### Classification Problem

------------

**Logistic Regression is also called a sigmoid function,** 
- which maps real numbers into probabilities, range in [0, 1]. Hence, the value of sigmoid function means how certain the data belongs to a category. The formula is defined as the following picture.

- The score t is often called the logit. The name comes from the fact that the logit function, defined as logit(p) = log(p / (1 – p)), is the inverse of the logistic function. Indeed, if you compute the logit of the estimated probability p, you will find that the result is t. The logit is also called the log-odds, since it is the log of the ratio between the estimated probability for the positive class and the estimated probability for the negative class.

- The bad news is that there is no known closed-form equation to compute the value of θ that minimizes this cost function (there is no equivalent of the Normal Equation). The good news is that this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not too large and you wait long enough). 


![image.png](attachment:image.png)

n order to understand this concept more easily and clearly, let’s visualize the example and explain it step by step.
1. Preparing Dataset - suppose we have a breast detection dataset, each one has two features and one label.
![image.png](attachment:image.png)



2. *&Getting Parameters** - after learning process, we get a nice model, that is, we get the parameters of logistic regression. Here, we assume they are -3, 1 and 1.
![image.png](attachment:image.png)


3. Predicting Labels — according to the sigmoid function with learned parameters.
If the -3+x1+x2≥0, it means h(x)≥0.5, then we predict this tumor is benign (label 1).
If not, then we predict this tumor is malignant (label 0)


![image.png](attachment:image.png)Understanding the concept of probability in logistic regression

##### Understanding the concept of probability in logistic regression
In the following picture, although the label of point A and point B are both 1, their probabilities are different, and P(B∈1)>P(A∈1).
- *WHEN A POINT IS FARTHER AWAY FROM THE REDLINE, THIS LABEL IS MORE DETERMINABLE.* 
Whereas when a point is right on the redline, the predictor can’t surely determine which label the point shall belong to since the probabilities are the same, P(a∈label0)=P(a∈label1)=0.5. That’s quite intuition.
![image.png](attachment:image.png)



###### Cost Function
After learning process, we get the predictor - the logistic regression with learned parameters. 
To decide and improve the quality of the predictor, ***we need to define the Cost Function which measures the error that the hypothesis made and to minimize it.*********

![image.png](attachment:image.png)

![image.png](attachment:image.png)

- for each instance it computes the prediction error and multiplies it by the jth feature value, and then it computes the average over all training instances.

## Multi-class classification
For example, suppose we have a dataset with three classes.
- Class1 (△)
- Class2 (◯)
- Class3 (☆)

![image.png](attachment:image.png)

Step1 - Choosing the class1 as positive class, and the other classes are negative class.
![image.png](attachment:image.png)

Step2 - Suppose we get a predictor g1(z1) after learning process, we use this predictor to estimate the probability of being class1 (△)
![image.png](attachment:image.png)

Step3 - Repeat step1 and step2, but set a different positive class. 
Finally, we can get g2(z2) and g3(z3), as shown in the following picture. We use predictor g2 and g3 to estimate the probability of being class2 (◯) and class3 (☆) respectively.
![image.png](attachment:image.png)

Step4 - Predicting the class of new data by using three predictors, g1, g2 and g3.
![image.png](attachment:image.png)

![image.png](attachment:image.png)
> As you can see, the probabilities we get for the new data are g1<0.5, g2>0.5 and g3<0.5. The maxima is g2>0.5, Hence, we can conclude the predict label of this new data is class2 (◯).

 To sum up, there are two advantages of using regularization.
-  The prediction error of the regularized model is lesser, that is, it works well in testing data (green points).
- The regularization model is simpler since it has less features (parameters).

![image.png](attachment:image.png)

![image.png](attachment:image.png)

- Figure 4-24 shows the same dataset, but this time displaying two features: **petal width and length.**
- The dashed line represents the points where the model estimates a 50% probability: this is the model’s decision boundary.

- Note that it is a linear boundary.16 Each parallel line represents the points where the model outputs a specific probability, **from 15% (bottom left) to 90% (top right)**.

- All the flowers beyond the top-right line have an over 90% chance of being Iris virginica, according to the model.

- The hyperparameter controlling the regularization strength of a Scikit-Learn LogisticRegression model is not alpha (as in other linear models), but its inverse: C. The higher the value of C, the less the model is regularized.

##### Softmax Regression
The Logistic Regression model can be generalized to **support multiple classes directly,** without having to train and combine multiple binary classifiers (as discussed in Chapter 3).** This is called Softmax Regression, or Multinomial Logistic Regression.**

- when given an instance x, the Softmax Regression model first computes a score sk(x) for each class k,
- then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.

> you can estimate the probability pˆk that the instance belongs to class k by running the scores through the softmax function

> The objective is to have a model that estimates a high probability for the target class (and consequently a low probability for the other classes). Minimizing the cost function shown in Equation 4-22, called the cross entropy, should lead to this objective because it penalizes the model when it estimates a low probability for a target class. Cross entropy is frequently used to measure how well a set of estimated class probabilities matches the target classes.

![image.png](attachment:image.png)

![image.png](attachment:dfc545d1-e064-4b96-a927-07f2d59a4cb8.png)

22. What are precision and recall?
Precision is the proportion of true positives out of predicted positives. To put it in another wa**y, it is the accuracy of the prediction. It is also known as the ‘positive predictive value**’.
   `Precision = TP/TP+FP (FPR)`
Recall is same as the true positive rate (TPR). 
`TPR = TP/TP+FN`

#####    2- 5. What are sensitivity and specificity and FPR?

`FPR (False Positive Rate)` :  # False Positives / # negatives  

>FPR =  FP / (FP+TN)
It represents the proportion of actual negative instances that are incorrectly predicted as positive by the model.
 
- FPR indicates how often the model incorrectly labels actual negatives as positives
- FPR is crucial in applications where minimizing false positives is crucial (e.g., medical testing),

Specificity is the same as true negative rate, or it is equal to 1 – false-positive rate.
`Specificity = TN/TN + FP. = TNR`
Sensitivity is the `True positive rate`.
`Sensitivity =  TP/TP + FN = ` `The True Positive Rate (TPR)`  = `Recall` 

> In the [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) we look at:  
**TPR (True Positive Rate) = # True positives / # positives = Recall = TP / (TP+FN)  
FPR (False Positive Rate) = # False Positives / # negatives = FP / (FP+TN)**

> the general performance of the [ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) which consists of the TPR and FPR through various probability thresholds.

- Recall
> where missing `positive instances (false negatives)` is more critical than wrongly classifying negatives (`false positives`),  `FN is important	`
> `Recall emphasizes the model's ability to capture all positive instances.`

- Precision

> If reducing false positives without compromising true positives is crucial. `FP is more important`

- FPR
> It represents the proportion of actual negative instances that are incorrectly predicted as positive by the model.

- ROC Limit

- This optimism bias arises because the ROC curve’s false positive rate (FPR) can become very small when the number of actual negatives is large.
-  As a result, `even a large number of false positives would only lead to a small FPR`, leading to a potentially `high AUC that doesn’t reflect the practical reality of using the model.`

> the PR curve offers a more transparent view of a model’s performance on imbalanced datasets.

> large number of negative instances won’t skew our understanding of how well our model performs on the positive class.


- **Use Precision and Recall for Small Positive Class:**
- **Use ROC Curve when Both Classes Detection is Equally Important:**
- **Use F1 Score for Balanced Importance on Precision and Recall:**

-   **Micro-Average:** `Calculate metrics globally by considering all predictions across classes`. It's suitable when class imbalances are present.
-   **Macro-Average:** `Calculate metrics independently for each class` and then average them. It treats all classes equally and is useful when all classes are of equal importance.
-   **Weighted Average:** Calculate metrics for each class, `but weighted by the number of instances in each class`. It's beneficial when there's class imbalance.


1. What is a logistic function? What is the range of values of a logistic function?
`f(z) = 1/(1+e -z )`
- The values of a logistic function will range from 0 to 1. The values of Z will vary from -infinity to +infinity

#### 5. What are odds?
> It is the ratio of the probability of an event occurring to the probability of the event not occurring. For example, let’s assume that the probability of winning a lottery is 0.01. - Then, the probability of not winning is 1- 0.01 = 0.99.
- The odds of winning the lottery = (Probability of winning)/(probability of not winning)
- The odds of winning the lottery = 0.01/0.99
- The odds of winning the lottery is 1 to 99, and the odds of not winning the lottery is 99 to 1.

#####  Why can’t linear regression be used in place of logistic regression for binary classification?
The reasons why linear regressions cannot be used in case of binary classification are as follows:
1. **Distribution of error terms:** The distribution of data in case of linear and logistic regression is different. Linear regression assumes that error terms are normally distributed. In case of binary classification, this assumption does not hold true.

3. **Model output: **In linear regression, the output is continuous. In case of binary classification, an output of a continuous value does not make sense. For binary classification problems, linear regression may predict values that can go beyond 0 and 1.

5. If we want the output in the form of probabilities, which can be mapped to two different classes, then its range should be restricted to 0 and 1. As the logistic regression model can output probabilities with logistic/sigmoid function, it is preferred over linear regression.

7. **Variance of Residual errors**: Linear regression assumes that the variance of random errors is constant. This assumption is also violated in case of logistic regression.

##### 12. What is the likelihood function?
- The likelihood function is the joint probability of observing the data.
The likelihood function gives the probability of observing the results using unknown parameters.

##### 17. Why can’t we use Mean Square Error (MSE) as a cost function for logistic regression?
- In logistic regression, we use the sigmoid function and perform a non-linear transformation to obtain the probabilities.
- **Squaring this non-linear transformation will lead to non-convexity with local minimums.**
- Finding the global minimum in such cases using gradient descent is not possible.
- Due to this reason, MSE is not suitable for logistic regression. **Cross-entropy or log loss** is used as a cost function for logistic regression. 
- In the cost function for logistic regression, the confident wrong predictions are penalised heavily.
- The confident right predictions are rewarded less. By optimising this cost function, convergence is achieved.

##### 18. Why is accuracy not a good measure for classification problems?
- Accuracy is not a good measure for classification problems because it gives equal importance to both **false positives **and **false negatives.**
- However, this may not be the case in most business problems. For example, in case of cancer prediction, declaring cancer as benign is more serious than wrongly informing the patient that he is suffering from cancer. 
- Accuracy gives equal importance to both cases and cannot differentiate between them.

###### 0. What are false positives and false negatives?
- False positives are those cases in which the negatives are wrongly predicted as positives. For example, predicting that a customer will churn when, in fact, he is not churning.
- False negatives are those cases in which the positives are wrongly predicted as negatives. For example, predicting that a customer will not churn when, in fact, he churns.



##### 23. What is F-measure?
- It is the harmonic mean of precision and recall. In some cases, there will be a trade-off between the precision and the recall. In such cases, the** F-measure will drop.**
- It will be high when both the precision and the recall are high. Depending on the business case at hand and the goal of data analytics, an appropriate metric should be selected.
- **F-measure **= 2 X (Precision X Recall) / (Precision+Recall)

##### 24. What is accuracy?
It is the number of correct predictions out of all predictions made.
   Accuracy = (TP+TN)/(The total number of Predictions)
   


##### 28. What is a cumulative response curve (CRV)?
- In order to convey the results of an analysis to the management, a ‘cumulative response curve’ is used, which is more intuitive than the ROC curve.
- A CRV consists of the true positive rate or the percentage of positives correctly classified on the Y-axis and the percentage of the population targeted on the X-axis.
- As with the ROC curve, there will be a diagonal line which represents random performance. Let’s

##### 29. What are the lift curves?
- The lift is the improvement in model performance (increase in true positive rate) when compared to random performance. 
- Random performance means if 50% of the instances is targeted, then it is expected that it will detect 50% of the positives. 
- Lift is in comparison to the random performance of a model.
- **If a model’s performance is better than its random performance, then its lift will be greater than 1.**

In a lift curve, lift is plotted on the Y-axis and the percentage of the population (sorted in descending order) on the X-axis. **At a given percentage of the target population, a model with a high lift is preferred.**

##### 31. How will you deal with the multiclass classification problem using logistic regression?

- The most famous method of dealing with multiclass classification using logistic regression is using the one-vs-all approach. 
- Under this approach, a number of models are trained, which is equal to the number of classes. The models work in a specific way. For example, the first model classifies the datapoint depending on whether it belongs to class 1 or some other class; the second model classifies the datapoint into class 2 or some other class. 
- This way, each data point can be checked over all the classes.

###### 32. Explain the use of ROC curves and the AUC of an ROC Curve.
- An ROC (Receiver Operating Characteristic) curve illustrates the performance of a binary classification model.
- It is basically a **TPR versus FPR** (true positive rate versus false-positive rate) curve for all the threshold values ranging from 0 to 1. 
- **A diagonal line from the bottom-left to the top-right on the ROC graph represents random guessing**
- The Area Under the Curve (AUC) signifies how good the classifier model is. 
**If the value for AUC is high (near 1), then the model is working satisfactorily, whereas if the value is low (around 0.5), then the model is not working properly and just guessing randomly.**

##### 33. How can you use the concept of ROC in a multiclass classification?
- The concept of ROC curves can easily be used for multiclass classification by using the one-vs-all approach.
- For example, let’s say that we have three classes ‘a’, ’b’, and ‘c’. Then, the first class comprises class ‘a’ (true class) and the second class comprises both class ‘b’ and class ‘c’ together (false class).
- Thus, the ROC curve is plotted. Similarly, for all the three classes, we will plot three ROC curves and perform our analysis of AUC.









- If you have low learning rate means your cost function will decrease slowly but in case of large learning rate cost function will decrease very fast.
- As loss function decreases as the log probability increases
- f you decrease the number of iteration while training it will take less time for surly but will not give the same accuracy for getting the similar accuracy but not exact you need to increase the learning rate.
- Since, more regularization means more penality means less complex decision boundry
- Adding more features to model will increase the training accuracy because model has to consider more data to fit the logistic regression. But testing accuracy increases if feature is found to be significant
- Model will become very simple so bias will be very high.



> Low bias machine learning algorithms — 
    - Decision Trees, k-NN and SVM 


> High bias machine learning algorithms — 
    - Linear Regression, Logistic Regression

In [7]:
arr = [[1, 2, 3, 4],
       [4, 5, 6, 7],
       [8, 9, 10, 11],
       [12, 13, 14, 15]]
for i in range(0, 4):
    print(arr[i].pop(0))

1
4
8
12


In [8]:
fruit_list1 = ['Apple', 'Berry', 'Cherry', 'Papaya']
fruit_list2 = fruit_list1
fruit_list3 = fruit_list1[:]

fruit_list2[0] = 'Guava'
fruit_list3[1] = 'Kiwi'

In [9]:
fruit_list1,fruit_list2,fruit_list3

(['Guava', 'Berry', 'Cherry', 'Papaya'],
 ['Guava', 'Berry', 'Cherry', 'Papaya'],
 ['Apple', 'Kiwi', 'Cherry', 'Papaya'])

In [10]:
('Python') * 3

'PythonPythonPython'

In [11]:
['Python'] * 3

['Python', 'Python', 'Python']

In [12]:
((1, 2),) * 7

((1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2), (1, 2))

In [15]:
init_tuple = ((1, 2),) * 7

In [20]:
print((init_tuple[-3:-18]))

()


In [22]:
a = {(1,2):1,(2,3):2}
print(a[(1,2)])

1


In [23]:
arr = {}
arr[1] = 1
arr['1'] = 2
arr[1] += 1
arr

{1: 2, '1': 2}

In [25]:

sum = 0
for k in arr:
    sum += arr[k]

print (sum)

4


In [26]:
my_dict = {}
my_dict[1] = 1
my_dict['1'] = 2
my_dict[1.0] = 4
my_dict

{1: 4, '1': 2}

In [29]:
dict = {'c': 97, 'a': 96, 'b': 98}

for _ in sorted(dict):
    print (_,dict[_])

a 96
b 98
c 97


In [30]:
sorted(dict)

['a', 'b', 'c']

In [32]:
rec = {"Name" : "Python", "Age":"20", "Addr" : "NJ", "Country" : "USA"}
id1 = id(rec)
del rec
rec = {"Name" : "Python", "Age":"20", "Addr" : "NJ", "Country" : "USA"}
id2 = id(rec)
print(id1 == id2)

False


In [62]:
# initializing dictionary 
test_dict = {'Gfg' : {'is' : 'best'}} 
  
# printing original dictionary 
print("The original dictionary is : " + str(test_dict)) 
from functools import reduce # only in Python 3

# using reduce() + lambda 
# Safe access nested dictionary key 
keys = ['Gfg', 'is'] 
res = reduce(lambda val, key: val.get(key) if val else None,  keys, test_dict)

The original dictionary is : {'Gfg': {'is': 'best'}}


In [63]:
res

'Gfg'

In [56]:
res = reduce(lambda k,v: v.get(k) if v else None, keys, test_dict)
res

AttributeError: 'str' object has no attribute 'get'

In [68]:
# initializing dictionary  
test_dict = {'Gfg1' : {'CS':1, 'GATE' : 2},  
             'Gfg2' : {'CS':2, 'GATE' : 3}, 
             'Gfg3' : {'CS':4, 'GATE' : 5},
            'Gfg4' : {'CS':4, 'GATE' : 5}}  
    
# printing original dictionary  
print("The original dictionary is : " + str(test_dict)) 

The original dictionary is : {'Gfg1': {'CS': 1, 'GATE': 2}, 'Gfg2': {'CS': 2, 'GATE': 3}, 'Gfg3': {'CS': 4, 'GATE': 5}, 'Gfg4': {'CS': 4, 'GATE': 5}}


In [69]:
temp = set().union(*test_dict.values()) 
temp

{'CS', 'GATE'}

In [70]:
res = [list(test_dict.keys())] 
res

[['Gfg1', 'Gfg2', 'Gfg3', 'Gfg4']]

In [71]:
test_dict.values()

dict_values([{'CS': 1, 'GATE': 2}, {'CS': 2, 'GATE': 3}, {'CS': 4, 'GATE': 5}, {'CS': 4, 'GATE': 5}])

In [72]:
res += [[key] + [sub.get(key, 0) for sub in test_dict.values()] for key in temp] 


In [73]:
res

[['Gfg1', 'Gfg2', 'Gfg3', 'Gfg4'], ['GATE', 2, 3, 5, 5], ['CS', 1, 2, 4, 4]]