In [1]:
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression

## Quiz 2 ##

### Q.1. Model Diagnostics###
Suppose you train a model to predict temperature for a given day (regression). The model performs somewhat well on the training data but not enough. Its performance on the testing data is slightly better but still far from what you desire. What is the problem most likely to be?

* The model is not doing well in general -> high bias

* The model is better on the testing set, so it performs well on a specific group of data present in the test data and overfitted that group -> high variance


### Q.2. Regularization ###
You've got a dataset with two features. You train two logistic regression models: model1 (C = 100), and model2 (C = 1). One model returns the following weights: [31.29, 28.14], the other's weights are [1.12, 3.46]. Which model is the SECOND set of weights more likely to correspond to and why?

The second weights are smaller, so we have high regularization.
Lambda is high and C is low. So...

* model2 because it has low regularization
* model2 because it has high regularization   -> True
* model1 because it has low regularization
* model1 because it has high regularization

In [2]:
iris_data, iris_types = load_iris().data, load_iris().target
iris_data_scaled = MinMaxScaler().fit_transform(iris_data)

In [8]:
# small C -> smaller weights -> bigger lambda -> bigger regularization 
# -> we do NOT care so much about the data
iris_model1 = LogisticRegression(C = 0.01)
iris_model1.fit(iris_data_scaled, iris_types)
iris_model1.coef_

array([[-1.16296292e-08,  7.72222196e-09, -1.94576265e-08,
        -1.98611104e-08],
       [ 1.28703699e-09, -5.98611091e-09,  4.25423714e-09,
         2.63888880e-09],
       [ 1.03425922e-08, -1.73611105e-09,  1.52033893e-08,
         1.72222216e-08]])

In [11]:
# big C -> larger weights -> small lambda -> bigger regularization 
# -> we care about the data!
iris_model2 = LogisticRegression(C = 100)
iris_model2.fit(iris_data_scaled, iris_types)
iris_model2.coef_

array([[-18.50439934,  29.40165499, -38.2104263 , -38.4367227 ],
       [ 13.69282683,  -6.68702161,  -8.71329212,  -2.71987842],
       [  4.81157251, -22.71463338,  46.92371842,  41.15660112]])

### Q.3. Regularization Coefficient  ###
Which of the following are true about the regularization coefficient lambda?

* Using a value that is too large may lead to overfitting  -> FALSE
* Using a value that is too small will make the algorithm converge very slowly
* Using a value that is too small may lead to underfitting -> FALSE
* Using a value that is too large may lead to underfitting -> TRUE
* Using a value that is too small may lead to overfitting -> TRUE
* Using a value that is too large will make the algorithm "miss" the minimum
* The value is not related to underfitting or overfitting; it's there to speed up the algorithm convergence
* Using a value that is too small will make the algorithm "miss" the minimum
* Using a value that is too large will make the algorithm converge very slowly

### Q.4. Bias and Variance ###
Which of the following are true?

* Plotting a learning curve is not enough to diagnose high bias or high variance -> True
* If an algorithm suffers from high variance, adding more features will lower the variance significantly -> False (it is about high bias, more features help to get more important patterns into consideration and increase accuracy)
* If an algorithm suffers from high bias, adding more examples will lower the bias significantly -> False (it is about high variance, more examples, will lower the noise for the model; for high variance also it will help to have less features)
* When an algorithm has much lower training set error than test set error, it suffers from high variance -> True (the algorithm has learned too well the training set)

### Q.5. Classification Metrics ###
You want to train a model to recognize spam messages. The data you've got contains 92% non-spam messages (class 0, negative) and 8% spam messages (class 1, positive). Which of the following are true?

So, for a model that always outputs class 0, accuracy is 92% /
                                             precision is 0% /
                                             recall is 0% /
                                            
And for a model that always outputs class 1, accuracy is 8% /
                                             precision is 8% /
                                             recall is 100% /
     

* For a model that always outputs class 1, the precision is is 8%       -> True
* For a model that always outputs class 1, the precision is is 100%     -> False
* For a model that always outputs class 0, the accuracy is 92%          -> True
* For a model that always outputs class 0, the recall is is 100%        -> False
* For a model that always outputs class 0, the accuracy is 8%           -> False
* For a model that always outputs class 1, the accuracy is 100%         -> False
* For a model that always outputs class 0, the recall is is 92%         -> False
* For a model that always outputs class 1, the precision is is 0%       -> False
* For a model that always outputs class 0, the recall is is 0%          -> True

In [5]:
real_y = []
for i in range(10000):
    real_y.append(np.random.choice(np.arange(0,2), p = [0.92, 0.08]))

In [6]:
# the model always outputs class 0
predicted_y = [0 for i in range(10000)]
#print(confusion_matrix(real_y, predicted_y))
print(classification_report(real_y, predicted_y))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96      9185
           1       0.00      0.00      0.00       815

    accuracy                           0.92     10000
   macro avg       0.46      0.50      0.48     10000
weighted avg       0.84      0.92      0.88     10000



  _warn_prf(average, modifier, msg_start, len(result))


In [7]:
# the model always outputs class 1
predicted_y = [1 for i in range(10000)]
#print(confusion_matrix(real_y, predicted_y))
print(classification_report(real_y, predicted_y))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00      9185
           1       0.08      1.00      0.15       815

    accuracy                           0.08     10000
   macro avg       0.04      0.50      0.08     10000
weighted avg       0.01      0.08      0.01     10000

