## W3&W4 post studio exercises (errors, model fitting)

Enter your solution in the cell(s) below each exercise. Add couple of inline comments explaining your code. Don't forget to add comments in markdown cell after each exercise. Missing comments (in markdown cells and/or inline) and late submissions will incur penalties.

Once done, drag&drop your python file to your ADS1002-name github account.

Copy url of this file on github to appropriate folder on Moodle by 09.30am prior your next studio. 

Solutions will be released later in the semester.

Max 10 marks - 2.5 marks per each exercise.

***
We will use 

* [who-health-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/who-health-data.csv)

* [wisconsin-cancer-data.csv](https://gitlab.erc.monash.edu.au/bads/data-challenges-resources/-/tree/main/Machine-Learning/Supervised-Methods/kaggle-wisconsin-cancer.csv)

throughout the exercises. Download the datasets into the same directory as your post-studio notebook.

In [None]:
# Remember these? Our usual package imports for handling data.
import numpy as np
import pandas as pd
import seaborn as sns

# Specialised functions for calculating prediction error rates.
from sklearn.metrics import precision_score

In [None]:
who_data_2015 = (
    pd.read_csv("who-health-data.csv") # Read in the csv data.
    .rename(columns=lambda c: c.strip())      # Clean up column names.
    .query("Year == 2015")                    # Restrict the dataset to records from 2015.
    # Removes two columns which contain a lot of missing data...
    .drop(columns=["Alcohol", "Total expenditure"])
    # ... then drop any rows with missing values.
    .dropna()
)

wisconsin_cancer_biopsies = (
    pd.read_csv("kaggle-wisconsin-cancer.csv")
    # This tidies up the naming of results (M -> malignant, B -> benign)
    .assign(diagnosis=lambda df: df['diagnosis']  
        .map({"M": "malignant", "B": "benign"})
        .astype('category')
    )
)


### Exercise 1

Given the dataframe `ex1_who_with_predictions` below, compute the Mean Absolute Error for the predicted values of life expectancy. You can repeat the process previously shown, or find a function in `sklearn.metrics` to compute this for you.

In [None]:
ex1_who_with_predictions = (
    who_data_2015[["Schooling", "Life expectancy"]]
    .assign(Predicted=lambda df: df["Schooling"] * 2.3 + 43)
    .dropna()
)
ex1_who_with_predictions.head()

In [None]:
#calculating the Mean Absolute Error for the predicted values of life expectancy

errors = ex1_who_with_predictions['Life expectancy'] - ex1_who_with_predictions['Predicted']
mae_score = errors.abs().mean()
print(f"Mean absolute error is {mae_score:.1f} years")



### Exercise 2

Given the classification predictions and actual results in the dataframe `ex2_biopsies_with_predictions` below, compute accuracy, precision and recall. Also find the number of false negatives.

In [None]:
import pandas as pd


ex2_biopsies_with_predictions = (
    wisconsin_cancer_biopsies
    .assign(prediction=lambda df: df['texture_mean'].lt(20)
        .map({True: "benign", False: "malignant"})
    )
    [['radius_mean', 'texture_mean', 'diagnosis', 'prediction']]
)
ex2_biopsies_with_predictions



threshold = 20

# Classify predictions based on the threshold
ex2_biopsies_with_predictions['classified_prediction'] = ex2_biopsies_with_predictions['texture_mean'].apply(lambda x: 'malignant' if x > threshold else 'benign')

# Define True Positives, True Negatives, False Positives, and False Negatives
TP = ex2_biopsies_with_predictions[(ex2_biopsies_with_predictions['classified_prediction'] == 'malignant') & (ex2_biopsies_with_predictions['diagnosis'] == 'malignant')].shape[0]
TN = ex2_biopsies_with_predictions[(ex2_biopsies_with_predictions['classified_prediction'] == 'benign') & (ex2_biopsies_with_predictions['diagnosis'] == 'benign')].shape[0]
FP = ex2_biopsies_with_predictions[(ex2_biopsies_with_predictions['classified_prediction'] == 'malignant') & (ex2_biopsies_with_predictions['diagnosis'] == 'benign')].shape[0]
FN = ex2_biopsies_with_predictions[(ex2_biopsies_with_predictions['classified_prediction'] == 'benign') & (ex2_biopsies_with_predictions['diagnosis'] == 'malignant')].shape[0]

print(f"True Positives (TP): {TP}")
print(f"True Negatives (TN): {TN}")
print(f"False Positives (FP): {FP}")
print(f"False Negatives (FN): {FN}")




In [None]:
TOTAL = TP + TN + FP + FN

print(f"Accuracy = {(TP + TN) / TOTAL = :.3f}")
print(f"Precision = {TP / (TP + FP) = :.3f}")
print(f"Recall = {TP / (TP + FN) = :.3f}")

### Exercise 3

Consider three different predictors for the cancer biopsy screening dataset:

* Predictor A has an accuracy of 0.95, and recall of 0.99
* Predictor B has an accuracy of 0.99, and recall of 0.95
* Predictor C has an accuracy of 0.5, and a recall of 1.0

The test required to collect data from a new patient (on which the predictor will give a predicted diagnosis) is minimally invasive. If the predictor predicts a positive (malignant) diagnosis, the patient will be referred for further screening which can be expensive.

Considering the context, which predictive model (A, B, or C) would likely be preferred for this task? Write your answer in a markdown cell below, and give a brief explanation of your reasoning.

-------------------------------------------------------------------------
<mark> Predictor B </mark>

In this scenario, where the test to collect data is minimally invasive but further screening for a positive diagnosis is expensive, the choice of the predictive model involves balancing accuracy and recall.

Predictor B** has the highest accuracy (0.99) and a slightly lower recall (0.95). It performs very well overall and balances between detecting positives and avoiding false positives.


-------------------------------------------------------------------------

### Exercise 4

Choose one different input/feature variable (other than Schooling) and fit a linear regression model to predict Life Expectancy using sklearn. Can you achieve a better error rate than what we found in pre-studio notebook? (RMSE and MAE for Schooling were 4.71 and 3.69, respectively.) Suggest a method to narrow down your choices of variables to use in order to arrive at a good model. 

Hint 1: Correlation.

Hint 2: You can use the functions written in the pre-studio notebook, e.g. prediction_root_mean_squared_error(gradient, intercept), to calculate the model error once you choose your model parameters (features).

----------------------------------------------------------------------------------------------
<mark>Answer</mark>


we can choose better RMSE AND MSE by having  <mark> "Income composition of resources" </mark> as our y intercept  <mark>(code has been provided below to show the working)</mark>

I think that gradient decent method would be better to nerrow down would be better at the data set is really large and also no direct mathematical expression given to directly computr the model parameterers.

In [None]:

#Using BMI instead of schoolinG

sns.relplot(data=who_data_2015, x="Income composition of resources", y="Life expectancy");

who_data_2015
#choose better error rate that found in pre reading material.


#uggest a method to narrow down your choices of variables to use in order to arrive at a good model.




In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

def prediction_root_mean_squared_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Income composition of resources"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    # Note that `squared=False` gives us RMSE. Then we're in the same units as MAE.
    return mean_squared_error(y_true=actual, y_pred=predictions, squared=False)

def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Income composition of resources"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)

# Compute error values for different gradient and intercepts.
# This will be used to build the colour contour plots.
gradient_values, intercept_values = np.meshgrid(
    np.linspace(0.5, 5.0, 30),
    np.linspace(30, 80, 30),
)
rmse_errors = np.zeros(gradient_values.shape)
for i in range(rmse_errors.shape[0]):
    for j in range(rmse_errors.shape[1]):
        rmse_errors[i, j] = prediction_root_mean_squared_error(gradient_values[i, j], intercept_values[i, j])
mae_errors = np.zeros(gradient_values.shape)
for i in range(mae_errors.shape[0]):
    for j in range(mae_errors.shape[1]):
        mae_errors[i, j] = prediction_mean_absolute_error(gradient_values[i, j], intercept_values[i, j])

In [None]:
from scipy.optimize import minimize

# This sets initial guess values (gradient = 3, intercept = 60) for the algorithm to
# use as a starting point. You can change these and re-run the cell to observe the
# different paths taken by the algorithm.
initial_guess = (3, 60)

# We'll record the different model parameters tested in these lists.
gradient_steps = [initial_guess[0]]
intercept_steps = [initial_guess[1]]

def callback(values, *args, **kwargs):
    """ This function is called by `minimize` whenever it takes a step. This allows the
    steps to be recorded """
    gradient_steps.append(values[0])
    intercept_steps.append(values[1])

def prediction_error(coefficients):
    """ This function is called with both coefficients (gradient and intercept) as a tuple.
    It returns the result of the error calculation which scipy.optimise will use """
    gradient, intercept = coefficients
    return prediction_root_mean_squared_error(gradient, intercept)

# Run the gradient descent algorithm and extract the optimal model parameter values.
opt_result = minimize(
    prediction_error,      # error evaluation function
    initial_guess,         # an initial guess of the model parameters
    callback=callback,     # a function to record trial points
)

# This gives some status information and the model parameter results.
opt_result


optimal_gradient, optimal_intercept = opt_result.x
print("Model is y = {:.2f}x + {:.2f}".format(optimal_gradient, optimal_intercept))
print("RMSE = {:.2f}".format(prediction_root_mean_squared_error(optimal_gradient, optimal_intercept)))
print("MAE = {:.2f}".format(prediction_mean_absolute_error(optimal_gradient, optimal_intercept)))


## Extra exercises

The following exercises with (*) will not be assessed. Use these to check your understanding of topics covered in the past 2 weeks.

### Exercise 5*

The function `model_correct_predictions` below returns the number of correct predictions made by a predictive model for the cancer biopsy dataset, for a given parameter value. This parameter value simply controls the threshold value for radius above which a sample is predicted as malignant.

Try different values of the parameter in this model within the range [0, 30]. Record and plot the resulting accuracy values against the parameter value (similar to the regression cost function example above).

What value of the parameter provides the best error rate? Explain how can you be confident you have found the best result here.

In [None]:
def model_correct_predictions(radius_split_parameter):
    """ Return the number of correct predictions made by the model
    for the given parameter value. """
    data = wisconsin_cancer_biopsies.assign(
        predicted=lambda df: df['radius_mean'].lt(radius_split_parameter)
            .map({True: "benign", False: "malignant"})
    )
    return (data['diagnosis'] == data['predicted']).sum()

model_correct_predictions(12)

### Exercise 6*

In examples in pre-studio notebook (W4) we have used root mean squared error (the standard cost function for linear regression) to fit the model parameters. Try re-running the `scipy.optimise` method using mean absolute error. Are the resulting model parameters the same as above? Give some brief reasoning why there might be a difference here.

In [None]:
# Hint: you only need to make one small change in the prediction_error function to do this.

In [None]:
def prediction_root_mean_squared_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    # Note that `squared=False` gives us RMSE. Then we're in the same units as MAE.
    return mean_squared_error(y_true=actual, y_pred=predictions, squared=False)

def prediction_mean_absolute_error(gradient, intercept):
    """ Return the prediction error associated with the value of the parameters.
    This time around, let's use sklearn.metrics. """
    predictions = who_data_2015["Schooling"] * gradient + intercept
    actual = who_data_2015["Life expectancy"]
    return mean_absolute_error(y_true=actual, y_pred=predictions)

### Exercise 7*

We can see above that different methods for determining model parameters arrive at the same result, but what happens if we change the dataset slightly. Experiment by taking several (at least 10) different samples of the data, fitting a linear model for each one, and plotting a histogram of the different gradient and intercept coefficients you find. Is there a significant amount of variation in the parameter values?

In [None]:
sample_data = who_data_2015.sample(30)  # selects a small sample of 30 random rows from the data.