<a href="https://colab.research.google.com/github/thuc-github/MIS710-T12023/blob/main/Week%204/MIS710_Lab_4_Logistic_Reg.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# **MIS710 Lab 4 - Introduction to Logistic Regression**

**Author: Associate Professor Lemai Nguyen**

Objective:
**Breast Cancer Diagnosis**
Predict the diagnosis (healthy or cancerous) based on a biopsy dataset.

**Context**: The dataset was adapted from a biospy dataset. The dataset contains five (5) biological variables and the target variable. 

**Data**: 
V1, V2, V7-V9: biological variables

Diagnosis: healthy or cancerous

**Source**: adapted from a dataset provided by Dr Mark Griffin, Industry Fellow, University of Queensland; also available at: https://www.kaggle.com/datasets/ukveteran/biopsy-data-on-breast-cancer-patients 

**Loading Libraries and Functions**

Read about Logistic Regression at:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Train Test Split:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

Classification metrics:
https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

In [None]:
!pip install pydotplus #interface for graph visualisation
!pip install graphviz #for graph visualisation

In [None]:
# load libraries
import pandas as pd #for data manipulation and analysis
import numpy as np
 
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for model evaluation



# **Loading Data**


1.   Load the dataset
2.   Explore the data



In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
url="https://raw.githubusercontent.com/VanLan0/MIS710/main/biopsy_ln.csv"

In [None]:
# load dataset
records = pd.read_csv(url)

#explore the dataset
print(records)

In [None]:
#What does the following code do?
print(records[50:70])

In [None]:
records.info()

In [None]:
#Write your own code to inspect missing data


In [None]:
#What does the code below do?
print('Sample size:', records.shape[0])
print('Number of columns:', records.shape[1]) 

records.describe()

In [None]:
#What does the code below do? Why would you do it?

records=records.drop(['ID'], axis=1)
records.info()

#Do NOT do if you have done the previous code!
ALTERNATIVE way to remove ID: 

In [None]:
#Do NOT do if you have done the previous code!
#ALTERNATIVE way to remove ID: 
records = records.iloc[:,1:]
records.head()

# **Visually Exploring Data**
1. Explore histograms of continuous variables
2. Generate barcharts of categorical variables
3. Convert data as needed
3. Explore relationships among the variables using heatmaps
4. Explore logistric regression relationships between variables 

In [None]:
#create histograms
for i in records.iloc[:,:]: 
    plt.hist(records[i])
    plt.title(i)
    plt.show()

In [None]:
#create barchats
sns.countplot(data=records, x='diagnosis')

In [None]:
#Interpreate the outcome of the following code
records['diagnosis'].describe()

**Examine other variables**
Run the code and write down your observations

In [None]:
for i in records.columns[1:5]:
    sns.boxplot(data=records, x=i, y='diagnosis')
    plt.show()

In [None]:
#write your own heatmap using cmap='Blue' and annot=True. Hint: using data=records.corr()


What can you observe in the heatmap?

# **Define your own function and call it**

In [None]:
#convert categorical data to numerical 
def coding_diagnosis(x):
        if x=='cancerous': return 1
        if x=='healthy': return 0
       
records['Diagnosis'] = records['diagnosis'].apply(coding_diagnosis)

print(records.sample(10))

In [None]:
#Another way to convert categorical variables to numerical using LabelEncoder
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
records['Diagnosis'] = encoder.fit_transform(records['diagnosis'])
print(records.sample(10))

Compare the above two techniques

**Plotting biomarkers and diagnosis uaing a logistric relationship**

In [None]:
sns.regplot(x=records['V7'], y=records['Diagnosis'], logistic=True, ci=None)

**Feature Selection**

Select predictors (attributes) for Classification
Set role (Target)

In [None]:
#Selecting predictors
features =['V1', 'V2', 'V7', 'V8', 'V9'] #you can select a range of columns features = records.columns[0:5]

#complete the code below
X= #Input data
y=    # Target variable


# **Split the Dataset**

Split arrays or matrices into random train and test subsets
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train_test_split#sklearn.model_selection.train_test_split

In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)  # 80% training and 20% testing 

#inspect the split datasets



# **Training a Logistic Regression Model**

1.   Train a model using the training dataset
2.   Make prediction using the model for the test dataset

Read about Logistic Regression Classifier at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html




In [None]:
# Create Logitic Regression classifer object

#Create an initial Logistic Regression model
logreg = LogisticRegression(max_iter=100)

# Complete the code to train Logistic Regression Classifer with the traning dataset 
logreg = 

#Complete the code to make predictions for the test dataset
y_pred = 


**Inspect Predictions**

In [None]:
#join unseen y_test with predicted value into a data frame
inspection=pd.DataFrame({'Actual':y_test, 'Predicted':y_pred})

#join X_test with the new dataframe
inspection=pd.concat([X_test,inspection], axis=1)

inspection.head(20)

# **Model Evaluation**



1.   Calculate Accuracy, Precision, Recall, F1


Classification metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics







In [None]:
#import evaluation functions
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import plot_precision_recall_curve
from sklearn.metrics import plot_confusion_matrix

#Model Evaluation, calculate metrics: Accuracy, Precision, Recall, F1,
print("Accuracy: ", metrics.accuracy_score(y_test,y_pred))
print("Precision: ", metrics.precision_score(y_test,y_pred))
print("Recall: ", metrics.recall_score(y_test,y_pred))
print("F1: ", metrics.f1_score(y_test,y_pred))

Interpret the above

In [None]:
#print confusion matrix and evaluation report
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

**Plot ROC (Receiver operating characteristic) curve and confusion matrix**

ROC surve
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_roc_curve.html?highlight=plot_roc_curve#sklearn.metrics.plot_roc_curve

Confusion matrix
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html?highlight=plot%20confusion%20matrix#sklearn.metrics.plot_confusion_matrix

In [None]:
#import classes to display RocCurve and Confusion Matrix, read example from the website and try on your own
from sklearn.metrics import RocCurveDisplay
from sklearn.metrics import ConfusionMatrixDisplay

#complete the code to display RocCurve and Confusion Matrix
RocCurveDisplay.from_estimator(write your parameters)
ConfusionMatrixDisplay.from_predictions(write your parameters)
plt.show()

In [None]:
# Define the sigmoid function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Generate a sequence of points along the x axis
x_vals = np.linspace(-10, 10, 100)

# Calculate the corresponding y values using the model coefficients
coef = logreg.coef_.flatten()
intercept = logreg.intercept_
y_vals = sigmoid(np.dot(X_test, coef) + intercept)

# Plot the sigmoid curve using seaborn
sns.lineplot(x=x_vals, y=sigmoid(x_vals), label='Sigmoid Curve')
sns.lineplot(x=x_vals, y=sigmoid(np.dot(np.column_stack(([x_vals]*5)), coef) + intercept), label='Model Fit')
sns.set_style('darkgrid')
sns.set_context('notebook', font_scale=1.2)
sns.scatterplot(x=X_test['V7'], y=y_test, color='blue')
plt.xlabel('Biomarker e.g. V7')
plt.ylabel('Diagnosis')
plt.title('Logistic Regression Sigmoid Curve')
plt.legend()
plt.show()

In [None]:
print(coef[0])

In [None]:
print('Diagnosis= ', '%.3f' % intercept, '+', '%.3f' %coef[0], '*V1', '+', '%.3f' %coef[1], '*V2', '+', '%.3f' %coef[2], '*V7', '+', '%.3f' %coef[3], '*V8', '+', '%.3f' %coef[4], '*V9')

# **Congratulaitons!**

Now do it yourself for other datasets:

1.  Habermans survival dataset: https://www.kaggle.com/datasets/gilsousa/ or an adapted dataset: https://raw.githubusercontent.com/VanLan0/MIS710/main/haberman_ln.csv or 
2.  Tinanic dataset from Lab 1
3.  and/or another dataset of your choice

