<a href="https://colab.research.google.com/github/wamaw123/Biomedical_Data_analysis/blob/main/Month_1/Week_2_Basic_Statistical_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2: Basic Statistical Modeling

In this notebook, we'll dive deeper into basic statistical modeling to understand the associations between variables in biomedical datasets. With a primary focus on the 'diagnosis' variable, explore various statistical models suitable for binary outcomes.
Like previous week we will proceed to do :
1. **Data Importing**: We'll import the dataset from a GitHub repository.
2. **Descriptive Statistics**: This will give us a reminder of the dataset's structure and characteristics.
3. **Data Cleaning**: Since we will be working with the already clean dataset, we will just make sure it is indeed clean and has no issues.
4. **Data exploration and visualization**: Exploring and visualizing the data will provide insights into its distribution and potential patterns.
5. **Determination of the scientific question**: We will set our scientific question as a preliminary for setting up the hypothesis
6. **Hypothesis postulation**: From the scientific question, we will posit a working hypothesis that we will strive to falsify using classical statistical models
7. **Statistical test selection**: We select a statistical test based on the hypothesis we want to test and verify if the postulate for this test is respected.
8. **Testing and analysis**: We will run the defined test and analyse the output

We will first set up the environment by installing and importing the necessary libraries.

In [None]:
# Install necessary libraries
!pip install pandas numpy scipy statsmodels patsy dtale


# Import necessary libraries
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.model_selection import train_test_split
import dtale                      # Interactive tool for data frame exploration.
import dtale.app as dtale_app

## Visualization
import matplotlib.pyplot as plt  # Fundamental plotting library.
import seaborn as sns            # Builds on top of matplotlib for more advanced visualizations.

# Installing libraries
!pip install statsmodels scikit-learn

Next we load the week 2 dataset directly from GitHub and set it into a Pandas dataframe


In [None]:
# Fetch the dataset from GitHub
url = "https://raw.githubusercontent.com/wamaw123/Biomedical_Data_analysis/26a597febf37711e75146e8781f4300b9651063a/Datasets/week_2/week_2.csv"
data = pd.read_csv(url)


## About Dataset

This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of the cell nuclei present in the image.

The 3-dimensional space is described in the following reference: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets," Optimization Methods and Software, 1, 1992, 23-34].

You can access this dataset from the following sources:

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
- [Kaggle Dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download)
- UW CS FTP Server: `ftp.cs.wisc.edu`, Path: `cd math-prog/cpo-dataset/machine-learn/WDBC/`

### Attribute Information

We will use the same dataset as in previous week. The dataset contains the following attributes:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32) Ten real-valued features computed for each cell nucleus:

   a) Radius (mean of distances from center to points on the perimeter)
   b) Texture (standard deviation of gray-scale values)
   c) Perimeter
   d) Area
   e) Smoothness (local variation in radius lengths)
   f) Compactness (perimeter^2 / area - 1.0)
   g) Concavity (severity of concave portions of the contour)
   h) Concave points (number of concave portions of the contour)
   i) Symmetry
   j) Fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

### Class Distribution

The class distribution in this dataset is as follows:
- 357 benign
- 212 malignant

Before any statistical modeling, a thorough exploratory data analysis (EDA) is crucial. This process will help us better understand the data's structure, identify any anomalies, and decide on subsequent steps. Since we did most of it last week, we will go for a quick version of it using basic exploration then D-Tale to verify everything is set properly.


In [None]:
# Basic data overview
print(data.head())
print(data.describe())
print(data.isnull().sum())


Analysis of the output : The dataset contains diagnostic measurements for 569 breast cancer samples. Each sample has an ID and a diagnosis (such as malignant), along with 30 other features describing the characteristics of cell nuclei from biopsy images. These features range from mean values, standard errors, and "worst" values (largest mean value) of attributes like radius, texture, perimeter, and area, among others.

### **Data Cleaning**

Overall, the dataset seems well-populated without missing data, except for a column named 'Unnamed: 32', which is entirely empty and might be an artifact from data collection or processing. Before diving into further analysis, let's go ahead and drop it.


In [None]:
# Drop the 'Unnamed: 32' column
if 'Unnamed: 32' in data.columns:
    data = data.drop('Unnamed: 32', axis=1)
    print("'Unnamed: 32' column has been successfully dropped.")
else:
    print("'Unnamed: 32' column does not exist or has already been dropped.")
data.head()

Let's now observe and explore the entire dataset using D-Tale

In [None]:
!pip install dtale
import dtale
import dtale.app as dtale_app

# Set a global variable to ensure D-Tale stays running
dtale_app.USE_COLAB = True

# Show your data with D-Tale
d = dtale.show(data)
d

Note : if D-tale does not work, one can try using ngrok but ngrok must be setup with tokken already available : see the details on how to do that [here](https://github.com/man-group/dtale#google-colab:~:text=If%20this%20does%20not%20work%20for%20you%20try%20using%20USE_NGROK%20which%20is%20described%20in%20the%20next%20section.)

If D-Tale is still bugging, one can still go ahead and try some of the packages described in week 1, use directly the libraries available here or just go forward since it is not indispensible to visualize the data at this step as it is clean and we don't have set the question yet.

## Determination of the Scientific Question
Given the biomedical nature of our dataset, a possible scientific question could be: "Is there a significant association between the any of the features and the diagnosis variable?"

Let's select radius_mean as our independant variable for the sake of this exercise

## Hypothesis Postulation
Based on the above question, our hypothesis can be:

Null Hypothesis (H0): There is no association between radius_mean and diagnosis.
Alternative Hypothesis (H1): There is a significant association between radius_mean and diagnosis.

## Statistical Test Selection
Before diving into the analysis, we must choose the appropriate statistical test and verify if all assumptions for this test are respected. We can either use already available tools to select the proper test like [this one](https://https://inspect-lb.org/statistical-tests/) or play a little with large language models like bellow.

You can get your chatGPT API [here](https://platform.openai.com/account/api-keys). Make sure to enter it between "".

In [None]:
!pip install openai
import openai

# Set your API key from OpenAI (make sure not to share it or expose it publicly)
# It's safer to use secrets or environment variables rather than hardcoding it.
api_key = "" #@param {type:"string"} # Form field for API key input

# Form fields for user to input details about their experiment
experiment_objective = "to know if there is a significant association between the continuous variable \"radius_mean\"  and the binary \"diagnosis\" variable" #@param {type:"string"}
num_of_groups =  1#@param {type:"integer"}
type_of_data = "Binary" #@param ["Continuous", "Categorical", "Ordinal", "Binary"]
is_data_paired = "No" #@param ["Yes", "No"]
additional_details = "Null Hypothesis (H0): There is no association between radius_mean and diagnosis. Alternative Hypothesis (H1):" #@param {type:"string"}

# Setting up the OpenAI API key from user input
openai.api_key = api_key

def chatWithGPT(prompt):
    completion = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return print(completion.choices[0].message.content)

# Constructing the prompt for ChatGPT
prompt = (f"I am conducting an experiment where the objective is: {experiment_objective}. "
         f"I have {num_of_groups} group and the outcome / dependant data type is {type_of_data}. "
         f"Is the data paired ? {is_data_paired}. Here are some additional details: {additional_details}. "
         "Which statistical test or method would be best to use?")

# Interacting with GPT-3.5 Turbo using the provided function
chatWithGPT(prompt)

**Using the statistical test selection tool listed above** : I found that the best test would be a pearson t-test comparing the two groups in the variable diagnosis for the variable radius-mean. If there is a statistically significant difference, then we can say there indeed an association.


**According to GPT**: In this case, the appropriate statistical test to determine the association between the continuous variable "radius_mean" and the binary "diagnosis" variable is an independent samples t-test. This test is used to compare the means of two independent groups. Since the data is not paired, an independent samples t-test is more appropriate than a paired samples t-test.

The independent samples t-test will allow you to evaluate if there is a significant difference in the mean "radius_mean" between the two groups defined by the "diagnosis" variable (e.g., malignant and benign). It will provide a p-value that will help determine if you can reject or fail to reject the null hypothesis.

However, it's worth noting that correlation analysis (e.g., Pearson correlation) can also provide information about the association between two variables, but in this case, a t-test is more suited to compare the means of two groups.