<a href="https://colab.research.google.com/github/wamaw123/Biomedical_Data_analysis/blob/main/Month_1/Week_1_Data_Importing_and_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 1: Data Importing and Cleaning

In this notebook, we'll focus on the foundational steps of any data analysis process:
1. **Data Importing**: We'll import a biomedical dataset from a GitHub repository.
2. **Descriptive Statistics**: This will give us a preliminary understanding of the dataset's structure and characteristics.
3. **Data Cleaning**: We'll handle missing values and outliers to ensure the data's quality.
4. **Data Visualization**: Visualizing the data will provide insights into its distribution and potential patterns.
5. **Normalization and Standardization**: We'll transform the data to prepare it for future analysis.

Let's begin by importing the necessary libraries and the dataset.


In [None]:
# Install necessary packages
!pip install pandas_profiling
!pip install dtale

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import pandas_profiling
from google.colab import files

# Set up the notebook for visualizations
%matplotlib inline

# Load the dataset from GitHub
url = "https://raw.githubusercontent.com/wamaw123/Biomedical_Data_analysis/c072fdafc2b2abe4e002f8611f80bcf5fd8366b8/Datasets/Week_1/week_1.csv"
data = pd.read_csv(url)

# Display the first few rows of the dataset to understand its structure
data.head()


# Descriptive Statistics and Data Exploration

In this section, we'll delve deep into our dataset to understand its structure, characteristics, and potential issues. This includes understanding basic information, central tendencies, visualizations, and more.


In [None]:
# Basic Information
print("Dataset Shape:", data.shape)
print("\nData Types:\n", data.dtypes)
print("\nUnique Values in Each Column:\n", data.nunique())

# Central Tendency and Dispersion
print("\nMean:\n", data.mean(numeric_only=True))
print("\nMedian:\n", data.median(numeric_only=True))
print("\nMode:\n", data.mode(numeric_only=True).iloc[0])
print("\nStandard Deviation:\n", data.std(numeric_only=True))
print("\nVariance:\n", data.var(numeric_only=True))

# For the range, we'll only consider numeric columns to avoid the TypeError
numeric_data = data.select_dtypes(include=[np.number])
print("\nRange:\n", numeric_data.max() - numeric_data.min())
print("\nQuartiles:\n", data.quantile([0.25, 0.5, 0.75], numeric_only=True))

# Missing Values
print("\nMissing Values Count:\n", data.isnull().sum())
print("\nPercentage of Missing Values:\n", (data.isnull().sum() / len(data)) * 100)

# Visualizations
numeric_data.hist(figsize=(12, 10))
plt.suptitle("Histograms of Data Columns")
plt.show()

plt.figure(figsize=(12, 10))
sns.boxplot(data=numeric_data)
plt.title("Boxplots of Data Columns")
plt.show()

plt.figure(figsize=(12, 10))
sns.heatmap(numeric_data.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
plt.show()

# Additional Insights
print("\nSkewness:\n", numeric_data.skew())
print("\nKurtosis:\n", numeric_data.kurtosis())

# Dynamic Data Exploration with pandas_profiling

This does the trick but it is not very friendly. For a more interactive and comprehensive overview of our dataset, we'll use the `pandas_profiling` package. This tool generates an interactive HTML report that provides a deep dive into each column, correlations, missing values, and much more.

## NOTE : Panda profiling is a powerfull tool but the file can be very large and exploring it can lead to buggy behavior.


In [None]:
# Generate the profile report
profile = pandas_profiling.ProfileReport(numeric_data)
profile_file_path = "data_profile_report.html"
profile.to_file(output_file=profile_file_path)
# Download the file to your local system
files.download(profile_file_path)

# More lightweight Data Exploration with D-Tale

Alternatively, D-Tale is a lightweight tool that provides an interactive web-based interface for viewing and analyzing Pandas data structures. It's a great alternative for quick and efficient data exploration without the overhead of more comprehensive tools like `pandas_profiling`.

In this section, we'll set up and use D-Tale to explore our dataset.


## Starting D-Tale Session

Once D-Tale is installed, we can start a session to view our dataset. After running the code below, you'll receive a link. Clicking on this link will open the D-Tale interface in a new tab, allowing for interactive exploration of the data.


In [None]:
import dtale
import dtale.app as dtale_app

# This is necessary in Colab to ensure the D-Tale instance keeps running
dtale_app.USE_COLAB = True

# Start D-Tale session
d = dtale.show(numeric_data)
d


In the D-Tale interface, you can:
- View the dataset in a tabular format.
- Generate charts and visualizations.
- Check statistics and distributions of columns.
- Run correlations.
- And much more!

Additionally, D-Tale provides options to export your data or any analysis directly from its interface.


## Data Corruption Step

The data looks pretty clean. That is because it was a high quality dataset imported from Kaggle. In real life, raw data comes with various issues that can hinder or skew our analysis. In this step, we'll intentionally introduce common data problems to our dataset. This will allow us to later demonstrate corrective measures in a practical context.

The issues we'll introduce are:
- Missing Values
- NaN Values
- Inconsistencies
- Outliers
- Duplicates
- Incorrect Data Types
- Irrelevant Data
- Errors or Typos
- Biased Data

Let's corrupt our data!


In [None]:
# Introduce Missing Values
for col in data.columns:
    data.loc[data.sample(frac=0.1).index, col] = None

# Introduce NaN Values
data.loc[data.sample(frac=0.05).index, 'radius_mean'] = np.nan

# Introduce Inconsistencies (using different units or scales)
data['texture_mean'] = data['texture_mean'].apply(lambda x: x*10 if random.random() > 0.9 else x)

# Introduce Outliers
data.loc[data.sample(frac=0.02).index, 'area_mean'] = data['area_mean'].mean() + (data['area_mean'].std() * 10)

# Introduce Duplicates
duplicates = data.sample(frac=0.05)
data = pd.concat([data, duplicates])

# Introduce Incorrect Data Types
data['id'] = data['id'].astype(str)

# Introduce Irrelevant Data (adding a column that doesn't relate to the analysis)
data['irrelevant_data'] = [random.choice(['A', 'B', 'C']) for _ in range(len(data))]

# Introduce Errors or Typos in 'diagnosis' column
data['diagnosis'] = data['diagnosis'].apply(lambda x: 'N' if x == 'M' and random.random() > 0.95 else x)

# Display the first few rows of the corrupted data
print(data.head())


## Corrective Measures

Now that our data is corrupted, let's address each issue step by step. For each problem, we'll provide multiple corrective methods, allowing you to choose the most suitable one based on the specific context of the data.


In [None]:
# Correct Missing Values
missing_value_method = "median"  # @param {type:"string"} ["mean", "median", "mode", "drop"]
if missing_value_method == "mean":
    data.fillna(data.mean(), inplace=True)
elif missing_value_method == "median":
    data.fillna(data.median(), inplace=True)
elif missing_value_method == "mode":
    for col in data.columns:
        data[col].fillna(data[col].mode()[0], inplace=True)
elif missing_value_method == "drop":
    data.dropna(inplace=True)

# Correct NaN Values
nan_value_method = "median"  # @param {type:"string"} ["mean", "median", "mode", "drop"]
if nan_value_method == "mean":
    data.fillna(data.mean(), inplace=True)
elif nan_value_method == "median":
    data.fillna(data.median(), inplace=True)
elif nan_value_method == "mode":
    for col in data.columns:
        data[col].fillna(data[col].mode()[0], inplace=True)
elif nan_value_method == "drop":
    data.dropna(inplace=True)

# Correct Inconsistencies
# For this example, we'll revert the texture_mean values to their original scale
data['texture_mean'] = data['texture_mean'].apply(lambda x: x/10 if x > 100 else x)

# Correct Outliers
outlier_method = "Z-Score"  # @param {type:"string"} ["IQR", "Z-Score", "drop"]
if outlier_method == "IQR":
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]
elif outlier_method == "Z-Score":
    from scipy.stats import zscore
    z_scores = zscore(data.select_dtypes(include=[np.number]))
    abs_z_scores = np.abs(z_scores)
    data = data[(abs_z_scores < 3).all(axis=1)]
elif outlier_method == "drop":
    # Drop rows where 'area_mean' is an outlier
    data = data[np.abs(data['area_mean'] - data['area_mean'].mean()) <= (3 * data['area_mean'].std())]

# Correct Duplicates
data.drop_duplicates(inplace=True)

# Correct Incorrect Data Types
data['id'] = data['id'].astype(int)

# Remove Irrelevant Data
data.drop(columns=['irrelevant_data'], inplace=True)

# Correct Errors or Typos
data['diagnosis'] = data['diagnosis'].apply(lambda x: 'M' if x == 'N' else x)

# Display the first few rows of the corrected data
print(data.head())
