<a href="https://colab.research.google.com/github/wamaw123/Biomedical_Data_analysis/blob/main/Month_1/Week_1_Data_Importing_and_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Analytics with Python: A 6-Month Journey
By : [Abderrahim Benmoussa, Ph.D. ](https://https://github.com/wamaw123)

Project's on Github : https://github.com/wamaw123/Biomedical_Data_analysis

# Week 1: Data Importing, exploring and Cleaning

---



In this notebook, we'll focus on the foundational steps of any data analysis process:
1. **Data Importing**: We'll import a biomedical dataset from a GitHub repository.
2. **Descriptive Statistics**: This will give us a preliminary understanding of the dataset's structure and characteristics.
3. **Data Cleaning**: We'll handle missing values and outliers to ensure the data's quality.
4. **Data Visualization**: Visualizing the data will provide insights into its distribution and potential patterns.
5. **Normalization and Standardization**: We'll transform the data to prepare it for future analysis.

Let's begin by importing the necessary libraries.


In [None]:
# Install necessary packages
!pip install pandas_profiling dtale

# Import necessary libraries

## Data Manipulation
import pandas as pd   # Essential for data manipulation and mathematical operations.
import numpy as np    # Used for array-based operations and mathematical functions.

## Visualization
import matplotlib.pyplot as plt  # Fundamental plotting library.
import seaborn as sns            # Builds on top of matplotlib for more advanced visualizations.

## Statistical Testing
from scipy.stats import shapiro, pearsonr  # Provides functions for statistical testing.

## Interactive Exploration
import pandas_profiling           # Generates profile reports from a pandas DataFrame.
import dtale                      # Interactive tool for data frame exploration.
import dtale.app as dtale_app
import ipywidgets as widgets      # For creating interactive widgets.
from IPython.display import display  # Helps in displaying objects in Jupyter.

## Scaling and standardization
from sklearn.preprocessing import MinMaxScaler, StandardScaler

## Google Colab-specific
from google.colab import files    # Specific tools for Google Colab environment.

## Database
import sqlite3  # Allows interaction with SQLite databases.

# Set up the notebook for visualizations
%matplotlib inline

Let's retrieve the dataset from GitHub, load it into a DataFrame, and immediately explore its visualization

In [None]:
# Load the dataset from GitHub
url = "https://raw.githubusercontent.com/wamaw123/Biomedical_Data_analysis/c072fdafc2b2abe4e002f8611f80bcf5fd8366b8/Datasets/Week_1/week_1.csv"
data = pd.read_csv(url)
data.head()

## About Dataset

This dataset contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features describe characteristics of the cell nuclei present in the image.

The 3-dimensional space is described in the following reference: [K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets," Optimization Methods and Software, 1, 1992, 23-34].

You can access this dataset from the following sources:

- [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29)
- [Kaggle Dataset](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data?resource=download)
- UW CS FTP Server: `ftp.cs.wisc.edu`, Path: `cd math-prog/cpo-dataset/machine-learn/WDBC/`

### Attribute Information

The dataset contains the following attributes:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32) Ten real-valued features computed for each cell nucleus:

   a) Radius (mean of distances from center to points on the perimeter)
   b) Texture (standard deviation of gray-scale values)
   c) Perimeter
   d) Area
   e) Smoothness (local variation in radius lengths)
   f) Compactness (perimeter^2 / area - 1.0)
   g) Concavity (severity of concave portions of the contour)
   h) Concave points (number of concave portions of the contour)
   i) Symmetry
   j) Fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

### Class Distribution

The class distribution in this dataset is as follows:
- 357 benign
- 212 malignant

This dataset can therefore be used for some classification in cancerology. Let's store it into a SQL database to select specific columns to querry. There are many ways to do this without requiring SQL but it is a good occasion to use relational databases for practice.

In [None]:
# Create a new SQLite database in memory
conn = sqlite3.connect(':memory:')

# Load the dataset into the SQLite database
data.to_sql('biomedical_data', conn, if_exists='replace')


Now let's create a tool to manually select columns

In [None]:
# Create widgets
available_columns = widgets.SelectMultiple(
    options=data.columns.tolist(),
    description='Available Columns',
    layout={'height': '150px', 'width': '400px'},
    disabled=False
)

selected_columns = widgets.SelectMultiple(
    description='Selected Columns',
    layout={'height': '150px', 'width': '400px'},
    disabled=False
)

# Define button click actions
def add_columns(b):
    for item in available_columns.value:
        if item not in selected_columns.options:
            selected_columns.options += (item,)

    available_columns.options = [item for item in data.columns.tolist() if item not in selected_columns.options]

def remove_columns(b):
    for item in selected_columns.value:
        if item not in available_columns.options:
            available_columns.options += (item,)

    selected_columns.options = [item for item in data.columns.tolist() if item not in available_columns.options]

# Create buttons
add_button = widgets.Button(description="Add >>")
remove_button = widgets.Button(description="<< Remove")

add_button.on_click(add_columns)
remove_button.on_click(remove_columns)

# Group widgets for column selection
left_box = widgets.VBox([available_columns, add_button, remove_button])
right_box = widgets.VBox([selected_columns])
column_selection_box = widgets.HBox([left_box, right_box])

# Global variable to store the validated columns
validated_columns = []

# Define the function to validate and save selected columns
def validate_selection(button):
    global validated_columns
    validated_columns = list(selected_columns.options)

    if not validated_columns:
        print("No columns selected. Please select columns from the 'Selected Columns' box before validating.")
    else:
        print(f"Selected columns have been saved: {', '.join(validated_columns)}")

# Button to validate and save the selected columns
validate_button = widgets.Button(description="Validate Selection")
validate_button.on_click(validate_selection)

# Group all widgets in a VBox and display
all_widgets = widgets.VBox([column_selection_box, validate_button])
display(all_widgets)


Now we select the rows on which we want to work. We can keep all of the or just some of them. Less rows will mean less accurate models but for some use cases, it can be good to use less to work the code out.

In [None]:
# Create a slider for selecting row range
row_slider = widgets.IntRangeSlider(
    value=[0, len(data)],
    min=0,
    max=len(data),
    step=1,
    description='Row Range:',
    continuous_update=False
)

# Display the row selection slider
display(row_slider)


Finally let's querry the data

In [None]:
# Global variable to store the selected data for subsequent use
selected_data = None

# Define the function to query data based on user selection
def get_selected_data(button):
    global selected_data  # Declare the variable as global
    columns = validated_columns  # Use the validated_columns global variable

    # Check if columns are selected
    if not columns:
        print("Please validate your column selection before querying.")
        return

    start_row = row_slider.value[0]
    end_row = row_slider.value[1]

    # Construct the SQL query with double quotes around column names
    columns_str = ', '.join([f'"{col}"' for col in columns])
    query = f"SELECT {columns_str} FROM biomedical_data LIMIT {start_row}, {end_row - start_row + 1}"

    result = pd.read_sql_query(query, conn)
    selected_data = result  # Assign the result to the global variable
    display(result)

# Button to execute the query and display results
query_button = widgets.Button(description="Show Data")
query_button.on_click(get_selected_data)

# Display the button
display(query_button)


# Descriptive Statistics and Data Exploration

In this section, we'll delve deep into our dataset to understand its structure, characteristics, and potential issues. This includes understanding basic information, central tendencies, visualizations, and more.


Let's first define few usefull functions

In [None]:
# Define a function to plot histograms for each column in the DataFrame
def plot_histograms(df):
    """
    Plot histograms for each column in the DataFrame.

    Parameters:
    - df: DataFrame
    """
    df.hist(figsize=(20, 15))
    plt.tight_layout()  # Adjusts subplot params for better layout
    plt.show()

# Define a function to plot a heatmap of correlations between columns in the DataFrame
def plot_corr_heatmap(df):
    """
    Plot a heatmap of correlations between columns in the DataFrame.

    Parameters:
    - df: DataFrame
    """
    plt.figure(figsize=(12, 10))
    sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
    plt.title("Correlation Heatmap")
    plt.show()

# Define a function to print skewness and kurtosis for each column in the DataFrame
def print_skewness_kurtosis(df):
    """
    Print skewness and kurtosis for each column in the DataFrame.

    Parameters:
    - df: DataFrame
    """
    print("\nSkewness:\n", df.skew())
    print("\nKurtosis:\n", df.kurtosis())


# Define a function to pmake violin plots for each column in the DataFrame

def plot_violinplots(df):
    for col in df.columns:
        plt.figure(figsize=(5, 4))
        sns.violinplot(y=df[col])
        plt.title(f"Violin plot of {col}")
        plt.show()

# Define a function to print missing values information and plot a heatmap of missing values
def print_missing_values_info(df):
    """
    Print missing values count and percentage for each column in the DataFrame.
    Also, plot a heatmap of missing values.

    Parameters:
    - df: DataFrame
    """
    # Print missing values count and percentage
    print("\nMissing Values Count:\n", df.isnull().sum())
    print("\nPercentage of Missing Values:\n", (df.isnull().sum() / len(df)) * 100)

    # Plot heatmap of missing values
    plt.figure(figsize=(12, 6))
    sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
    plt.title("Heatmap of Missing Values (Yellow indicates missing data)")
    plt.show()

Let's now select numericals data for exploration

In [None]:
# Extract only the numeric columns from the selected data for further analysis
numeric_selected_data = selected_data.select_dtypes(include=[np.number])

Now we can finally perform descriptive statistics and check some visualizations to understand our dataset

In [None]:
# Descriptive statistics
print("\nDescriptive Statistics:\n", selected_data.describe(include='all'))

# Visualizations

# Histograms
plot_histograms(numeric_selected_data)

# Pairplots - activate for small datasets, else it can be very very greedy in compute
#sns.pairplot(numeric_selected_data)
#plt.suptitle("Pair Plot of Numeric Features", y=1.02)
#plt.show()

# Violin plots for each numeric column
plot_violinplots(numeric_selected_data)

# Outlier detection
Q1 = numeric_selected_data.quantile(0.25)
Q3 = numeric_selected_data.quantile(0.75)
IQR = Q3 - Q1
outliers = ((numeric_selected_data < (Q1 - 1.5 * IQR)) | (numeric_selected_data > (Q3 + 1.5 * IQR))).sum()
print("\nOutliers Count (using IQR method):\n", outliers)

# Call the function to display missing values info and heatmap
print_missing_values_info(selected_data)

# Correlation Analysis
correlations = numeric_selected_data.corr()
# Print significant correlations
for col in correlations.columns:
    for idx in correlations.index:
        if idx >= col:  # This avoids duplicate pairs
            continue
        corr_val = correlations.loc[idx, col]
        _, p_value = pearsonr(numeric_selected_data[idx], numeric_selected_data[col])
        if p_value < 0.05:
            print(f"Significant correlation between {idx} and {col}: {corr_val:.2f} (p-value: {p_value:.5f})")

# Plot correlation heatmap
plot_corr_heatmap(numeric_selected_data)


# Skewness and kurtosis
print_skewness_kurtosis(numeric_selected_data)

This does the trick of showing some exploratory information but it is not very user friendly which can be limiting when speaking with people outside the field or not use to colab. It is also fairly long to complete.

# Dynamic Data Exploration with pandas_profiling

 For a more interactive and comprehensive overview of our dataset, we can use the `pandas_profiling` package. This tool generates an interactive HTML report that provides a deep dive into each column, correlations, missing values, and much more.

### NOTE : Panda profiling is a powerfull tool but the file can be very large and exploring it can lead to buggy behavior. But on the good side, you only need to run this one time and it will make it easy to explore one set of data outside of colab or any other tool.


In [None]:
# Generate the profile report for the selected columns
profile = pandas_profiling.ProfileReport(numeric_selected_data)
profile_file_path = "selected_data_profile_report.html"
profile.to_file(output_file=profile_file_path)

# Download the file to your local system
files.download(profile_file_path)

# More light-weight Data Exploration with D-Tale

Alternatively, D-Tale is a lightweight tool that provides an interactive web-based interface for viewing and analyzing Pandas data structures. It's a great alternative for quick and efficient data exploration without the overhead of more comprehensive tools like `pandas_profiling`.

## Starting D-Tale Session

After running the code below, you'll receive a link. Clicking on this link will open the D-Tale interface in a new tab, allowing for interactive exploration of the data.

In the D-Tale interface, you can:
- View the dataset in a tabular format.
- Generate charts and visualizations.
- Check statistics and distributions of columns.
- Run correlations.
- And much more!

Additionally, D-Tale provides options to export your data or any analysis directly from its interface. For now we will just use it to explore visually the dataset and understand it.


In [None]:
# This is necessary in Colab to ensure the D-Tale instance keeps running
dtale_app.USE_COLAB = True

# Start D-Tale session
d = dtale.show(numeric_selected_data)
d

Note : if D-tale does not work, one can try using ngrok but ngrok must be setup with tokken already available : see the details on how to do that [here](https://github.com/man-group/dtale#google-colab:~:text=If%20this%20does%20not%20work%20for%20you%20try%20using%20USE_NGROK%20which%20is%20described%20in%20the%20next%20section.)

## Other Automated Data Exploration and Cleaning Tools

There are several other tools that one can use to explore the dataset, identify common problems like missing values, inconsistencies, outliers, and more. We will not use them here but they are kept for the record.

### 1. [sweetviz](https://github.com/fbdesignpro/sweetviz)
An open-source Python library generating beautiful visualizations for EDA. It provides dataset comparisons, target value analysis, and highlights missing and zero values.
```python
# !pip install sweetviz
# import sweetviz as sv
# report = sv.analyze(your_dataframe)
# report.show_html('report.html')
```

### 2. [DataPrep.EDA](https://github.com/sfu-db/dataprep)
Allows you to explore the dataset with a single line of code, providing insights on missing values, data distribution, correlation, and more.
```python
# !pip install dataprep
# from dataprep.eda import create_report
# report = create_report(your_dataframe)
# report.show_browser()
```

### 3. [datacleaner](https://github.com/rhiever/datacleaner)
A Python tool that automatically cleans datasets and readies them for analysis. It can handle missing values and incorrect data types.
```python
# !pip install datacleaner
# from datacleaner import autoclean
# your_clean_dataframe = autoclean(your_dataframe)
```

### 4. [pydqc](https://github.com/SauceCat/pydqc)
Designed to automatically compare and validate datasets, generating data summaries, missing value statistics, and more.
```python
# !pip install pydqc
# from pydqc.data_summary import distribution_summary_pretty
# distribution_summary_pretty(your_dataframe, 'output_directory')
```

### 5. [Great Expectations](https://github.com/great-expectations/great_expectations)
A Python-based open-source library for validating, documenting, and profiling your data, helping maintain data quality.
```python
# !pip install great_expectations
# import great_expectations as ge
# your_dataframe_ge = ge.dataset.PandasDataset(your_dataframe)
# your_dataframe_ge.expect_column_values_to_not_be_null('column_name')
```

### 6. Tidy Data
The concept of ["tidy data"](https://vita.had.co.nz/papers/tidy-data.html) introduced by Hadley Wickham is a standard to structure datasets to facilitate analysis. Ensuring your data is "tidy" can help prevent inconsistencies and irrelevant data.

Remember to replace `your_dataframe` with the name of your actual dataframe when using these tools.



## Data Corruption Step

The data looks pretty clean. That is because it was a high quality dataset imported from Kaggle. In real life, raw data comes with various issues that can hinder or skew our analysis. In this step, we'll intentionally introduce common data problems to our dataset. This will allow us to later demonstrate corrective measures in a practical context.

The issues we'll introduce are:
- Missing Values
- NaN Values
- Inconsistencies
- Outliers
- Duplicates
- Incorrect Data Types
- Irrelevant Data
- Errors or Typos
- Biased Data

Let's corrupt our data!


In [None]:
# Introduce Missing Values
for col in selected_data.columns:
    selected_data.loc[selected_data.sample(frac=0.1).index, col] = None

# Introduce NaN Values
selected_data.loc[selected_data.sample(frac=0.05).index, 'radius_mean'] = np.nan

# Introduce Inconsistencies (using different units or scales)
selected_data['texture_mean'] = selected_data['texture_mean'].apply(lambda x: x*10 if random.random() > 0.9 else x)

# Introduce Outliers
selected_data.loc[selected_data.sample(frac=0.02).index, 'area_mean'] = selected_data['area_mean'].mean() + (selected_data['area_mean'].std() * 10)

# Introduce Duplicates
duplicates = selected_data.sample(frac=0.05)
selected_data = pd.concat([selected_data, duplicates])

# Introduce Incorrect Data Types
selected_data['id'] = selected_data['id'].astype(str)

# Introduce Irrelevant Data (adding a column that doesn't relate to the analysis)
selected_data['irrelevant_data'] = [random.choice(['A', 'B', 'C']) for _ in range(len(selected_data))]

# Introduce Errors or Typos in 'diagnosis' column
selected_data['diagnosis'] = selected_data['diagnosis'].apply(lambda x: 'N' if x == 'M' and random.random() > 0.95 else x)

# Display the first few rows of the corrupted data
selected_data.head()


Let's look again at the dataset now it has been corrupted

In [None]:
# Call the function to display missing values info and heatmap
print_missing_values_info(selected_data)

# This is necessary in Colab to ensure the D-Tale instance keeps running
dtale_app.USE_COLAB = True

# Start D-Tale session
d = dtale.show(selected_data)
d

## Corrective Measures - Make sure to drop line when diagnosis is NaN

Now that our data is corrupted, let's address each issue step by step. For each problem, we'll provide multiple corrective methods, allowing you to choose the most suitable one based on the specific context of the data.


In [None]:
selected_data.head()

In [None]:
data = selected_data
# Correct Missing Values
missing_value_method = "median"  # @param {type:"string"} ["mean", "median", "mode", "drop"]
if missing_value_method == "mean":
    data.fillna(data.mean(), inplace=True)
elif missing_value_method == "median":
    data.fillna(data.median(), inplace=True)
elif missing_value_method == "mode":
    for col in data.columns:
        data[col].fillna(data[col].mode()[0], inplace=True)
elif missing_value_method == "drop":
    data.dropna(inplace=True)
data.head()


In [None]:
# Correct NaN Values
nan_value_method = "median"  # @param {type:"string"} ["mean", "median", "mode", "drop"]
if nan_value_method == "mean":
    data.fillna(data.mean(), inplace=True)
elif nan_value_method == "median":
    data.fillna(data.median(), inplace=True)
elif nan_value_method == "mode":
    for col in data.columns:
        data[col].fillna(data[col].mode()[0], inplace=True)
elif nan_value_method == "drop":
    data.dropna(inplace=True)

data.head()



Now we check for outliers. Be carefull, the Z-score and IQR methods are highly dependant on the distribution.

In [None]:
# Correct Inconsistencies
# For this example, we'll revert the texture_mean values to their original scale
data['texture_mean'] = data['texture_mean'].apply(lambda x: x/10 if x > 100 else x)

# Correct Outliers
outlier_method = "IQR"  # @param {type:"string"} ["IQR", "Z-Score", "drop"]
if outlier_method == "IQR":
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]
elif outlier_method == "Z-Score":
    from scipy.stats import zscore
    z_scores = zscore(data.select_dtypes(include=[np.number]))
    abs_z_scores = np.abs(z_scores)
    data = data[(abs_z_scores < 3).all(axis=1)]
elif outlier_method == "drop":
    # Drop rows where 'area_mean' is an outlier
    data = data[np.abs(data['area_mean'] - data['area_mean'].mean()) <= (3 * data['area_mean'].std())]

data.head()



In [None]:
# Correct Duplicates
data.drop_duplicates(inplace=True)
data.head()


In [None]:
#Drop the lines where id is nan or "nan"
data.dropna(subset=['id'], inplace=True)
data = data[data['id'] != 'nan']
#Drop the lines where diagnosis is nan since it is the targeted variable
data.dropna(subset=['diagnosis'], inplace=True)
data = data[data['diagnosis'] != 'nan']

data.head()
data.describe

In [None]:
# Correct Incorrect Data Types
data['id'] = data['id'].astype(float).astype(int)
data['id'] = data['id'].astype(int)
data.head()

In [None]:
# Create widgets
available_columns = widgets.SelectMultiple(
    options=data.columns.tolist(),
    description='Available Columns',
    layout={'height': '150px', 'width': '400px'},
    disabled=False
)

selected_columns = widgets.SelectMultiple(
    description='Columns to Drop',
    layout={'height': '150px', 'width': '400px'},
    disabled=False
)

# Define button click actions
def add_columns(b):
    for item in available_columns.value:
        if item not in selected_columns.options:
            selected_columns.options += (item,)

    available_columns.options = [item for item in data.columns.tolist() if item not in selected_columns.options]

def remove_columns(b):
    for item in selected_columns.value:
        if item not in available_columns.options:
            available_columns.options += (item,)

    selected_columns.options = [item for item in data.columns.tolist() if item not in available_columns.options]

# Create buttons
add_button = widgets.Button(description="Add >>")
remove_button = widgets.Button(description="<< Remove")

add_button.on_click(add_columns)
remove_button.on_click(remove_columns)

# Group widgets for column selection
left_box = widgets.VBox([available_columns, add_button, remove_button])
right_box = widgets.VBox([selected_columns])
column_selection_box = widgets.HBox([left_box, right_box])

# Global variable to store the columns to be dropped
columns_to_drop = []

# Define the function to validate and save columns to drop
def validate_selection(button):
    global columns_to_drop
    columns_to_drop = list(selected_columns.options)

    if not columns_to_drop:
        print("No columns selected to drop. Please select columns from the 'Columns to Drop' box before validating.")
    else:
        print(f"Columns selected to be dropped: {', '.join(columns_to_drop)}")
        data.drop(columns=columns_to_drop, inplace=True)  # This line drops the selected columns from the DataFrame
        print("Columns have been dropped from the DataFrame!")

# Button to validate and save the columns to drop
validate_button = widgets.Button(description="Drop Selected Columns")
validate_button.on_click(validate_selection)

# Group all widgets in a VBox and display
all_widgets = widgets.VBox([column_selection_box, validate_button])
display(all_widgets)


In [None]:
# Correct Errors or Typos
data['diagnosis'] = data['diagnosis'].apply(lambda x: 'M' if x == 'N' else x)
data.head()

One can also directly solve some of these issues in D-Tale, export as CSV and then open back in panda to solve the rest (drop NaN etc).

## Normalization and Standardization

Finally, we'll transform our data to ensure it's on a consistent scale. This is crucial for many machine learning algorithms. We'll use:
1. **Min-Max Normalization**: This scales the data between 0 and 1.
2. **Z-score Standardization**: This scales the data based on its mean and standard deviation.


In [None]:
data = pd.DataFrame(data)

# Select only numerical columns (excluding 'diagnosis' and 'id')
num_cols = data.select_dtypes(include=['float64', 'int64']).columns
num_cols = num_cols.drop(['id'])  # exclude id column

if len(num_cols) > 0:
    # Normalize
    scaler_norm = MinMaxScaler()
    data_normalized = data.copy()
    data_normalized[num_cols] = scaler_norm.fit_transform(data[num_cols])
    normalized_data = data_normalized

    # Standardize
    scaler_std = StandardScaler()
    data_standardized = data.copy()
    data_standardized[num_cols] = scaler_std.fit_transform(data[num_cols])
    standardized_data = data_standardized

    print("\nNormalized Data:")
    print(normalized_data.head())
    print("\nStandardized Data:")
    print(standardized_data.head())
else:
    print("There are no numerical columns to normalize or standardize.")


There ! We have somewhat proper dataset, ready to use for modeling or statistical analyses.

In [None]:
# Call the function to display missing values info and heatmap
print_missing_values_info(normalized_data)

#Duplicates
print("\nNumber of duplicate rows:", normalized_data.duplicated().sum())
normalized_data.drop_duplicates(inplace=True)
print("Duplicates removed. New shape:", normalized_data.shape)

# This is necessary in Colab to ensure the D-Tale instance keeps running
dtale_app.USE_COLAB = True

# Start D-Tale session
d = dtale.show(normalized_data)
d

# Conclusion and perspectives

Analyzing and preprocessing data is a pivotal aspect of data science. While it can be time-consuming, meticulous and rigorous attention to this phase can significantly enhance model performance.

Several strategies exist to address the issues mentioned. Although I've highlighted traditional methods, it's essential to remember that tasks like data cleaning, imputation, and normalization should be undertaken in collaboration with subject matter experts, examining each feature individually and choosing the proper strategy. In some occasions, droping features might be more advisable than droping lines to ensure accuracy in the later modeling steps.

Few things to improve in this document for proper data-handling :
- It could be better practice to create new dataframes at each steps instead of working on the entire dataframes. This would ensure data integrity and facilitates easier tracking of changes.
- One might want to run a full diagnosis on the data once cleaned and loop the error solving issues
- Visual inspection of the data might also help spot some errors that are comon and one might use regex to solve some typos and common data errors and insertions (unwanted spaces, underlines, comas instead of points, etc)
- A lot of code here is more for user experience than really for the targeted objective. Depending on the project, it might be better to keep the code tidy or more user friendly
- The code might need to be tidyer
- Some functions would probably be better in a method file ajuncted to this document in a normal workflow. Here, I chose to keep everything in one document for clarity.
- Some of the comments and document description might beneficiate of a better formulation
- A link to D-Tale or other tools tutorials might be a good idea to add.