<a href="https://colab.research.google.com/github/victormurcia/VCHAMPS/blob/main/VCHAMPS_Initial_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# VCHAMPS EDA on Quality Check Dataset
*Made by Victor M. Murcia on 6/22/2023*

This notebook performs initial EDA on the various files that comprise the dataset provided for the VCHAMPS challenge hosted by precisionFDA found [here](https://precision.fda.gov/challenges/31).

I showcase routines to generate EDA reports and widgets that allow for rapid exploration and inspection of the various dataframes.

# Required Python Libraries and Modules
Below are the various libraries and modules I used to carry out this initial cursory analysis.

In [None]:
#General utilities
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Ellipse
import seaborn as sns
from tqdm import tqdm  # Import tqdm for the progress bar
import glob
import os
from typing import List

#For Slider viz
import ipywidgets as widgets
from IPython.display import display, clear_output,HTML

#DataPrep for Quick EDA
!pip install dataprep
from dataprep.eda import create_report

#Enable data to be extracted and downloaded from my Google Drive
from google.colab import drive, files
drive.mount('/content/drive')

In [36]:
def load_csvs(path2data: str) -> List[str]:
  """
  Load and return a list of CSV file paths from the specified directory.

  Args:
      path2data (str): The directory path containing the CSV files.

  Returns:
      List[str]: A list of CSV file paths.

  """
  csv_files = glob.glob(path2data + '/*.csv')
  return csv_files

def make_df_list(csv_files: List[str]) -> List[pd.DataFrame]:
  """
  Read CSV files from the provided list of file paths and return a list of DataFrames.

  Args:
      csv_files (List[str]): A list of CSV file paths.

  Returns:
      List[pd.DataFrame]: A list of DataFrames read from the CSV files.

  """
  df_list = []
  # Read the CSV file
  for csv in csv_files:
    df = pd.read_csv(csv)
    df_list.append(df)

  return df_list

def clean_filenames(csv_files: List[str]) -> List[str]:
  """
  Clean the file names by removing directory path and the .csv extension.

  Args:
      csv_files (List[str]): A list of CSV file paths.

  Returns:
      List[str]: A list of cleaned file names without directory path and file extension.

  """
  #Get list of file names without directory junk and remove .csv extension from name
  file_names = []

  for file_path in csv_files:
      file_name = os.path.basename(file_path)  # Get the file name with extension
      file_name = os.path.splitext(file_name)[0]  # Remove the file extension
      file_names.append(file_name)
  return file_names

#Define data location
path2data = '/content/drive/MyDrive/Quality Check'
#Load the .csv files into memory
csv_files  = load_csvs(path2data)
#Create list of dataframes from csvs
df_list    = make_df_list(csv_files)
#Clean the names of .csv files
file_names = clean_filenames(csv_files)

# Widget for Rapid Dataframe Inspection
I made this widget to quickly inspect the generated dataframes from the provided .csv files. This allows us to quickly see all the different attributes present in the data.

In [43]:
def visualize_dataframes(df_list: List[pd.DataFrame], file_names: List[str]) -> None:
    """
    Visualizes a list of DataFrames with an interactive slider.

    Args:
        df_list (List[pd.DataFrame]): A list of pandas DataFrames to be visualized.
        file_names (List[str]): A list of corresponding file names for the DataFrames.

    Returns:
        None
    """
    # Set the maximum number of rows to be displayed
    pd.set_option("display.max_rows", 10)

    # Create an index slider
    index_slider = widgets.IntSlider(min=0, max=len(df_list)-1, value=0, description='DataFrame Index')

    # Function to display the selected dataframe
    output = widgets.Output()

    def display_dataframe(index: int) -> None:
        """
        Displays the selected DataFrame based on the index.

        Args:
            index (int): Index of the DataFrame to be displayed.

        Returns:
            None
        """
        df = df_list[index]
        current_file = file_names[index]
        with output:
            clear_output(wait=True)
            print('Filename: ',current_file)
            display(df)

    # Display the initial dataframe
    display(output)

    # Link the slider value to the display_dataframe function
    widgets.interactive_output(display_dataframe, {'index': index_slider})

    # Display the slider
    display(index_slider)

    # Link the slider value to the display_dataframe function
    widgets.interactive(display_dataframe, index=index_slider)

visualize_dataframes(df_list, file_names)

Output()

IntSlider(value=0, description='DataFrame Index', max=14)

# Widget for Rapid Extraction of Dataframe Information
This widget allows you to get a quick descrption of the dataframe composition son that you don't have to have multiple cells/calls to the df.info() method from pandas.

In [53]:
def show_dataframe_info_slider(df_list: List[pd.DataFrame], file_names: List[str]) -> None:
    """
    Display a slider to select a dataframe from the list and show its .info() method.

    Args:
        df_list (List[pd.DataFrame]): List of pandas DataFrames.
        file_names (List[str]): A list of corresponding file names for the DataFrames.

    Returns:
        None
    """
    def show_dataframe_info(index: int) -> None:
        """
        Display the .info() method for the selected dataframe.

        Args:
            index (int): Index of the dataframe to display info for.

        Returns:
            None
        """
        selected_df = df_list[index]
        current_file = file_names[index]
        with output:
            output.clear_output()
            print('Filename: ',current_file)
            selected_df.info()

    # Create the index slider
    index_slider = widgets.IntSlider(min=0, max=len(df_list)-1, value=0, description='DataFrame Index')

    # Output widget for displaying the .info() output
    output = widgets.Output()

    # Display the initial .info() output
    show_dataframe_info(index_slider.value)

    # Link the slider value to the show_dataframe_info function
    widgets.interactive_output(show_dataframe_info, {'index': index_slider})

    # Display the slider and output
    display(index_slider, output)

show_dataframe_info_slider(df_list, file_names)

IntSlider(value=0, description='DataFrame Index', max=14)

Output()

# Widget for Quick Inspection of Descriptive Statistics
This widget gives you the descriptive statistics for any numerical variables present in each of the dataframes via the .describe() method from pandas.

In [54]:
def show_dataframe_describe_slider(df_list: List[pd.DataFrame], file_names: List[str]) -> None:
    """
    Display a slider to select a dataframe from the list and show its .describe() method.

    Args:
        df_list (List[pd.DataFrame]): List of pandas DataFrames.
        file_names (List[str]): A list of corresponding file names for the DataFrames.

    Returns:
        None
    """
    def show_dataframe_describe(index: int) -> None:
        """
        Display the .describe() method for the selected dataframe.

        Args:
            index (int): Index of the dataframe to display describe for.

        Returns:
            None
        """
        selected_df = df_list[index]
        current_file = file_names[index]
        with output:
            output.clear_output()
            print('Filename: ',current_file)
            display(selected_df.describe())

    # Create the index slider
    index_slider = widgets.IntSlider(min=0, max=len(df_list)-1, value=0, description='DataFrame Index')

    # Output widget for displaying the .describe() output
    output = widgets.Output()

    # Display the initial .describe() output
    show_dataframe_describe(index_slider.value)

    # Link the slider value to the show_dataframe_describe function
    widgets.interactive_output(show_dataframe_describe, {'index': index_slider})

    # Display the slider and output
    display(index_slider, output)

show_dataframe_describe_slider(df_list, file_names)

IntSlider(value=0, description='DataFrame Index', max=14)

Output()

# Widget for Inspection of Categorical Variable in Dataframe
This widget will tell you the unique values and their counts for categorical variables for a given column in each dataframe. Use the slider to change the dataframe, and then use the dropdown menu to select the column you want information for.

In [55]:

def show_dataframe_column_info(df_list: List[pd.DataFrame], file_names: List[str], truncate_list: bool = True) -> None:
    """
    Display a slider to select a dataframe from the list, a dropdown menu to select columns,
    and show the unique() and value_counts() for the selected non-numerical column.

    Args:
        df_list (List[pd.DataFrame]): List of pandas DataFrames.
        file_names (List[str]): A list of corresponding file names for the DataFrames.
        truncate_list (bool): Flag to indicate whether to truncate the list of unique values. Default is True.

    Returns:
        None
    """
    def show_column_info(index: int, column: str) -> None:
        """
        Display the unique() and value_counts() for the selected non-numerical column of the selected dataframe.

        Args:
            index (int): Index of the dataframe.
            column (str): Name of the selected column.

        Returns:
            None
        """
        selected_df = df_list[index]
        current_file = file_names[index]
        with output:
            output.clear_output()
            print('Filename: ',current_file)
            non_numerical_columns = selected_df.select_dtypes(exclude='number').columns
            if column in non_numerical_columns:
                column_values = selected_df[column]
                unique_values = column_values.unique()
                value_counts = column_values.value_counts()

                # Truncate the list of unique values if truncate_list is True
                if truncate_list:
                    unique_values = unique_values[:10]

                print("Unique Values:")
                print(unique_values)
                print("\nValue Counts:")
                print(value_counts)
            else:
                print("Selected column is not non-numerical.")

    # Create the index slider
    index_slider = widgets.IntSlider(min=0, max=len(df_list)-1, value=0, description='DataFrame Index')

    # Create the column dropdown menu
    columns_dropdown = widgets.Dropdown(options=df_list[0].columns, description='Columns')

    # Output widget for displaying the column info
    output = widgets.Output()

    # Function to update the column options based on the selected dataframe
    def update_columns_options(change):
        columns_dropdown.options = df_list[index_slider.value].columns

    # Update the column options when the index slider changes
    index_slider.observe(update_columns_options, 'value')

    # Display the initial column info
    show_column_info(index_slider.value, columns_dropdown.value)

    # Link the slider and dropdown value to the show_column_info function
    widgets.interactive_output(show_column_info, {'index': index_slider, 'column': columns_dropdown})

    # Set the display.max_rows option based on the truncate_list flag
    if truncate_list:
        pd.set_option("display.max_rows", 10)
    else:
        pd.set_option("display.max_rows", None)

    # Display the slider, dropdown, and output
    display(index_slider, columns_dropdown, output)
show_dataframe_column_info(df_list, file_names,truncate_list = False)

IntSlider(value=0, description='DataFrame Index', max=14)

Dropdown(description='Columns', options=('Unnamed: 0', 'Internalpatientid', 'Ethnicity', 'Gender', 'Races', 'V…

Output()

# Generating EDA Reports
I'm using the dataprep to conduct the initial EDA to get a basic idea of the data we have available. The routine I made  below generates the EDA reports, saves them as .html files, and finally saves the generated reports into your local computer for easy viewing for each of the provided .csv files.

In [32]:
def df_eda_report(savepath: str, df_list: List[pd.DataFrame], file_names: List[str]):
    """
    Generate EDA reports for each DataFrame in df_list and save them as HTML files.

    Args:
        savepath (str): The directory path to save the generated HTML files.
        df_list (List[pd.DataFrame]): A list of DataFrames to generate reports for.
        file_names (List[str]): A list of desired names for the HTML files.

    Raises:
        FileNotFoundError: If the specified savepath directory does not exist.

    """
    if not os.path.exists(savepath):
        raise FileNotFoundError(f"The directory '{savepath}' does not exist.")

    i = 0
    for df in df_list:
        report = create_report(df, title='My Report')
        report.save()
        os.rename('report.html', os.path.join(savepath, file_names[i] + '.html'))
        i += 1

    html_files = glob.glob(os.path.join(savepath, '*.html'))
    for file in html_files:
        files.download(file)

#df_eda_report(savepath)

In [None]:
df_list[4]['Systolic bp'].plot.hist()

# Dataframe Transformations for Tidying
None of the provided datasets are in tidy format (i.e., there are multiple observations for the same patient across multiple rows). We'll need change this in order to train our models.

I started working on this and it still needs work,however, it'll require something along the lines of a combination of pivoting and flattening transformations as shown below.

The resulting dataframe is now tidy, however, we also now have hundreds of new features. We'll need to do some feature engineering prior to carrying out this transformation effectively.

In [57]:
pd.set_option("display.max_rows", 10)
# Use pivot_table to group observations by Internalpatientid
pivot_df = pd.pivot_table(df_list[1].drop('Unnamed: 0', axis=1), index='Internalpatientid', columns=['Specialty'], aggfunc='first')
flattened = pd.DataFrame(pivot_df.to_records())
flattened.columns = [hdr.replace("('Age at specialty', ", "Age").replace(")", "").replace("(", "") \
                     for hdr in flattened.columns]
flattened.columns = [hdr.replace("('State', ", "State").replace(")", "").replace("(", "") \
                     for hdr in flattened.columns]
flattened

Unnamed: 0,Internalpatientid,Age'ACUTE PSYCHIATRY <45 DAYS',Age'ALCOHOL DEPENDENCE TRMT UNIT',Age'ALLERGY',Age'ANESTHESIOLOGY',Age'BLIND REHAB',Age'BLIND REHAB OBSERVATION',Age'CARDIAC INTENSIVE CARE UNIT',Age'CARDIAC SURGERY',Age'CARDIAC-STEP DOWN UNIT',...,"'State', 'SURGICAL ICU'","'State', 'SURGICAL OBSERVATION'","'State', 'SURGICAL STEPDOWN'","'State', 'TELEMETRY'","'State', 'THORACIC SURGERY'","'State', 'TRANSPLANTATION'","'State', 'UROLOGY'","'State', 'VASCULAR'","'State', 'ZZALCOHOL DEPENDENCE TRMT UNIT'","'State', 'ZZSUBSTANCE ABUSE INTERMEDCARE'"
0,67,,,56.997988,,,,,,,...,,,,,,,,,,
1,200,,,84.145147,,,,,,,...,,,,,,,Utah,,,
2,291,,,,,,,,,,...,,,,,,,,,,
3,330,,,,,,,,,,...,,,,,,,,,,
4,351,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
627,166881,,,,,,,,,,...,,,,,,,,,,
628,167102,,,,,,,,,64.047944,...,,,,,,,,,,
629,167404,,,,,,,,,,...,,,,,,,,,,
630,167917,,,,,,,,,,...,,,,Pennsylvania,,,,,,


In [60]:
flattened.columns

Index(['Internalpatientid', 'Age'ACUTE PSYCHIATRY <45 DAYS'',
       'Age'ALCOHOL DEPENDENCE TRMT UNIT'', 'Age'ALLERGY'',
       'Age'ANESTHESIOLOGY'', 'Age'BLIND REHAB'',
       'Age'BLIND REHAB OBSERVATION'', 'Age'CARDIAC INTENSIVE CARE UNIT'',
       'Age'CARDIAC SURGERY'', 'Age'CARDIAC-STEP DOWN UNIT'',
       ...
       ''State', 'SURGICAL ICU'', ''State', 'SURGICAL OBSERVATION'',
       ''State', 'SURGICAL STEPDOWN'', ''State', 'TELEMETRY'',
       ''State', 'THORACIC SURGERY'', ''State', 'TRANSPLANTATION'',
       ''State', 'UROLOGY'', ''State', 'VASCULAR'',
       ''State', 'ZZALCOHOL DEPENDENCE TRMT UNIT'',
       ''State', 'ZZSUBSTANCE ABUSE INTERMEDCARE''],
      dtype='object', length=469)

In [None]:
# Use pivot_table to group observations by Internalpatientid
pivot_df = pd.pivot_table(df_list[1].drop('Unnamed: 0', axis=1), index='Internalpatientid', columns=['Specialty'], aggfunc='first')
pivot_df.columns = pivot_df.columns.droplevel(0)
pivot_df.columns.name = None
pivot_df.reset_index()
pivot_df

In [None]:
table_df.columns

#GitHub repository for Project
I made a GitHub repository to maintain the code. You can find that repository [here](https://github.com/victormurcia/VCHAMPS).
## Git Basics
1. To download files from the repository, open up a terminal and clone the repository into your local machine via
```
git clone https://github.com/victormurcia/VCHAMPS.git
```

2. Navigate to the repository in your local machine via
```
cd VCHAMPS
```

3. Now you can make whatever changes you want in your local system to those files.

For example, I mistakenly uploaded the Quality Check EDA Reports into the wrong folder. To correct that I wrote a quick PowerShell script to move all files in the initial folder to a desired folder as shown below.
```
Get-ChildItem 'C:\Users\vmurc\Desktop\VCHAMPS\Quality Check\' -File | ForEach-Object {
  Move-Item -Path $_.FullName -Destination'C:\Users\vmurc\Desktop\VCHAMPS\Quality Check\EDA\'
}
```

4. After you've made changes to the file(s) in the repository you can now stage the changes into the branch via

```
git add .
```

5. Next, you'll want to commit the changes via
```
git commit -m "Moved files to new location"
```

6. Finally, push the changes to the remote repository:

```
git push origin main
```