## Build a Data Quality Dashboard

**Description**: Create a simple dashboard that displays data quality metrics using a library like `dash` or `streamlit`.

**Steps:**
1. Install Streamlit: pip install streamlit
2. Create a Python script dashboard.py.
3. Run the dashboard: streamlit run dashboard.py

In [None]:
# Write your code from here

In [1]:
pip install streamlit


Defaulting to user installation because normal site-packages is not writeable
Collecting streamlit
  Downloading streamlit-1.45.1-py3-none-any.whl (9.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.9/9.9 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting pyarrow>=7.0
  Downloading pyarrow-20.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (42.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m13.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting tenacity<10,>=8.1.0
  Downloading tenacity-9.1.2-py3-none-any.whl (28 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7
  Downloading GitPython-3.1.44-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.6/207.6 kB[0m [31m30.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting cachetools<6,>=4.0
  Downloading cachetools-5.5.2-py3-none-any.whl (10 kB)
Collecting packaging<25,>=20
  Downloading packaging-24.2-py3-none-any.w

In [2]:
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt

st.title("Data Quality Dashboard")

# Step 1: File uploader for CSV
uploaded_file = st.file_uploader("Upload your CSV file", type=["csv"])

if uploaded_file is not None:
    # Step 2: Read CSV
    df = pd.read_csv(uploaded_file)
    
    st.subheader("Raw Data")
    st.dataframe(df)

    # Step 3: Calculate missing values and DQI
    total_rows = len(df)
    missing_counts = df.isna().sum()
    dqi_per_column = 1 - (missing_counts / total_rows)
    overall_dqi = dqi_per_column.mean()
    
    st.subheader("Data Quality Metrics")
    st.write(f"**Overall Data Quality Index (DQI):** {overall_dqi:.4f}")
    
    # Show missing counts and DQI per column in a table
    metrics_df = pd.DataFrame({
        "Missing Values": missing_counts,
        "DQI": dqi_per_column
    })
    st.dataframe(metrics_df)
    
    # Step 4: Visualize with bar plot + line plot for missing values
    fig, ax1 = plt.subplots(figsize=(10, 5))
    
    dqi_per_column.plot(kind='bar', color='skyblue', ax=ax1)
    ax1.set_ylabel('DQI (1 - Missing Ratio)')
    ax1.set_ylim(0, 1.05)
    ax1.set_title('Data Quality Index (DQI) and Missing Values per Column')
    
    ax2 = ax1.twinx()
    missing_counts.plot(kind='line', color='red', marker='o', linewidth=2, ax=ax2)
    ax2.set_ylabel('Number of Missing Values')
    ax2.set_ylim(0, missing_counts.max() + 5)
    
    st.pyplot(fig)
else:
    st.info("Please upload a CSV file to get started.")


2025-05-22 04:44:26.653 
  command:

    streamlit run /home/vscode/.local/lib/python3.10/site-packages/ipykernel_launcher.py [ARGUMENTS]
