<a href="https://colab.research.google.com/github/victorialovefranklin/MSBD566/blob/main/Lab_2_(Google_Colab).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab (Colab): Secure Retrieval & Analysis of Tennessee SARS-CoV-2 Data

---

## Step 1: Mount Google Drive

This step connects Colab to your Google Drive so you can access your secure folder. Follow the prompt to authorize access.



In [None]:
from google.colab import drive
drive.mount('/content/drive')  # Follow the prompt to authorize access

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: Set Paths to Your Secure Folder & List Files

Update **BASE_DIR** to match the location of your lab folder in Google Drive.
This will confirm the raw CSVs are present in data/raw/.

You should see the three CSVs listed:

- CDC_NWSS_TN_Wastewater.csv
- WastewaterSCAN_TN_SARS2.csv
- CDC_COVIDNET_ClinicalCases_2024_2025.csv

In [None]:
import os, glob

# Path to your lab folder inside Google Drive
BASE_DIR = "/content/drive/MyDrive/Secure_Biosurveillance_Data"  # <- edit if needed

print("Base:", BASE_DIR)

# Find all CSVs directly inside the folder
csvs = glob.glob(os.path.join(BASE_DIR, "*.csv"))
print("Found CSV files:")
for p in csvs:
    print(" -", os.path.basename(p))


Base: /content/drive/MyDrive/Secure_Biosurveillance_Data
Found CSV files:
 - WastewaterSCAN_TN_SARS2.csv.csv
 - CDC_NWSS_TN_Wastewater.csv
 - CDC_COVIDNET_ClinicalCases_2024_2025.csv


## Step 3: Create a Working Area (Don’t Modify Raw Files)

We’ll keep data/raw/ as read-only and write any processed outputs to a separate folder.

In [None]:
import os

WORK_DIR = "/content/working"
os.makedirs(WORK_DIR, exist_ok=True)
print("Working directory:", WORK_DIR)



Working directory: /content/working


## Step 4: Load the Datasets with Pandas

- Now load the three CSVs into DataFrames.  
- Check their lengths to confirm successful loading.
- Expected: three numbers (row counts for each dataset).

In [None]:
import pandas as pd
import os

# File paths (directly inside Secure_Biosurveillance_Data)
path_nwss     = os.path.join(BASE_DIR, "/content/drive/MyDrive/Secure_Biosurveillance_Data/CDC_NWSS_TN_Wastewater.csv")
path_wwscan   = os.path.join(BASE_DIR, "/content/drive/MyDrive/Secure_Biosurveillance_Data/WastewaterSCAN_TN_SARS2.csv.csv")
path_covidnet = os.path.join(BASE_DIR, "/content/drive/MyDrive/Secure_Biosurveillance_Data/CDC_COVIDNET_ClinicalCases_2024_2025.csv")

# Load CSVs
df_nwss     = pd.read_csv(path_nwss)
df_wwscan   = pd.read_csv(path_wwscan)
df_covidnet = pd.read_csv(path_covidnet)

# Check number of rows in each dataset
len(df_nwss), len(df_wwscan), len(df_covidnet)


(14027, 192, 185194)

## Step 5: Quick Data Health Check

Before moving to analysis, always inspect your datasets to ensure they loaded correctly.  
We’ll check:
- Shape (rows × columns)  
- Column names  
- First few rows  
- Missing value counts

In [None]:
def quick_check(df, name):
    print(f"\n=== {name} ===")
    print("Shape:", df.shape)
    print("Columns:", list(df.columns))
    print("\nFirst rows:")
    print(df.head(3))
    print("\nMissing values (top 10):")
    print(df.isna().sum().sort_values(ascending=False).head(10))

quick_check(df_nwss, "CDC NWSS (Wastewater)")
quick_check(df_wwscan, "WastewaterSCAN (SARS-CoV-2)")
quick_check(df_covidnet, "CDC COVID-NET (Clinical Cases)")


=== CDC NWSS (Wastewater) ===
Shape: (14027, 9)
Columns: ['State/Territory', 'Week_Ending_Date', 'Data_Collection_Period', 'State/Territory_WVAL', 'National_WVAL', 'Regional_WVAL', 'WVAL_Category', 'Coverage', 'date_updated']

First rows:
  State/Territory Week_Ending_Date Data_Collection_Period  \
0    South Dakota        1/20/2024            All Results   
1       Tennessee         1/1/2022            All Results   
2    South Dakota        9/14/2024                 1 Year   

   State/Territory_WVAL  National_WVAL  Regional_WVAL WVAL_Category  \
0             16.655816       7.185262       7.864493     Very High   
1              3.368486      20.012608      24.156987           Low   
2              8.444015       7.952651       8.444015     Very High   

           Coverage              date_updated  
0  Limited Coverage  2025-09-18T07:27:17.693Z  
1  Limited Coverage  2025-09-18T07:27:17.693Z  
2  Limited Coverage  2025-09-18T07:27:17.693Z  

Missing values (top 10):
Coverage    