# Week 7 Independent Lab: Hospital Data Manipulation  
**Author:** Thomas J. Greenberg  
**Course:** BGEN632 – Graduate Introduction to Python  
**Term:** Spring 2025  
**Date:** April 15, 2025  


---  

  ## Overview and Methods 

This notebook simulates pre-processing for assessing hospitals.  

- Import and inspect structured `.csv` and `.txt` files  
- Clean and transform datasets with `pandas`  
- Simulate data expansion using row/column insertion  
- Perform filtering, sorting, sampling, and merging operations  
- Apply basic model preparation techniques using `KFold`

### Tools & Libraries  
- Python (Jupyter Notebook)  
- pandas, numpy  
- scikit-learn (`KFold`)  



In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

In [2]:
os.chdir("C:/MySystem/School/Python/GitHubStuff/week7labs/data")
print("Current working directory:", os.getcwd())

Current working directory: C:\MySystem\School\Python\GitHubStuff\week7labs\data


## Load Hospital Datasets  
Use `pd.read_csv()` to import both the hospital and personnel data.  
Confirm proper delimiters: `.csv` uses commas, `.txt` uses tabs (`sep="\t"`).

In [3]:
hospitals = pd.read_csv("CaliforniaHospitalData.csv")
personnel = pd.read_csv("CaliforniaHospitalData_Personnel.txt", sep="\t")

## Inspect Dataset Structure  
Check dimensions, column names, and data types using `.shape`, `.columns`, and `.info()`.  
Verify both datasets (`hospitals` and `personnel`) are properly loaded.

In [4]:
print(hospitals.shape)
print(hospitals.columns)
hospitals.info()
personnel.info()

(61, 14)
Index(['HospitalID', 'Name', 'Zip', 'Website', 'TypeControl', 'Teaching',
       'DonorType', 'NoFTE', 'NetPatRev', 'InOperExp', 'OutOperExp', 'OperRev',
       'OperInc', 'AvlBeds'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61 entries, 0 to 60
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   HospitalID   61 non-null     int64  
 1   Name         61 non-null     object 
 2   Zip          61 non-null     object 
 3   Website      61 non-null     object 
 4   TypeControl  61 non-null     object 
 5   Teaching     61 non-null     object 
 6   DonorType    61 non-null     object 
 7   NoFTE        61 non-null     float64
 8   NetPatRev    61 non-null     float64
 9   InOperExp    61 non-null     float64
 10  OutOperExp   61 non-null     float64
 11  OperRev      61 non-null     int64  
 12  OperInc      61 non-null     int64  
 13  AvlBeds      61 non-null     int64  
dtypes: fl

## Indexing and Slicing with `.iloc[]`  
Use `.iloc[]` to retrieve rows and columns by position.  
We'll preview the first 5 rows and select a few rows/columns by index.

In [5]:
hospitals.iloc[:5]  
hospitals.iloc[[0, 2, 4], [1, 3]]

Unnamed: 0,Name,Website
0,Mammoth Hospital,www.mammothhospital.com
2,Pioneers Memorial Hospital,www.pmhd.org
4,Barstow Community Hospital,www.barstowhospital.com


## Indexing and Slicing with `.loc[]`  
Use `.loc[]` to retrieve rows and columns by name.  
This is helpful for selecting ranges by column label or filtering columns like `'Hospital Name'` or `'City'`.

In [6]:
hospitals.loc[:, 'Name']  
hospitals.loc[0:3, ['Name', 'Zip']] 

Unnamed: 0,Name,Zip
0,Mammoth Hospital,93546-0660
1,Victor Valley Community Hospital,92392
2,Pioneers Memorial Hospital,92227
3,Ridgecrest Regional Hospital,93555


## Unique Values and Metadata  
Use `.dtypes` to review variable types and `.shape` to check dimensions.  
**Note:** The `"County Name"` column is not present in the provided dataset. 

In [10]:
len(hospitals.index), len(hospitals.columns)
hospitals.dtypes

HospitalID       int64
Name            object
Zip             object
Website         object
TypeControl     object
Teaching        object
DonorType       object
NoFTE          float64
NetPatRev      float64
InOperExp      float64
OutOperExp     float64
OperRev          int64
OperInc          int64
AvlBeds          int64
dtype: object

## Check for Missing Values  
Use `.isnull().sum()` and `.notnull().sum()` to inspect the dataset for any missing or invalid entries.

In [11]:
hospitals.isnull().sum()
hospitals.notnull().sum()

HospitalID     61
Name           61
Zip            61
Website        61
TypeControl    61
Teaching       61
DonorType      61
NoFTE          61
NetPatRev      61
InOperExp      61
OutOperExp     61
OperRev        61
OperInc        61
AvlBeds        61
dtype: int64

## Add New Rows (Simulation)  
Create a new DataFrame with mock hospitals and combine it with the original using `pd.concat()`.  
This simulates expanding the dataset without modifying the original.


In [12]:
new_rows = pd.DataFrame({
    'Name': ["Sacred Heart Hospital", "New Sacred Heart"],
    'Zip': ["90210", "90211"],
    'Website': ["www.sacredheart.org", "www.newsacredheart.org"],
    'TypeControl': ["Non-Profit", "Private"],
    'Teaching': ["Yes", "Yes"],
    'DonorType': ["Community", "Corporate"],
    'NoFTE': [275.0, 190.0],
    'NetPatRev': [950000.0, 750000.0],
    'InOperExp': [300000.0, 240000.0],
    'OutOperExp': [160000.0, 120000.0],
    'OperRev': [620000.0, 510000.0],
    'OperInc': [160000, 150000],
    'AvlBeds': [110, 95]
})

hospitals_mod_1 = pd.concat([hospitals, new_rows], ignore_index=True)

## Add New Column  
Use `np.random.randint()` to generate a new column simulating experimental group assignment.  
Combine it with the modified dataset using `pd.concat(axis=1)`.

In [13]:
attendings = pd.DataFrame({
    'Assigned Attending': np.random.choice(["Dr. Cox", "Dr. Turk"], size=len(hospitals_mod_1))
})

hospitals_mod_2 = pd.concat([hospitals_mod_1, attendings], axis=1)

## Rename Columns  
Use `.rename()` with `inplace=True` to update column names directly. Here, we rename `"Website"` to `"HospitalURL"`. 

In [16]:
hospitals_mod_2.rename(columns={'Website': 'HospitalURL'}, inplace=True)

## Sort Data  
Use `.sort_values()` to sort alphabetically and `.nlargest()` to find top numeric values.  
Here, we sort by hospital name and show hospitals with the highest operating income.

In [17]:
hospitals_mod_2.sort_values(by='Name').head()
hospitals_mod_2.nlargest(5, 'OperInc')

Unnamed: 0,HospitalID,Name,Zip,HospitalURL,TypeControl,Teaching,DonorType,NoFTE,NetPatRev,InOperExp,OutOperExp,OperRev,OperInc,AvlBeds,Assigned Attending
60,38900.0,Cedars-Sinai Medical Center,90048,www.csmc.edu,Non Profit,Teaching,Alumni,8000.0,4662581.617,1285631000.0,461675838.2,1912179000.0,164872413,909,Dr. Turk
54,39102.0,UCSD Medical Center - Hillcrest,92103-8970,www.health.ucsd.edu,Non Profit,Teaching,Alumni,3892.0,2428730.281,462934100.0,332127550.7,941108800.0,146047105,527,Dr. Turk
59,33192.0,California Pacific Medical Center - Pacific Ca...,94115,www.cpmc.org,Non Profit,Teaching,Alumni,1565.1,2736281.415,633790300.0,296020659.9,1074257000.0,144445998,730,Dr. Cox
58,22460.0,Stanford Hospital & Clinics,94305,stanfordhospital.org,Non Profit,Teaching,Alumni,6392.0,4333934.423,928962100.0,662155885.7,1650392000.0,59273999,445,Dr. Turk
44,39076.0,UC Irvine Medical Center,92868,www.healthcare.uci.edu,Non Profit,Teaching,Alumni,3151.0,1476284.836,380582300.0,117664293.6,552107500.0,53860958,363,Dr. Turk


## Manual Sampling  
Calculate 10% of the dataset size, round it, and sample that number of rows using `.sample()`.  
This ensures reproducibility for **manual sample size computation**.

In [18]:
sample_size = int(np.round(len(hospitals_mod_2.index) * 0.1, 0))
sample = hospitals_mod_2.sample(n=sample_size, random_state=42)
sample.head()

Unnamed: 0,HospitalID,Name,Zip,HospitalURL,TypeControl,Teaching,DonorType,NoFTE,NetPatRev,InOperExp,OutOperExp,OperRev,OperInc,AvlBeds,Assigned Attending
61,,Sacred Heart Hospital,90210,www.sacredheart.org,Non-Profit,Yes,Community,275.0,950000.0,300000.0,160000.0,620000.0,160000,110,Dr. Cox
57,31032.0,Long Beach Memorial & Miller Children's Hospital,90806,www.memorialcare.org,Non Profit,Teaching,Alumni,5218.0,1187022.0,291248100.0,109328600.0,439084836.0,38508125,462,Dr. Cox
0,45740.0,Mammoth Hospital,93546-0660,www.mammothhospital.com,District,Small/Rural,Charity,327.0,135520.2,20523430.0,34916220.0,49933713.0,-5505933,15,Dr. Cox
43,19868.0,Ojai Valley Community Hospital,93023-3163,www.cmhhospital.org,Non Profit,Small/Rural,Charity,180.0,59504.62,11955300.0,10326800.0,22492281.0,210180,103,Dr. Turk
5,17741.0,St. Elizabeth Community Hospital,96080,redbluff.mercy.org/index.htm,Non Profit,Small/Rural,Charity,397.5,232503.0,36682890.0,36739260.0,85808509.0,12386360,66,Dr. Turk


## Convert to Categorical  
Convert the `"TypeControl"` column to categorical format using `.astype('category')`.  
This helps optimize memory and correct filtering later.

In [19]:
hospitals_mod_2['TypeControl'] = hospitals_mod_2['TypeControl'].astype('category')
hospitals_mod_2.dtypes

HospitalID             float64
Name                    object
Zip                     object
HospitalURL             object
TypeControl           category
Teaching                object
DonorType               object
NoFTE                  float64
NetPatRev              float64
InOperExp              float64
OutOperExp             float64
OperRev                float64
OperInc                  int64
AvlBeds                  int64
Assigned Attending      object
dtype: object

## DateTime Conversion (Personnel Dataset)  
Convert the `"StartDate"` column in the `personnel` dataset using `pd.to_datetime()`.  
This ensures proper date formatting and catches invalid entries with `errors='coerce'`.

In [20]:
personnel['StartDate'] = pd.to_datetime(personnel['StartDate'], errors='coerce')
personnel.dtypes

HospitalID                int64
Work_ID                   int64
LastName                 object
FirstName                object
Gender                   object
PositionID                int64
PositionTitle            object
Compensation              int64
MaxTerm                   int64
StartDate        datetime64[ns]
Phone                    object
Email                    object
dtype: object

## KFold Demonstration  
Use `KFold` from `sklearn.model_selection` to simulate splitting the hospital dataset into two groups (only shows the first split).
This is to practice fold-based splitting.

In [21]:
kf = KFold(n_splits=2)

for train_index, test_index in kf.split(hospitals_mod_2):
    print("Train indices:", train_index)
    print("Test indices:", test_index)
    break  

Train indices: [32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
 56 57 58 59 60 61 62]
Test indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31]


## Final Preview of Key DataFrames  
Preview the final form of my cleaned and expanded datasets.

In [22]:
hospitals_mod_2.head()
sample.head()
personnel.head()

Unnamed: 0,HospitalID,Work_ID,LastName,FirstName,Gender,PositionID,PositionTitle,Compensation,MaxTerm,StartDate,Phone,Email
0,35665,351131,Cherukuri,Dileep,M,4,Safety Inspection Member,23987,2,2019-01-01,405-564-5580,dileep.cherukuri@okstate.edu
1,12145,756481,Rodriguez,Jose,M,1,Regional Representative,46978,4,2009-01-01,405-744-2238,jose.rodriguez@edihealth.com
2,45771,756481,Rodriguez,Jose,M,1,Regional Representative,46978,4,2011-01-01,405-744-2238,jose.rodriguez@edihealth.com
3,43353,756481,Rodriguez,Jose,M,4,Safety Inspection Member,23987,2,2011-01-01,405-744-2238,jose.rodriguez@edihealth.com
4,17718,811240,Charles,Kenneth,M,1,Regional Representative,46978,4,2009-01-01,405-744-3412,kenneth.charles@edihealth.com


## References 
---  


- pandas Documentation – https://pandas.pydata.org/  
- NumPy Documentation – https://numpy.org/doc/  
- scikit-learn Documentation – https://scikit-learn.org/stable/  
- BGEN632 – Intro. to Python material/ Prof. Olivia B. Newton, Ph.D  
- ChatGPT (OpenAI) – Debugging support (`KeyError`, `SyntaxError`)- refer to screenshot 