# Assess Data Structure Programmatically

In this exercise, you will perform a brief structural assessment of two datasets.

In [1]:
#Imports - DO NOT MODIFY
import pandas as pd

## Dataset context

Our dataset is the "Hospital Annual Utilization Report data from the California Health and Human Services Open Data Portal, containing data on hopsital buildings. 

Columns' description (taken from https://data.chhs.ca.gov/dataset/hospital-building-data/resource/cefc10e5-5071-4ca4-8b03-2249caf0d294):

- **County Code:** County number (set by State of California) and County Name
- **Perm ID:**	Facility number per Facilities Development Division
- **Facility Name:** Name of the General Acute Care Hospital
- **City:** City
- **Building Nbr:** Unique building number assigned to seismically separate building in a hospital campus.
- **Building Name:** Building name provided by the Facility.
- **Building Status	Text:** Building Status.	
> If currently in service, status is "In Service". If under construction, status is "Under Construction". Other statuses are used to identify buildings that may be located in general acute care facility but do not provide general acute care services.
- **SPC Rating\*	Text:** SPC Rating\* Structural Performance Category 
> It is used to rate the building structure, can be 1 to 5, “s” is added where the rating is not confirmed by HCAI. SPC 1 is assigned to buildings that may be at risk of collapse in a strong earthquake and SPC 5 is assigned to buildings reasonably capable of providing services to the public following strong ground motion. N/A = Not Applicable and NYA = Not Yet Available.
- **Building URL:** A URL that oens the page associated with Building Nbr in the eServices Portal which provides access to related projects being constructed in the building.
- **Height (ft):** Height in feet for the building where available

Let's take a look at the first few rows of the data.

In [2]:
# DO NOT MODIFY - read the .csv file
ospd_data = pd.read_csv('ca-oshpd-gachospital-hospitalbuildingdata-03092023.csv')

In [3]:
# FILL IN - show the first few values of the dataframe
ospd_data.head()

Unnamed: 0,County Code,Perm ID,Facility Name,City,Building Nbr,Building Name,Building Status,SPC Rating *,Building URL,Height (ft),Stories,Building Code,Building Code Year,Year Completed,AB 1882 Notice,Latitude,Longitude,Count
0,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01278,Original Hospital,No Gen Acute Care - OSHPD Bldg,,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,44.17,4.0,Unknown,,1926.0,,37.762657,-122.253899,1
1,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01279,Stephens Wing,In Service,2,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,35.0,3.0,1952 Uniform Building Code (UBC),1952.0,1956.0,This building does not significantly jeopardiz...,37.762657,-122.253899,1
2,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01280,West Wing,In Service,2,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,2.0,1964 Uniform Building Code (UBC),1964.0,1968.0,This building does not significantly jeopardiz...,37.762657,-122.253899,1
3,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01281,South Wing,In Service,3s,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,3.0,1976 California Building Code (CBC),1976.0,1983.0,,37.762657,-122.253899,1
4,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01282,Radiology Addition,In Service,5s,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,2.0,1985 California Building Code (CBC),1985.0,1995.0,,37.762657,-122.253899,1


## 1. Inspect the data tidiness

During our data tidiness investigation, we will look at the application of the following rules:

- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.

We can see our dataset is clearly formatted with the column names accepted as the header, each row consisting of a single observation, and the cells in the dataset being single values. But looking into the dataset, we can notice a number of issues. Below, you will find three issues related to data tidiness and investigate them programmatically below.

### 1.1 Investigate the `Building Code` 

Investigate the `Building Code` column programmatically using the `.describe()` and `.value_counts()` functionality. What are the kinds of values in this column and how could we tidy this? Hint: Think about the "Multiple variables stored in one column" guideline.	

In [4]:
#FILL IN - describe the data
ospd_data['Building Code'].describe()

count                                    3932
unique                                     67
top       2001 California Building Code (CBC)
freq                                      327
Name: Building Code, dtype: object

In [5]:
#FILL IN - use .value_counts() on the data
ospd_data['Building Code'].value_counts()

2001 California Building Code (CBC)    327
1973 California Building Code (CBC)    292
1979 California Building Code (CBC)    279
1985 California Building Code (CBC)    241
1989 California Building Code (CBC)    228
                                      ... 
1961 County of Los Angeles (LAC)         1
1959 County of Los Angeles (LAC)         1
1964 County of Los Angeles (LAC)         1
1966 County of Los Angeles (LAC)         1
2022 California Building Code (CBC)      1
Name: Building Code, Length: 67, dtype: int64

*FILL IN explanation:* 

The OSPD data's `Building Code` column has multiple data points within the column, i.e., multiple variables stored in one column. There are about 67 unique values in this column! 
It contains the year of the building code, the type of building code, and the associated acronym. However, there is already a variable representing the Building Code Year containing the years of the applicable Building Codes. Ideally, this column should be modified to be one column containing:
- `Building Code Type`: The acronym of the applicable location, e.g., CBC, UBC, and LAC. The full expansions of the acronyms can be placed in an accompanying legend/documentation.

### 1.2 Investigate unnecessary values
Which variable adds no value to the data that we could remove from the dataframe? Hint: Use the `describe()` function.

In [6]:
#Fill in
#Describe the dataset
ospd_data.describe()

Unnamed: 0,Perm ID,Height (ft),Stories,Building Code Year,Year Completed,Latitude,Longitude,Count
count,3932.0,2020.0,3302.0,3856.0,3550.0,3932.0,3932.0,3932.0
mean,12157.352238,29.460847,1.958207,1985.51971,1988.371549,35.670429,-119.428604,1.0
std,2153.781276,27.21235,1.796883,19.356151,20.356176,2.257315,2.045018,0.0
min,10006.0,0.0,0.0,1927.0,1902.0,32.6189,-124.194,1.0
25%,10677.0,13.0,1.0,1973.0,1973.0,33.906395,-121.452156,1.0
50%,11510.0,17.0,1.0,1985.0,1990.0,34.325735,-118.463211,1.0
75%,12662.0,36.0,2.0,2001.0,2005.0,37.71674,-117.871239,1.0
max,18232.0,195.5,15.0,2022.0,2022.0,41.774509,-114.595116,1.0


*FILL IN explanation*: The `Count` variable only contains the value `1` for the entire dataframe and adds no meaningful value, so it could be removed.

### 1.3 Investigate different observational units

Do you see cases of multiple observational units being stored in a single table in the dataset? Inspect the dataframe visually by looking at the first few rows of the dataframe, and programmatically by checking the number of unique values for the dataframe. Explain how we could mitigate this duplication by having two seperate tables.

**Note:** Here, we are asking you to think about how you would separate the data into two separate entities/dataframes/tables, rather than looking at repetitive values accross columns in the original dataframe.

In [7]:
#FILL IN - print first 10 rows of dataframe
ospd_data.head(10)

Unnamed: 0,County Code,Perm ID,Facility Name,City,Building Nbr,Building Name,Building Status,SPC Rating *,Building URL,Height (ft),Stories,Building Code,Building Code Year,Year Completed,AB 1882 Notice,Latitude,Longitude,Count
0,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01278,Original Hospital,No Gen Acute Care - OSHPD Bldg,,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,44.17,4.0,Unknown,,1926.0,,37.762657,-122.253899,1
1,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01279,Stephens Wing,In Service,2,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,35.0,3.0,1952 Uniform Building Code (UBC),1952.0,1956.0,This building does not significantly jeopardiz...,37.762657,-122.253899,1
2,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01280,West Wing,In Service,2,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,2.0,1964 Uniform Building Code (UBC),1964.0,1968.0,This building does not significantly jeopardiz...,37.762657,-122.253899,1
3,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01281,South Wing,In Service,3s,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,3.0,1976 California Building Code (CBC),1976.0,1983.0,,37.762657,-122.253899,1
4,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01282,Radiology Addition,In Service,5s,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,2.0,1985 California Building Code (CBC),1985.0,1995.0,,37.762657,-122.253899,1
5,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-01283,Medical Gas Storage,In Service,3,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,8.42,1.0,1985 California Building Code (CBC),1985.0,1995.0,,37.762657,-122.253899,1
6,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-02630,Compactor Shed,In Service,4,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,1.0,1976 California Building Code (CBC),1976.0,1983.0,,37.762657,-122.253899,1
7,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-03120,Emergency Room Relocation,In Service,3s,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,1.0,1979 California Building Code (CBC),1979.0,1988.0,,37.762657,-122.253899,1
8,01 - Alameda,11210,Alameda Hospital,Alameda,BLD-05597,LOX Tank,Under Construction,,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,,,2010 California Building Code (CBC),2010.0,2016.0,,37.762657,-122.253899,1
9,01 - Alameda,11322,Alta Bates Summit Medical Center,Oakland,BLD-00695,Ehman Building,In Service,2,https://esp.oshpd.ca.gov/CitizenAccess/Cap/Cap...,49.75,4.0,1927 Uniform Building Code (UBC),1927.0,1937.0,This building does not significantly jeopardiz...,37.820809,-122.263081,1


In [8]:
#FILL IN
#Find number of unique values using .unique()
ospd_data.nunique()

County Code             56
Perm ID                425
Facility Name          415
City                   254
Building Nbr          3932
Building Name         2708
Building Status         14
SPC Rating *            12
Building URL          3932
Height (ft)            694
Stories                 15
Building Code           67
Building Code Year      43
Year Completed         101
AB 1882 Notice           2
Latitude               424
Longitude              424
Count                    1
dtype: int64

*FILL IN explanation*: The data set contains information about buildings in hospitals, including their **location information** and other **detail information**. 

The "County Code", "Perm ID", "Facility Name", and "City" variables are repeated across multiple rows because they correspond to the same location. Although the other variables are also repeated across multiple rows, they provide unique details about the specific hospital buildings. 

We could separate the dataset into two units/entities.
- "County Code", "Perm ID", "Facility Name", and "City" that contains location information
- All other columns that contains details of each hospital building