# Computational complexity/Data structures

Develop a Python module that provides some simple analytical capabilities on some (synthetic) EHR data. This is provided as:

* A table of patients with demographic data: `PatientCorePopulatedTable.txt`
* A table of laboratory results: `LabsCorePopulatedTable.txt`

## Data parsing

Define a function `parse_data(filename: str) -> ???` that reads and parses the data files. Choose appropriate data structures such that the expected analyses (below) are efficient.

Include a module docstring describing your rationale for choosing these data structures.

Include a function docstring analyzing the computational complexity of the data parser.

## Analysis

Define the following functions to interrogate the data. In each one, include a function docstring describing its computational complexity _at runtime_ (i.e. after parsing into the global data structures).

### Old patients

The function `num_older_than(age, ???)` should take the data and return the number of patients older than a given age (in years). For example,

```python
>> num_older_than(51.2)
52
```

### Sick patients

The function `sick_patients(lab, gt_lt, value, ???)` should take the data and return a (unique) list of patients who have a given test with value above (">") or below ("<") a given level. For example,

```python
>> sick_patients("METABOLIC: ALBUMIN", ">", 4.0)
["FB2ABB23-C9D0-4D09-8464-49BF0B982F0F", "64182B95-EB72-4E2B-BE77-8050B71498CE"]
```

## Notes

All of this should be generalizable, i.e. it should be designed to work with files with these formats, not just these _specific_ files. State (in module/function docstrings) any assumptions that you make about the input data.

When describing computational complexity, document your thought process in detail. For example:

> 5 is added to `element`, which is a single operation. This operation is performed twice for each element, leading to 2N operations. For big-O analysis, we drop the constant factor, yielding O(N) complexity.

You may like to use the `datetime` (standard) library. _Do not import any other libraries._

Your submission should be a single file titled `ehr_analysis.py`.

## Submission

1. Create a _private_ GitHub repository.
2. Invite `patrickkwang` to collaborate.
3. Create a branch called `part1` and complete this assignment there.
4. Make a pull request `part1` -> `main`. _DO NOT MERGE IT._
5. Request a review from `patrickkwang`.
6. Submit the link to your repository on Sakai by the due date.

In [1]:
import pandas as pd

labs= pd.read_csv("/mnt/c/Users/sdona/Documents/Duke/22Spring/821BIOSTAT/03Assignment/LabsCorePopulatedTable.txt", sep="\t")
patient = pd.read_csv("/mnt/c/Users/sdona/Documents/Duke/22Spring/821BIOSTAT/03Assignment/PatientCorePopulatedTable.txt", sep="\t")

In [None]:
labs.to_csv("/mnt/c/Users/sdona/Documents/Duke/22Spring/"
"821BIOSTAT/03Assignment/Labs.txt", sep="\t")

patient.to_csv("/mnt/c/Users/sdona/Documents/Duke/22Spring/"
"821BIOSTAT/03Assignment/Patient.txt", sep="\t")

In [2]:
delimiter = "\t"

In [3]:
file = open("/mnt/c/Users/sdona/Documents/Duke/22Spring/821BIOSTAT/03Assignment/LabsCorePopulatedTable.txt", "r",  encoding="UTF-8-sig")
data = file.readlines()
count = len(data)
rows = []
records = []
columns = []
data_dict = dict()
for i in data:
    rows.append(i.strip("\n"))
for i in range(count):
    if i == 0:
        columns.append(rows[i].split(delimiter))
    else:
        records.append(rows[i].split(delimiter))

for i in range(len(columns[0])):
    values = []
    for j in range(len(records)):
        values.append(records[j][i])  
    data_dict[columns[0][i]] = values        

In [4]:
import datetime

In [5]:
# dates = data_dict["PatientDateOfBirth"]
# dates_2 = [datetime.datetime.strptime(i, "%Y-%m-%d %H:%M:%S.%f") for i in dates]
# today = datetime.datetime.today()

In [6]:
# age_days = []
# age_years = []
# for i in range(len(dates_2)):
#     age_days.append(today - dates_2[i])
#     age_years.append(age_days[i].days/365.25)
# data_dict["PatientDateOfBirth"] = age_years

In [7]:
# age = 51.2
# older = []
# for i in range(len(data_dict["PatientDateOfBirth"])):
#     if data_dict["PatientDateOfBirth"][i] > age:
#         older.append(data_dict["PatientID"][i])
# print(len(older))

In [8]:
import numpy as np

np.unique(labs["LabName"])

array(['CBC: ABSOLUTE LYMPHOCYTES', 'CBC: ABSOLUTE NEUTROPHILS',
       'CBC: BASOPHILS', 'CBC: EOSINOPHILS', 'CBC: HEMATOCRIT',
       'CBC: HEMOGLOBIN', 'CBC: LYMPHOCYTES', 'CBC: MCH', 'CBC: MCHC',
       'CBC: MEAN CORPUSCULAR VOLUME', 'CBC: MONOCYTES',
       'CBC: NEUTROPHILS', 'CBC: PLATELET COUNT', 'CBC: RDW',
       'CBC: RED BLOOD CELL COUNT', 'CBC: WHITE BLOOD CELL COUNT',
       'METABOLIC: ALBUMIN', 'METABOLIC: ALK PHOS', 'METABOLIC: ALT/SGPT',
       'METABOLIC: ANION GAP', 'METABOLIC: AST/SGOT',
       'METABOLIC: BILI TOTAL', 'METABOLIC: BUN', 'METABOLIC: CALCIUM',
       'METABOLIC: CARBON DIOXIDE', 'METABOLIC: CHLORIDE',
       'METABOLIC: CREATININE', 'METABOLIC: GLUCOSE',
       'METABOLIC: POTASSIUM', 'METABOLIC: SODIUM',
       'METABOLIC: TOTAL PROTEIN', 'URINALYSIS: PH',
       'URINALYSIS: RED BLOOD CELLS', 'URINALYSIS: SPECIFIC GRAVITY',
       'URINALYSIS: WHITE BLOOD CELLS'], dtype=object)

In [9]:
import numpy as np
np.unique(labs[labs["LabName"]=="METABOLIC: ALBUMIN"].LabValue)

array([2.5, 2.6, 2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7,
       3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5. ,
       5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7, 5.8, 5.9, 6. ])

In [10]:
albumin = labs[labs["LabName"]=="METABOLIC: ALBUMIN"]
albumin_values = albumin[albumin["LabValue"] > 5.9]
albumin[albumin["LabValue"]> 5.9].PatientID.nunique()

42

In [19]:
lab = "METABOLIC: ALBUMIN"
gt_lt = ">"
value = 5.9

In [20]:
values = []
for i in range(len(data_dict["PatientID"])):
    for (key, value_pair) in data_dict.items():
        if value_pair[i] == lab:
            if gt_lt == ">":
                if data_dict["LabValue"][i] > str(value):
                    if data_dict["PatientID"][i] not in values:
                        values.append(data_dict["PatientID"][i])
            elif gt_lt == "<":
                if data_dict["LabValue"][i] < str(value):
                    if data_dict["PatientID"][i] not in values:
                        values.append(data_dict["PatientID"][i])
            else:
                raise ValueError("Please enter a valid operator")
print(values)

['0BC491C5-5A45-4067-BD11-A78BEA00D3BE', '220C8D43-1322-4A9D-B890-D426942A3649', '81C5B13B-F6B2-4E57-9593-6E7E4C13B2CE', '69B5D2A0-12FD-46EF-A5FF-B29C4BAFBE49', '8D389A8C-A6D8-4447-9DDE-1A28AB4EC667', '49DADA25-F2C2-42BB-8210-D78E6C7B0D48', 'B5D31F01-7273-4901-B56F-8139769A11EF', '80D356B4-F974-441F-A5F2-F95986D119A2', '016A590E-D093-4667-A5DA-D68EA6987D93', 'B2EB15FA-5431-4804-9309-4215BDC778C0', '53B9FFDD-F80B-43BE-93CF-C34A023EE7E9', '35FE7491-1A1D-48CB-810C-8DC2599AB3DD', '56A35E74-90BE-44A0-B7BA-7743BB152133', '21792512-2D40-4326-BEA2-A40127EB24FF', '2EE42DEF-37CA-4694-827E-FA4EAF882BFC', '4C201C71-CCED-40D1-9642-F9C8C485B854', 'C65A4ADE-112E-49E4-B72A-0DED22C242ED', '7A025E77-7832-4F53-B9A7-09A3F98AC17E', 'DDC0BC57-7A4E-4E02-9282-177750B74FBC', '9C75DF1F-9DA6-4C98-8F5B-E10BDC805ED0', 'A19A0B00-4C9A-4206-B1FE-17E6DA3CEB0B', '66154E24-D3EE-4311-89DB-6195278F9B3C', 'CC9CDA72-B37A-4F8F-AFE4-B08F56A183BE', '3E462A8F-7B90-43A1-A8B6-AD82CB5002C9', '6D5DCAC1-17FE-4D7C-923B-806EFBA3E6DF',