# Cell Population Analysis
This Jupyter Notebook contains the code to analyze cell populations given in the file cell-count.csv. There are two parts, 1: computing the relative frequencies and generating the output .csv file, and 2: patient response analysis.

To recreate my outputs: the .csv file and the boxplot, run all cells.

## Part 1: Compute the Relative Frequencies

In [None]:
import pandas as pd

In [35]:
cellCountDF = pd.read_csv("cell-count.csv")

#### Generate the output data:
I created a small array of each cell's populations to elegantly access the data frame for the computation of total and relative frequency for each sample.

Reminder that the output csv file should be in the format: sample, total_count, population, count, percentage

In [39]:
# Collecting the output in a list is more computationally friendly than in a dataframe
outputData = []
for index, sample in cellCountDF.iterrows():
  populations = ['b_cell', 'cd8_t_cell', 'cd4_t_cell', 'nk_cell', 'monocyte']
  total = 0

  for population in populations:
    total += sample[population]

  for population in populations:
    outputData.append([
        sample['sample'],
        total,
        population,
        sample[population],
        round((sample[population] / total * 100), 2)
    ])

outputDF = pd.DataFrame(outputData, columns=['sample', 'total_count', 'population', 'count', 'percentage'])

Because this type of data may be used in high-stakes contexts like medicine or clinical trials, I included a few tests to ensure correctness of the output before exporting.

In [74]:
'''Agreement tests, ensuring internal consistency (not necessarily correctness)'''

# Ensure that total_count agrees with the sum of each sample's population counts
for sample_id in outputDF['sample'].unique():
  sample_rows = outputDF[outputDF['sample'] == sample_id]
  total = sample_rows['total_count'].values[0]
  count_sum = sample_rows['count'].sum()
  assert total == count_sum, f"Mismatch in total for {sample_id}"


# Ensure that every percentage adds up to 100, also checks for floating point inaccuracy
for sample_id in outputDF['sample'].unique():
  sample_rows = outputDF[outputDF['sample'] == sample_id]
  percentage_sum = sample_rows['percentage'].sum()
  assert percentage_sum == 100, f"Mismatch in percentages for {sample_id}"

'''Correctness test'''

# Ensure that s9's total and relative frequencies are as expected per the input
s9 = outputDF[outputDF['sample'] == 's9']
assert(s9['total_count'].values[0] == 130000)

expectedFrequencies = [
    round(45500 / 130000 * 100, 2),
    round(27300 / 130000 * 100, 2),
    round(32500 / 130000 * 100, 2),
    round(6500 / 130000 * 100, 2),
    round(18200 / 130000 * 100, 2)
    ]

assert(s9['percentage'].tolist() == expectedFrequencies)

In [75]:
# Generate the output csv file from the output data
outputDF.to_csv('output.csv', index=False)