<a href="https://colab.research.google.com/github/sevwal/BME362_image_analysis/blob/main/BME362_data_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Analysis**
You now have a large dataframe with many different variables. To answer our experiment question, we first think back to our experiment. **Remember**: What was the original goal of this experiment? What parameters do we need to answer this question? Where can you find these parameters?

From your programming and data analysis courses you should remember that weshould first do some quality checks on our data. For this, we'll do some histograms to check the distribution of variables such as area, or signal intensities. What we are looking for is nicely distributed data. Outliers, or bimodal distribution may give us a hint what we need to filter or what went wrong.

In [1]:
import seaborn as sns
sns.set_theme(style="ticks", palette="pastel")

dd = pd.read_csv(os.path.join(out_dir, "combined_measurements.csv"))

NameError: ignored

In [None]:
# To check for differences between our conditions
# Compare the number of cells between wells
sns.displot(data=dd, x="well",
    facet_kws=dict(margin_titles=True))

Clearly, there's many more cells that were segmented in well E05. Seems peculiar, so we should keep that in mind.

In [None]:
# To check for differences in segmentation between wells
# Compare the size of cells between wells
sns.displot(data=dd, x="area", col="well",
    facet_kws=dict(margin_titles=True),
)

We can see that this large difference in the number of cells between wells D05 and E05 stems from a large number of very small cells being present in well E05. We see that their size is close to 0, and barely any cells of that size are present in well D05. This hints that these cells were missegmented. 

We don't know why, but it might be that cells were more confluent, there was a lot of debris, or our image quality was just bad. If you take a look at some images from well E05 you might notice that the signal intensity is pretty bad for some sites. This indicates to me that something went wrong during image acquisition, like the autofocus messing up.

Moreover, some cells might also be mitotic. For mitotic cells, we would expect them to have a high mean DAPI intensity, as they are dividing and have high DNA content. They are also a lot smaller and elongated (or: more elliptical than regular cells/nuclei).

Now for debris, we would expect that signal to be small and spotty, i.e. more perfectly round than a cell should be.

We can test our assumption with one more plot:

In [None]:
sns.relplot(data=dd, x="area", y="eccentricity", hue="mean_intensity-0", alpha=0.4)

What have we done here? We plotted eccentricity vs area and used the mean DAPI intensity as a hue. Eccentricity is the ratio of the distance between two focal points over the length of the major axis. If eccentricity = 1 it is a perfect circle.

We can see three things:
 

1.   A large population of cells with a similarly low mean DAPI intensity (< 15'000) that is spread in both eccentricity and area.
2.   A second population of very small cells (area ~0) with high intensity (> 30'000).
3.   A third population of almost perfectly round cells (eccentricity == 1) with high intensity.

Recall what we said about mitotic cells and debris. Our second population probably represents mitotic cells, while the third corresponds to debris.

In our next step we will thus filter cells for eccentricity and area. Not only will we filter for small cells, we will also filter for very large cells (where several cells were combined into one, see histogram above).



In [None]:
dd_filtered = dd[(dd["area"]>=650) & (dd["area"]<=7500) & (dd["eccentricity"]<=0.95)]

sns.displot(data=dd_filtered, x="area", col="well",
    facet_kws=dict(margin_titles=True),
)

sns.relplot(data=dd_filtered, x="area", y="eccentricity", hue="mean_intensity-0", alpha=0.4)

So, with those quality checks settled, let's dig deeper. First we might ask, well how many cells do we have where transfection actually worked? GFP was the signal we used to check if transfection worked. So now we can create a new variable called `transfection`. This will be a binary (0 = no, 1 = yes) for each cell depending on whether it reached a GFP threshold or not (i.e. was transfected or not).

In [None]:
dd_filtered["transfected"] = np.where(dd_filtered["mean_intensity-1"]>=1000, 1, 0) #1 if >=1000, 0 if <1000

print("Transfection efficiency:\n")

for well in wells:
  t = dd_filtered.shape[0] #No. Total
  p = dd_filtered[(dd_filtered["transfected"]==1) & (dd_filtered["well"]==well)].shape[0] #No. of transfected
  n = dd_filtered[(dd_filtered["transfected"]==0) & (dd_filtered["well"]==well)].shape[0] #No. of non-transfected
  print("Well %s" % (well))
  print("Total cells: %d\nTransfected: %d\nNon-transfected: %d" % (t,p,n))
  print("\nTransfection Efficiency: ", p/t*100, "%\n")

In [None]:
sns.boxplot(data=dd_filtered, x="well", y="mean_intensity-1")