# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

In [None]:
from google.colab import drive
drive.mount('/content/drive')

---

# Day 2 Project - More strings & loops - and very messy data

Data is messy. Biologist data even more so. Here we have some data on bacterial abundance as collected by some well meaning scientists but unfortunately it's a bit of a mess. It is technically in a four column format liks this, howver when you look below it's mixed up:

```
| Collector | Percentage abundance | Dominant Phyla | Date |
```

Delimeters:
- Between collected sample records: ```,```
- Between data fields per sample: ```-```

We want to clean up the data and make some sense out of it. **The objective is to output a count of the number of samples with a high proportion of each phyla.**

1. Look at the text file first so that you know what we are looking at!
2. We will read in the file ```MessyData.txt``` with ```open()``` as one object (it is too mixed-up to read line-by-line), and then split based on the delimiters above. We will learn more about loading files in the IO session.

If you want to challange yourself try to clean the data first before looking in this guide section!
I recommend using ```print()``` functions after each step to check the output is as expected.

---

<details>
<summary>Step-by-step guide</summary>

2. First split the data by commas into a new list of ```records``` with the function ```.split()```
2. Create a new loop to go through your ```records``` list and split each record by ```-``` into the 4 data elements (put the output into a new list too)
3. Create a **2D/nested** loop for your latest list, to remove the whitepace off each element with ```.strip()```. (First go through each record, then through each element. Make sure to keep experiments together!)
4. Create a long list of all the dominant phyla per sample (The third column of the data) - some samples have multiple phyla, so have to be split again first! Careful here, because you want a basic list, not a list of lists.

</details>

---

5. Print out your new clean dataframe!

**Extensions**
1. Calculate the average abundance per collection date (4 options) (use ```if date_column == ....```. We'll look at automatically building lists later)
2. Output a clean list of all named phyla from the data column in a list named ```phyla_count```. There may be more than one phyla in the list per sample. There is a codeblock at the end that will count for each of the list I've given you, and summarise your output for a list of phyla.

In [18]:
# Write your code here
# Read file in as one block because too messy to read line by line
with open("/Day2-Project-MessyData.txt") as inFile:
  data = inFile.read()

records = data.split(",")

clean_records = []

for record in records:
    fields = record.split("-")

    clean_fields = []

    for field in fields:
        clean_field = field.strip()
        clean_fields.append(clean_field)

    clean_records.append(clean_fields)

all_phyla = []

for record in clean_records:
    phyla = record[2].split("&")
    all_phyla += phyla

print(all_phyla)


['Chloroflexi', 'Chloroflexi', 'Acidobacteria', 'Chloroflexi', 'Acidobacteria', 'Chloroflexi', 'Chloroflexi', 'Bacillus', 'Actinomycetes', 'Actinomycetes', 'Bacillus', 'Actinomycetes', 'Bacillus', 'Acidobacteria', 'Acidobacteria', 'Actinomycetes', 'Acidobacteria', 'Chloroflexi', 'Chloroflexi', 'Firmicutes', 'Chloroflexi', 'Acidobacteria', 'Firmicutes', 'Acidobacteria', 'Proteobacteria', 'Acidobacteria', 'Proteobacteria', 'Acidobacteria', 'Firmicutes', 'Proteobacteria', 'Acidobacteria', 'Firmicutes', 'Cyanobacteria', 'Cyanobacteria', 'Bacillus', 'Chloroflexi', 'Cyanobacteria', 'Bacillus', 'Chloroflexi', 'Cyanobacteria', 'Bacillus', 'Proteobacteria', 'Proteobacteria', 'Bacillus', 'Proteobacteria', 'Bacillus', 'Acidobacteria', 'Proteobacteria', 'Bacillus', 'Actinomycetes', 'Acidobacteria', 'Cyanobacteria', 'Cyanobacteria', 'Acidobacteria', 'Cyanobacteria', 'Acidobacteria', 'Cyanobacteria', 'Cyanobacteria', 'Actinomycetes', 'Cyanobacteria', 'Actinomycetes', 'Bacillus', 'Bacillus', 'Firmicu

In [21]:
# Name your final clean list of all phyla "phyla_count", then test it with this code block
phyla = ['Actinomycetes', 'Proteobacteria', 'Cyanobacteria', 'Firmicutes', 'Chloroflexi', 'Acidobacteria', 'Bacillus']

print("Phylum\t\tCount")
for p in phyla:
    print(p, "\t", all_phyla.count(p))


Phylum		Count
Actinomycetes 	 17
Proteobacteria 	 30
Cyanobacteria 	 26
Firmicutes 	 24
Chloroflexi 	 28
Acidobacteria 	 22
Bacillus 	 34
