## Exploring the Tree of Life

<table>
  <tr>
    <td><img
      src="https://i.guim.co.uk/img/static/sys-images/Guardian/Pix/pictures/2008/04/17/DarwinSketch.article.jpg?width=445&quality=85&auto=format&fit=max&s=c7f89552d12b8495b2b4eb4d7a5bc391"
      alt="A page from Darwin's Notebook B showing his sketch of the tree of life" width="200"><a
      href="https://i.guim.co.uk/img/static/sys-images/Guardian/Pix/pictures/2008/04/17/DarwinSketch.article.jpg?width=445&quality=85&auto=format&fit=max&s=c7f89552d12b8495b2b4eb4d7a5bc391">Source</a>
    </td>
    <td><img src="https://www.greennature.ca/greennature/taxonomy/tree_of_life.png" alt="the tree of life"
             width="300"><a href="https://www.greennature.ca/greennature/taxonomy/tree_of_life.png">Source</a></td>
  </tr>
</table>

In this hands-on exercise, you answer the following questions by using pandas data structures and methods to analyze the eukaryote genome data store in the following tab-delimited file. `https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv`

In [16]:
import pandas as pd

In [17]:
%matplotlib inline
# this input file is tab-delimited instead of comma-delimited
tsvFile = "https://raw.githubusercontent.com/csbfx/advpy122-data/master/euk.tsv"
# Load the csv file into a dataframe
euk = pd.read_csv(tsvFile, sep="\t")
euk

Unnamed: 0,Species,Kingdom,Class,Size (Mb),GC%,Number of genes,Number of proteins,Publication year,Assembly status
0,Emiliania huxleyi CCMP1516,Protists,Other Protists,167.676000,64.5,38549,38554,2013,Scaffold
1,Arabidopsis thaliana,Plants,Land Plants,119.669000,36.0529,38311,48265,2001,Chromosome
2,Glycine max,Plants,Land Plants,979.046000,35.1153,59847,71219,2010,Chromosome
3,Medicago truncatula,Plants,Land Plants,412.924000,34.047,37603,41939,2011,Chromosome
4,Solanum lycopersicum,Plants,Land Plants,828.349000,35.6991,31200,37660,2010,Chromosome
...,...,...,...,...,...,...,...,...,...
8297,Saccharomyces cerevisiae,Fungi,Ascomycetes,3.993920,38.2,-,-,2017,Scaffold
8298,Saccharomyces cerevisiae,Fungi,Ascomycetes,0.586761,38.5921,155,298,1992,Chromosome
8299,Saccharomyces cerevisiae,Fungi,Ascomycetes,12.020400,38.2971,-,-,2018,Chromosome
8300,Saccharomyces cerevisiae,Fungi,Ascomycetes,11.960900,38.2413,-,-,2018,Chromosome


## Q1. How many Mammals have at least 20,000 genes? What are their scientific names? 
*Note:* 
- *Mammals are under Class*
- *Scientific names are under Species*

In [22]:
# According to Jones (2020) on pages 18–19, we desire short names as possible.
my_names = [
    "species",
    "kingdom",
    "class",
    "size",
    "gc",
    "genes",
    "proteins",
    "year",
    "status",
]

# https://pandas.pydata.org/docs/user_guide/basics.html#dtypes
# According to Jones (2020) on page 15, we can explicitly set each Series/column
# to a specific data type.
my_types = {
    "species": "string",
    "kingdom": "string",
    "class": "string",
    "genes": "Int64",
    "proteins": "Int64",
    "status": "string",
}

# According to Jones (2020) on page 15, reassign `euk` with the same data set
# but define the `-` values to `NaN` with the defined column names.
#
# According to https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html,
# if column names are being explicitly passed, then `header=0` needs to be
# explicitly passed too.
euk = pd.read_csv(
    tsvFile, sep="\t", header=0, names=my_names, dtype=my_types, na_values=["-"]
)

# According to Jones (2020) on pages 31–34, it outlines how to select and filter
# rows and columns.
#
# According to Jones (2020) on page 18, we can use the `len` function to count
# rows.
# Mammal's with 20,000 or more genes:
mammals_genes20k = euk[(euk["class"] == "Mammals") & (euk["genes"] >= 20_000)]
print(
    "There are ",
    len(mammals_genes20k),
    " mammals with at least 20,000 genes",
)

There are  134  mammals with at least 20,000 genes


In [23]:
# According to https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html
# After filtering, and selecting the species column, since columns are the
# Series data structure, it has a function call `unique()` which will return
# the unique/distinct values in the 20k gene mammals species Series data
# structure, no repeats
#
# The `tolist()` function was suggested by my IDE (JetBrains' DataSpell)
mammals_genes20k["species"].unique().tolist()

['Homo sapiens',
 'Mus musculus',
 'Rattus norvegicus',
 'Felis catus',
 'Bos taurus',
 'Ovis aries',
 'Canis lupus familiaris',
 'Sus scrofa',
 'Ornithorhynchus anatinus',
 'Equus caballus',
 'Pan troglodytes',
 'Macaca mulatta',
 'Monodelphis domestica',
 'Loxodonta africana',
 'Sorex araneus',
 'Erinaceus europaeus',
 'Cavia porcellus',
 'Echinops telfairi',
 'Dasypus novemcinctus',
 'Oryctolagus cuniculus',
 'Pongo abelii',
 'Canis lupus dingo',
 'Papio anubis',
 'Callithrix jacchus',
 'Otolemur garnettii',
 'Ictidomys tridecemlineatus',
 'Nomascus leucogenys',
 'Myotis lucifugus',
 'Pteropus vampyrus',
 'Tursiops truncatus',
 'Microcebus murinus',
 'Dipodomys ordii',
 'Macaca fascicularis',
 'Ochotona princeps',
 'Bubalus bubalis',
 'Galeopterus variegatus',
 'Vicugna pacos',
 'Gorilla gorilla gorilla',
 'Ailuropoda melanoleuca',
 'Cricetulus griseus',
 'Sarcophilus harrisii',
 'Mustela putorius furo',
 'Bos indicus',
 'Odocoileus virginianus texanus',
 'Saimiri boliviensis bolivi

### Q2. Animals are a part of Kingdom. How many records are there for each Class of Animals?

In [19]:
# According to Jones (2020) on page 180, the function `groupby()` for the dataframe
# data structure to group rows by a particular column. It will return a
# DataFrameGroupBy object. With that object we can invoke the `size` function to
# count the number of rows in each group.
euk[(euk["kingdom"] == "Animals")].groupby(euk["class"]).size()

class
Amphibians         7
Birds            172
Fishes           282
Flatworms         47
Insects          602
Mammals          658
Other Animals    210
Reptiles          41
Roundworms       162
dtype: int64

### Q3. Animals are a part of Kingdom. How many unique Species are there for each Class of Animals?

In [20]:
# According to https://saturncloud.io/blog/how-to-extract-column-values-based-on-another-column-in-pandas/#method-3-using-the-groupby-method
# and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
# DataFrame data structure has a `groupby()` function to group by one or more
# columns and extract specific data based on another column/series.
#
# Using the `value_counts()` function to get the count for each distinct species
euk[euk["kingdom"] == "Animals"].groupby(euk["class"])["species"].nunique()

class
Amphibians         6
Birds            144
Fishes           218
Flatworms         34
Insects          402
Mammals          313
Other Animals    171
Reptiles          38
Roundworms       116
Name: species, dtype: int64

### Q4. What are the unique scientific names of Mammals with the genus name Macaca?
*Recall: the scientific name starts with genus followed by a space and then the species name. Example: Homo sapiens.*

In [21]:
# According to Jones (2020), there is a boolean expression for selecting things
# based on specific string properties: `.str.startswith("*string*")`
#
# Using the `unique().tolist()` for better output
euk[(euk["class"] == "Mammals") & (euk["species"].str.startswith("Macaca"))][
    "species"
].nunique().tolist()

AttributeError: 'int' object has no attribute 'tolist'

### Q5. Modify the Species names to only contain the scientific names  and create a new dataframe.

Some of the names in the Species column have more than two parts, such as `Emiliania huxleyi CCMP1516`. Create a new column `Species` that contains only the first two parts of name, such as `Emiliania huxleyi`. Combine this new Species column with `Kingdom`, `Class`, `Size (Mb)`, `Number of genes`, and `Number of proteins` and store this new dataframe as `df_species`.

Hint: Follow Q3 in Lecture 4 with a little twist. Instead of just getting the first element from the split results, you will get the first two elements using `.str[0:2]` which will give you a list. You can then use `.str.join(" ")` to change it back to a string.

In [None]:
euk["Species"] = euk["species"].str.split(" ").str[0:2].str.join(" ")
df_species = euk[["Species", "kingdom", "class", "size", "genes", "proteins"]]
df_species

### Q6. Create a pie plot using pandas to show the number of unique Species in each Class of Animals using the new dataframe you created in Q5
Hint: First, create a new dataframe that contains the number of unique Species and the index is the corresponding Animals Class. Then, use that dataframe to plot the pie plot.
[Check out this documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.pie.html)

In [None]:
# Create a new dataframe that contains the number of unique Species and
# the index is the corresponding Animals Class.

unique_animal_species = (
    df_species[df_species["kingdom"] == "Animals"]
    .groupby(["class"])["Species"]
    .nunique()
)

unique_animal_species

In [None]:
# Create a panda pie plot using the dataframe above
unique_animal_species.plot.pie(
    y="Species",
    title="Unique Animal Species",
    autopct="%1.1f%%",
    explode=(0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07, 0.07),
    startangle=70,
    pctdistance=0.75,
    # labeldistance=.6
    # shadow=True
    # legend=True
    # hatch=['**O', 'oO', 'O.O', '.||.']
)

## Reference(s)

Jones, M. (2020). Biological data exploration with Python, pandas and seaborn. Independently published.