# `pandas` Quickstart Tutorial
`pandas` is a Python library which deals with manipulating tabular data, selecting whatever subset you wish to select, overriding some values if necessary, and so on. The basics of `pandas`, covered here, concern the very basics of previewing the data, and some simple data retrieval. From my experience, elementary functions in `pandas` are very easy to learn, but `pandas` (and database curation) is very hard to master (that's why a "database analyst", a person or team dedicated to just data management and retrieval, is a legit occupation in the private sector).
 - Difficulty: Beginner
 - Time: 10 - 15 mins
 - Prerequisites: none
 
If you haven't installed `pandas` yet, pip install it by running on the terminal:

`pip install pandas`

If it turns out that you already do, then the pip-installer will just inform you.

This tutorial uses a large (n=3942) data set of Flu B Yam sequences. 

In [1]:
# import the library
import pandas as pd

# I usually define the working directory here, 
# so I don't have to type this long name over and over again
path0 = '/users/dten0001/Google Drive/data_archive/FluB/'

In [2]:
# Read the csv into a pandas dataframe:
d0 = pd.read_csv(path0+"master_3942_yam.csv")

## Viewing your Data
The following is a bunch of frequently used commands to explore the data.

In [3]:
# How many rows and columns does your dataframe have?
print(d0.shape)

# What are the column names?
print("Column names:")
print(d0.columns)

(3941, 20)
Column names:
Index(['iso_name', 'iso_id', 'cdate', 'cyear', 'data_source', 'lineage',
       'location', 'continent', 'country', 'state', 'city', 'district', 'HA',
       'MP', 'NA', 'NP', 'NS', 'PA', 'PB1', 'PB2'],
      dtype='object')


In [9]:
# Select one column, say, 'HA':
HA_col = list(d0["HA"])
HA_col

['ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAGATCGAATCTGCACTGGGATAACATCTTCAAACTCACCTCATGTGGTCAAAACAGCTACTCAAGGGGAGGTCAATGTGACTGGCGTGATACCACTGACAACAACACCAACAAAATCTTATTTTGCAAATCTCAAAGGAACAAGGACCAGAGGGAAACTATGCCCGGACTGTCTCAACTGCACAGATCTGGATGTGGCCTTGGGCAGGCCAATGTGTGTGGGGACCACACCTTCTGCTAAAGCTTCAATACTCCATGAGGTCAGACCTGTTACATCCGGGTGCTTTCCTATAATGCACGACAGAACAAAAATCAGGCAACTACCCAATCTTCTCAGAGGATATGAAAAGATCAGGTTATCAACCCAAAACGTTATCGATGCAGAAAAAGCACCAGGAGGACCCTACAGACTTGGAACCTCAGGATCTTGCCCTAACGCTACCAGTAAAATCGGATTTTTTGCAACAATGGCTTGGGCTGTCCCAAAG---GACAACTACAAAAATGCAACGAACCCACAAACAGTGGAAGTACCATACATTTGTACAGAAGGGGAAGACCAAATTACTGTTTGGGGGTTCCATTCGGATAACAAAACCCAAATGAAGAGCCTCTATGGAGACTCAAATCCTCAAAAGTTCACCTCATCTGCTAATGGAGTAACCACACATTATGTTTCTCAGATTGGCGACTTCCCAGATCAAACAGAAGACGGAGGACTACCACAAAGCGGCAGAATTGTTGTTGATTACATGATGCAAAAACCTGGGAAAACAGGAACAATTGTCTATCAAAGGGGTGTTTTGTTGCCTCAAAAGGTGTGGTGCGCGAGTGGCAGGAGCAAAGTAATAAAAGGGTCATTGCCTTTAATTGGTGAAGCAGATTGCCTTCATGAAGAATACGGTGGATTAAACAAAAGCAAGCCTTACTACACAGGAAAACATGCAAAAGCCAT

In [5]:
# Or select multiple columns by using a list as input:
d0[["iso_name", "HA", "MP", "NA"]]

Unnamed: 0,iso_name,HA,MP,NA
0,B/Acre/133954_Iec/2015,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
1,B/Acre/135657_Iec/2015,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
2,B/Acre/144346_Iec/2016,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
3,B/Adana/Tr_06/2010,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
4,B/Afghanistan/0446/2015,ATGAAGGCAATAGTTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCCTATTTCTCACATCAG...
5,B/Afghanistan/1312/2016,ATGAAGGCAATAATTGTACTACTCATGGTAGTAATGTCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
6,B/Afghanistan/202/2016,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
7,B/Aichi/23/2015,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
8,B/Aichi/25/2015,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...
9,B/Aichi/7/2016,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...


In [6]:
# Preview the first 10 rows of data. 
# If you leave the input parameter empty, default is 5
d0.head(10)

Unnamed: 0,iso_name,iso_id,cdate,cyear,data_source,lineage,location,continent,country,state,city,district,HA,MP,NA,NP,NS,PA,PB1,PB2
0,B/Acre/133954_Iec/2015,EPI_ISL_195663,16/1/15,2015,GISAID,yam,South America / Brazil,South America,Brazil,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
1,B/Acre/135657_Iec/2015,EPI_ISL_195665,29/5/15,2015,GISAID,yam,South America / Brazil,South America,Brazil,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
2,B/Acre/144346_Iec/2016,EPI_ISL_247579,21/10/16,2016,GISAID,yam,South America / Brazil,South America,Brazil,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
3,B/Adana/Tr_06/2010,EPI_ISL_246512,24/12/10,2010,GISAID,yam,Asia / Turkey,West Asia,Turkey,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGAACAATTGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTCTTCATAGATGTACCCATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAACAACTGTTAAGGGACA...
4,B/Afghanistan/0446/2015,EPI_ISL_217505,26/1/15,2015,GISAID,yam,Asia / Afghanistan,South Asia,Afghanistan,*,*,*,ATGAAGGCAATAGTTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCCTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATCAACACTGGGACAATTGACA...,ATGGCG---GACAACATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACCTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTCTTCATAGATGTACCCATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAACAACTGTTAAGGGACA...
5,B/Afghanistan/1312/2016,EPI_ISL_232264,16/1/16,2016,GISAID,yam,Asia / Afghanistan,South Asia,Afghanistan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAATGTCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
6,B/Afghanistan/202/2016,EPI_ISL_232263,19/1/16,2016,GISAID,yam,Asia / Afghanistan,South Asia,Afghanistan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
7,B/Aichi/23/2015,EPI_ISL_192546,3/5/15,2015,GISAID,yam,Asia / Japan,East Asia,Japan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
8,B/Aichi/25/2015,EPI_ISL_211778,19/10/15,2015,GISAID,yam,Asia / Japan,East Asia,Japan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATGAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
9,B/Aichi/7/2016,EPI_ISL_215682,18/1/16,2016,GISAID,yam,Asia / Japan,East Asia,Japan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAATACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTCTTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...


You can also view *some* of the dataframe just by entering `d0`. Jupyter will cut off the middle so that your whole screen isn't flooded. Note the number of rows and columns reported all the way at the bottom.

In [7]:
d0

Unnamed: 0,iso_name,iso_id,cdate,cyear,data_source,lineage,location,continent,country,state,city,district,HA,MP,NA,NP,NS,PA,PB1,PB2
0,B/Acre/133954_Iec/2015,EPI_ISL_195663,16/1/15,2015,GISAID,yam,South America / Brazil,South America,Brazil,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
1,B/Acre/135657_Iec/2015,EPI_ISL_195665,29/5/15,2015,GISAID,yam,South America / Brazil,South America,Brazil,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
2,B/Acre/144346_Iec/2016,EPI_ISL_247579,21/10/16,2016,GISAID,yam,South America / Brazil,South America,Brazil,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
3,B/Adana/Tr_06/2010,EPI_ISL_246512,24/12/10,2010,GISAID,yam,Asia / Turkey,West Asia,Turkey,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGAACAATTGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTCTTCATAGATGTACCCATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAACAACTGTTAAGGGACA...
4,B/Afghanistan/0446/2015,EPI_ISL_217505,26/1/15,2015,GISAID,yam,Asia / Afghanistan,South Asia,Afghanistan,*,*,*,ATGAAGGCAATAGTTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCCTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATCAACACTGGGACAATTGACA...,ATGGCG---GACAACATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACCTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTCTTCATAGATGTACCCATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAACAACTGTTAAGGGACA...
5,B/Afghanistan/1312/2016,EPI_ISL_232264,16/1/16,2016,GISAID,yam,Asia / Afghanistan,South Asia,Afghanistan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAATGTCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
6,B/Afghanistan/202/2016,EPI_ISL_232263,19/1/16,2016,GISAID,yam,Asia / Afghanistan,South Asia,Afghanistan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
7,B/Aichi/23/2015,EPI_ISL_192546,3/5/15,2015,GISAID,yam,Asia / Japan,East Asia,Japan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
8,B/Aichi/25/2015,EPI_ISL_211778,19/10/15,2015,GISAID,yam,Asia / Japan,East Asia,Japan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATGAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
9,B/Aichi/7/2016,EPI_ISL_215682,18/1/16,2016,GISAID,yam,Asia / Japan,East Asia,Japan,*,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAATACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTCTTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...


## Data Retrieval
The generic command syntax to select columns, based on the value in a particular column (or columns), is: 

`df.loc[df["column_name"] == <something>]`

To select rows whose column value equals a a particular value which can be a string, integer, or whatever:

`df.loc[df['column_name'] == some_value]`

To select rows whose column value is in an iterable, `values_list`, use `isin`:

`df.loc[df['column_name'].isin(values_list)]`

Combine multiple conditions with `&`. Place parentheses ( ) around each condition:

`df.loc[(df['column_name'] == some_value) & df['other_column'].isin(values_list)]`

To select rows whose column value does not equal some_value, use `!=`:

`df.loc[df['column_name'] != some_value]`

isin returns a boolean Series, so to select rows whose value is not in `values_list`, negate the boolean Series using ~:

`df.loc[~df['column_name'].isin(values_list)]`

You'll usually have to execute these in a sequence to get your data. The following cell shows some worked examples, which will generate a new dataframe, d1. We overwrite d1 over and over again, so feel free to break the cell below into multiple cells if you want play around and use `d1.shape`, and so on.

In [12]:
# Select only the records from GISAID, and assign it to another dataframe, called d1
d1 = d0.loc[d0["data_source"] == "GISAID"]

# Select only records from Australia, New Zealand, and Singapore
countries = ["Australia", "New Zealand", "Singapore"]
d1 = d0.loc[d0["country"].isin(countries)]

# Select by multiple conditions: say, records from Australia, NZ and SG, from 2012 to 2014
# Currently, all data in d0 are strings. 
# We want to convert the collection year column, cyear, to a number (integer):
pd.to_numeric(d0["cyear"])

d1 = d0.loc[(d0["country"].isin(countries)) & (d0["cyear"] <= 2014) & (d0["cyear"] >= 2012)]
d1.head()

Unnamed: 0,iso_name,iso_id,cdate,cyear,data_source,lineage,location,continent,country,state,city,district,HA,MP,NA,NP,NS,PA,PB1,PB2
132,B/Auckland/1/2013,EPI_ISL_145324,10/4/13,2013,GISAID,yam,Oceania / New Zealand / AUCKLAND,Oceania,New Zealand,Auckland,*,*,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCCTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATCAACACTGGGACAATTGACA...,ATGGCG---GACAACATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACCTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCTTATTTTCTCTTCATAGATGTACCCATACAGG...,ATGACATTGGCCAAAATTGAATTGTTAAAACAACTGTTAAGGGACA...
133,B/Auckland/1/2014,EPI_ISL_215448,14/6/14,2014,LIDD,yam,Oceania/New Zealand/New Zealand/Auckland/Centr...,Oceania,New Zealand,New Zealand,Auckland,Central Auckland,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
134,B/Auckland/10/2014,EPI_ISL_215467,1/9/14,2014,LIDD,yam,Oceania/New Zealand/New Zealand/Auckland/East ...,Oceania,New Zealand,New Zealand,Auckland,East Auckland,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
137,B/Auckland/11/2014,EPI_ISL_215455,1/9/14,2014,LIDD,yam,Oceania/New Zealand/New Zealand/Auckland/South...,Oceania,New Zealand,New Zealand,Auckland,South Auckland,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...
139,B/Auckland/12/2014,EPI_ISL_215456,5/9/14,2014,LIDD,yam,Oceania/New Zealand/New Zealand/Auckland/South...,Oceania,New Zealand,New Zealand,Auckland,South Auckland,ATGAAGGCAATAATTGTACTACTCATGGTAGTAACATCCAATGCAG...,ATGTCGCTGTTTGGAGACACAATTGCCTACCTGCTTTCATTGACAG...,ATGCTACCTTCAACTATACAAACGTTAACCTTATTTCTCACATCAG...,ATGTCCAACATGGATATTGACGGTATAAACACTGGGACAATCGACA...,ATGGCG---GACAATATGACCACAACACAAATTGAGGTGGGTCCGG...,ATGGATACTTTTATTACAAGAAACTTCCAGACTACAATAATACAAA...,ATGAATATAAATCCGTATTTTCTATTCATAGATGTACCTATACAGG...,ATGACATTGGCTAAAATTGAATTGTTAAAGCAACTGTTAAGGGACA...


## Misc. Commands

In [13]:
# How to see the sizes of different partitions of data, say, by collection year?
d0.groupby(["cyear"]).size()

cyear
1969       1
1987       1
1988       1
1989       1
1990       3
1991       5
1992       3
1993       8
1994      10
1995      16
1996      10
1997      17
1998      11
1999      15
2000      15
2001      42
2002      12
2003       7
2004      86
2005      94
2006      15
2007      97
2008     160
2009       2
2010      34
2011      86
2012     288
2013     388
2014     403
2015     815
2016    1155
2017     140
dtype: int64

In [18]:
# For multiple levels of grouping, say, by continent, then country:
d1 = d0.groupby(["continent", "country"]).size()
# Set to a dataframe
d1 = d1.reset_index()
# Give it some nice column names
d1.columns=["continent", "country", "counts"]

# And try to preview it
d1

Unnamed: 0,continent,country,counts
0,Africa,Algeria,2
1,Africa,Burkina Faso,4
2,Africa,Cameroon,1
3,Africa,Congo,11
4,Africa,Cote DIvoire,8
5,Africa,Egypt,1
6,Africa,Ethiopia,1
7,Africa,Ghana,6
8,Africa,Kenya,1
9,Africa,Madagascar,1


In [19]:
# Note that if d1 is too big, only the top and bottom bits will be shown. 
# To get around this, increase the maximum number of rows printed out to, say, 500:
pd.set_option('display.max_rows', 500)
# Other options of this sort:
# pd.set_option('display.height', 1000)
# pd.set_option('display.max_rows', 500)
# pd.set_option('display.max_columns', 500)
# pd.set_option('display.width', 1000)

# Now try viewing d1 again:
d1

Unnamed: 0,continent,country,counts
0,Africa,Algeria,2
1,Africa,Burkina Faso,4
2,Africa,Cameroon,1
3,Africa,Congo,11
4,Africa,Cote DIvoire,8
5,Africa,Egypt,1
6,Africa,Ethiopia,1
7,Africa,Ghana,6
8,Africa,Kenya,1
9,Africa,Madagascar,1


In [20]:
# How to subsample? Select 200 records at random:
d_sample = d0.sample(n=200, replace=False)

# Or select a percentage, like 20%:
d_sample = d0.sample(frac=0.2, replace=False)

In [None]:
# Write d_sample to a csv:
#d_sample.to_csv(path0+"d_sample.csv")