There are several ways to declare and populate a python pandas DataFrame with lists, dictionaries and combinations of lists and dictionaries (e.g., list of dictionaries, dictionary of lists)

## Declaring a DataFrame from a dictionary of lists

Let's create a DataFrame from a single sample's order-level proteobacteria amplicon reads

In [111]:
import pandas as pd

OTUnames=['Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales', 'Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales', 'Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales','Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales','Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales']

noReads=[3000,6000,500,300,0]

envType=['suboxic']*5

In [112]:
dfFromDictofLists=pd.DataFrame({'OTU Name':OTUnames,'noReads':noReads,'Environment':envType})

Look at the DataFrame by typing into the cell below

In [113]:
dfFromDictofLists

Unnamed: 0,OTU Name,noReads,Environment
0,Bacteria;Proteobacteria;Deltaproteobacteria;De...,3000,suboxic
1,Bacteria;Proteobacteria;Gammaproteobacteria;Vi...,6000,suboxic
2,Bacteria;Proteobacteria;Deltaproteobacteria;Sy...,500,suboxic
3,Bacteria;Proteobacteria;Gammaproteobacteria;Ae...,300,suboxic
4,Bacteria;Proteobacteria;Gammaproteobacteria;Me...,0,suboxic


I have declared a new DataFrame using the function `pd.DataFrame()` , and inside is a dictionary in the format column name: data column

Notice that the indices of the rows are set to the pandas default because no column has been explicitely set.

To declare a DataFrame and designate an index column:

In [114]:
dfFromDictofLists=pd.DataFrame({'noReads':noReads,'Environment':envType},index=OTUnames)

Look at the DataFame by typing into the cell below

In [115]:
dfFromDictofLists

Unnamed: 0,noReads,Environment
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales,3000,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales,6000,suboxic
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales,500,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales,300,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales,0,suboxic


The numeric default indices have been replaced by the list **OTUnames**, and OTUnames is no longer a data column

## Declaring a DataFrame from a list of lists

Let's reorder our 3 lists into a list of lists in the format `[OTUname,reads,environment]`. This format is common if prior data processing has been done by iterating through several lists in parallel

In [116]:
bigList=[]

for count,n in enumerate(OTUnames):
    bigList.append([n,noReads[count],envType[count]])

Look at bigList by typing it into the cell below. bigList is made of lists that contain each OTUname, count, environment type set

In [117]:
bigList

[['Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales',
  3000,
  'suboxic'],
 ['Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales', 6000, 'suboxic'],
 ['Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales',
  500,
  'suboxic'],
 ['Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales', 300, 'suboxic'],
 ['Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales', 0, 'suboxic']]

In [118]:
dfFromListofLists=pd.DataFrame(bigList,columns=['OTU Name','noReads','Environment'])

Look at the DataFrame by typing it into the cell below

In [119]:
dfFromListofLists

Unnamed: 0,OTU Name,noReads,Environment
0,Bacteria;Proteobacteria;Deltaproteobacteria;De...,3000,suboxic
1,Bacteria;Proteobacteria;Gammaproteobacteria;Vi...,6000,suboxic
2,Bacteria;Proteobacteria;Deltaproteobacteria;Sy...,500,suboxic
3,Bacteria;Proteobacteria;Gammaproteobacteria;Ae...,300,suboxic
4,Bacteria;Proteobacteria;Gammaproteobacteria;Me...,0,suboxic


This produces exactly the same DataFrame as our unindexed dfFromDictofLists

To set OTU Name as index for this DataFrame as we declare it would be a pain, because we would need to extract all OTU names from each smaller list in bigList. In this case it is easier to declare the index after. 

In [120]:
dfFromListofLists.set_index('OTU Name',inplace=True)

By using the `set_index()` function, I can choose which column I want as my index column by name using an existing DataFrame. the argument `inplace` allows the user to choose whether they want to permanently alter their original DataFrame (True) or not (False)

Look at the DataFrame by typing it into the cell below. The DataFrame is now the same as the indexed version of dfFromDictofLists

In [121]:
dfFromDictofLists

Unnamed: 0,noReads,Environment
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales,3000,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales,6000,suboxic
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales,500,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales,300,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales,0,suboxic


## Declaring a DataFrame from a list of dictionaries

A big list of dictionaries is common way to store bioinformatics data. Each dictionary in the list is made of key:value pairs for one data set at a time as with the list of lists, but each key is the intended column header.

Let's first make a list of dictionaries using our original lists

In [122]:
bigListofDicts=[]

for count,n in enumerate(OTUnames):
    d={'OTU Name':n,'noReads':noReads[count],'envType':envType[count]}
    bigListofDicts.append(d)

Look at the list of dictionaries by typing into cell below

In [123]:
bigListofDicts

[{'OTU Name': 'Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales',
  'noReads': 3000,
  'envType': 'suboxic'},
 {'OTU Name': 'Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales',
  'noReads': 6000,
  'envType': 'suboxic'},
 {'OTU Name': 'Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales',
  'noReads': 500,
  'envType': 'suboxic'},
 {'OTU Name': 'Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales',
  'noReads': 300,
  'envType': 'suboxic'},
 {'OTU Name': 'Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales',
  'noReads': 0,
  'envType': 'suboxic'}]

This format is convenient because pandas will recognize each key as a column header and will know which column to place each value in to

In [124]:
dfFromListofDicts=pd.DataFrame(bigListofDicts)
dfFromListofDicts.set_index('OTU Name',inplace=True)

Look at the indexed DataFrame by typing into cell below:

In [125]:
dfFromListofDicts

Unnamed: 0_level_0,noReads,envType
OTU Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales,3000,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales,6000,suboxic
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales,500,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales,300,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales,0,suboxic


## The basics of exploring and manipulating DataFrames

The basic indexing structure of the pandas DataFrame is DataFrame.column[row] or DataFrame['columnname'][row].
DataFrame rows and columns can be called by their zero-based indexing positions as with lists. If a DataFrame has an index column, rows should be called either by DataFrame.loc[index] or by DataFrame.iloc[rownumber] to use it's position

Specifying a column but no row or vice versa will return the entire column or all columns in a given row.If a column header has a space or special characters, it cannot be called by DataFrame.column, and should be called by DataFrame['columnname']. 

To print indices, use DataFrame.index.

Running the cell below will print all columns in the 2nd row using the .iloc method

In [126]:
dfFromListofDicts.iloc[1]

noReads       6000
envType    suboxic
Name: Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales, dtype: object

The output will include each column name and its associated data in that entry. Name refers to index name.

We can also call on this row by index name using .loc:

In [127]:
dfFromListofDicts.loc['Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales']

noReads       6000
envType    suboxic
Name: Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales, dtype: object

Running the cell below will print all rows in the column 'noReads'

In [128]:
dfFromListofDicts['noReads']

OTU Name
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales      3000
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales            6000
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales     500
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales           300
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales           0
Name: noReads, dtype: int64

python will return the column of data itself, the indices (left), and at the bottom will display the name of the column and the datatype. 

If I wanted to know the number of reads only for, say Gammaproteobacteria entries, I can filter the index column as a list and use `.loc` to call on a subset of the noReads column. The ability to conditionally sort and manipulate DataFrames is what makes pandas so powerful for handling large and complex data sets.

In [129]:
gammL=[i for i in dfFromListofDicts.index if "Gammaproteobacteria" in i]

dfFromListofDicts['noReads'].loc[gammL]

OTU Name
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales        6000
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales       300
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales       0
Name: noReads, dtype: int64

It is a common task in cleaning up bioinformatics datasets to remove OTUs with zero counts. We can first examine the data entry by using an argument structure that evaluates the DataFrame and returns a boolean DataFrame. That boolean DataFrame can then be placed inside of brackets to return only entries where that condition is true or false. 

Let's go through each part of this argument step by step and assemble the final argument. In the argument below i am testing if each entry in ['noReads'] is equal to zero. Remember when you are testing if something is equal to a value, use `==`. If `=` is used, all values in the DataFrame column will be set to that value!

In [130]:
dfFromListofDicts['noReads']==0

OTU Name
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales      False
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales            False
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales    False
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales          False
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales         True
Name: noReads, dtype: bool

This returns a boolean DataFrame. The entire argument is placed inside the square brackets with the DataFrame name on the outside to return all rows in the DataFrame that meet the condition

In [131]:
dfFromListofDicts[dfFromListofDicts['noReads']==0]

Unnamed: 0_level_0,noReads,envType
OTU Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bacteria;Proteobacteria;Gammaproteobacteria;Methylococcales,0,suboxic


Using a similar structure, we can drop the entry from the DataFrame by condition using the pandas `.drop()` method or simply omit it from the DataFrame using the `~` operator

The `.drop()` method:

* basic syntax: DataFrame.drop(condition.index)
* default settings are axis=0 (rows)
* If dropping columns, axis=1 
* By default inplace=False 
* If you want to alter the DataFrame permanently in place, add inplace=True to the arguments. Generally speaking, aside from setting index, I do not recommend altering any DataFrames inplace, because the original data can no longer be accessed  

In [132]:
noZerosdfFromListofDicts=dfFromListofDicts.drop(dfFromListofDicts[dfFromListofDicts['noReads']==0].index,axis=0)

Enter the noZeros DataFrame in the cell below to view

In [133]:
noZerosdfFromListofDicts

Unnamed: 0_level_0,noReads,envType
OTU Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales,3000,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales,6000,suboxic
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales,500,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales,300,suboxic


the `~` method:

* basic syntax: df[~(condition)]

In [134]:
noZerosdfFromListofDicts=dfFromListofDicts[~(dfFromListofDicts['noReads']==0)]

Enter the noZeros DataFrame in the cell below to view

In [135]:
noZerosdfFromListofDicts

Unnamed: 0_level_0,noReads,envType
OTU Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Bacteria;Proteobacteria;Deltaproteobacteria;Desulfobacterales,3000,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales,6000,suboxic
Bacteria;Proteobacteria;Deltaproteobacteria;Syntrophobacterales,500,suboxic
Bacteria;Proteobacteria;Gammaproteobacteria;Aeromonadales,300,suboxic
