## Create Filtered Feature Tables

Split the GCMP feature table into 3 feature tables, one each for mucus, tissue, and skeletal samples


In [14]:
from qiime2 import Artifact
from qiime2.plugins.feature_table.methods import filter_samples
from qiime2.metadata import Metadata
from os.path import abspath,exists

In [2]:
#import the feature table
#import the sequences (need upper?)
#filter the sequences by the feature table
#filter the sequences and feature table by metadata (needed?)

As in past notebooks, we assume the analysis is structured in directories as follows:
<pre>

analysis_name/
   input/
       input_files (.biom, etc)
   output/
   procedure/
       this_notebook.ipynb
</pre>

#### Set up variables to hold paths for all input files

In [11]:
working_dir = abspath('../')
biom_path = working_dir + '/input/all.biom'
metadata_path = working_dir + '/input/GCMP_EMP_map_r28_no_empty_samples.txt'
seqs_path = working_dir + '/input/all.seqs.fa'

#### Verify that input files exist at the above paths

Let's run a quick check to make sure we have everything we need:

In [16]:
print("Verifying that all needed starting data files exist.")
for existing_file in [biom_path,metadata_path,seqs_path]:
    if not exists(existing_file):
        raise IOError(f"Required file {existing_file} not found. Please ensure it is in that directory.")
print("Done.")

Verifying that all needed starting data files exist.
Done.


#### Import the GCMP feature table and metadata

In [17]:
#import the biom table as a feature table
GCMP_ft = Artifact.import_data('FeatureTable[Frequency]', biom_path,
                               'BIOMV210Format')

#import the metadata
metadata = Metadata.load(metadata_path)

#### Split the GCMP feature table by compartment

The GCMP includes samples from coral mucus, tissue, and skeleton, as well as some contextual water and sediment samples. Here we create separate .qza feature tables by filtering for just mucus, tissue or skeleton samples.

In [21]:
#split into mucus, tissue, and skeleton compartments
compartments = ['M', 'T', 'S']
for compartment in compartments:
    where = "tissue_compartment='" + compartment + "'"
    GCMP_filtered, = filter_samples(GCMP_ft, metadata = metadata,
                                   where = where)
    save_path = working_dir + '/output/' + compartment + "_ft.qza"
    GCMP_filtered.save(save_path)

#### Load the GCMP Sequences

QIIME2 requires sequences to be in uppercase. However the GCMP fasta file has a mix of upper and lowercase sequences. Therefore we have to manually convert sequences (but not sequence labels) to uppercase prior to loading into QIIME2. 

In [22]:
upper_seqs_path = working_dir + '/output/GCMP.fasta'
with open(seqs_path) as infile:
    with open(upper_seqs_path, "w") as outfile:
        for line in infile:
            #careful to preserve the headers so they match the feature table
            if not line.startswith(">"):
                line = line.upper()
            outfile.write(line)

Now that we have made a file of uppercase sequences we should be abble to import into QIIME2 and save a .qza artifact

In [24]:
#now we can import, and save the artifact
GCMP_seqs = Artifact.import_data('FeatureData[Sequence]', upper_seqs_path)
GCMP_seqs.save(working_dir + '/output/GCMP_seqs.qza')

'/Users/jzaneveld/Dropbox/Zaneveld_Lab_Organization/Projects/GCMP_Global_Disease/gcmp_global_disease/analysis/organelle_removal/output/GCMP_seqs.qza'