<!-- <div id="toc_container"> -->
<h2>Table of Contents</h2>
<ul class="toc_list">
  <a href="#1&nbsp;&nbsp;General-Analysis-Information">1&nbsp;&nbsp;General Analysis Information</a><br>
    <ul>
    <a href="#1.1&nbsp;&nbsp;Project-summary">1.1&nbsp;&nbsp;Project summary</a><br>
    <a href="#1.2&nbsp;&nbsp;Project-directory">1.2&nbsp;&nbsp;Project directory</a><br>
    </ul>
  <a href="#2&nbsp;&nbsp;Manifest-preparation">2&nbsp;&nbsp;Manifest preparation</a><br>
    <ul>
    <a href="#2.1&nbsp;&nbsp;Metadata-overview">2.1&nbsp;&nbsp;Metadata overview</a><br>
    <a href="#2.2&nbsp;&nbsp;Receipt-selection">2.2&nbsp;&nbsp;Receipt selection</a><br>
    <a href="#2.3&nbsp;&nbsp;Output-manifest">2.3&nbsp;&nbsp;Output manifest</a><br>  
    </ul>
  <a href="#3&nbsp;&nbsp;Sequencing-information">3&nbsp;&nbsp;Sequencing information</a><br>
    <ul>
    <a href="#3.1&nbsp;&nbsp;Yaml-parameters">3.1&nbsp;&nbsp;Yaml parameters</a><br>
    <a href="#3.2&nbsp;&nbsp;Pipeline-run">3.2&nbsp;&nbsp;Pipeline run</a><br>
    </ul>
</ul>

In [2]:
%%bash
pip install git+https://github.com/webermarcolivier/statannot

Collecting git+https://github.com/webermarcolivier/statannot
  Cloning https://github.com/webermarcolivier/statannot to c:\users\slsevilla\appdata\local\temp\pip-req-build-eo7_cj7g
Building wheels for collected packages: statannot
  Building wheel for statannot (setup.py): started
  Building wheel for statannot (setup.py): finished with status 'done'
  Created wheel for statannot: filename=statannot-0.2.3-py3-none-any.whl size=11042 sha256=3dc5493390d22b50dda6349e1c3b4466f55b947794b857e422cab92f83243741
  Stored in directory: C:\Users\slsevilla\AppData\Local\Temp\pip-ephem-wheel-cache-l6j3i86q\wheels\8c\02\05\9e6984dcc76e303ad64b18e7de7a1129c71efb98082cd09bb7
Successfully built statannot
Installing collected packages: statannot
Successfully installed statannot-0.2.3


  Running command git clone -q https://github.com/webermarcolivier/statannot 'C:\Users\slsevilla\AppData\Local\Temp\pip-req-build-eo7_cj7g'


In [3]:
from IPython.display import display
import os.path
import yaml
import pandas as pd
import numpy as np
import glob

<h2 id="1&nbsp;&nbsp;Ge">1&nbsp;&nbsp;General Analysis Information</h2>

<h3 id="1.1&nbsp;&nbsp;Project-summary">1.1&nbsp;&nbsp;Project summary</h3>

AIM 1 consists of a fecal extraction comparison using fresh-frozen fecal material from human donors, and artificial colonies. Samples were extracted using multiple methods, to be examined. The 16S rRNA gene was then sequenced using an Illumina MiSeq from three projects: NP0084-MB4, NP0084-MB5, NP0084-MB6. A summary of the three individual projects is included below:

- NP0084-MB4
- NP0084-MB5
- NP0084-MB6

<h3 id="1.2&nbsp;&nbsp;Project-summary">1.2&nbsp;&nbsp;Project directory</h3>

In [4]:
with open(r'C:\Program Files\Git\Coding\gmu\dissertation\aim1\notebooks\config.yaml') as file:
    # The FullLoader parameter handles the conversion from YAML
    # scalar values to Python the dictionary format
    configfile = yaml.load(file, Loader=yaml.FullLoader)

In [5]:
data_dir=configfile['data_dir']
analysis_dir=configfile['analysis_dir']
git_dir=configfile['git_dir']
manifest_file=configfile['metadata_manifest']
variable_file=configfile['variable_manifest']

<h2 id="2&nbsp;&nbsp;Manifest-preparation">2&nbsp;&nbsp;Manifest preparation</h3>

<h3 id="2.1&nbsp;&nbsp;Metadata Processing">2.1&nbsp;&nbsp;Metadata Processing</h3>

Review and clean-up manifest before processing

In [6]:
manifest = pd.read_csv(manifest_file,sep='\t')
manifest.head()

Unnamed: 0,Sample ID,Source-PCR-Plate,Run-ID,Project-ID,CGR-Sample-ID,Sample-Type,Sample-Des,Subject-ID,Reciept,Ext-Company,Ext-Kit,Ext-Robotics,Homo-Status,Homo-Method,Homo-Holder,Mag-Col,IR
0,SC249383,PC04924_E_02,180112_M01354_0104_000000000-BFN3F,NP0084-MB4,SC249383,Ext_Control,art_col,DZ35322,sFEMB-001-R-003,ZymoResearch,96 MagBead DNA Extraction Kit,Manual,Standard,Plate Adaptor,Plate,Magnetic,
1,SC249388,PC04924_A_03,180112_M01354_0104_000000000-BFN3F,NP0084-MB4,SC249388,Ext_Control,art_col,DZ35322,sFEMB-001-R-003,ZymoResearch,96 MagBead DNA Extraction Kit,Manual,Standard,Plate Adaptor,Plate,Magnetic,
2,SC249387-PC04925-G-01,PC04925_G_01,180112_M03599_0134_000000000-BFD9Y,NP0084-MB4,SC249387,Ext_Control,Ext_Blank,Water,sFEMB-001-R-003,ZymoResearch,96 MagBead DNA Extraction Kit,Manual,Standard,Plate Adaptor,Plate,Magnetic,
3,SC249387-PC07578-G-01,PC07578_G_01,180328_M01354_0106_000000000-BFMHC,NP0084-MB5,SC249387,Ext_Control,Ext_Blank,Water,sFEMB-001-R-003,ZymoResearch,96 MagBead DNA Extraction Kit,Manual,Standard,Plate Adaptor,Plate,Magnetic,
4,SC249393,PC04924_F_03,180112_M01354_0104_000000000-BFN3F,NP0084-MB4,SC249393,Ext_Control,Ext_Blank,Water,sFEMB-001-R-003,ZymoResearch,96 MagBead DNA Extraction Kit,Manual,Standard,Plate Adaptor,Plate,Magnetic,


In [7]:
#remove NAs
manifest = manifest.dropna(how='all', axis='columns')

#remove spaces
manifest.columns = manifest.columns.str.replace(' ', '')

#rename variables for simplification
df = pd.read_csv(variable_file,sep='\t',index_col=0)
for i in manifest['Run-ID'].unique():
    manifest.replace({i:df.loc[df.index==i, 'Identifier'].values[0]},inplace=True)
for i in manifest['Ext-Kit'].unique():
    manifest.replace({i:df.loc[df.index==i, 'Identifier'].values[0]},inplace=True)
for i in manifest['Subject-ID'].unique():
    manifest.replace({i:df.loc[df.index==i, 'Identifier'].values[0]},inplace=True)
    
#remove well location
manifest.replace({r'_._..': r''}, inplace=True, regex=True)

#replace _ with - for fastq matching
manifest['SampleID'].str.replace('_','-')

#set index
manifest.set_index('SampleID',inplace=True)

In [8]:
#Print the variables for the metadata
for i in manifest.drop(columns=['Source-PCR-Plate','CGR-Sample-ID']).columns:
    print(i)
    print(*manifest[i].unique(), sep = ", ")  
    print('\n')

Run-ID
SQ-1, SQ-2, SQ-3, SQ-4


Project-ID
NP0084-MB4, NP0084-MB5, NP0084-MB6


Sample-Type
Ext_Control, Study, Seq_Control


Sample-Des
art_col, Ext_Blank, H01, Robogut, ZymoBiomics, H02, MSA, Seq_Blank


Subject-ID
M22, BW, H01, M98, Z00, M16, H02, Z10, A00, A01, A02, A03, BN, BP, Z05, Z06, Z11


Reciept
sFEMB-001-R-003, sFEMB-001-R-004, sFEMB-001-R-012, sFEMB-001-R-015, sFEMB-001-R-016, sFEMB-001-R-017, sFEMB-001-R-041, sFEMB-001-R-044, sFEMB-001-R-049, sFEMB-001-R-054, sFEMB-001-R-034, sFEMB-001-R-039, sFEMB-001-R-043, sFEMB-001-R-047, sFEMB-001-R-053, sFEMB-001-R-002, sFEMB-001-R-011, sFEMB-001-R-007, sFEMB-001-R-008, sFEMB-001-R-014, sFEMB-001-R-038, sFEMB-001-R-046, sFEMB-001-R-052, sFEMB-001-R-005, sFEMB-001-R-006, sFEMB-001-R-013, sFEMB-001-R-037, sFEMB-001-R-040, sFEMB-001-R-045, sFEMB-001-R-048, sFEMB-001-R-051, sFEMB-001-R-042, sFEMB-001-R-050, sFEMB-001-R-055, sFEMB-001-R-009, sFEMB-001-R-010, Seq


Ext-Company
ZymoResearch, Qiagen, ThermoFisher, Seq


Ext-Kit
EX-7, EX-3, 

<h3 id="2.2&nbsp;&nbsp;Sample-selection">2.2&nbsp;&nbsp;Sample selection</h3>

Review receipts and sample types, to make final selection

In [9]:
def sort_list (m,col):
    sort_list = [x for x in m[col].unique() if str(x) != 'nan']
    sort_list.sort()
    return sort_list

In [10]:
print ("Receipt", "Ext Kit", "Homogenization Method", sep = " - ")
for i in sort_list(manifest,'Reciept'):
    temp = manifest[(manifest['Reciept'] == i)]
    col2 = temp['Ext-Kit'].unique()
    col3 = temp['Homo-Status'].unique()
    print (i, col2[0], col3[0],sep=" - ")

Receipt - Ext Kit - Homogenization Method
Seq - Seq - Seq
sFEMB-001-R-002 - EX-1 - Standard
sFEMB-001-R-003 - EX-7 - Standard
sFEMB-001-R-004 - EX-7 - Discard
sFEMB-001-R-005 - EX-4 - Standard
sFEMB-001-R-006 - EX-4 - Standard
sFEMB-001-R-007 - EX-2 - Standard
sFEMB-001-R-008 - EX-2 - Standard
sFEMB-001-R-009 - EX-5 - Standard
sFEMB-001-R-010 - EX-5 - Standard
sFEMB-001-R-011 - EX-1 - Standard
sFEMB-001-R-012 - EX-7 - Discard
sFEMB-001-R-013 - EX-4 - Residual
sFEMB-001-R-014 - EX-2 - Residual
sFEMB-001-R-015 - EX-7 - Standard
sFEMB-001-R-016 - EX-7 - Standard
sFEMB-001-R-017 - EX-7 - Residual
sFEMB-001-R-034 - EX-3 - Altered
sFEMB-001-R-037 - EX-4 - Standard
sFEMB-001-R-038 - EX-2 - Standard
sFEMB-001-R-039 - EX-3 - Standard
sFEMB-001-R-040 - EX-4 - Standard
sFEMB-001-R-041 - EX-7 - Standard
sFEMB-001-R-042 - EX-6 - Standard
sFEMB-001-R-043 - EX-3 - Horizontal
sFEMB-001-R-044 - EX-7 - Horizontal
sFEMB-001-R-045 - EX-4 - SPEX
sFEMB-001-R-046 - EX-2 - SPEX
sFEMB-001-R-047 - EX-3 - SPEX
s

Receipts were removed from the project for the following reasons:

- Receipts that failed sequencing: R-01*, R-04, R-012
- Receipts that were cancelled: R-23 to R-25*
- Receipts that contained residuals: R-13, R-14, R-17
- Receipts that contained altered homogenization: R-034, R-043 to R-055

*Not included in sequencing project, but found in LIMS

In [11]:
m_rec = manifest.copy()

recp_num = ['04','12','13','14','17','34','43','44','45','46','47','48','49','50','51','52','53','54','55']
drop_recp = []

for i in recp_num:
    drop_recp.append("sFEMB-001-R-0" + i)

m_rec = manifest[~manifest.Reciept.isin(drop_recp)]
m_rec_rm = manifest[manifest.Reciept.isin(drop_recp)]
print("The original number of samples:", manifest.shape[0], "\nThe filtered number of samples:",m_rec.shape[0])

The original number of samples: 474 
The filtered number of samples: 341


Samples were removed from the project for the following reasons:

- Samples of the typee Biocollective were not equally used by all extraction methods

In [12]:
m_nonbio = m_rec[m_rec['Sample-Des']!="H02"]
print("The original number of samples:", manifest.shape[0], "\nThe final filtered number of samples:",m_nonbio.shape[0])

The original number of samples: 474 
The final filtered number of samples: 311


In [13]:
%cd {git_dir}
m_nonbio.to_csv(r'manifest/m_clean.txt',header = m_nonbio.columns, sep='\t')
m_save=m_nonbio.copy()

C:\Program Files\Git\Coding\gmu\dissertation\aim1


<h3 id="2.3&nbsp;&nbsp;Project-sample-metadata">2.3&nbsp;&nbsp;Project sample metadata</h3>


In [52]:
def metacount_all(df_in):
    #Print the counts for all of the metadata
    m = df_in.drop(columns=['ExternalID','Project-ID','ExtractionBatchID'],errors='ignore')

    for i in m.columns:
        display(m[i].value_counts().rename_axis(i).to_frame('Number of samples'))

In [53]:
def metacount_run_sub(df_in):
    tmp = df_in[['Run-ID','Subject-ID']].copy()
    tmp = tmp.groupby(['Subject-ID','Run-ID'])
    display(tmp.size().to_frame('Subject-ID').join(tmp.apply(list).apply(pd.Series)))

In [54]:
metacount_all(m_nonbio)

Unnamed: 0_level_0,Number of samples
Source-PCR-Plate,Unnamed: 1_level_1
PC04924,85
PC04925,79
PC07578,74
PC22190,37
PC22192,36


Unnamed: 0_level_0,Number of samples
Run-ID,Unnamed: 1_level_1
SQ-1,85
SQ-2,79
SQ-3,74
SQ-4,73


Unnamed: 0_level_0,Number of samples
CGR-Sample-ID,Unnamed: 1_level_1
SC249448,4
SC249414,3
SC253201,3
SC253190,3
SC249411,3
...,...
SC249358,1
SC552966,1
SC553021,1
SC552973,1


Unnamed: 0_level_0,Number of samples
Sample-Type,Unnamed: 1_level_1
Study,151
Ext_Control,118
Seq_Control,42


Unnamed: 0_level_0,Number of samples
Sample-Des,Unnamed: 1_level_1
H01,103
ZymoBiomics,61
Robogut,48
Ext_Blank,40
art_col,29
MSA,20
Seq_Blank,10


Unnamed: 0_level_0,Number of samples
Subject-ID,Unnamed: 1_level_1
H01,103
M98,48
Z00,41
BW,40
M22,24
Z10,8
A02,5
BP,5
M16,5
BN,5


Unnamed: 0_level_0,Number of samples
Reciept,Unnamed: 1_level_1
Seq,42
sFEMB-001-R-007,26
sFEMB-001-R-002,24
sFEMB-001-R-006,23
sFEMB-001-R-008,22
sFEMB-001-R-010,21
sFEMB-001-R-005,21
sFEMB-001-R-011,21
sFEMB-001-R-003,18
sFEMB-001-R-009,18


Unnamed: 0_level_0,Number of samples
Ext-Company,Unnamed: 1_level_1
Qiagen,213
ZymoResearch,47
Seq,42
ThermoFisher,9


Unnamed: 0_level_0,Number of samples
Ext-Kit,Unnamed: 1_level_1
EX-4,62
EX-2,57
EX-7,47
EX-1,45
Seq,42
EX-5,39
EX-3,10
EX-6,9


Unnamed: 0_level_0,Number of samples
Ext-Robotics,Unnamed: 1_level_1
KingFisher,128
Manual,96
QIASymphony,45
Seq,42


Unnamed: 0_level_0,Number of samples
Homo-Status,Unnamed: 1_level_1
Standard,269
Seq,42


Unnamed: 0_level_0,Number of samples
Homo-Method,Unnamed: 1_level_1
TissueLyzer,128
Vertical,123
Seq,42
Plate Adaptor,18


Unnamed: 0_level_0,Number of samples
Homo-Holder,Unnamed: 1_level_1
Plate,146
Tubes,123
Seq,42


Unnamed: 0_level_0,Number of samples
Mag-Col,Unnamed: 1_level_1
Magnetic,220
Column,49
Seq,42


Unnamed: 0_level_0,Number of samples
IR,Unnamed: 1_level_1
Solution,138
,86
Tablet,45
Seq,42
