# Input Files Generation for DeepNovoV2

This Jupyter notebook was designed to demonstrate the use of the Python module developed to facilitate generation of input files for DeepNovoV2. This module was designed to handle any MGF file, but relies on results from the Mascot software to perform sequence annotation.

In [1]:
# Python libraries.
import numpy as np
import glob
import os
import smbp_deepnovov2_tools as sdt

The `format_mgf_deepnovo` function parses an MGF file and formats it to comply with DeepNovoV2 parameters requirements.

The `extract_features` function is used to extract features from an MGF file. It uses an optional Mascot XML results file to annotate each spectrum with a sequence. If multiple sequences are found for one spectrum, the feature is duplicated so that each sequence appears in the features file. Spectra without a sequence are removed, since they won't be usable during training.

In [2]:
# Extract features.
mgf_files = glob.glob('smbp_data/*.dat.mgf')
for mgf_file in mgf_files:
    print('Extracting features from: {}'.format(mgf_file))

    mgf_file_formatted = '{0}_formatted{1}'.format(*os.path.splitext(mgf_file))
    mascot_file = os.path.splitext(mgf_file)[0]+'.xml'
    features_file = (
        "{0}_features.csv"
        .format(os.path.splitext(mgf_file_formatted)[0])
    )

    # Reformat MGF file to comply with DeepNovoV2.
    sdt.format_mgf_deepnovo(mgf_file, mgf_file_formatted)
    print('Formatted MGF: {}'.format(mgf_file_formatted))

    # Extract features for current MGF file.
    features = sdt.extract_features(
        mgf_file_formatted, mascot_file
    )

    # Save features to CSV, appending _features to the filename.
    features.to_csv(features_file, index=False)
    print('Extracted features: {}\n'.format(features_file))

Extracting features from: smbp_data/F091321.dat.mgf
Formatted MGF: smbp_data/F091321.dat_formatted.mgf
Extracted features: smbp_data/F091321.dat_formatted_features.csv

Extracting features from: smbp_data/F090408.dat.mgf
Formatted MGF: smbp_data/F090408.dat_formatted.mgf
Extracted features: smbp_data/F090408.dat_formatted_features.csv

Extracting features from: smbp_data/F091600.dat.mgf
Formatted MGF: smbp_data/F091600.dat_formatted.mgf
Extracted features: smbp_data/F091600.dat_formatted_features.csv

Extracting features from: smbp_data/F090201.dat.mgf
Formatted MGF: smbp_data/F090201.dat_formatted.mgf
Extracted features: smbp_data/F090201.dat_formatted_features.csv

Extracting features from: smbp_data/F091234.dat.mgf
Formatted MGF: smbp_data/F091234.dat_formatted.mgf
Extracted features: smbp_data/F091234.dat_formatted_features.csv

Extracting features from: smbp_data/F091301.dat.mgf
Formatted MGF: smbp_data/F091301.dat_formatted.mgf
Extracted features: smbp_data/F091301.dat_formatted_

The `merge_mf` and `merge_features` functions are used to merge the previously generated MGF and features files. Since the scan IDs will be re-numbered using the order of the input files, the latter should be kept the same in both functions.

The `partition_feature_file_nodup` function partitions the merged features file into training, validation and testing datasets using the desired ratios.

In [3]:
# Merge MGF and features files. Keep the same file order by sorting the input filenames.
mgf_formatted_files = sorted(glob.glob('smbp_data/*_formatted.mgf'))
features_files = sorted(glob.glob('smbp_data/*_features.csv'))
sdt.merge_mgf(mgf_formatted_files, 'spectrum_smbp.mgf')
sdt.merge_features(features_files, 'features_smbp.csv')

# Partition features into training, validation and testing sets.
sdt.partition_feature_file_nodup('features_smbp.csv', prob=[0.8, 0.1, 0.1])