# Machine learning - Features extraction

Demo to create a feature vector for protein fold classification. 
In this demo we try to classify a protein chain as either an all alpha or all beta protein based on protein sequence. We use n-grams and a Word2Vec representation of the protein sequence as a feature vector.

[Word2Vec model](https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec)

[Word2Vec example](https://spark.apache.org/docs/latest/ml-features.html#word2vec)

## Imports

In [1]:
from pyspark import SparkConf, SparkContext, SQLContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from mmtfPyspark.io import mmtfReader
from mmtfPyspark.webFilters import Pisces
from mmtfPyspark.filters import ContainsLProteinChain
from mmtfPyspark.mappers import StructureToPolymerChains
from mmtfPyspark.datasets import secondaryStructureExtractor
from mmtfPyspark.ml import ProteinSequenceEncoder

## Configure Spark Context

In [2]:
conf = SparkConf() \
            .setMaster("local[*]") \
            .setAppName("MachineLearningFeaturesExtractionDemo")

sc = SparkContext(conf = conf)

## Read MMTF File and create a non-redundant set (<=40% seq. identity) of L-protein clains

In [3]:
pdb = mmtfReader.read_reduced_sequence_file(sc, fraction=0.1) \
                .flatMap(StructureToPolymerChains()) \
                .filter(Pisces(sequenceIdentity=40,resolution=3.0))

Hadoop Sequence file path: MMTF_REDUCED=/home/marshuang80/PDB/reduced/


2174

## Get secondary structure content

In [4]:
data = secondaryStructureExtractor.get_dataset(pdb)

## Define addProteinFoldType function

In [5]:
def add_protein_fold_type(data, minThreshold, maxThreshold):
    '''
    Adds a column "foldType" with three major secondary structure class:
    "alpha", "beta", "alpha+beta", and "other" based upon the fraction of alpha/beta content.

    The simplified syntax used in this method relies on two imports:
        from pyspark.sql.functions import when
        from pyspark.sql.functions import col

    Attributes:
        data (Dataset<Row>): input dataset with alpha, beta composition
        minThreshold (float): below this threshold, the secondary structure is ignored
        maxThreshold (float): above this threshold, the secondary structure is ignored
    '''

    return data.withColumn("foldType", \
                           when((col("alpha") > maxThreshold) & (col("beta") < minThreshold), "alpha"). \
                           when((col("beta") > maxThreshold) & (col("alpha") < minThreshold), "beta"). \
                           when((col("alpha") > maxThreshold) & (col("beta") > minThreshold), "alpha+beta"). \
                           otherwise("other")\
                           )

## Classify chains by secondary structure type

In [6]:
data = add_protein_fold_type(data, minThreshold=0.05, maxThreshold=0.15)

## Create a Word2Vec representation of the protein sequences

**n = 2**     # create 2-grams 

**windowSize = 25**    # 25-amino residue window size for Word2Vector

**vectorSize = 50**    # dimension of feature vector

In [7]:
encoder = ProteinSequenceEncoder(data)
data = encoder.overlapping_ngram_word2vec_encode(n=2, windowSize=25, vectorSize=50).cache()

data.toPandas().head(5)

Unnamed: 0,structureChainId,sequence,alpha,beta,coil,dsspQ8Code,dsspQ3Code,foldType,ngram,features
0,2WWE.A,MHHHHHHSSGVDLGTENLYFQSMSIERATILGFSKKSSNLYLIQVT...,0.414414,0.243243,0.342342,XXXXXXXXXXXXXXXXCCCSCSSSEEEEEEEEEETTEEEEEEEEEE...,XXXXXXXXXXXXXXXXCCCCCCCCEEEEEEEEEECCEEEEEEEEEE...,alpha+beta,"[MH, HH, HH, HH, HH, HH, HS, SS, SG, GV, VD, D...","[-0.29939163091873366, -0.10455474958178543, 0..."
1,5H9N.A,AEVTSIPTGCNALSGKIMSGFDANRFFTGDWYLTHSRDSEVPVRCE...,0.084416,0.422078,0.493506,XXCCSCCTTSCCCTTTSCSSCCHHHHSSSEEEEEEESSCCSSCCCE...,XXCCCCCCCCCCCCCCCCCCCCHHHHCCCEEEEEEECCCCCCCCCE...,other,"[AE, EV, VT, TS, SI, IP, PT, TG, GC, CN, NA, A...","[-0.2998051074423617, -0.06302618445347874, 0...."
2,3U9J.A,MAPFPEEVDVFTAPHWRMKQLVGLYCDKLSKTNFSNNNDFRALLQS...,0.808917,0.0,0.191083,XXXCCGGGCSSHHHHHHHHHHHHHHHHHHHHCCTTSHHHHHHHHHH...,XXXCCHHHCCCHHHHHHHHHHHHHHHHHHHHCCCCCHHHHHHHHHH...,alpha,"[MA, AP, PF, FP, PE, EE, EV, VD, DV, VF, FT, T...","[-0.32142191160317957, 0.057224821281742375, 0..."
3,3U9L.A,MVMSKKIILITGASSGFGRLTAEALAGAGHRVYASMRDIVGRNASN...,0.524476,0.136364,0.339161,XXXXCCEEEESSCSSHHHHHHHHHHHHTTCEEEEEESCTTTTTHHH...,XXXXCCEEEECCCCCHHHHHHHHHHHHCCCEEEEEECCCCCCCHHH...,alpha+beta,"[MV, VM, MS, SK, KK, KI, II, IL, LI, IT, TG, G...","[-0.277416939741929, -0.1891483612554638, 0.27..."
4,3UBK.B,MVMIKLHGASISNYVNKVKLGILEKGLEYEQIRIAPSQEEDFLKIS...,0.548544,0.082524,0.368932,XCCEEEESCTTCHHHHHHHHHHHHHTCCEEEECCCCCCCHHHHTTS...,XCCEEEECCCCCHHHHHHHHHHHHHCCCEEEECCCCCCCHHHHCCC...,alpha+beta,"[MV, VM, MI, IK, KL, LH, HG, GA, AS, SI, IS, S...","[-0.2884620371181904, 0.06924909606181796, 0.5..."


## Keep only a subset of relevant fields for further processing

In [8]:
data = data.select(['structureChainId','alpha','beta','coil','foldType','features'])

## Write to parquet file

In [9]:
data.write.format('parquet').save('./features')

## Terminate Spark

In [10]:
sc.stop()