 ## Feature Engineering
 This section will discuss the feature engineering method.

 In this study, we extract the features from four different views, thus, we have totally 85 features within four categories.

 The categories are "Expression", "Epigenetic", "Genomic" and "Network".

 After observation, we have found that these features have different number orders and some of them may have the variances with very low value, which may be though of repeated ones.
 By this minner, we should do some feature engineering.

 We employed the Laplacian Score function to score each features in its related category, and then use these scores to filter the features (lower score means the feature is more important):
 + The feature with low score will be retained.
 + The feature with high score should be avoided.

 The Laplacian Score is implemented by the Python Scikit-feature package mentioned [here](https://github.com/jundongl/scikit-feature)

 ### Code

 First, we should load the data with the category:

In [0]:
import numpy as np
import pandas as pd

def LoadData(DataPath, PositivePath, NegativePath, Seed):
    """
    A function to load the data from the given datafile
    Input: TotalDataPath, PositiveNamePath, NegativeNamePath, and random seed for shuffle.
    Output: TrainingData(DataFrame), TrainingLabel(List) and ValidSet(DataFrame). Please note that the index in DataFrame are not reset.

    """
    Data = pd.read_csv(DataPath)
    PositiveData = pd.read_csv(PositivePath, header=None).iloc[:,0].tolist()
    NegativeData = pd.read_csv(NegativePath, header=None).iloc[:,0].tolist()
    # Generate the Positive Index
    PositiveIndex = []
    for i in range(len(PositiveData)):
        index = Data[Data['Gene_ID'] == PositiveData[i]].index.tolist()
        PositiveIndex.extend(index)
    # Generate the negative index
    NegativeIndex = []
    for i in range(len(NegativeData)):
        index = Data[Data['Gene_ID'] == NegativeData[i]].index.tolist()
        NegativeIndex.extend(index)

    # Generate the data that has the label
    UsefulIndex = PositiveIndex+NegativeIndex
    TrainingData = Data.iloc[UsefulIndex, :]
    ValidData = Data.drop(UsefulIndex)
    ValidName = ValidData['Gene_ID']
    ValidData = ValidData.drop(['Gene_ID'], axis = 1)

    # Generate the training data label
    P_Label = [1 for i in range(len(PositiveIndex))]
    N_Label = [0 for i in range(len(NegativeIndex))]
    TrainingLabel = P_Label + N_Label

    TrainingData = TrainingData.assign(Label = TrainingLabel)
    TrainingData = TrainingData.sample(frac=1, random_state=Seed)

    TrainingLabel = TrainingData['Label'].tolist()
    TrainingName = TrainingData['Gene_ID']
    TrainingData = TrainingData.drop(['Label', 'Gene_ID'], axis=1)


    return TrainingData, TrainingLabel, ValidData

def GetTypeData(varpath):
    DataFilePath = 'OriginalData/Data.csv'
    PositivePath = 'OriginalData/Positive.csv'
    NegativePath = 'OriginalData/Negative.csv'

    ShuffleSeed = 442
    X, y, Valid = LoadData(DataFilePath, PositivePath, NegativePath, ShuffleSeed)
    X = X.fillna(0)
    Valid = Valid.fillna(0)

    _TypeFeatureName = pd.read_csv(varpath).columns.values
    Type_X = X[_TypeFeatureName]
    Type_Valid = Valid[_TypeFeatureName]
    
    return Type_X, Type_Valid

TypeOnePath = "FeatureVariance/Feature_Epigenetic.csv"
TypeOneTrain, TypeOneValid = GetTypeData(TypeOnePath)

TypeTwoPath = "FeatureVariance/Feature_Expression.csv"
TypeTwoTrain, TypeTwoValid = GetTypeData(TypeTwoPath)

TypeThrPath = "FeatureVariance/Feature_Genomic.csv"
TypeThrTrain, TypeThrValid = GetTypeData(TypeThrPath)

TypeFourPath = "FeatureVariance/Feature_Network.csv"
TypeFourTrain, TypeFourValid = GetTypeData(TypeFourPath)


 Then, The Laplacian Score should be calculated:

In [0]:
from skfeature.function.similarity_based import lap_score
from skfeature.utility import construct_W
def LapScoreCal(Data):
    Data = np.asarray(Data)
    kwargs_W = {"metric": "euclidean", "neighbor_mode": "knn", "weight_mode": "heat_kernel", "k": 5, 't': 1}
    W = construct_W.construct_W(Data, **kwargs_W)
    LapScore = lap_score.lap_score(Data, W=W)
    RankScore = lap_score.feature_ranking(LapScore)

    return LapScore, RankScore

LapScore_One, RankOne = LapScoreCal(TypeOneTrain)
LapScore_Two, RankTwo = LapScoreCal(TypeTwoTrain)
LapScore_Thr, RankThr = LapScoreCal(TypeThrTrain)
LapScore_Four, RankFour = LapScoreCal(TypeFourTrain)


 Now we get the scores and corresponding orders.
 We can plot them to see the score distributions.

In [0]:
from matplotlib import pyplot as plt
%matplotlib inline
ScoreList = [LapScore_One, LapScore_Two, LapScore_Thr, LapScore_Four]
RankList = [RankOne, RankTwo, RankThr, RankFour]
TypeList = ['Epigenetic', 'Expression', 'Genomic', 'Network']
plt.figure(figsize=(6,5))
for i in range(4):
    plt.subplot(2,2, i+1)
    plt.plot(np.arange(0, len(ScoreList[i])), ScoreList[i], "o")
    plt.title("Laplacian Score\n%s"%(TypeList[i]))
plt.subplots_adjust(hspace=0.13)
plt.tight_layout()
plt.show()

 It is clear that each category of features can be divided into three parts by the first two largest difference, which is as follows,
 ![LapScore](2.png)

 Just remove the features with high score is not desirable because these features also contain some information.
 In this way, we intergate the high-score features by mean in each of the high-score space to obtain two more intergation features in each category.

 Finally, we have got 53 features, 22 for "Epigenetic", 14 for "Expression", 11 for "Genomic" and 6 for "Network".