# Introduction to Machine Learning 2021 - Task 3

Proteins are large molecules. Their blueprints are encoded in the DNA of biological organisms. Each protein consists of many amino acids: for example, our protein of interest consists of a little less than 400 amino acids. Once the protein is created (synthesized), it folds into a 3D structure, which can be seen in Figure 1. The mutations influence what amino acids make up the protein, and hence have an effect on its shape.

The goal of this task is to classify mutations of a human antibody protein into active (1) and inactive (0) based on the provided mutation information. Under active mutations the protein retains its original function, and inactive mutation cause the protein to lose its function. The mutations differ from each other by 4 amino acids in 4 respective sites. The sites or locations of the mutations are fixed. The amino acids at the 4 mutation sites are given as 4-letter combinations, where each letter denotes the amino acid at the corresponding mutation site. Amino acids at other places are kept the same and are not provided.

For example, FCDI corresponds to amino acid F (Phenylanine) being in the first site, amino acid C (Cysteine) being in the second site and so on. The Figure 2 gives translation from symbols to amino acid chemical names for the interested students. The biological and chemical aspects can be abstracted to solve this task.

In [None]:
#!/usr/bin/env python
# coding: utf-8

import os
import numpy as np
import pandas as pd
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import (cross_val_score, KFold, GridSearchCV)
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.compose import (make_column_transformer, make_column_selector)
from sklearn.preprocessing import (OrdinalEncoder, MinMaxScaler)

In [None]:
# function to load data and optional features into dataframe
def create_dataframe(df, list_of_features, dict_of_aa):
    data = []
    for i in range(len(df)):
        sites = list(df.iloc[i][0])
        data.append(sites)

    select_features = []
    for site in data:
        sample_features = []
        for aa in site:
            for feature in range(len(list_of_features)):
                sample_features.append(dict_of_aa[aa][feature])
        select_features.append(sample_features)
    for i in range(len(data)):
        data[i] = data[i] + select_features[i]

    new_df = pd.DataFrame(data)
    return(new_df)


# function to convert object to category dtype
def object_to_category(df):
    object_mask = (df.dtypes == 'object').to_list()
    for i in range(len(object_mask)):
        if df[df.columns[i]].dtype == 'object':
            df[df.columns[i]] = df[df.columns[i]].astype('category')
    return(df)

In [None]:
# load data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
y_train = train_data.pop('Active')

In [None]:
# amino acid chemical and physical features
aa_order = 'ARNDCQEGHILKMFPSTWYV'
phys_chem = ['no', 'pos', 'polar', 'neg', 'polar', 'polar', 'neg', 'polar', 'pos', 'no', 'no',
        'pos', 'no', 'no', 'no', 'polar', 'polar', 'no', 'polar', 'no']
molmass = [15, 100, 58, 59, 47, 72, 73, 1, 81, 57, 57, 72, 75, 91, 42, 31, 45,
        130, 107, 43]
hydro = [1.8, -4.5, -3.5, -3.5, 2.5, -3.5, -3.5, -0.4, -3.2, 4.5, 3.8, -3.9,
        1.9, 2.8, -1.6, -0.8, -0.7, -0.9, -1.3, 4.2]
polar = []

# include features for training
include_features = [hydro]

# create dictionary with amino acids as keys for corresponding features
aa_dict = {}
for i in range(len(aa_order)):
    aa_dict[aa_order[i]] = []
    for j in range(len(include_features)):
        aa_dict[aa_order[i]].append(include_features[j][i])

In [None]:
# create dataframe
train_df = create_dataframe(train_data, include_features, aa_dict)
test_df = create_dataframe(test_data, include_features, aa_dict)

In [None]:
# convert feature dtype
X_train = object_to_category(train_df)
X_test = object_to_category(test_df)

In [None]:
# ordinal encode categorical features for native categorical support
# minmax scale continuous features
ordinal_encoder = make_column_transformer(
    (OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=np.nan),
     make_column_selector(dtype_include='category')),
    remainder=MinMaxScaler())

In [None]:
# specify categorical features for native categorical support
categorical_mask = (X_train.dtypes == 'category').to_list()

# create model pipeline
pipeline = make_pipeline(
    ordinal_encoder,
    HistGradientBoostingClassifier(random_state=42,
                                   max_leaf_nodes=None,
                                   max_iter=500,
                                   min_samples_leaf=21,
                                   categorical_features=categorical_mask))

# cross validation
scores = cross_val_score(pipeline, X_train, y_train, cv=5, scoring='f1')
print('{:.6f} (+/-{:.04f}) - F1 5-fold cross validation'.format(scores.mean(),
    scores.std()*2))

In [None]:
# train full model
full_model = pipeline.fit(X_train, y_train)
prediction = full_model.predict(X_test)

In [None]:
# create submission file
submission = pd.DataFrame({'status': prediction})
filename = 'submission.csv'
submission.to_csv(os.path.join('.', filename), index=False, header=False)