**SVM Model for training**
-
The purpose of this notebook is to import direct RNA sequencing data and use it for training a machine learning model. This model is then optimized to get the final model. The data used in this notebook are `dataset0.json`, which contains the feature data, and `data.info.labelled`, which contains the actual labels. 

In [2]:
# Load the JSON file, reading each line and storing it as an element in a list
import gzip
import json

json_path = "dataset0.json.gz"

# Load in the feature data
data = []
with gzip.open(json_path, 'rt', encoding='utf-8') as f: 
    for line in f:
        if line.strip():  # skip empty lines
            record = json.loads(line)
            data.append(record)

In [3]:
# Load in the m6A label data
import pandas as pd

data_labels = pd.read_csv("data.info.labelled")

### Aggregation of feature reads data

In [None]:
# Let us compute the aggregate numbers for each line/record, by mean
import numpy as np

def aggregation(record):
    aggregates = []

    for transcript_id, value in record.items():
        for pos, value1 in value.items():
            for mers, value2 in value1.items():

                arr = np.array(value2, dtype=float)
                means = arr.mean(axis=0).tolist()
                aggregates.append({
                    "transcript_id": transcript_id,
                    "position": int(pos),
                    "kmer": mers,
                    "features": means
                })
    return aggregates

# For each record, use the function.
parsed_data = []
for line in data:
    parsed_data.extend(aggregation(line))

# Should give the same number of lines
print(len(parsed_data))

121838


In [5]:
# Data manipulation of the feature dataset to make it more readable
df_features = pd.DataFrame(parsed_data)
features_df = pd.DataFrame(df_features["features"].tolist(),
                           columns = [f"feature_{i+1}" for i in range(9)])
new_df = pd.concat([df_features.drop(columns=["features"]), features_df], axis = 1)

In [6]:
df_labels = pd.DataFrame(data_labels)

# Let us join the features and labels dataset together
ndf = pd.merge(new_df,
               df_labels,
               how = 'left',
               left_on = ['transcript_id', 'position'],
               right_on = ['transcript_id', 'transcript_position'])
print(len(ndf))
ndf.head()

121838


Unnamed: 0,transcript_id,position,kmer,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,gene_id,transcript_position,label
0,ENST00000000233,244,AAGACCA,0.008264,4.223784,123.702703,0.009373,7.382162,125.913514,0.007345,4.386989,80.57027,ENSG00000004059,244,0
1,ENST00000000233,261,CAAACTG,0.006609,3.216424,109.681395,0.006813,3.226535,107.889535,0.00771,3.016599,94.290698,ENSG00000004059,261,0
2,ENST00000000233,316,GAAACAG,0.00757,2.940541,105.475676,0.007416,3.642703,98.947027,0.007555,2.087146,89.364324,ENSG00000004059,316,0
3,ENST00000000233,332,AGAACAT,0.01062,6.47635,129.355,0.008632,2.8992,97.8365,0.006102,2.23652,89.154,ENSG00000004059,332,0
4,ENST00000000233,368,AGGACAA,0.010701,6.415051,117.924242,0.011479,5.870303,121.954545,0.010019,4.260253,85.178788,ENSG00000004059,368,0


In [7]:
df_0 = ndf[ndf["label"] == 0]
# Randomly shuffle the rows
df0 = df_0.sample(frac=1, random_state=42).reset_index(drop=True)
df1 = ndf[ndf["label"] == 1]
print(f"Number of records labelled 0: {len(df0)}")
print(f"Number of records labelled 1: {len(df1)}")

Number of records labelled 0: 116363
Number of records labelled 1: 5475


## Performing hyperparameter tuning
I will perform hyperparameter tuning to identify the best combination of parameters that will make up the final model. This is conducted on a subset of train data that has a smaller number of "0" records so as to make tuning faster.

In [94]:
# Near the ratio of 3:1 
df0_sub = df0.iloc[:15000]
dfs = pd.concat([df0_sub, df1], ignore_index=True)

# Identify the features columns and label column
X = dfs.iloc[:, 3:12]
y = dfs["label"]

# Split the data into training and testing sets in a 4:1 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

In [95]:
# Hyperparameter Tuning of Support Vector Machine (SVM) with GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn import svm

param_grid={
    'C':[1,10,100],
    'gamma':[0.0001,0.001, 0.01],
    'kernel': ['rbf', 'linear']
    }

svc=svm.SVC(random_state = 42)
model = GridSearchCV(svc, param_grid, refit = True, cv=5, verbose=2)

model.fit(X_train, y_train)
# Best parameters found by GridSearchCV
print("Best Parameters:", model.best_params_)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END ......................C=1, gamma=0.0001, kernel=rbf; total time=   8.8s
[CV] END ......................C=1, gamma=0.0001, kernel=rbf; total time=   9.7s
[CV] END ......................C=1, gamma=0.0001, kernel=rbf; total time=   9.0s
[CV] END ......................C=1, gamma=0.0001, kernel=rbf; total time=   8.0s
[CV] END ......................C=1, gamma=0.0001, kernel=rbf; total time=   8.6s
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time= 1.7min
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time= 1.5min
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time= 2.2min
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time= 2.3min
[CV] END ...................C=1, gamma=0.0001, kernel=linear; total time= 2.4min
[CV] END .......................C=1, gamma=0.001, kernel=rbf; total time=  16.0s
[CV] END .......................C=1, gamma=0.001

### Solving the problem of imbalanced data
Now that we have our best combination of parameters, which are {C = 100, gamma = 0.01, kernel = 'rbf'}, train the SVM model with these parameters on the training dataset.

In [None]:
# Here, use a near to 6:1 ratio for number of "0" records to "1" records
# We do not want it to be too imbalanced
df0_sub = df0.iloc[:30000]
print(len(df0_sub))
dfs = pd.concat([df0_sub, df1], ignore_index=True)
print(len(dfs))

30000
35475


In [None]:
X = dfs.iloc[:, :12]
y = dfs["label"]

In [27]:
# We can split the data into training and testing sets in a 4:1 ratio
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

In [28]:
# We will use the SVM model
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.metrics import classification_report

# Note: SVM is slow on large datasets

# Train SVM on the training data, with the following parameter values
# (the best of which) as a result after hyperparameter tuning
svm_model = SVC(kernel='rbf', C = 100, gamma = 0.01, class_weight='balanced', 
                probability = True, random_state=42)
svm_model.fit(X_train.iloc[:, 3:12], y_train)

# After training the model, we can test on our validation set
y_prob = svm_model.predict_proba(X_test.iloc[:, 3:12])
y_pred = svm_model.predict(X_test.iloc[:, 3:12])

# Evaluate
print("SVM Accuracy:", accuracy_score(y_test, y_pred))

SVM Accuracy: 0.8200140944326991


In [29]:
from sklearn.metrics import average_precision_score, f1_score
print(roc_auc_score(y_test, y_prob[:, 1]))
print(average_precision_score(y_test, y_prob[:, 1]))
print(f1_score(y_test, y_pred))

0.8721152207001522
0.6115614688967365
0.5678510998307953


In [31]:
# Save the final model as a .pkl file
import pickle

with open("trained_svm.pkl", "wb") as f:
    pickle.dump(svm_model, f)