# Protein Attribute Prediction Project

This script is part of the final project for the data science course, a Kaggle competition. The goal is to predict 5 attributes of proteins using the provided training data.Participants are allowed to use any prediction algorithm to achieve the best score.

The code sets up a hyperparameter search space for the XGBoost classifier and utilizes Randomized Search for hyperparameter tuning. It trains models for each attribute using the best parameters found and generates predictions on the test set.

In [1]:
import numpy as np
from sklearn.base import clone
import pandas as pd
import xgboost 
from sklearn.model_selection import RandomizedSearchCV


def save_predictions(predictions, filename):
    df = pd.DataFrame(np.array(predictions).T, columns=['attribute 1', 'attribute 2', 'attribute 3', 'attribute 4', 'attribute 5'])
    df.to_csv(filename, index_label= 'Id')

def fit_predict (best_params,x_train,labels_train,x_test):
     for i in range(len(best_params)):
            model= clone(best_params[i]).fit(x_train, labels_train[i])
            prediction = model.predict_proba(x_test)[:,1]
            predictions.append(prediction)

In [2]:
x_train = np.genfromtxt('protein_features_train.csv', delimiter=',')
y_train = np.genfromtxt('protein_labels_train.csv', delimiter=',')
x_test = np.genfromtxt('protein_features_test.csv', delimiter=',')

Create a vector of labels for each class

In [3]:
labels_train = [np.copy(y_train[:,i]) for i in range(0, y_train.shape[1])]

 Define hyperparameter search space


In [6]:
params={
 "learning_rate"    : [0.008, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
 "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15, 18, 20, 21, 24],
 "min_child_weight" : [ 1, 3, 5, 7 ],
 "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
 "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ],
 "n_estimators"     : [100, 200, 300, 400, 500, 1024, 2048, 2042],
 "subsample"        : [0.65, 0.7 ,0.8,0.9]}

Perform Randomized Search for each class, Find best parameters for each class



In [None]:
classifier=xgboost.XGBClassifier()
random_search=RandomizedSearchCV(classifier,param_distributions=params,n_iter=5,scoring='roc_auc',n_jobs=-1,cv=5,verbose=3)
best_params = [] 
for i in range (len(labels_train)):
    random_search.fit(x_train ,labels_train[i])
    best_params.append(random_search.best_estimator_)
    

Fitting 5 folds for each of 5 candidates, totalling 25 fits


In [None]:
predictions = [] 
fit_predict(best_params, x_train, labels_train, x_test)
save_predictions(predictions, 'predictions.csv')