# Predicting Protein Expression from Gene Sequences

This Jupyter Notebook presents a Python script for predicting protein expression levels based on gene sequences. The dataset consists of gene sequences and corresponding protein expression measurements. In addition to the main script, there is an additional script for feature extraction from gene sequences. The main script covers data preprocessing, feature selection, model evaluation, and application to new data. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from scipy.stats import spearmanr

##### **Data Preprocessing**

Initial data cleaning, feature columns extraction, shuffle and normalize data.

In [12]:
data = pd.read_excel('new_known_dataset.xlsx')
feature_cols = data.columns[6:]
np.random.seed(42)
data = data.sample(frac=1).reset_index(drop=True)

data['PA'] = data['PA'].str.replace("'", '')
data['PA'] = pd.to_numeric(data['PA'])

scaler = StandardScaler()
data.iloc[:, 6:] = scaler.fit_transform(data.iloc[:, 6:])

Data spliting


In [13]:
train_size = int(0.4 * len(data))
val_size = int(0.4 * len(data))
test_size = int(0.2 * len(data))

train_data = data[:train_size]
val_data = data[train_size:train_size + val_size]
test_data = data[train_size + val_size:]

X_train = train_data[feature_cols]
y_train = train_data['PA']

X_val = val_data[feature_cols]
y_val = val_data['PA']

X_test = test_data[feature_cols]
y_test = test_data['PA']

In [14]:
best_features = []
best_corr = 0
best_model = None
stages_corr = []
add_feature = True

##### **Feature Selection**
 We perform feature selection using a while loop until no further improvement is achieved in the correlation.

In [15]:
while add_feature:
    for i in range(X_train.shape[1]):
        current_feature = X_train.columns[i]

        if current_feature in best_features:
            continue

        features = [current_feature] + best_features

        X_train_selected = X_train[features]
        X_for_model = np.column_stack((np.ones(X_train_selected.shape[0]), X_train_selected))
        weights = np.linalg.lstsq(X_for_model, y_train, rcond=None)[0]

        X_val_selected = X_val[features]
        y_pred = np.dot(np.column_stack((np.ones(X_val_selected.shape[0]), X_val_selected)), weights)

        corr_value, _ = spearmanr(y_pred, y_val)

        if corr_value > best_corr:
            best_corr = corr_value
            best_features = features
            best_model = weights


    stages_corr.append(best_corr)
    if len(stages_corr) > 1 and ((stages_corr[-1] <= stages_corr[-2]) or ((stages_corr[-1] - stages_corr[-2]) < 0.005)):
        add_feature = False

#### **Test Set Evaluation**

In [16]:
X_test_selected = X_test[best_features]
X_for_model = np.column_stack((np.ones(X_test_selected.shape[0]), X_test_selected))
y_pred = np.dot(X_for_model, best_model)
corr_value, _ = spearmanr(y_pred, y_test)

print("Best Features:")
print(best_features)
print("Test Correlation:")
print(corr_value)


Best Features:
['TTG_prob', 'GGG_prob', 'GGT_prob', 'TGG_prob', 'GCC_prob', 'ACT_prob', 'CCC_prob', 'CCT_prob', 'TCT_prob', 'ATG_prob', 'CTG_prob', 'CTA_prob', 'CTT_prob', 'TTA_prob', 'cys_prob', 'ATG_context', 'freq_AA', 'GC_content', 'FEWindow3', 'FEWindow2', 'FEWindow1', 'CAI']
Test Correlation:
0.5015384893317347


#### **Apply the Model on an Unknown Dataset**
##### Data Preprocessing

In [19]:

data = pd.read_excel('new_unknown_dataset.xlsx')

feature_cols = data.columns[5:]

data = pd.read_excel('new_unknown_dataset.xlsx')

feature_cols = data.columns[5:]

data.iloc[:, 5:] = scaler.fit_transform(data.iloc[:, 5:])

prediction


In [40]:
X_test_selected = data[best_features]
X_for_model = np.column_stack((np.ones(X_test_selected.shape[0]), X_test_selected))
y_pred = np.dot(X_for_model, best_model)


In [None]:
result = pd.DataFrame({'GeneName': data['GeneName'], 'PA_predicted': y_pred})

result = result.sort_values(by='PA_predicted')
result['Rate'] = range(1, len(result) + 1)

result.to_csv('results.csv', index=False)