<a href="https://colab.research.google.com/github/reneebrecht/human-protein-atlas-image-classification/blob/nn_run/NN_run.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook for running Neural networks

For Neural networks the creation of models works a little different than for other models. So we are creating an own notebook for this topic. 

### Access to the cloud
Because the data is saved in the cloud and the results shall be saved in the cloud we need a name to identify the models and we need access to the cloud. 

In [10]:
#Give model a name before running the notebook:
model_name = input("Enter name of model: ")

Enter name of model: NNNNN


In [26]:
## First need to clone Github repo to access of other files

import os
from getpass import getpass
import urllib

user = input('Github User name: ')
password = getpass('Github Password: ')
password = urllib.parse.quote(password) # your password is converted into url format
#repo_name = 'human-protein-atlas-image-classification' #input('Repo name: ')

cmd_string = 'git clone https://{0}:{1}@github.com/reneebrecht/human-protein-atlas-image-classification.git'.format(user, password)

os.system(cmd_string)
cmd_string, password = "", "" # removing the password from the variable 

# may also need to access google drive
from google.colab import drive
drive.mount('/content/drive')

Github User name: Nilodnewg
Github Password: ··········
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
from google.colab import auth
auth.authenticate_user()

# https://cloud.google.com/resource-manager/docs/creating-managing-projects
project_id = ' imposing-league-354107'
!gcloud config set project {project_id}

Updated property [core/project].


In [13]:
# so that it finds the classes to import
import sys
sys.path.insert(0,'/content/human-protein-atlas-image-classification/notebooks')

In [14]:
!pip install gcsfs

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### Preparation for running the model 
We need a few packages for the coding. 

In [15]:
!pip install fsspec # this is needed for pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
# Import all the libraries I need
import numpy as np
import pandas as pd
#import matplotlib.pyplot as plt
#import seaborn as sns
%matplotlib inline

# ignore Deprecation Warning
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

from sklearn.preprocessing import StandardScaler, QuantileTransformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import f1_score

import joblib

import tensorflow as tf

np.random.seed(421)
tf.random.set_seed(421)



In [27]:
#Get classes from self written package
from Helper_classes import Location_in_Target, Bin_Embedding, Prepared_Test_Train_Data, Prepare_NN_for_pipline

ModuleNotFoundError: ignored

The data was labeled in a seperate `.csv` file. For working with the data, we need to get the file from the cloud and read it. 

In [None]:
#get all of the labels
# Download the file from a given Google Cloud Storage bucket.
!gsutil cp 'gs://human_proteins/train.csv' /tmp/train.csv
labels_training = pd.read_csv('/tmp/train.csv')

### Get the embedding 

The size of the given data is huge. Therefore we decided to use embeddings. On the one hand embeddings converts data to a much smaller size and on the other hand the data gets already categorized. So the embedding is also a benefit for training models later. <br>
The embeddings are build in another notebook and saved in the cloud. In the next step we are accessing the embeddings.

In [None]:
!mkdir -p /tmp/embed_path
!gsutil cp 'gs://human_proteins_data/embeddings_train/*' /tmp/embed_path

## Coding for training the model, prediction and saving the results
Because the training and prediction have to be done for each label the coding is split in different functions, which are called later in a loop over all labels. 


### Get the data for one label.
Create a balanced subset of images for one label, which contains the same amount of images with this label and not with this label. Get the embedded data for these images.

In [None]:
def embedding_for_one_location(location_number):
  pictures = Location_in_Target(location = location_number)
  pictures.determine_pictures(labels_training)
  bin_embed = Bin_Embedding(pictures.get_pictures(), location_number, '/tmp/embed_path')
  return pictures, bin_embed

### Split into train/test and transform/standardize
Preprocessing the data can bring better results. 

In [None]:
def get_train_test(bin_embed):
  prepared_data = Prepared_Test_Train_Data(bin_embed.get_embedding())
  X_train, X_test, y_train, y_test = prepared_data.splitter()
  return X_train, X_test, y_train, y_test

### Build the pipline
Pipelines are tools to get cleaner and reproduceable code. For NN the model with the layers has to build first, the model has to be saved in a regressor and the than it can be build in a pipline.

In [None]:
def setting_up_nn():
  '''Build the NN with all the layers. '''
  nn_model = Prepare_NN_for_pipline()
  nn_model.build_layers(number_layers=3, dropout_rate=0.25)
  return nn_model


In [None]:
def model_object(nn_model, X_train):
  '''Get the model in a format for the pipeline.'''
  my_model = nn_model.build_regressor(n_train = len(X_train))
  return my_model

In [None]:
def pipeline(my_model, X_train, y_train):
  '''Define the pipeline with the created model, the transformer and a scaler.'''
  with tf.device('/cpu:0'):

    # just create the pipeline
    pipe = make_pipeline(QuantileTransformer(random_state=0), StandardScaler(), my_model)
    training = pipe.fit(X_train, y_train)  # apply scaling on training data
  return pipe, training

### Prediction 

In [23]:
def prediction_and_stuff(my_model, X_test, y_test, pipe):
  '''Predict data with the trained model and calculate the f1_score. '''
  y_pred = pipe.predict(X_test).round(0)
  f1_test = f1_score(y_test, y_pred).round(2)
  y_pred_train = pipe.predict(X_train).round(0)
  f1_train = f1_score(y_test, y_pred).round(2)
  return y_pred, f1_train, f1_test

### Save the results 
To compare different results and reproduce the model, we are saving the model and the results in the cloud.

In [19]:
#Create folder for temporary storing model
!mkdir -p /tmp/saved_model

In [20]:
list_model = np.zeros(28)
df_f1_scores = pd.DataFrame(list_model, columns = [model_name])

In [21]:
def save_one_model(my_model,y_pred, y_test, location_number):
  #save the model
  my_model.save('/tmp/saved_model/'+model_name+'_model_'+str(location_number))
  #save prediction and the given labels
  joblib.dump([y_pred, y_test], '/tmp/saved_model/'+model_name+'_'+str(location_number))

### Run the coding for each possible label

In [22]:
for location_number in range(5,6):
  pictures, bin_embed = embedding_for_one_location(location_number)
  X_train, X_test, y_train, y_test =  get_train_test(bin_embed)
  nn_model = setting_up_nn()
  my_model = model_object(nn_model, X_train)
  pipe, training =  pipeline(my_model, X_train, y_train )
  y_pred, f1, f1_train =  prediction_and_stuff(my_model, X_test, y_test, pipe)
  #save_one_model(nn_model.model,y_pred.round(0), y_test, location_number)
  #write scores in a dataframe
  #df_f1_scores.loc[df_f1_scores.index[int(location_number)],model_name]= f1


NameError: ignored

In [21]:
filename = '/tmp/saved_model/'+model_name+'*'
!gsutil cp -r {filename} gs://human_proteins/saved_model/

Copying file:///tmp/saved_model/nn_3_layers_dropout_1 [Content-Type=application/octet-stream]...
Copying file:///tmp/saved_model/nn_3_layers_dropout_2 [Content-Type=application/octet-stream]...
Copying file:///tmp/saved_model/nn_3_layers_dropout_3 [Content-Type=application/octet-stream]...
Copying file:///tmp/saved_model/nn_3_layers_dropout_4 [Content-Type=application/octet-stream]...
\ [4 files][646.9 KiB/646.9 KiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying file:///tmp/saved_model/nn_3_layers_dropout_5 [Content-Type=application/octet-stream]...
Copying file:///tmp/saved_model/nn_3_layers_dropout_6 [Content-Type=application/octet-stream]...
Copying file:///tmp/saved_model/nn_3_layers_dropout_7 [Content-Type=application/octet

In [45]:
f1_mean = sum(df_f1_scores[model_name])/28
df_f1_scores.loc[28] = f1_mean

In [8]:
import pandas as pd
!gsutil cp 'gs://human_proteins/f1_scores.csv' /tmp/f1_scores.csv
all_models_score = pd.read_csv('/tmp/f1_scores.csv')

Copying gs://human_proteins/f1_scores.csv...
/ [0 files][    0.0 B/  2.5 KiB]                                                / [1 files][  2.5 KiB/  2.5 KiB]                                                
Operation completed over 1 objects/2.5 KiB.                                      


In [9]:
all_models_score

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,nn_6_layers_dropout,KNN,GradientBoostingClassifier,ExtraT_Bea,PAC_Bea,RiCcv_Bea,xgboostclass,BerNB_Bea,svc_rbf_C0_1,svc_rbf,Ada_Bea,nn_5_layers_dropout_try,RF_Bea,nn_3_layers_dropout
0,-1,,,,,,,,,,,,,,,0.2125
1,0,0.0,0.76,0.64,0.64,0.78,0.63,0.72,0.7,0.58,0.7,0.76,0.66,0.76,0.76,0.0
2,1,1.0,0.63,0.62,0.62,0.72,0.6,0.66,0.64,0.58,0.6,0.68,0.62,0.0,0.7,0.7
3,2,2.0,0.66,0.63,0.63,0.83,0.59,0.67,0.65,0.6,0.63,0.72,0.62,0.0,0.8,0.77
4,3,3.0,0.57,0.62,0.62,0.66,0.54,0.63,0.62,0.57,0.62,0.66,0.6,0.0,0.63,0.63
5,4,4.0,0.62,0.62,0.62,0.67,0.59,0.63,0.62,0.59,0.64,0.68,0.6,0.0,0.66,0.67
6,5,5.0,0.69,0.6,0.6,0.7,0.56,0.63,0.63,0.58,0.62,0.67,0.61,0.0,0.68,0.68
7,6,6.0,0.64,0.63,0.63,0.65,0.59,0.63,0.65,0.57,0.56,0.69,0.63,0.0,0.65,0.64
8,7,7.0,0.56,0.61,0.61,0.62,0.58,0.59,0.62,0.53,0.49,0.63,0.56,0.0,0.61,0.62
9,8,8.0,0.61,0.71,0.71,0.65,0.64,0.59,0.61,0.53,0.49,0.69,0.6,0.0,0.66,0.62


In [53]:
all_models_score =pd.concat([all_models_score,df_f1_scores], axis = 1)

In [54]:
from pathlib import Path  
filepath = Path('tmp/f1_scores.csv')  
filepath.parent.mkdir(parents=True, exist_ok=True)  
all_models_score.to_csv(filepath) 
!gsutil cp -r 'tmp/f1_scores.csv' gs://human_proteins/

Copying file://tmp/f1_scores.csv [Content-Type=text/csv]...
/ [1 files][  2.5 KiB/  2.5 KiB]                                                
Operation completed over 1 objects/2.5 KiB.                                      
