# 3. Normalization 

<strong> 1. Reduction of Technical Variation: </strong>

The primary goal of normalization is to reduce technical vasriablity and preferably preserve biological variability. By applying sample wise normalization we ensure the technical variations are kept in check
2. Empirical Evidence: Several studies (e.g., Callister et al., Kultima et al., and Webb-Robertson et al.) have found that methods reducing intragroup (within-sample) variation and sample-specific biases generally perform better (wrt to spike in proteins)

In [21]:
# Import the libraries
import pandas as pd
import rpy2.robjects as robjects
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import pandas as pd
import yaml

# Load configuration
with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)

# Accessing config values
categorized_dir = config['datasets']['categorized_dir']
normalized_dir = config['datasets']['normalized_dir']

#from config we are now taking the data from catagorized and filtered data (the dir) we have specified in our yaml config file
# Construct the file paths
asthma_after = os.path.join(categorized_dir, 'df_after_asthma_filtered.csv')
asthma_before = os.path.join(categorized_dir, 'df_before_asthma_filtered.csv')
control_after = os.path.join(categorized_dir, 'df_after_control_filtered.csv')
control_before = os.path.join(categorized_dir, 'df_before_control_filtered.csv')

asthma_after_norm = os.path.join(normalized_dir, 'df_after_asthma_VSN.csv')
asthma_before_norm = os.path.join(normalized_dir, 'df_before_asthma_VSN.csv')
control_after_norm = os.path.join(normalized_dir, 'df_after_control_VSN.csv')
control_before_norm = os.path.join(normalized_dir, 'df_before_control_VSN.csv')

In [25]:
import rpy2.robjects as robjects

def vsn_normalization(input_csv, output_csv):
    try:
        # R script to install vsn package if needed and perform normalization
        r_script = f"""
        library(vsn)
        library(Biobase)

        # Read the data from CSV
        data <- read.csv('{input_csv}', row.names=1)

        # Print column names for debugging
        print(paste('Column names:', toString(colnames(data))))

        # Ensure the entire dataframe is included in normalization
        exprs_data <- as.matrix(data)
        print('Expression data before normalization:')
        print(head(exprs_data))

        # Apply VSN normalization with adjusted minDataPointsPerStratum
        vsn_data <- vsn2(exprs_data, minDataPointsPerStratum=10)

        # Extract the normalized expression values
        normalized_matrix <- predict(vsn_data, newdata=exprs_data)
        print('Normalized data:')
        print(head(normalized_matrix))

        # Convert the normalized matrix back to a dataframe
        normalized_df <- as.data.frame(normalized_matrix)
        colnames(normalized_df) <- colnames(data)  # Restore original column names
        
        # Write the normalized data back to a CSV file
        write.csv(normalized_df, '{output_csv}', quote=FALSE, row.names=TRUE)
        """

        # Run the R script
        robjects.r(r_script)
        
        print("VSN normalization completed successfully. Normalized data saved to:", output_csv)

    except Exception as e:
        print("An error occurred:", e)


In [26]:
df_list = [
    (asthma_after, asthma_after_norm), 
    (asthma_before, asthma_before_norm), 
    (control_after, control_after_norm), 
    (control_before, control_before_norm)
]

# Reorder columns and save the dataframes back to CSV
for df, path in df_list:
    vsn_normalization(df, path)

[1] "Column names: F125..Sample..asthma..B, F126..Sample..asthma..B, F128..Sample..asthma..B, F134..Sample..asthma..B, F136..Sample..asthma..B, F137..Sample..asthma..B, F138..Sample..asthma..B, F139..Sample..asthma..B, F141..Sample..asthma..B, F142..Sample..asthma..B, F184..Sample..asthma..B, F185..Sample..asthma..B, F107..Sample..asthma..B, F108..Sample..asthma..B, F111..Sample..asthma..B, F112..Sample..asthma..B, F113..Sample..asthma..B, F114..Sample..asthma..B, F120..Sample..asthma..B, F121..Sample..asthma..B, F122..Sample..asthma..B, F123..Sample..asthma..B, F150..Sample..asthma..B, F194..Sample..asthma..B, F175..Sample..asthma..B, F177..Sample..asthma..B, F178..Sample..asthma..B, F179..Sample..asthma..B, F180..Sample..asthma..B, F181..Sample..asthma..B"
[1] "Expression data before normalization:"
          F125..Sample..asthma..B F126..Sample..asthma..B
B cyto10              37810.90098              676814.061
keratin4               5157.83715               26892.274
keratin8     

R[write to console]: vsn2: 30 x 30 matrix (1 stratum). 

R[write to console]: Please use 'meanSdPlot' to verify the fit.



[1] "Normalized data:"
          F125..Sample..asthma..B F126..Sample..asthma..B
B cyto10                14.971980               16.424757
keratin4                12.106787               11.765555
keratin8                 6.691406               10.505072
desmopla                12.532523               12.877153
keratin12                      NA                9.153562
albumin                 12.823905               12.406945
          F128..Sample..asthma..B F134..Sample..asthma..B
B cyto10                15.878100                16.76548
keratin4                11.851003                12.91121
keratin8                10.130600                14.34008
desmopla                12.042728                11.27013
keratin12                9.304125                11.96802
albumin                 12.020267                14.41906
          F136..Sample..asthma..B F137..Sample..asthma..B
B cyto10                16.285241               16.189907
keratin4                12.918201               1

albumin               4326.4036              9333.307              8658.964
surf A2               1340.8725              1289.348              2597.150
DENN                   521.0201             13815.369             61764.913
Keratin 1            23203.6283            351272.831             83972.276
keratin2             26982.2923            445249.543             94119.855
cyto 10              14809.1369            294515.429             83872.002
          F24..Sample.A..asthma F32..Sample.A..asthma F35..Sample.A..asthma
albumin              1060936.92             5340.1203            15212.6111
surf A2               202941.95              733.7596             2617.9297
DENN                   14805.55             3204.7975              711.8536
Keratin 1              66047.65           873276.1977            70715.4867
keratin2              235265.38           803349.8440            67548.5193
cyto 10               229594.91           310334.6878            29118.4282
          F3

R[write to console]: vsn2: 30 x 33 matrix (1 stratum). 

R[write to console]: Please use 'meanSdPlot' to verify the fit.



[1] "Normalized data:"
          F5..Sample.A..asthma F6..Sample.A..asthma F8..Sample.A..asthma
albumin              11.867328             13.98398              8.51687
surf A2               7.778194             12.92224                   NA
DENN                 16.415786             15.30458             14.97468
Keratin 1            16.464254             16.87015             17.78880
keratin2             16.725344             16.89356             17.74341
cyto 10              16.119531             16.68111             16.13637
          F10..Sample.A..asthma F18..Sample.A..asthma F19..Sample.A..asthma
albumin                13.57690              12.70913              13.40729
surf A2                11.50021              10.96227               9.68069
DENN                   14.02947              12.64140              13.42143
Keratin 1              16.88790              16.30597              14.89631
keratin2               17.47514              16.69939              15.34741
cyto 10   

          F102..Sample..control..B F105..Sample..control..B
B cyto10                81250.2567               128514.019
keratin4                 2675.8040                 4794.931
keratin8                  342.9837                 4831.509
desmopla                 1683.6725                 1755.305
keratin12                 159.3874                 2270.217
albumin                 97961.3394                60174.615
          F104..Sample..control..B F116..Sample..control..B
B cyto10                25788.4589               186244.108
keratin4                 1059.0594                 7356.128
keratin8                  120.4641                 8085.630
desmopla                 1035.2638                 6252.628
keratin12                       NA                  535.823
albumin                  6350.8325                27642.615
          F115..Sample..control..B F171..Sample..control..B
B cyto10                43594.5550              115172.2436
keratin4                 2007.1960      

R[write to console]: vsn2: 22 x 18 matrix (1 stratum). 

R[write to console]: Please use 'meanSdPlot' to verify the fit.



[1] "Normalized data:"
          F118..Sample..control..B F119..Sample..control..B
B cyto10                 14.613772                15.545808
keratin4                 11.725114                11.722967
keratin8                 14.275457                13.098097
desmopla                  9.829955                10.979225
keratin12                10.415143                 8.394716
albumin                  16.854524                16.197877
          F131..Sample..control..B F133..Sample..control..B
B cyto10                 15.731994                 14.96261
keratin4                 11.011139                 11.32890
keratin8                  8.802992                 12.12304
desmopla                  9.317194                 10.86545
keratin12                 6.949998                 10.40517
albumin                  16.186839                 15.87980
          F117..Sample..control..B F183..Sample..control..B
B cyto10                 15.859548                 14.46947
keratin4         

R[write to console]: vsn2: 31 x 16 matrix (1 stratum). 

R[write to console]: Please use 'meanSdPlot' to verify the fit.



[1] "Normalized data:"
          F1..Sample.A..control F2..Sample.A..control F14..Sample.A..control
albumin                17.82013              14.26085               16.42803
surf A2                16.06867              13.42024               14.51577
DENN                   15.75689              15.05066                     NA
Keratin 1              15.45814              16.00053               16.27266
keratin2               15.17511              16.36018               16.46785
cyto 10                14.75875              15.86171               16.06761
          F16..Sample.A..control F27..Sample.A..control F29..Sample.A..control
albumin                 16.77479               11.63186              12.030228
surf A2                 14.46185               10.78657               9.952108
DENN                    13.09938               12.52017              10.623032
Keratin 1               16.23645               16.79197              17.136953
keratin2                16.29603           

In [4]:
# Ussage
data = '/Users/shahansd/SM_ Proteomics data/GitHUB copy/1- Protein with 70% presence/df_asthma_after_filtered.csv'
data_imp = ''

vsn_normalization(data, data_imp)

R[write to console]: Loading required package: Biobase

R[write to console]: Loading required package: BiocGenerics

R[write to console]: 
Attaching package: ‘BiocGenerics’


R[write to console]: The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


R[write to console]: The following objects are masked from ‘package:base’:

    Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
    as.data.frame, basename, cbind, colnames, dirname, do.call,
    duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
    lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
    pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
    tapply, union, unique, unsplit, which.max, which.min


R[write to console]: Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.


R[write to console]

An error occurred: Error in file(file, "rt") : cannot open the connection

