## Data-Preprocessing
---

### - Present notebook uses _sklearn_ (version 0.23.2) and _scipy_ version (1.4.1), to preprocess data
### - Generates input data for the ML-Classifiers
### - File used in this notebook can be created using "ExpressionData-Wrangling.ipynb"

#### Step 1: Load libraries

In [1]:
# For dataframe handeling 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Inline Visuals
%matplotlib inline

# For pre-processing
from sklearn.model_selection import train_test_split # Train-Test Splitting

# This will print entire output of the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#### Step 2: Load dataset
* File used in this notebook can be created using "ExpressionData-Wrangling.ipynb"
* Dataframe header has integer values, therefore first read the header as ```None```
* Afterwords, make the first row as the dataframe header

In [2]:
# Importing the dataset
dataset = pd.read_csv('exp_set.csv', header = None)

# Replacing the header with row 1 (Labels)
dataset = dataset.rename(columns = dataset.iloc[0]).drop(dataset.index[0])

# Viewing
dataset.head(2)

# Dimensions
dataset.shape

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,NaN,Case,Control,Control.1,Case.1,Control.2,Control.3,Case.2,Control.4,Control.5,...,Case.3,Control.6,Control.7,Control.8,Control.9,Case.4,Case.5,Case.6,Case.7,Control.10
1,ENSG00000266714,25,13,27,1,90,3,39,1,0,...,0,1,0,0,5,0,7,0,0,22
2,ENSG00000266208,12,4,15,0,2,4,3,4,4,...,8,1,2,7,2,0,3,4,5,0


(12879, 105)

#### Step 3: Transposing 
1. Transpose the whole dataframe
2. Reset the index with ```reset_index()```
3. Make the first row as the header
4. Replave ```NaN``` with "Labels"

In [3]:
# Transposing
dfT = pd.DataFrame(dataset.T.reset_index())
dfT.head(1)

# Renaming
df = dfT.rename(columns = dfT.iloc[0]).drop(dfT.index[0]).rename(columns = {np.nan : "Labels"})
df.head(2)

Unnamed: 0,index,1,2,3,4,5,6,7,8,9,...,12870,12871,12872,12873,12874,12875,12876,12877,12878,12879
0,,ENSG00000266714,ENSG00000266208,ENSG00000266173,ENSG00000266086,ENSG00000265817,ENSG00000265681,ENSG00000265118,ENSG00000264278,ENSG00000263731,...,ENSG00000001460,ENSG00000001167,ENSG00000001084,ENSG00000001036,ENSG00000000971,ENSG00000000938,ENSG00000000460,ENSG00000000457,ENSG00000000419,ENSG00000000003


Unnamed: 0,Labels,ENSG00000266714,ENSG00000266208,ENSG00000266173,ENSG00000266086,ENSG00000265817,ENSG00000265681,ENSG00000265118,ENSG00000264278,ENSG00000263731,...,ENSG00000001460,ENSG00000001167,ENSG00000001084,ENSG00000001036,ENSG00000000971,ENSG00000000938,ENSG00000000460,ENSG00000000457,ENSG00000000419,ENSG00000000003
1,Case,25,12,882,20,30,147,10,8,82,...,199,2798,1376,65,40,162,142,1108,1274,114
2,Control,13,4,777,23,16,169,9,20,165,...,119,1568,1105,124,60,449,147,615,681,45


#### Step 4: Train-Test Split
1. Divide the labels and expression values in ```numpy-arrays```
2. Use ```train_test_split()``` for data splitting
3. Set random state for reproducibility
4. Ratio for splitting is ```80:20```

In [4]:
# Dividing the labels and values
X = df.iloc[:,1:].values
y = df.iloc[:, 0].values

# Splitting
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 101)

#### Step 5: Saving the sets
1. Convert into dataframes
2. Save without index

In [5]:
pd.DataFrame(X_train).to_csv("training_features.csv", index = False)
pd.DataFrame(X_test).to_csv("testing_features.csv", index = False)
pd.DataFrame(y_train).to_csv("training_labels.csv", index = False)
pd.DataFrame(y_test).to_csv("training_labels.csv", index = False)

---