<a href="https://colab.research.google.com/github/vy-phung/genomic-data-science-project-about-fetus-and-adult/blob/main/Predicted_and_classified_genes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction and Classification from fetal and adult genes 

## Objectives

I will use 3 types of genomic table data to identify if we can predict ages and classify age_group (fetus or adult) and genders of these 10 samples: 
1. Edata_tr.csv (15559 genes, haven't statistically analyzed but already filtered)
2. Data for regulated gene.csv (5566 genes, already statistically analyzed)
3. 2 most up-down regulated genes (2 genes, extracting from data for regulated gene.csv)

*Getting data in Github: genomic data (tidy_data folder); phenotype data (sample_data folder)*

**Steps:**
*   Classifying age_group (fetus or adult) of samples and gender (female or male) by using Logistic regression
*   Predicting age which is continuous data by using Multi linear regression, Ridge Regression with the support of Grid alpha search, and Random Forest Regressor


<h1>Table of content</h1>
<ul>
    <li><a href="#ref1">Part 1: Algorithm </a></li>
    <li><a href="#ref2">Part 2: Analyzing table type 1 </a></li>
    <li><a href="#ref3">Part 3: Analyzing table type 2 </a></li>
    <li><a href="#ref4">Part 4: Analyzing table type 3</a></li>



<h2>Part 1: Predicting and Classifying Algorithm</h2>

In [None]:
#@title Import packages 
# Package
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures

In [None]:
#@title Logistic regression & Multi linear regression
class Prediction:
  def __init__(self,x,y,lr):
    self.x = x
    self.y = y
    self.lr = lr
  def train_test(self):
      # lr is regressor used to predict or classify 
    x_train, x_test, y_train, y_test = train_test_split(self.x, self.y, test_size=0.20, random_state=0)
    self.lr.fit(x_train,y_train)
    self.x_test = x_test
    self.x_train = x_train
    self.y_test = y_test
    self.y_train = y_train
  def predict(self,x,y): 
    self.train_test() 
    if str(x.values) == str(self.x.values) and str(y.values) == str(self.y.values):
      x = self.x_test
      y = self.y_test
      print("number of test samples :", self.x_test.shape[0])
      print("number of training samples:",self.x_train.shape[0])
    print("actual data: ",y.values)
    print("predicted data: ",self.lr.predict(x)) 
  def cnf(self,x,y):
    if y.iloc[:,0].dtype == 'O': # logistic regression for binary variable
      self.train_test()
      if str(x.values) == str(self.x.values) and str(y.values) == str(self.y.values):
        self.x == x
        self.y == y 
      print("confusion matrix:",metrics.confusion_matrix(y, self.lr.predict(x)))
  def score(self,x,y):
    if y.iloc[:,0].dtype == 'float64': # regression for continuous data 
      self.train_test()
      if str(x.values) == str(self.x.values) and str(y.values) == str(self.y.values):
        print('The R-square of testing data: ', self.lr.score(self.x_test, self.y_test))
        print('The R-square of training data: ', self.lr.score(self.x_train, self.y_train))
        print('The mean square error of actual and predicted fitting data is: ', \
          mean_squared_error(self.y, self.lr.predict(self.x)))
        return
      print('The R-square: ', self.lr.score(x, y))

In [None]:
#@title Ridge regression and Grid Search
class other_prediction:
  def __init__(self,x,y,parameters):
    self.x = x
    self.y = y
    self.parameters = parameters
  def train_test(self):
      # lr is regressor used to predict or classify 
    x_train, x_test, y_train, y_test = train_test_split(self.x, self.y, test_size=0.20, random_state=0)
    RR=Ridge()
    self.Grid = GridSearchCV(RR, self.parameters,cv=4)
    self.x_test = x_test
    self.x_train = x_train
    self.y_test = y_test
    self.y_train = y_train
  def Ridge(self,x,y):
    if y.iloc[:,0].dtype == 'float64':
      self.train_test()
      self.Grid.fit(self.x, self.y)
      BestRR=self.Grid.best_estimator_
      RigeModel=Ridge(alpha=BestRR.alpha)
      RigeModel.fit(self.x, self.y)
      print(BestRR)
      if str(x.values) == str(self.x.values) and str(y.values) == str(self.y.values):
        x = self.x_test
        y = self.y_test
      print("R^2 of data:",RigeModel.score(x, y))
      print("predict data:",RigeModel.predict(x))
      print("actual testing data:",y.values)

<h2>Part 2: Table type 1</h2>

#### <b>Data processing<b/>

In [None]:
edata = pd.read_csv("edata_tr.csv")
edata

Unnamed: 0.1,Unnamed: 0,SRR1554534,SRR1554535,SRR1554568,SRR1554561,SRR1554567,SRR1554536,SRR1554541,SRR1554539,SRR1554538,SRR1554537
0,A1BG,9.143633,8.583227,8.105891,9.206773,6.995598,8.762864,8.373665,8.429514,7.632873,8.489938
1,ADA,8.646385,9.145974,7.303994,8.001912,7.351627,9.770115,7.843839,8.253862,7.831596,7.197164
2,CDH2,13.562468,13.585906,14.957846,13.652418,14.829826,13.812075,14.545283,13.717079,14.708753,15.051336
3,AKT3,13.479431,13.159273,15.876052,13.925121,14.376621,12.332018,15.204848,14.383520,14.749481,15.584859
4,ZBTB11-AS1,7.746129,7.539217,7.489181,7.323988,7.706548,7.731734,7.684944,7.654166,7.625887,7.705363
...,...,...,...,...,...,...,...,...,...,...,...
15554,CDH1,9.283095,9.269662,10.242205,8.787615,10.671178,9.219568,9.003333,8.933461,9.185240,9.167789
15555,SLC12A6,13.031252,12.899341,13.801854,13.197724,13.451059,12.386146,13.613920,13.345966,13.644767,13.585326
15556,PTBP3,11.367093,11.867973,13.315986,11.766260,12.716064,11.924352,12.990262,12.386899,12.840028,12.564455
15557,DGCR2,15.038492,14.455368,13.976339,15.153004,13.996213,14.484859,13.867110,14.553994,13.960779,14.168734


In [None]:
edata = edata.set_index("Unnamed: 0")
edata.index.names  = [None]
edata_T = edata.T
edata_T

Unnamed: 0,A1BG,ADA,CDH2,AKT3,ZBTB11-AS1,MED6,NAALAD2,DDTL,NAALADL1,NINJ2-AS1,ACOT8,ABI1,GNPDA1,MIMT1,KCNE3,ZBTB33,SNHG8,GTF2IP4,CDH3,TANK,HAVCR1P1,POM121C,ZSCAN30,MCTS2P,SRA1,UCKL1-AS1,TMEM170B,SNORD58C,ZGLP1,FAM86JP,LOC100126784,C8orf88,PLCE1-AS1,FAM229A,JAZF1-AS1,LOC100128164,TMPO-AS1,LINC02731,ZNF667-AS1,LOC100128253,...,EXOG,XYLB,OXSR1,GFPT2,CRYZL1,WDR1,AMMECR1,CDC25C,GOLGA5,HS3ST4,HS3ST2,HS3ST1,USP15,CDC27,USP3,MVP,SLC23A2,SLC23A1,THRAP3,MED12,MED13,CDC34,NR1I3,NUP153,CCS,NR1D2,RBX1,CDC42,DOP1B,THOC1,REC8,RCE1,HNRNPDL,DMTF1,PPP4R1,CDH1,SLC12A6,PTBP3,DGCR2,CASP8AP2
SRR1554534,9.143633,8.646385,13.562468,13.479431,7.746129,10.96576,10.367572,7.328231,6.963692,8.822544,11.946038,13.366491,12.071443,8.185792,7.299651,11.72431,11.040949,9.332611,10.056517,11.043564,6.306034,12.131384,11.18706,8.428487,11.774578,5.942993,12.447507,5.99028,9.813163,8.156964,11.546933,5.310318,5.012922,8.198675,8.275069,7.41189,6.364689,9.262248,11.996517,8.508692,...,10.069137,8.260353,12.261088,11.033097,11.74505,13.960913,8.851549,4.834219,11.609201,13.196758,12.322287,10.436515,11.51002,12.598076,9.241031,11.818152,14.454579,6.029861,14.044886,12.553101,12.546297,13.108467,7.783518,12.263514,11.977166,13.057045,12.213122,13.563907,12.427323,11.218925,11.567748,10.500004,15.010639,13.123384,11.748144,9.283095,13.031252,11.367093,15.038492,11.208092
SRR1554535,8.583227,9.145974,13.585906,13.159273,7.539217,10.953407,10.644661,7.691085,6.885521,9.179721,11.757942,13.75868,12.430628,6.48586,7.174165,12.13727,11.156462,9.108857,8.777516,11.573996,5.65982,11.713654,11.583213,8.646386,11.696444,6.466384,13.535621,6.264076,9.802732,7.660717,11.087468,6.39425,5.049559,7.810233,8.786065,6.888673,6.585095,9.554467,12.107183,7.520252,...,10.350487,8.431188,12.25201,11.37567,11.812673,13.572111,9.177446,5.725231,12.15637,12.959437,12.4104,9.938927,12.066043,13.211498,9.896571,11.529537,14.225245,6.335155,13.753086,12.006727,12.878277,12.702216,7.319999,12.5876,11.735796,13.893709,12.673437,14.078996,12.243227,11.659515,11.144585,10.362881,15.20731,13.417473,12.330147,9.269662,12.899341,11.867973,14.455368,11.459779
SRR1554568,8.105891,7.303994,14.957846,15.876052,7.489181,11.062971,9.091478,7.186178,6.567708,8.099645,10.652184,12.833693,12.045205,8.525005,7.401378,13.056753,10.896357,9.652456,9.948553,12.237046,6.912586,12.078191,12.612542,9.491832,12.062338,7.420499,13.577083,6.72647,10.16727,6.40982,9.171032,7.553961,6.791164,7.500049,4.925746,6.790891,8.339422,7.691874,12.001934,5.392324,...,11.014434,9.493641,13.640403,10.264169,11.56818,13.661309,10.381393,9.036064,12.09683,13.808086,9.569563,10.199837,13.293106,13.49871,12.472372,8.653635,14.039276,7.900844,14.713353,13.746716,14.387287,12.293468,7.643377,13.459464,11.521068,12.491744,11.777826,14.12906,14.28426,12.559479,13.256709,11.223629,16.253288,14.440244,14.06145,10.242205,13.801854,13.315986,13.976339,12.916206
SRR1554561,9.206773,8.001912,13.652418,13.925121,7.323988,10.825176,10.406496,7.30599,6.689252,8.340297,12.014095,13.464797,12.325439,6.562989,7.236675,11.858819,10.746418,8.524623,9.883197,10.949873,6.030009,12.425645,11.040663,8.53971,11.678305,7.618927,12.740738,5.806708,10.260433,8.05794,11.414649,5.844022,4.581718,8.199565,8.083382,7.354909,6.401091,9.734726,11.736873,8.910149,...,10.316555,9.09542,12.113526,11.062619,11.844331,13.799661,8.785126,5.243429,11.545216,12.645491,12.061151,10.549871,11.786105,12.582161,8.911123,11.170068,14.513305,6.162469,13.985093,12.670834,12.884486,13.156844,7.894102,12.546458,12.006163,13.57133,12.112764,13.609118,12.720381,10.998715,11.434048,10.645623,14.82571,13.043291,12.223982,8.787615,13.197724,11.76626,15.153004,11.188994
SRR1554567,6.995598,7.351627,14.829826,14.376621,7.706548,11.161363,10.659595,7.545062,7.167443,8.226544,10.470291,13.058067,11.7991,8.035972,7.522792,12.975194,10.914658,9.626314,10.21418,11.847383,7.252727,11.747622,12.383769,9.571803,11.834974,7.270623,13.558488,6.698271,10.177543,6.812443,9.747976,7.021122,6.238395,7.47833,5.423092,6.93229,7.883494,7.494239,11.908942,5.702407,...,10.988777,8.928275,13.372049,10.518149,11.434877,13.923825,10.270943,8.138421,12.302425,12.840988,10.066941,11.272583,13.166701,13.297583,11.945974,9.499193,14.141542,7.775489,14.487247,12.872057,14.165483,12.183555,7.84095,13.425944,11.439851,12.427382,12.008372,14.559294,14.036642,12.3764,13.087952,11.107675,16.131331,14.136332,13.859248,10.671178,13.451059,12.716064,13.996213,12.598715
SRR1554536,8.762864,9.770115,13.812075,12.332018,7.731734,10.689913,10.299173,7.964718,8.185957,9.999757,11.250633,13.886566,12.113007,7.928325,8.362669,11.759541,12.084851,9.486665,9.344669,11.626087,7.166815,11.276501,11.699403,8.49442,12.000776,6.134546,13.580215,5.459952,10.244601,7.576995,9.96062,7.823179,6.126681,9.247907,7.291583,7.58301,6.524355,9.309402,12.100452,6.236367,...,10.351041,8.592756,12.53632,11.131612,11.551088,14.140617,9.388747,6.278089,12.230316,11.279648,9.403883,9.725603,11.680456,12.622177,10.693432,12.14406,14.625417,6.217735,13.886194,12.442416,13.302027,13.125076,7.781475,12.234108,12.289836,13.081711,13.077235,13.828584,11.0612,12.020781,11.486456,10.342871,15.050849,13.381695,13.077099,9.219568,12.386146,11.924352,14.484859,11.474929
SRR1554541,8.373665,7.843839,14.545283,15.204848,7.684944,11.229573,10.146644,7.117489,6.716187,8.372258,10.522167,13.080177,11.609264,7.924678,6.98381,12.988798,11.095362,9.024497,10.524482,11.599479,6.77456,11.794413,12.182915,9.677926,11.698341,7.921699,14.089833,6.252003,10.280204,7.060275,10.010809,6.705624,4.99975,7.063388,5.21252,7.069784,6.544668,7.243693,11.953166,5.769721,...,11.063445,9.271493,13.364083,10.993929,11.612425,14.120697,9.413796,5.990236,12.156954,13.199108,10.965337,10.308968,13.413337,13.335499,11.672582,9.05964,14.746426,7.244637,14.433801,12.96659,14.532041,12.044646,7.930144,13.430838,11.2073,12.535674,12.352845,15.022133,14.086799,12.025424,12.938192,11.053431,15.731998,13.898561,13.628867,9.003333,13.61392,12.990262,13.86711,12.611138
SRR1554539,8.429514,8.253862,13.717079,14.38352,7.654166,11.095095,11.016579,7.332757,6.733446,8.667929,11.461451,13.620263,12.146074,8.51042,7.914843,12.410141,11.055537,8.966921,9.211475,11.36915,6.265437,11.972456,11.462477,8.654467,11.28815,6.36158,13.962509,5.593128,9.195965,8.16652,11.345474,6.320564,4.970934,7.187688,8.59433,6.973314,5.915103,9.754659,12.038969,7.911591,...,10.406944,8.91649,12.425301,10.963291,11.938392,13.521245,9.518708,4.809077,12.049582,12.803462,12.092248,10.314434,12.692617,13.532675,9.261326,10.919109,14.544397,6.646107,13.975075,12.098914,13.43217,12.386501,7.524622,12.837602,11.373368,14.232894,12.354708,14.026965,12.664088,11.34344,10.863579,10.213115,15.199773,13.450274,12.149493,8.933461,13.345966,12.386899,14.553994,11.802491
SRR1554538,7.632873,7.831596,14.708753,14.749481,7.625887,11.270915,10.110863,7.837074,7.121652,7.761858,10.601639,13.045144,12.010443,7.354156,7.639001,13.263124,10.828735,9.615013,10.247867,11.707281,7.076594,11.884892,12.586344,9.52257,11.828403,8.563425,13.697779,6.513355,10.365845,6.804072,9.786128,7.202698,7.027104,7.439959,6.139303,6.954601,7.583894,7.464869,11.887428,4.781603,...,11.120033,9.519349,13.375188,10.565695,11.52975,13.912222,10.225424,8.343487,12.06887,12.601003,10.043323,11.37818,13.143812,13.153767,12.034657,9.12982,14.247109,7.633567,14.538066,13.484143,14.534461,12.236302,7.724579,13.500652,11.364777,12.463614,12.047324,14.488108,13.995449,12.26821,13.035486,11.18837,15.606863,14.172005,13.728335,9.18524,13.644767,12.840028,13.960779,12.758373
SRR1554537,8.489938,7.197164,15.051336,15.584859,7.705363,11.087431,9.066688,7.288075,6.927873,7.922981,10.700495,12.923172,11.952618,7.240286,7.380692,13.212136,11.467183,9.165945,11.125987,11.593463,6.87832,11.988077,12.277068,9.304977,11.982555,7.968995,13.730049,6.522953,10.679902,6.516883,10.063939,7.289761,6.726085,7.788948,5.692328,6.909132,7.567133,7.722174,11.704102,4.796039,...,11.178258,9.396158,13.460092,10.63957,11.525694,13.957229,9.970585,8.52111,11.993909,13.630118,9.278591,11.800934,13.183333,13.345844,12.174824,9.207155,14.492742,7.684917,14.538169,13.489821,14.506063,12.249261,7.698794,13.469554,11.496307,12.636038,12.099059,14.616747,13.771822,12.246823,13.350524,11.350119,15.920022,14.123586,13.434599,9.167789,13.585326,12.564455,14.168734,12.705135


In [None]:
pheno = pd.read_csv("/content/pheno_sample.csv")
pheno = pheno.set_index("Unnamed: 0")
pheno.index.names = [None]
pheno

Unnamed: 0,fetus_adult,age_group,age,sex,gender
SRR1554534,adult,1,40.42,male,1
SRR1554535,adult,1,41.58,male,1
SRR1554568,fetus,0,-0.4986,male,1
SRR1554561,adult,1,43.88,male,1
SRR1554567,fetus,0,-0.4027,male,1
SRR1554536,adult,1,44.17,female,0
SRR1554541,fetus,0,-0.3836,male,1
SRR1554539,adult,1,36.5,female,0
SRR1554538,fetus,0,-0.4027,female,0
SRR1554537,fetus,0,-0.3836,female,0


In [None]:
x = edata_T

#### <b>Logistic regression<b/>

**Classifying fetus or adult based on age_group**

In [None]:
y = pheno[["age_group"]]
y

Unnamed: 0,age_group
SRR1554534,1
SRR1554535,1
SRR1554568,0
SRR1554561,1
SRR1554567,0
SRR1554536,1
SRR1554541,0
SRR1554539,1
SRR1554538,0
SRR1554537,0


In [None]:
lr = LogisticRegression()
p = Prediction(x,y,lr)
p.predict(x,y)
p.cnf(x,y)

  y = column_or_1d(y, warn=True)


number of test samples : 2
number of training samples: 8
actual data:  [[0]
 [0]]
predicted data:  [0 0]


**Classifying gender of samples**

In [None]:
y_gender = pheno[["gender"]]
y_gender

Unnamed: 0,gender
SRR1554534,1
SRR1554535,1
SRR1554568,1
SRR1554561,1
SRR1554567,1
SRR1554536,0
SRR1554541,1
SRR1554539,0
SRR1554538,0
SRR1554537,0


In [None]:
p = Prediction(x,y_gender,lr)
p.predict(x,y_gender)
p.cnf(x,y_gender)

  y = column_or_1d(y, warn=True)


number of test samples : 2
number of training samples: 8
actual data:  [[1]
 [0]]
predicted data:  [0 0]


#### <b>Predicting age (continuous data)<b/>

In [None]:
y_age = pheno[["age"]]
y_age

Unnamed: 0,age
SRR1554534,40.42
SRR1554535,41.58
SRR1554568,-0.4986
SRR1554561,43.88
SRR1554567,-0.4027
SRR1554536,44.17
SRR1554541,-0.3836
SRR1554539,36.5
SRR1554538,-0.4027
SRR1554537,-0.3836


**Multi linear regression**

In [None]:
lr = LinearRegression()
p = Prediction(x,y_age,lr)
p.predict(x,y_age)
p.score(x,y_age)

number of test samples : 2
number of training samples: 8
actual data:  [[-0.4986]
 [-0.4027]]
predicted data:  [[-3.38719242]
 [-1.75369115]]
The R-square of testing data:  -2210.4501135869386
The R-square of training data:  1.0
The mean square error of actual and predicted fitting data is:  1.0169143259568714


The R-square of testing data:  -2210.4501135869386

The R-square of training data:  1.0

We can see that R^2 of testing data is negatively large, while that of training data is fit to statistical model

=> It is overfitting

**Using grid search to find suitable alpha for ridge model**

In [None]:
parameters= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
o = other_prediction(x,y_age,parameters)
o.Ridge(x,y_age)

Ridge(alpha=1000)
R^2 of data: -31.726080041146417
predict data: [[-0.88566156]
 [-0.428606  ]]
actual testing data: [[-0.4986]
 [-0.4027]]


R^2 of data: -31.726080041146417

We can see that using Ridge model combined with alpha 1000 can increase R^2
 
=> It might be better to use than multi linear regression model.


<h2>Part 3: Table type 2</h2>

#### <b>Data processing<b/>

In [None]:
up_down = pd.read_csv("data for regulated gene.csv")
up_down

Unnamed: 0.1,Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,SRR1554534,SRR1554535,SRR1554568,SRR1554561,SRR1554567,SRR1554536,SRR1554541,SRR1554539,SRR1554538,SRR1554537
0,A1BG,380.846902,-1.102936,0.407692,-2.705317,6.823930e-03,1.190664e-02,9.143633,8.583227,8.105891,9.206773,6.995598,8.762864,8.373665,8.429514,7.632873,8.489938
1,A2M,13185.383185,-1.670680,0.420489,-3.973179,7.091974e-05,1.717291e-04,13.657416,13.795969,12.584194,13.003245,12.597035,15.235020,12.938130,13.881265,12.930901,12.535831
2,A2ML1,484.632776,-2.748295,0.569283,-4.827644,1.381576e-06,4.322248e-06,8.848175,9.062477,6.909735,8.644313,7.457088,10.622862,6.885452,7.708383,7.500564,7.351208
3,A4GALT,412.311453,-2.790862,0.571964,-4.879432,1.063917e-06,3.379052e-06,8.868851,8.344972,6.895125,8.568670,6.710507,10.362960,7.260562,7.533763,7.292369,6.723675
4,AARD,124.885643,-2.105662,0.537200,-3.919699,8.865947e-05,2.115377e-04,7.347104,7.314591,5.402734,6.636227,6.395016,7.306073,6.484457,7.470666,5.264764,5.737391
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7026,ZXDA,1338.880407,1.446283,0.201328,7.183722,6.783895e-13,4.459430e-12,9.298298,9.408328,10.695708,9.704292,10.691396,9.581422,10.874795,9.996956,10.778896,10.978820
7027,ZXDB,2840.638643,1.029280,0.251155,4.098180,4.164116e-05,1.045492e-04,10.528675,11.216468,11.697161,10.929075,11.812847,10.392533,11.949005,11.379329,11.716040,11.857990
7028,ZXDC,14083.362352,1.596761,0.129687,12.312466,7.761457e-35,3.260866e-33,12.904692,12.734604,14.242381,12.935801,14.026589,12.953526,14.199765,12.951150,14.424315,14.336139
7029,ZYX,14229.672689,-1.002155,0.256375,-3.908937,9.270299e-05,2.205726e-04,14.633231,13.965675,13.345845,14.427006,13.168138,13.685386,13.205510,13.632358,13.208680,13.456611


In [None]:
select = ["Unnamed: 0","SRR1554534",	"SRR1554535",	"SRR1554568",	"SRR1554561",	"SRR1554567",	"SRR1554536",	"SRR1554541",	"SRR1554539",	"SRR1554538",	"SRR1554537"]
up_down1 = up_down[select]
up_down1 = up_down1.set_index("Unnamed: 0")
up_down1.index.names = [None]
up_down_T = up_down1.T
up_down_T

Unnamed: 0,A1BG,A2M,A2ML1,A4GALT,AARD,AARS1,AATK,ABAT,ABCA1,ABCA10,ABCA12,ABCA2,ABCA6,ABCA7,ABCA8,ABCA9,ABCB1,ABCB7,ABCC12,ABCC4,ABCC8,ABCC9,ABCG2,ABCG4,ABHD11,ABHD12,ABHD12B,ABHD14A,ABHD14B,ABHD15,ABHD16B,ABHD17A,ABHD18,ABHD8,ABI3,ABI3BP,ABL2,ABLIM2,ABR,ABRACL,...,ZNF878,ZNF879,ZNF880,ZNF883,ZNF888,ZNF890P,ZNF90,ZNF91,ZNF92,ZNF93,ZNHIT1,ZNHIT2,ZNRF1,ZNRF2,ZNRF3,ZRANB2-AS1,ZSCAN12,ZSCAN16,ZSCAN2,ZSCAN20,ZSCAN21,ZSCAN22,ZSCAN23,ZSCAN26,ZSCAN29,ZSCAN30,ZSCAN9,ZSWIM3,ZSWIM4,ZSWIM5,ZSWIM6,ZSWIM8-AS1,ZUP1,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYX,ZZZ3
SRR1554534,9.143633,13.657416,8.848175,8.868851,7.347104,15.757594,15.412314,15.773004,10.728758,8.459024,6.127944,16.924871,6.642122,11.876442,11.22606,10.146097,11.725727,10.232176,9.959768,8.642513,13.262715,11.017764,12.156003,12.361851,10.660498,14.027339,8.701414,11.986268,10.572723,9.056314,9.86149,12.058016,9.233208,14.098318,7.786744,7.51958,12.477906,14.127248,16.562266,10.143301,...,5.523122,8.38892,9.117129,9.318429,7.031279,7.694471,5.447482,10.384837,9.385131,7.455699,11.989397,9.976802,13.007817,9.953358,12.479115,5.873261,9.956498,8.342552,10.050968,8.445623,9.713292,9.671006,7.658145,11.056144,11.53146,11.18706,9.657207,9.298647,11.358897,10.997099,12.237565,6.544398,9.8726,9.203633,9.15843,9.298298,10.528675,12.904692,14.633231,11.142157
SRR1554535,8.583227,13.795969,9.062477,8.344972,7.314591,15.345289,15.065392,15.345071,11.097286,8.806715,6.304167,16.764446,7.367755,11.504536,12.397963,10.500113,11.504491,10.476153,9.713523,9.107198,12.773792,11.002192,12.35351,11.955876,10.431284,13.922281,8.933736,11.84107,10.369419,9.143898,9.390017,11.550377,9.317387,13.556946,7.055041,8.419397,12.318609,13.42687,16.039723,10.826507,...,6.584933,9.262461,9.203113,9.805414,7.386525,6.746905,6.036277,11.06584,10.845381,7.94345,11.735374,9.699339,12.606994,10.172354,12.324561,7.158877,10.243413,8.677554,9.795162,8.263585,9.464862,9.680491,7.331028,11.578808,11.704103,11.583213,9.615276,9.454976,10.480033,10.825541,12.48337,6.814309,10.563563,10.563658,9.266755,9.408328,11.216468,12.734604,13.965675,11.893657
SRR1554568,8.105891,12.584194,6.909735,6.895125,5.402734,14.13347,11.532492,14.75558,13.191035,6.858484,7.621509,14.372198,6.09779,11.133889,8.692355,9.506912,9.500953,11.538425,5.277975,10.846871,10.464994,9.921029,10.012782,10.405635,9.161784,11.737641,3.264354,10.270335,9.561828,7.882171,8.783332,9.927939,10.52108,11.857025,6.375065,9.69131,13.387431,10.465655,15.238726,12.014159,...,7.839871,10.127181,11.089222,11.054856,8.429374,5.974485,8.839597,13.197695,12.082913,10.808293,10.895053,8.6974,14.001511,11.231796,11.589324,8.347241,12.09083,10.170591,12.288275,9.769823,10.585639,10.576931,10.375166,12.529951,12.852354,12.612542,11.566507,10.154758,12.185891,13.014882,13.548023,7.629274,11.259243,11.688186,11.768482,10.695708,11.697161,14.242381,13.345845,12.719315
SRR1554561,9.206773,13.003245,8.644313,8.56867,6.636227,15.908997,15.465764,15.899338,10.71751,8.575097,5.908803,16.784595,6.879113,12.057728,11.010436,10.494653,11.084899,9.7398,10.092062,8.077065,13.661388,10.231561,11.516742,12.396533,10.475263,13.848728,8.642972,11.968709,10.327511,8.692463,9.80006,12.277649,9.003095,14.280003,6.652067,7.843736,12.702621,14.271265,16.90717,10.242545,...,5.708142,8.073806,9.156198,9.659393,6.80831,7.87749,5.72666,10.788137,9.020034,7.214739,11.777721,10.451524,12.989843,10.252754,12.057338,6.447057,9.869246,7.923305,9.945018,9.097662,9.19977,9.875337,8.230564,11.173858,11.610486,11.040663,9.007093,9.615308,11.489777,10.879565,12.278762,6.672439,9.89402,9.646344,8.77014,9.704292,10.929075,12.935801,14.427006,11.499921
SRR1554567,6.995598,12.597035,7.457088,6.710507,6.395016,14.037013,11.828298,14.695775,12.993483,7.561909,7.299447,14.174753,6.097308,9.223357,10.005713,9.447863,9.724329,11.047798,7.269236,10.216889,11.079017,10.008156,10.814944,10.334022,9.238554,12.053859,4.516572,10.446782,9.577568,7.492575,8.560335,9.900396,10.267065,11.787334,6.165152,10.041557,13.502438,10.731063,15.325546,11.559712,...,7.641562,10.195784,11.425419,11.261326,7.995979,6.123099,8.002635,13.242272,11.822411,10.332267,10.986188,9.061884,13.999649,11.319463,11.41168,8.548376,11.817836,10.220482,12.305577,9.40036,10.407359,10.02979,9.625091,12.446861,12.600827,12.383769,12.012142,10.18219,12.059872,12.759299,13.339272,7.201364,11.282461,11.741266,10.929261,10.691396,11.812847,14.026589,13.168138,12.594819
SRR1554536,8.762864,15.23502,10.622862,10.36296,7.306073,14.566318,13.675639,15.950138,12.610804,8.817514,6.762878,15.202903,7.559637,10.829712,11.49233,11.877039,12.977066,10.738751,9.313206,10.014521,12.027243,12.094676,14.008323,10.825705,11.162297,13.625637,7.036439,11.942625,11.457666,9.472182,9.306874,11.936746,9.007012,13.562381,7.662428,9.143716,11.914227,12.49365,16.048982,10.276661,...,6.896193,8.988674,10.687252,10.074483,7.232065,5.933244,5.669458,11.183925,10.368346,8.424874,12.722674,9.560777,12.170424,10.099704,13.350345,6.474878,10.319819,8.308719,9.78504,8.380051,9.006333,8.77276,8.03693,11.256797,11.611506,11.699403,9.997116,9.236661,10.455883,11.057891,12.4616,7.120985,10.131568,9.119705,8.653688,9.581422,10.392533,12.953526,13.685386,11.839166
SRR1554541,8.373665,12.93813,6.885452,7.260562,6.484457,14.312565,12.056377,14.899582,12.200847,7.614164,7.872199,14.485433,6.604809,9.533953,9.361372,10.234131,10.332023,11.036601,6.431374,9.920664,11.577783,10.56243,11.374498,10.986478,9.10014,12.214557,4.106232,10.829724,9.391934,6.915489,7.567838,9.842287,10.262243,12.020757,6.106361,9.980088,13.931721,11.478559,15.52662,11.375684,...,7.138301,10.189643,10.839026,11.496353,7.668963,6.60707,7.727974,13.448212,11.8309,9.930534,11.032188,8.935184,14.152803,12.053968,11.366272,8.188887,11.541641,9.988589,12.013712,9.793927,10.06501,10.383262,10.262279,12.294102,12.52136,12.182915,11.569984,10.358341,11.975724,12.69983,13.39752,7.202656,11.181691,12.014128,9.032225,10.874795,11.949005,14.199765,13.20551,12.744916
SRR1554539,8.429514,13.881265,7.708383,7.533763,7.470666,15.532491,14.628722,15.651408,10.843313,8.726925,6.168267,16.26292,7.651274,11.439154,11.805462,10.941172,11.729856,10.484628,10.457069,9.169996,13.016873,10.447138,12.276873,11.96409,10.170058,13.721899,9.222596,11.683003,10.074123,8.376754,9.21842,11.332301,9.479364,13.519756,6.477536,8.485803,12.994895,13.706205,16.417238,10.760538,...,6.281694,9.108239,9.634101,10.534168,7.398601,7.555886,5.978255,11.955723,10.692762,7.752293,11.500031,9.139357,12.675364,10.498504,11.467811,7.34246,10.347586,8.133257,9.971567,8.870507,9.276787,9.626151,8.183871,11.241223,11.833354,11.462477,9.314155,9.50894,10.695666,10.891782,12.405263,6.724808,10.525105,10.472781,8.943709,9.996956,11.379329,12.95115,13.632358,12.01947
SRR1554538,7.632873,12.930901,7.500564,7.292369,5.264764,14.157486,11.641881,14.640573,13.038372,7.01715,7.488857,14.356908,6.022465,9.327464,8.262962,9.484422,9.894313,11.403434,5.826383,10.305574,11.149483,10.638605,10.799533,10.329258,8.967529,11.714889,4.636377,10.381,9.106582,7.644281,8.178266,9.730864,10.439218,11.655518,6.702167,9.949246,13.735393,10.56419,15.280285,11.424988,...,7.469045,10.184612,10.759095,11.441091,8.330132,6.239233,8.251958,13.380185,11.84361,10.458373,10.998128,8.607657,13.863627,11.318112,11.421746,8.371995,11.893092,10.332044,12.257206,10.124996,10.268063,10.358589,10.627177,12.403223,12.682141,12.586344,12.001362,10.27858,12.158205,12.984516,13.459228,7.327727,10.986112,11.676821,11.047248,10.778896,11.71604,14.424315,13.20868,12.714798
SRR1554537,8.489938,12.535831,7.351208,6.723675,5.737391,14.182287,12.151114,14.437766,11.547024,7.10201,6.988406,14.549216,5.339306,10.372521,7.247915,8.526775,9.669661,11.39349,6.019801,10.055305,11.561851,10.067586,10.577172,9.988133,9.007739,12.131975,3.884935,10.928635,9.473773,7.660333,8.464343,10.087238,10.380579,12.058889,6.190769,9.877223,13.598623,10.99077,15.564852,11.684125,...,7.547207,9.924465,10.720465,11.315918,8.328927,6.242902,8.265535,13.145109,11.736259,10.400877,11.141026,9.153779,13.823957,11.312668,11.241986,8.151297,11.81702,10.335912,12.319213,10.054939,10.350619,10.587941,10.329576,12.27958,12.483246,12.277068,11.614997,10.279064,12.18608,12.953881,13.504647,7.569523,10.959673,11.858405,11.014847,10.97882,11.85799,14.336139,13.456611,12.679419


In [None]:
x1 = up_down_T

#### <b>Logistic regression<b/>

**Classifying fetus or adult based on age_group**

In [None]:
lr = LogisticRegression()
p = Prediction(x1,y,lr)
p.predict(x1,y)
p.cnf(x1,y)

number of test samples : 2
number of training samples: 8
actual data:  [[0]
 [0]]
predicted data:  [0 0]


  y = column_or_1d(y, warn=True)


**Classifying gender of samples**

In [None]:
p = Prediction(x1,y_gender,lr)
p.predict(x1,y_gender)
p.cnf(x1,y_gender)

  y = column_or_1d(y, warn=True)


number of test samples : 2
number of training samples: 8
actual data:  [[1]
 [0]]
predicted data:  [0 1]


#### <b>Predicting age (continuous data)<b/>

**Multi linear regression**

In [None]:
lr = LinearRegression()
p = Prediction(x1,y_age,lr)
p.predict(x1,y_age)
p.score(x1,y_age)

number of test samples : 2
number of training samples: 8
actual data:  [[-0.4986]
 [-0.4027]]
predicted data:  [[-3.72575422]
 [-1.9660493 ]]
The R-square of testing data:  -2795.314245992872
The R-square of training data:  1.0
The mean square error of actual and predicted fitting data is:  1.2858585410344976


The R-square of testing data:  -2759.9297929965064

The R-square of training data:  1.0

We can see that R^2 of testing data is negatively large, while that of training data is exactly fit which also means statistical model fits

It is overfitting

**Using grid search to find suitable alpha for ridge model to predict age of sample**

In [None]:
parameters= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
o = other_prediction(x1,y_age,parameters)
o.Ridge(x1,y_age)

Ridge(alpha=1000)
R^2 of data: -94.94043343888032
predict data: [[-1.15500159]
 [-0.50423776]]
actual testing data: [[-0.4986]
 [-0.4027]]


R^2 of data: -94.94043343888032

We can see that using Ridge model combined with alpha 1000 can increase the R^2, and reduce overfitting compared with multi linear regression 

<h2>Part 4: Table type 3</h2>

#### <b>Data processing <b/>

In [None]:
# The most up-regulated genes
up = up_down[up_down['log2FoldChange'] > 1]
up = up[up["padj"] == min(up["padj"])]
up = up.set_index("Unnamed: 0")
up.index.names = [None]
up

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,SRR1554534,SRR1554535,SRR1554568,SRR1554561,SRR1554567,SRR1554536,SRR1554541,SRR1554539,SRR1554538,SRR1554537
ST8SIA2,21470.98555,7.44298,0.2003,37.159148,3.1199569999999997e-302,4.823766e-298,8.614598,8.485263,14.645985,8.707785,14.756824,8.39641,15.159507,8.971893,14.917903,15.015816


In [None]:
# The most down-regulated genes
down = up_down[up_down['log2FoldChange'] < -1]
down = down[down["padj"] == min(down["padj"])]
down = down.set_index("Unnamed: 0")
down.index.names = [None]
down

Unnamed: 0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,SRR1554534,SRR1554535,SRR1554568,SRR1554561,SRR1554567,SRR1554536,SRR1554541,SRR1554539,SRR1554538,SRR1554537
BCL2L2,21968.220153,-2.676407,0.088315,-30.305208,9.788276e-202,3.783413e-198,14.967316,14.943844,12.776112,15.151775,12.71442,14.999478,12.761278,15.053436,12.718466,12.793366


In [None]:
up1 = up[["SRR1554534",	"SRR1554535",	"SRR1554568",	"SRR1554561",	"SRR1554567",	"SRR1554536",	"SRR1554541",	"SRR1554539",	"SRR1554538",	"SRR1554537"]]
up1_T = up1.T
up1_T

Unnamed: 0,ST8SIA2
SRR1554534,8.614598
SRR1554535,8.485263
SRR1554568,14.645985
SRR1554561,8.707785
SRR1554567,14.756824
SRR1554536,8.39641
SRR1554541,15.159507
SRR1554539,8.971893
SRR1554538,14.917903
SRR1554537,15.015816


In [None]:
down1 = down[["SRR1554534",	"SRR1554535",	"SRR1554568",	"SRR1554561",	"SRR1554567",	"SRR1554536",	"SRR1554541",	"SRR1554539",	"SRR1554538",	"SRR1554537"]]
down1_T = down1.T
down1_T

Unnamed: 0,BCL2L2
SRR1554534,14.967316
SRR1554535,14.943844
SRR1554568,12.776112
SRR1554561,15.151775
SRR1554567,12.71442
SRR1554536,14.999478
SRR1554541,12.761278
SRR1554539,15.053436
SRR1554538,12.718466
SRR1554537,12.793366


In [None]:
x2 = pd.concat([up1_T,down1_T],axis=1)
x2

Unnamed: 0,ST8SIA2,BCL2L2
SRR1554534,8.614598,14.967316
SRR1554535,8.485263,14.943844
SRR1554568,14.645985,12.776112
SRR1554561,8.707785,15.151775
SRR1554567,14.756824,12.71442
SRR1554536,8.39641,14.999478
SRR1554541,15.159507,12.761278
SRR1554539,8.971893,15.053436
SRR1554538,14.917903,12.718466
SRR1554537,15.015816,12.793366


#### <b>Logistic regression<b/>

**Classifying by age_group**

In [None]:
lr = LogisticRegression()
p = Prediction(x2,y,lr)
p.predict(x2,y)
p.cnf(x2,y)

number of test samples : 2
number of training samples: 8
actual data:  [[0]
 [0]]
predicted data:  [0 0]


  y = column_or_1d(y, warn=True)


**Classifying by gender**

In [None]:
# gender
p = Prediction(x2,y_gender,lr)
p.predict(x2,y_gender)
p.cnf(x2,y_gender)

number of test samples : 2
number of training samples: 8
actual data:  [[1]
 [0]]
predicted data:  [1 1]


  y = column_or_1d(y, warn=True)


#### <b>Predicting age (continuous data)<b/>

**Multi linear regression**

In [None]:
lr = LinearRegression()
p = Prediction(x2,y_age,lr)
p.predict(x2,y_age)
p.score(x2,y_age)

number of test samples : 2
number of training samples: 8
actual data:  [[-0.4986]
 [-0.4027]]
predicted data:  [[ 1.38679745]
 [-0.25893323]]
The R-square of testing data:  -776.5288225177012
The R-square of training data:  0.9934719800426676
The mean square error of actual and predicted fitting data is:  2.511263182559502


The R-square of testing data:  -776.5288225177012

We can see that R^2 of testing data is negatively large, while that of training data nearly fits statistical model

=> It is overfitting

**Using grid search to find suitable alpha for ridge model**

In [None]:
parameters= [{'alpha': [0.001,0.1,1, 10, 100, 1000, 10000, 100000, 100000]}]
o = other_prediction(x2,y_age,parameters)
o.Ridge(x2,y_age)

Ridge(alpha=1)
R^2 of data: -684.0498220950029
predict data: [[ 1.27614849]
 [-0.42280739]]
actual testing data: [[-0.4986]
 [-0.4027]]


R^2 of data: -684.0498220950029

**Random Forest Regression**

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
lr = RandomForestRegressor(n_estimators=1)
p = Prediction(x2,y_age,lr)
p.predict(x2,y_age)
p.score(x2,y_age)

number of test samples : 2
number of training samples: 8
actual data:  [[-0.4986]
 [-0.4027]]
predicted data:  [-0.3836 -0.3836]
The R-square of testing data:  -1.8759972207754645
The R-square of training data:  0.9830837747102115
The mean square error of actual and predicted fitting data is:  5.582322500000003


  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.


Using random forest regression with estimators = 1, we have R-square of testing data:  -1.8759972207754645 

In this table type 3, the random forest regression is better than multi linear regression and ridge model. This is because its R^2 of testing data is the highest, but it's still negative.  