### Paper Choice and Background Outline

- In my project, I will implement the algorithm developed by Dennis D. Boos, Leonard A. Stefanski and Yujun Wu in their article "Fast FSR Variable Selection with Applications to Clinical Trials".

- Many variable selection procedures have been developed in the literature for linear regression models. This paper proposed an updated version of False Selection Rate (FSR) method to control variable selection without simulation. By adding a number of phony variables to the real set of data and monitoring the proportion of the phony variables falsely selected as a function of the tuning parameter, like α-to-enter of forward selection, FSR is able to estimate the appropriate tuning parameter and control the model false selection rate, selecting informative variables and preventing uninformative ones from being selected. Fast FSR in this paper allows us to estimate the tuning parameter from the summary table of the forward selection variable sequence. Therefore, to achieve the same result, no phony variable generation is required in the Fast FSR.

### Pseudocode

- Step 1: Use forward selection to generate the sequence of variables and the associated p-values.

- Step 2: Monotonize the p-value of the original sequence by carrying the larger p-value forward until a even larger p-value. Denote the monotonized p-value sequence with
$$
\tilde{p_1}\leq\tilde{p_2}\leq\cdots\leq\tilde{p_k}\\
$$

- Step 3: For each variable $x_i$ in the selection sequence, calculate the associated

$$
\hat{\alpha_i} = \frac{\gamma(1+S_i)}{k-S_i}  \\
$$

, where $\gamma$ is the pre-determined average selection rate of uninformative variables in the model. $S_i$ is the model size associated with the variables in the sequence. 

- Step 4: Compare $\tilde{p_i}$ and $\hat{\alpha_i}$. Select the model of size $j$, where $j = max\{i: \tilde{p_i}\leq\hat{\alpha_i}\}$. Also, return the corresponding $\hat{\alpha_i}$.

### Draft of Unit Tests 

- Test the algorithm with data from Mangold, Bean, Adams (2003), Journal Of Higher Education, p. 540-562, "The Impact of Intercollegiate Athletics on Graduation Rates Among Major NCAA Division I Universities." Compare my output with the result by Dennis D. Boos and Leonard A. Stefanski, which is saved in http://www4.stat.ncsu.edu/~boos/var.select/fsr.fast.ncaa.ex.txt

### Install "leaps" Package

In [1]:
%load_ext rpy2.ipython
from rpy2.robjects.packages import importr
p1=importr('leaps')
p2=importr('stats')

### Initial Python Code

In [51]:
from __future__ import division
import os
import sys
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as stats
from sklearn import datasets, linear_model
%matplotlib inline
%precision 4
plt.style.use('ggplot')

In [59]:
def fsr_fast(x,y,gam0=.05,digits=4):
    
    m = x.shape[1]
    n = x.shape[0]
    if(m >= n):
        m1 = n-5  
    else:
        m1 = m    
    
    #as.matrix(x)->x                      # in case x is a data frame
    
    pvm = np.zeros(m1)                      # to create pvm below
    
    out_x = p1.regsubsets(x,y,method="forward")
    
    rss = out_x[9]
    nn = x.shape[0]
    vorder = out_x[7]
    
    q = [(rss[i]-rss[i+1])*(nn-i-2)/rss[i+1] for i in range(len(rss)-1)]
    orig = [1-stats.f.cdf(q[i],1,nn-i-2) for i in range(len(rss)-1)]
    
   
    for i in range(0,m1):
        pvm[i] = max(orig[0:i+1])  # sequential max of pvalues
   
    S = np.arange(1,m1+1)
    alpha = gam0*(1+S)/(m1-S)
   
    
    for i in range(0,m1):
        if orig[i]>orig[i+1]:
            i = i+1
        elif pvm[i]<alpha[i] and pvm[i]<gam0:
            i = i+1
        else:
            break
        i = i-1
        
    svorder = np.array(vorder[0:i])-1
    data_x = x.ix[:,svorder]
    
    regr = linear_model.LinearRegression()

    # Train the model using the training sets
    regr.fit(data_x, y)
    
    return regr.coef_, list(data_x.columns.values)
   

In [53]:
import os   
import pandas as pd
if not os.path.exists('data.txt'):
    ! wget http://www4.stat.ncsu.edu/~boos/var.select/ncaa.data2.txt -O data.txt
data = pd.read_csv('data.txt',delim_whitespace = True).dropna()
data.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,x11,x12,x13,x14,x15,x16,x17,x18,x19,y
0,13,17,9,15,28.0,0,-1.14045,3.66,4.49,3409,65.8,18,81,42.2,660000,77,100,59,1,35.0
1,28,20,32,18,18.4,18,-0.13719,2.594,3.61,7258,66.3,17,82,40.5,150555,88,94,41,25,57.0
2,32,20,20,20,34.8,18,1.55358,2.06,4.93,6405,75.0,19,71,46.5,415400,94,81,25,36,51.3
3,32,21,24,21,14.5,20,2.05712,2.887,3.876,18294,66.0,16,84,42.2,211000,93,88,26,13,41.3
4,24,20,16,20,21.8,13,-0.77082,2.565,4.96,8259,63.5,16,91,41.2,44000,90,92,32,31,65.7


In [61]:
import os   
import pandas as pd
if not os.path.exists('ATCG.txt'):
    ! wget http://www4.stat.ncsu.edu/~boos/var.select/actg.175.trt0.txt -O ATCG.txt
data = pd.read_csv('ATCG.txt',delim_whitespace = True).dropna()
data.head()

--2015-04-16 23:52:44--  http://www4.stat.ncsu.edu/~boos/var.select/actg.175.trt0.txt
Resolving www4.stat.ncsu.edu (www4.stat.ncsu.edu)... 152.1.51.52
Connecting to www4.stat.ncsu.edu (www4.stat.ncsu.edu)|152.1.51.52|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 79417 (78K) [text/plain]
Saving to: ‘ATCG.txt’


2015-04-16 23:52:44 (770 KB/s) - ‘ATCG.txt’ saved [79417/79417]



Unnamed: 0,Obs,censor,event,age,wtkg,hemo,homo,drugs,karnof,oprior,...,gender,str2,strat,symptom,cd40,cd420,cd496,r,cd80,cd820
0,1,0,1090,43,66.679,0,1,0,100,0,...,1,1,3,0,504,353,660,1,870,782
1,2,1,794,31,73.03,0,1,0,100,0,...,1,1,3,0,244,225,106,1,708,699
2,3,0,957,41,66.226,0,1,1,100,0,...,1,1,3,0,401,366,453,1,889,720
3,4,1,188,35,78.019,0,1,0,100,0,...,1,1,3,0,221,132,-1,0,221,759
4,5,1,308,40,83.009,0,1,0,100,0,...,1,1,3,1,150,90,20,1,1730,1160


In [8]:
data_x = data.ix[:,:19]
data_y = data.ix[:,19]

In [17]:
q = [(rss[i]-rss[i+1])*(nn-i-2)/rss[i+1] for i in range(len(rss)-1)]
orig = [1-stats.f.cdf(q[i],1,nn-i-2) for i in range(len(rss)-1)]


In [64]:
data_y = data.ix[:,'event']

0    1090
1     794
2     957
3     188
4     308
Name: event, dtype: int64

In [116]:
data_x = data.ix[:,np.array(['cd40','cd80','age','wtkg','karnof','hemo','homo','drugs','race','gender','str2','symptom'])]

In [117]:
data_x.ix[:,'cd40sq']=np.multiply(data_x.ix[:,'cd40'],data_x.ix[:,'cd40'])
data_x.ix[:,'cd80sq']=np.multiply(data_x.ix[:,'cd80'],data_x.ix[:,'cd80'])
data_x.ix[:,'agesq']=np.multiply(data_x.ix[:,'age'],data_x.ix[:,'age'])
data_x.ix[:,'wtkgsq']=np.multiply(data_x.ix[:,'wtkg'],data_x.ix[:,'wtkg'])
data_x.ix[:,'karnofsq']=np.multiply(data_x.ix[:,'karnof'],data_x.ix[:,'karnof'])

In [118]:
col = 0
inter = np.zeros(shape=(data_x.shape[0],66))
inter = pd.DataFrame(inter)
for i in np.arange(12):
    for j in np.arange((i+1),12):
        inter.ix[:,col]=data_x.ix[:,i]*data_x.ix[:,j]
        col = col + 1
        
inter.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,56,57,58,59,60,61,62,63,64,65
0,438480,21672,33606.216,50400,0,504,0,0,504,504,...,0,0,0,0,0,0,0,1,0,0
1,172752,7564,17819.32,24400,0,244,0,0,244,244,...,0,0,0,0,0,0,0,1,0,0
2,356489,16441,26556.626,40100,0,401,401,0,401,401,...,0,1,1,0,0,0,0,1,0,0
3,48841,7735,17242.199,22100,0,221,0,0,221,221,...,0,0,0,0,0,0,0,1,0,0
4,259500,6000,12451.35,15000,0,150,0,0,150,150,...,0,0,0,0,0,0,0,1,1,1


In [119]:
data_x = data_x.join(inter)

In [122]:
fsr_fast(data_x,data_y,gam0=.05,digits=4)

Reordering variables and trying again:


(array([ -9.8465e-05,   1.5578e-02]), [0, 1])