## Categorical Input and Continuous Output
Students t-test is usually used when we want to check if the two samples were drawn from the same population or not and ANOVA when more than two categorical variables are involved. These techniques can also be adopted for Feature Selection. 


## a.    Students t-test for Feature Selection:

When we have a binary classification problem t test can be used to select features. The idea is that a large t-statistic value with a smaller p – value would provide sufficient evidence that the distribution of values for each of the examined classes are distinct and the variable may have enough discriminative power to be included in the classification model.

- Null Hypothesis: There is no significant difference between the means of two groups.
- Alternate Hypothesis: There is a significant difference between the means of two groups.

### About the data:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.


In [14]:
from scipy import stats
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

###  About the data
1) ID number<br>
2) Diagnosis (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)<br>
b) texture (standard deviation of gray-scale values)<br>
c) perimeter<br>
d) area<br>
e) smoothness (local variation in radius lengths)<br>
f) compactness (perimeter^2 / area - 1.0)<br>
g) concavity (severity of concave portions of the contour)<br>
h) concave points (number of concave portions of the contour)<br>
i) symmetry<br>
j) fractal dimension ("coastline approximation" - 1)

In [15]:
df = pd.read_csv('data.csv')
df.drop(['id','Unnamed: 32'],axis = 1,inplace = True)
df.columns

<IPython.core.display.Javascript object>

Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [23]:
#Encoding Male = 0 and Female = 1
alter = {'B' : 1,'M' : 0}
df['diagnosis'] = df['diagnosis'].map(alter)
df.shape

(569, 31)

Selecting features whose p value is > 0.05

In [17]:
new_features = []
for x in df.columns[1:]:
    pvalue = stats.ttest_ind(df.loc[df.diagnosis==1][x], df.loc[df.diagnosis==0][x])[1]
    if pvalue < 0.05:
        new_features.append(x)    
new_df = df[new_features]

A = new_df.columns
B = df.columns
print('Çolumns whose p-value was >0.05 are:\n',
      list(set(A).symmetric_difference(set(B))))

Çolumns whose p-value was >0.05 are:
 ['smoothness_se', 'texture_se', 'diagnosis', 'fractal_dimension_mean', 'fractal_dimension_se', 'symmetry_se']


## b. Using ANOVA F- Test
Analysis of Variance is a statistical method which is used to check the means of two or more groups that are significantly different from each other.

The scikit-learn machine library provides an implementation of the ANOVA F-test in the f classif() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

In [21]:
# split into input (X) and output (y) variables
X = df.iloc[:,1:]
y = df.iloc[:,:1]

#Split into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

select = SelectKBest(score_func=f_classif, k=8)
new = select.fit_transform(X_train,y_train)

#printing the features that have been selected using get_support()
cols = select.get_support(indices=True)

#Printing the scores of the selected columns
for i in range(len(cols)):
    print('Feature %d: %f' % (cols[i], select.scores_[i]))


<IPython.core.display.Javascript object>

Feature 0: 490.196564
Feature 2: 92.200936
Feature 3: 519.087945
Feature 7: 416.020744
Feature 20: 45.425277
Feature 22: 230.825282
Feature 23: 371.882298
Feature 27: 568.580445


  return f(*args, **kwargs)


In [22]:
# Creating a new dataframe with the selected columns
features_df_new = df.iloc[:,cols]
features_df_new.columns
features_df_new.shape

(569, 8)