# Feature Selection and Hyperperameter Tuning

This notebook will use a dataset from [kaggle]('https://www.kaggle.com/pavanraj159/predicting-a-pulsar-star') to try and find the best model for predicting pulsar stars. I will do this by selecting the best features and then by tuning hyperperameters

### Preprocessing

In [1]:
# read csv file into a pandas dataframe
import numpy as np
import pandas as pd

stars = pd.read_csv('HTRU_2.csv',  
                  names=["1", "2", "3", "4", "5", "6", "7", "8", "class"])

stars.head()

Unnamed: 0,1,2,3,4,5,6,7,8,class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


In [2]:
# Turn negative numbers into NAN
stars = stars[stars >= 0]

In [3]:
# Drop NAN values
stars = stars.dropna()

In [4]:
# split data into train/test sets as well as feature and label dataframes

from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(stars, test_size=0.2)

y_train = x_train['class']
y_test = x_test['class']

x_train=x_train.drop(['class'], axis=1)
x_test=x_test.drop(['class'], axis=1)

### Feature selection

In this section I will use selectkbest to find the features that will be the best predictors for the labels which will lead to a better model and faster training time. 

In [5]:
# Here I am searching for the three best predictors out the eight features. 

from sklearn.feature_selection import SelectKBest, chi2

x_train_best=SelectKBest(score_func=chi2,k=3).fit_transform(x_train,y_train)
print(x_train_best[:10])

[[2.22069817e-01 3.16471572e+00 6.75923744e+01]
 [2.73274451e+00 3.91354515e+01 1.36942577e+00]
 [6.68161838e-01 7.00501672e+00 2.37320597e+01]
 [2.03404680e+01 1.46530100e+01 1.14409955e+01]
 [1.99916994e-01 2.93143813e+00 9.47097486e+01]
 [6.61009395e-01 2.12541806e+00 1.21746158e+02]
 [1.09159462e-01 6.89799331e-01 3.63471664e+02]
 [6.43726530e-01 3.70150502e+00 5.65603488e+01]
 [4.08896980e+00 2.22993311e+01 4.24387489e+00]
 [8.01163159e-01 2.06354515e+00 1.82066055e+02]]


In [6]:
# Print out all features to find that the three best features form above are 4, 5 and eight

with pd.option_context('display.max_rows', None, 'display.max_columns', None):  
    print(x_train.head(10))

                1          2         3          4          5          6  \
8419   107.609375  44.675083  0.184331   0.222070   3.164716  23.471511   
9037    82.218750  49.904274  1.610495   2.732745  39.135452  72.729627   
3121   118.078125  41.883322  0.125735   0.668162   7.005017  30.025435   
3243    49.007812  32.512377  3.953519  20.340468  14.653010  46.732418   
3274   126.945312  46.491987  0.151432   0.199917   2.931438  14.698940   
4653    99.320312  42.363002  0.650448   0.661009   2.125418  14.520174   
10247  130.937500  48.955616  0.005734   0.109159   0.689799  11.089101   
2503   114.609375  43.190211  0.018183   0.643727   3.701505  25.605974   
7097    88.773438  32.705890  0.758555   4.088970  22.299331  62.944904   
11926  113.062500  45.729632  0.353996   0.801163   2.063545  13.094370   

               7           8  
8419    8.099852   67.592374  
9037    1.684183    1.369426  
3121    4.748131   23.732060  
3243    3.470588   11.440995  
3274    8.144680   

In [15]:
# create a new dataframe with the three best features

new_stars = stars[['4', '5', '8', 'class']]

### Cross Validation

Below I will use cross validation to predict pulsar stars. Cross validation is useful because the datasets are broken down into smaller pieces which are each used for training and testing which helps cut down the variance of the model.

In [18]:
# Split data into training and testing data

from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(new_stars, test_size=0.2)
y_train = x_train['class']
y_test = x_test['class']

x_train=x_train.drop(['class'], axis=1)
x_test=x_test.drop(['class'], axis=1)

In [19]:
# Normalize feature data using standard scalar to make the model more efficient

from sklearn.preprocessing import StandardScaler  
feature_scaler = StandardScaler()  
x_train = feature_scaler.fit_transform(x_train)  
x_test = feature_scaler.transform(x_test)  


In [20]:
# Import the random forest classifier and create an object

from sklearn.ensemble import RandomForestClassifier  
classifier = RandomForestClassifier(n_estimators=300, random_state=0)  

In [23]:
# Use cross validation on the dataset for five folds

from sklearn.model_selection import cross_val_score  
all_accuracies = cross_val_score(estimator=classifier, X=x_train, y=y_train, cv=5) 

In [24]:
# Print accuracies of the five folds
print(all_accuracies)  

[0.95031491 0.96431071 0.96078431 0.95868347 0.96426069]


In [25]:
# Print mean of the five folds
print(all_accuracies.mean())

0.9596708172373164


In [26]:
# Print the standard deviation of the five folds
print(all_accuracies.std())

0.005144769838290858


### Gird Search

Grid search will help find the optimal hyperparameters for the model so we can avoid guessing and checking.

In [28]:
# Create dictionary of parameters that are being tested testing

from sklearn.model_selection import GridSearchCV

grid_param = {  
    'n_estimators': [100, 300, 500, 800, 1000],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}

In [29]:
# Create grid search class

gd_sr = GridSearchCV(estimator=classifier,  
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=5,
                     n_jobs=-1)

In [30]:
# fit the class to our data from before

gd_sr.fit(x_train, y_train)  

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'n_estimators': [100, 300, 500, 800, 1000], 'criterion': ['gini', 'entropy'], 'bootstrap': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [31]:
# Print out the results for the best model parameters
best_parameters = gd_sr.best_params_  
print(best_parameters)  

{'bootstrap': True, 'criterion': 'entropy', 'n_estimators': 100}


In [32]:
# Print best accuracy
best_result = gd_sr.best_score_  
print(best_result)

0.9605097325304579
