# Feature selection

### What is feature selection?

Feature selection is a method of selecting features from your feature set to be used for modeling. It draws from a set of existing features, so it's different than feature engineering because it doesn't create new features. The goal of feature selection is to improve your model's performance. Perhaps your existing feature set is much too large, or some of the features you're working with are unnecessary.

There are different ways you can perform feature selection. It's possible to do it in an automated way. Scikit-learn has several methods for automated feature selection, such as choosing a variance threshold and using univariate statistical tests, but we won't cover those here. 

Feature selection can help to get rid of noise in your model. Perhaps you have redundant features, like latitude and longitude, city, and state. Or maybe you have features that are strongly statistically correlated, which breaks the assumptions of certain models and thus impacts model performance. If your feature set is large, it may be beneficial to use dimensionality reduction to combine and reduce the number of features in your dataset in a way that also reduces the overall variance.

### Removing features manually

One of the easiest ways to determine if a feature is unnecessary is if it is redundant in some way. For example, if it exists in another form as another feature—sometimes, when you create features through feature engineering, you end up duplicating existing features in some way. 

For example, if your dataset contains repeated information in its feature set, it's unlikely you'll need to use each feature for modeling. You may see columns related to city, state, latitude, and longitude in the same dataset, such as in the `volunteer` dataset:

In [1]:
import pandas as pd

dir_string = "../../datasets/"

volunteer = pd.read_csv(dir_string + "volunteer.csv")
volunteer.columns

Index(['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'locality', 'region', 'postalcode', 'primary_loc', 'display_url',
       'recurrence_type', 'hours', 'created_date', 'last_modified_date',
       'start_date_date', 'end_date_date', 'status', 'Latitude', 'Longitude',
       'Community Board', 'Community Council ', 'Census Tract', 'BIN', 'BBL',
       'NTA'],
      dtype='object')

The `volunteer` dataset contains a lot of different information related to the location of the volunteer opportunity: `Latitude`, `Longitude`, `locality`, `region`, and `postalcode`.

Dropping columns is as simple as using pandas' `drop()` method, which we learned about previously, but it's important to remember here. Specifying `axis=1` ensures that we drop entire columns:

In [2]:
volunteer_subset = volunteer.drop(["Latitude", "Longitude", "locality", "postalcode"], axis=1)

print(volunteer_subset.columns)

Index(['opportunity_id', 'content_id', 'vol_requests', 'event_time', 'title',
       'hits', 'summary', 'is_priority', 'category_id', 'category_desc',
       'amsl', 'amsl_unit', 'org_title', 'org_content_id', 'addresses_count',
       'region', 'primary_loc', 'display_url', 'recurrence_type', 'hours',
       'created_date', 'last_modified_date', 'start_date_date',
       'end_date_date', 'status', 'Community Board', 'Community Council ',
       'Census Tract', 'BIN', 'BBL', 'NTA'],
      dtype='object')


Perhaps, for your modeling task, you only need the high-level state information in the `region` field, which we didn't drop. 

Another situation where duplicate features occur is through feature engineering. In an earlier example, we calculated a number of aggregate statistics on running times. The columns we had were times from 5 separate runs as well as mean, median, total, and fastest time. We could likely drop the values that generated the aggregate statistic. Feature selection, like most aspects of the machine learning pipeline, is very dependent on the model you're trying to build.

### Removing correlated features

Another clear situation in which you'd want to drop features is when they are statistically correlated, meaning they move together directionally. Linear models in particular assume that features are independent of each other, and if features are strongly correlated, that could introduce bias into your model. 

We'll use Pearson's correlation coefficient to check a feature set for correlation. The Pearson correlation coefficient is a measure of this directionality: a score closer to 1 between pairs of features means that they move together in the same direction more strongly, a score closer to 0 means features are not correlated, and a score close to -1 means they are more strongly negatively correlated, meaning one feature increases in value while the other decreases. 

Let's take a look at this simple dataset:

In [3]:
corr_dict = {"col1": [1, 2, 3, 4], 
             "col2": [0.23, 3, 8, -9], 
             "col3": [10, 20, 30, 40], 
             "col4": [0.3, 0.2, 90.6, 0.1]}

corr_df = pd.DataFrame(corr_dict)

We can easily check correlation in Pandas using the `corr()` method, which outputs features and their measures of correlation:

In [4]:
corr_df.corr()

Unnamed: 0,col1,col2,col3,col4
col1,1.0,-0.410435,1.0,0.256485
col2,-0.410435,1.0,-0.410435,0.696157
col3,1.0,-0.410435,1.0,0.256485
col4,0.256485,0.696157,0.256485,1.0


As you can see, every column is obviously perfectly correlated with itself. However, we can also see that colunms 1 and 3 have a perfect correlation as well! In the toy dataset above, you can see that column 3 is simply column 1 multipled by 10, so this makes sense. We'd definitely want to remove one of those columns.

Correlation isn't usually as clear-cut as this. You can also see that columns 2 and 4 have a relatively high correlation at around 0.7. Since there are only four features in this dataset, perhaps we wouldn't remove one of those columns. This is where iteration is important - you might be able to pick the perfect feature set on the first try, or you might need to experiment with a few different configurations.

### Your turn!

Let's take a look at the `wine` dataset, which is made up of continuous, numerical features of various characteristics of wine. Read in the file and print out the `head()`:

In [5]:
wine = pd.read_csv(dir_string + "wine_types.csv")

wine.head()

Unnamed: 0,Type,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


Let's use a subset of the `wine` dataset so we can see the output of `corr()` more easily:

In [6]:
wine_subset = wine[["Flavanoids", "Total phenols", "Malic acid", "OD280/OD315 of diluted wines", "Hue"]]

wine_subset.corr()

Unnamed: 0,Flavanoids,Total phenols,Malic acid,OD280/OD315 of diluted wines,Hue
Flavanoids,1.0,0.864564,-0.411007,0.787194,0.543479
Total phenols,0.864564,1.0,-0.335167,0.699949,0.433681
Malic acid,-0.411007,-0.335167,1.0,-0.36871,-0.561296
OD280/OD315 of diluted wines,0.787194,0.699949,-0.36871,1.0,0.565468
Hue,0.543479,0.433681,-0.561296,0.565468,1.0


Taking a close look at the `Flavanoids` feature after correlation, you can see that it's strongly positively correlated with both `Total phenols` and `OD280/OD315 of diluted wines`. Let's drop it from that subset, and then look at the correlation again:

In [7]:
wine_subset = wine_subset.drop("Flavanoids", axis=1)

wine_subset.corr()

Unnamed: 0,Total phenols,Malic acid,OD280/OD315 of diluted wines,Hue
Total phenols,1.0,-0.335167,0.699949,0.433681
Malic acid,-0.335167,1.0,-0.36871,-0.561296
OD280/OD315 of diluted wines,0.699949,-0.36871,1.0,0.565468
Hue,0.433681,-0.561296,0.565468,1.0


Another feature you might consider dropping is `OD280/OD315 of diluted wines`, which is strongly positively correlated with `Total Phenols` and moderately positively correlated with `Hue`. This is a situation where you'd want to iterate on your chosen features and see how they impact your model's performance.

### Dimensionality reduction for feature selection

A less manual way of reducing the size of your feature set is through dimensionality reduction. Dimensionality reduction is a form of unsupervised learning that transforms your data in a way that shrinks the number of features in your feature space. 

The method of dimensionality reduction we'll cover is principal component analysis, or PCA. PCA uses a linear transformation to project features into a space where they are completely uncorrelated. While the feature space is reduced, the variance is captured in a meaningful way by combining features into components. PCA captures, in each component, as much of the variance in the dataset as possible. In terms of feature selection, it can be a useful method when you have a large number of features and no strong candidates for elimination.

Transforming a dataset through PCA is relatively straightforward in scikit-learn. Let's apply PCA to the `wine` dataset. 

First, we'll set up our data for modeling by removing the label column, `Type`, from the `wine` dataset:

In [8]:
wine_X = wine.drop(["Type"], axis=1)

Next, we'll apply `PCA` to `wine_X` using `fit_transform()`, which transforms the data into components:

In [9]:
from sklearn.decomposition import PCA

pca = PCA()
transformed_X = pca.fit_transform(wine_X)

By default, PCA in scikit-learn keeps the number of components equal to the number of input features. If we print out the `explained_variance_ratio_`, we can see, by component, the percentage of variance explained by that component:

In [10]:
pca.explained_variance_ratio_

array([9.98091230e-01, 1.73591562e-03, 9.49589576e-05, 5.02173562e-05,
       1.23636847e-05, 8.46213034e-06, 2.80681456e-06, 1.52308053e-06,
       1.12783044e-06, 7.21415811e-07, 3.78060267e-07, 2.12013755e-07,
       8.25392788e-08])

You can see that much of the variance is explained by the first component here—around 99%—so it's likely that we could drop those components that don't explain much variance.

There are a couple of things to note regarding PCA. The first is that it can be very difficult to interpret PCA components beyond which components explain the most variance. PCA is more of a black box method than other methods of dimensionality reduction. The other thing to note is that PCA is a good step to do at the end of your preprocessing journey, because of the way the data gets transformed and reshaped. It can be difficult to do much feature work post-PCA, other than eliminating components that aren't useful in explaining variance.

### Your turn: training a model using PCA

Let's take a look at how `knn` performs using PCA components as the `X` data. First, let's create our `y` labels out of the `Type` column as well as split the PCA-transformed data. We can simply pass in the `transformed_X` data we created in the previous step to `train_test_split()` to create training and test data. 

In [11]:
from sklearn.model_selection import train_test_split

y = wine["Type"]
X_pca_train, X_pca_test, y_pca_train, y_pca_test = train_test_split(transformed_X, y)

Next, we'll create the `knn` classifier, fit it to the training data, and score it on the test data:

In [12]:
from sklearn.neighbors import KNeighborsClassifier

knn_pca = KNeighborsClassifier()

Let's fit the model to the data and print out the score:

In [13]:
knn_pca.fit(X_pca_train, y_pca_train)

knn_pca.score(X_pca_test, y_pca_test)

0.7555555555555555

Some questions to think about: What other improvements could you make to the `wine` dataset? Would you use PCA? Why or why not?