# Feature Selection Using Variance in Scikit-learn

This notebook explains how to use low variance to remove features in `scikit-learn`'.

This notebook will work with an [OpenML](https://www.openml.org/d/981) dataset to predict who pays for internet with 10108 observations and 69 columns.  

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_openml
import category_encoders as ce

from sklearn.feature_selection import VarianceThreshold

## Reading the data

The data is from [OpenML](https://www.openml.org/d/981) imported using the Python package `sklearn.datasets`.

In [2]:
data = fetch_openml(name='kdd_internet_usage')
df = data.frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10108 entries, 0 to 10107
Data columns (total 69 columns):
 #   Column                                    Non-Null Count  Dtype   
---  ------                                    --------------  -----   
 0   Actual_Time                               10108 non-null  category
 1   Age                                       10108 non-null  category
 2   Community_Building                        10108 non-null  category
 3   Community_Membership_Family               10108 non-null  category
 4   Community_Membership_Hobbies              10108 non-null  category
 5   Community_Membership_None                 10108 non-null  category
 6   Community_Membership_Other                10108 non-null  category
 7   Community_Membership_Political            10108 non-null  category
 8   Community_Membership_Professional         10108 non-null  category
 9   Community_Membership_Religious            10108 non-null  category
 10  Community_Membership_S

Split the data into target and features

In [3]:
target = 'Who_Pays_for_Access_Work'
y = df[target]
X_cat = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
       'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
       'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])

Encode the categorical variables prior to feature selection.

In [4]:
encoder = ce.LeaveOneOutEncoder(return_df=True)
X = encoder.fit_transform(X_cat, y)

## Feature Selection

### Select the top `N`

Start with 63 features

In [5]:
X.shape

(10108, 63)

Select the those features with a variance greater than `.0025`.

In [13]:
selector = VarianceThreshold(threshold=0.0025)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape

(10108, 18)

The function `get_support` can be used to generate the list of features that were kept.

In [14]:
cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns

['Actual_Time',
 'Age',
 'Community_Membership_Professional',
 'Country',
 'Education_Attainment',
 'Falsification_of_Information',
 'Household_Income',
 'How_You_Heard_About_Survey_Friend',
 'How_You_Heard_About_Survey_Mailing_List',
 'Major_Geographical_Location',
 'Major_Occupation',
 'Most_Import_Issue_Facing_the_Internet',
 'Primary_Computing_Platform',
 'Primary_Language',
 'Primary_Place_of_WWW_Access',
 'Not_Purchasing_Company_policy',
 'Web_Ordering',
 'Web_Page_Creation']