https://www.analyticsindiamag.com/what-are-feature-selection-techniques-in-machine-learning/

* binarization
* scaling
* normalization
* mean removal
etc.

## Binarization

In [3]:
from sklearn import preprocessing
import numpy as np 

data = np.array([[2.2, 5.9, -1.8], [5.4, -3.2, -5.1], [-1.9, 4.2, 3.2]])

bindata = preprocessing.Binarizer(threshold=1.5).transform(data)
print('Binarized data:\n\n', bindata)

Binarized data:

 [[1. 1. 0.]
 [1. 0. 0.]
 [0. 1. 1.]]


## Mean Removal

In [2]:
print('Mean (before)= ', data.mean(axis=0))
print('Standard Deviation (before)= ', data.std(axis=0))

scaled_data = preprocessing.scale(data)

print('Mean (after)= ', scaled_data.mean(axis=0))
print('Standard Deviation (after)= ', scaled_data.std(axis=0))


Mean (before)=  [ 1.9         2.3        -1.23333333]
Standard Deviation (before)=  [2.98775278 3.95052739 3.41207008]
Mean (after)=  [0.00000000e+00 0.00000000e+00 7.40148683e-17]
Standard Deviation (after)=  [1. 1. 1.]


## Scaling

* It is imperative that you normalize your scale of feature values in order to begin any machine learning process especiallly the clustering process. 
* This is because each observations' feature values are represented as coordinates in n-dimensional space  (n is the number of features) and then the distances between these coordinates are calculated. 
* If these coordinates are not normalized, then it may lead to false results.

For example, suppose you have data about height and weight of three people: A (6ft, 75kg), B (6ft,77kg), C (8ft,75kg). If you represent these features in a two-dimensional coordinate system, height and weight, and calculate the Euclidean distance between them, the distance between the following pairs would be:

**A-B : 2 units**

**A-C : 2 units**

Well, the distance metric tells that both the pairs A-B and A-C are similar but in reality they are clearly not! The pair A-B is more similar than pair A-C. Hence it is important to scale these values first and then calculate the distance.

There are various ways to normalize the feature values, you can either consider standardizing the entire scale of all the feature values (x(i)) between [0,1] (known as min-max normalization) by applying the following transformation:

**min-max normalization**  
`x(s)=x(i)−min(x)/(max(x)−min(x))`

Other type of scaling can be achieved via the following transformation:

**standard scalar**  
`x(s)=x(i)−mean(x)/sd(x)`
Where sd(x) is the standard deviation of the feature values. This will ensure your distribution of feature values has mean 0 and a standard deviation of 1. You can achieve this via the scale() function in R.

StandardScaler - => features with a mean=0 and variance=1  
MinMaxScaler - => features in a 0 to 1 range  
Normalizer - => feature vector to a Euclidean length=1  

In [5]:
print(data)

minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_minmax = minmax_scaler.fit_transform(data)
print('MinMaxScaler applied on the data:\n', data_minmax)

[[ 2.2  5.9 -1.8]
 [ 5.4 -3.2 -5.1]
 [-1.9  4.2  3.2]]
MinMaxScaler applied on the data:
 [[0.56164384 1.         0.39759036]
 [1.         0.         0.        ]
 [0.         0.81318681 1.        ]]


## Normalization

-- bringing the values of each feature vector on a common scale  

L1 - Least Absolute Deviations - sum of absolute values (on each row) = 1; it is insensitive to outliers  
L2 - Least Squares - sum of squares (on each row) = 1; takes outliers in consideration during training  

In [6]:
data
data_l1 = preprocessing.normalize(data, norm='l1')
data_l2 = preprocessing.normalize(data, norm='l2')

print('L1-normalized data:\n', data_l1)
print('\nL2-normalized data:\n', data_l2)

L1-normalized data:
 [[ 0.22222222  0.5959596  -0.18181818]
 [ 0.39416058 -0.23357664 -0.37226277]
 [-0.20430108  0.4516129   0.34408602]]

L2-normalized data:
 [[ 0.3359268   0.90089461 -0.2748492 ]
 [ 0.6676851  -0.39566524 -0.63059148]
 [-0.33858465  0.74845029  0.57024784]]


In [7]:
0.39416058+0.23357664+0.37226277

0.9999999900000001

In [8]:
0.3359268**2+0.90089461**2+(-0.2748492)**2

0.9999999960259321

## One Hot Encoding (dummy variables)

* used on categorical variables
* it replaces a categorical variable/feature with one or more new features that will take the values of 0 or 1
* increases data burden
* increases the efficiency of the process

In [13]:
import pandas as pd
from IPython.display import display

data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header=None, index_col=False, names=['age', 'workclass', 'fnlwgt', 'education', 
                                                                      'education-num', 'marital-status', 'occupation', 
                                                                      'relationship', 'race', 'gender', 'capital-gain', 
                                                                      'capital-loss', 'hours-per-week', 'native-country', 
                                                                      'income'])
data.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [14]:
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income']]
display(data)

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K
5,37,Private,Masters,Female,40,Exec-managerial,<=50K
6,49,Private,9th,Female,16,Other-service,<=50K
7,52,Self-emp-not-inc,HS-grad,Male,45,Exec-managerial,>50K
8,31,Private,Masters,Female,50,Prof-specialty,>50K
9,42,Private,Bachelors,Male,40,Exec-managerial,>50K


In [15]:
print('Original Features:\n', list(data.columns), '\n')
data_dummies = pd.get_dummies(data)
print('Features after One-Hot Encoding:\n', list(data_dummies.columns))

Original Features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

Features after One-Hot Encoding:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine

In [18]:
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
X = features.values
y = data_dummies['income_ >50K'].values

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

print('Logistic Regression score on the test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Logistic Regression score on the test set: 0.81


<span style="color:red; font-family:Comic Sans MS"> **above Code pulled from:** </span>  
<a href="https://github.com/CristiVlad25/ml-sklearn" target="_blank">https://github.com/CristiVlad25/ml-sklearn</a>