In [1]:
"""Working through the 'Mini-Course' from MachineLearningMastery: https://machinelearningmastery.com/data-preparation-for-machine-learning-7-day-mini-course/
This notebook will cover all lessons:
Lesson 01: Importance of Data Preparation.
Lesson 02: Fill Missing Values With Imputation.
Lesson 03: Select Features With RFE.
Lesson 04: Scale Data With Normalization.
Lesson 05: Transform Categories With One Hot Encoding.
Lesson 06: Transform Numbers to Categories With kBins.
Lesson 07: Dimensionality Reduction with PCA.
"""
 

"Working through the 'Mini-Course' from MachineLearningMastery: https://machinelearningmastery.com/data-preparation-for-machine-learning-7-day-mini-course/\nThis notebook will cover all lessons:\nLesson 01: Importance of Data Preparation.\nLesson 02: Fill Missing Values With Imputation.\nLesson 03: Select Features With RFE.\nLesson 04: Scale Data With Normalization.\nLesson 05: Transform Categories With One Hot Encoding.\nLesson 06: Transform Numbers to Categories With kBins.\nLesson 07: Dimensionality Reduction with PCA.\n"

In [None]:
#Lesson 01: Importance of Data Preparation.
"""Why do we need to prep the data?
Data Types: Machine learning algorithms require data to be numbers.
Data Requirements: Some machine learning algorithms impose requirements on the data.
Data Errors: Statistical noise and errors in the data may need to be corrected.
Data Complexity: Complex nonlinear relationships may be teased out of the data.

Methods of data preparation:
Data Cleaning: Identifying and correcting mistakes or errors in the data.
Feature Selection: Identifying those input variables that are most relevant to the task.
Data Transforms: Changing the scale or distribution of variables.
Feature Engineering: Deriving new variables from available data.
Dimensionality Reduction: Creating compact projections of the data.

Examples of data preparation:
1. Scaling and normalization of data.
2. Coverting a gender column to a boolean column.
3. Looking for highly correlated features, that prove to be redundant or insignificant.
"""

In [2]:
#Lesson 02: Fill Missing Values With Imputation.
"""For most data, if a feature is null in a row, the row will be useless. The SimpleImputer class from sk-learn
transforms all missing values marked with a NaN value with the mean of the column. Other options are median, mode and 
a user-defined constant."""
from numpy import isnan
from pandas import read_csv
from sklearn.impute import SimpleImputer

# load dataset and fill na with '?'
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/horse-colic.csv'
dataframe = read_csv(url, header=None, na_values='?')

# split into input and output elements
data = dataframe.values
ix = [i for i in range(data.shape[1]) if i != 23]
X, y = data[:, ix], data[:, 23]

# print total missing
print('Missing: %d' % sum(isnan(X).flatten()))

# define imputer
imputer = SimpleImputer(strategy='mean')

# fit on the dataset
imputer.fit(X)

# transform the dataset
Xtrans = imputer.transform(X)

# print total missing
print('Missing: %d' % sum(isnan(Xtrans).flatten()))

Missing: 1605
Missing: 0


In [4]:
#Lesson 03: Select Features With RFE.
"""Feature selection is the process of reducing the number of input variables when developing a predictive model. Reducing 
input variables can reduce cost and even improve results. Recursive Feature Elimination (RFE) is easy to configure and use.
"""
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier

# define dataset with 5 redundant features
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, random_state=1)

# define RFE using a Decision-Tree and 5 selected features.
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=5)

# fit RFE
rfe.fit(X, y)

# summarize all features
for i in range(X.shape[1]):
    print('Column: %d, Selected=%s, Rank: %d' % (i, rfe.support_[i], rfe.ranking_[i]))
    
"""This lesson seems a little lack luster. It would be nice to know when RFE is a good choice and how to pick the 
estimator or the features to select."""

Column: 0, Selected=False, Rank: 3
Column: 1, Selected=False, Rank: 4
Column: 2, Selected=True, Rank: 1
Column: 3, Selected=True, Rank: 1
Column: 4, Selected=True, Rank: 1
Column: 5, Selected=False, Rank: 5
Column: 6, Selected=True, Rank: 1
Column: 7, Selected=True, Rank: 1
Column: 8, Selected=True, Rank: 1
Column: 9, Selected=False, Rank: 2


In [5]:
#Lesson 04: Scale Data With Normalization.
"""Algorithms that use a weighted sum of the input (linear regression) and algorithms that use distance measures (KNN) 
benefit from normalization of feature values. A good additional rule of thumb is that if your data is near gaussian, use 
Standardization. Normalization otherwise or when you plan on using either of the aforementioned algos."""
from sklearn.datasets import make_classification
from sklearn.preprocessing import MinMaxScaler

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=1)

# summarize data before the transform
print(X[:3, :])

# define the scaler
trans = MinMaxScaler()

# transform the data
X_norm = trans.fit_transform(X)

# summarize data after the transform
print(X_norm[:3, :])

[[ 2.39324489 -5.77732048 -0.59062319 -2.08095322  1.04707034]
 [-0.45820294  1.94683482 -2.46471441  2.36590955 -0.73666725]
 [ 2.35162422 -1.00061698 -0.5946091   1.12531096 -0.65267587]]
[[0.77608466 0.0239289  0.48251588 0.18352101 0.59830036]
 [0.40400165 0.79590304 0.27369632 0.6331332  0.42104156]
 [0.77065362 0.50132629 0.48207176 0.5076991  0.4293882 ]]


In [6]:
#Lesson 05: Transform Categories With One Hot Encoding.
"""Here we will look at dealing with categorical data, yes/no or red/green/blue. Sk-learn has a OneHotEncoder class that will
work for us. The OneHotEncoder will convert a categorical column "color" to three columns, booleans, like "isgreen", "isred"
or "isblue. "
"""
from pandas import read_csv
from sklearn.preprocessing import OneHotEncoder

# define the location of the dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/breast-cancer.csv"

# load the dataset
dataset = read_csv(url, header=None)

# retrieve the array of data
data = dataset.values

# separate into input and output columns
X = data[:, :-1].astype(str)
y = data[:, -1].astype(str)

# summarize the raw data
print(X[:3, :])

# define the one hot encoding transform
encoder = OneHotEncoder(sparse=False)

# fit and apply the transform to the input data
X_oe = encoder.fit_transform(X)

# summarize the transformed data
print(X_oe[:3, :])

[["'40-49'" "'premeno'" "'15-19'" "'0-2'" "'yes'" "'3'" "'right'"
  "'left_up'" "'no'"]
 ["'50-59'" "'ge40'" "'15-19'" "'0-2'" "'no'" "'1'" "'right'" "'central'"
  "'no'"]
 ["'50-59'" "'ge40'" "'35-39'" "'0-2'" "'no'" "'2'" "'left'" "'left_low'"
  "'no'"]]
[[0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]]


In [8]:
#Lesson 06: Transform Numbers to Categories With kBins
"""Naturally the next lesson is to do the reverse operation. Some algos, like some decision tree and 
rule-based algorithms prefer categorical data. The process of discretization puts values in probabilistic bins.
The sk-learn class KBinsDiscretizer will do this for us. This algorithm can "discretize" multiple ways, by a 
uniform distribution, a quantile distribution or by a k-means clustering operation. """
from sklearn.datasets import make_classification
from sklearn.preprocessing import KBinsDiscretizer

# define dataset
X, y = make_classification(n_samples=1000, n_features=5, n_informative=5, n_redundant=0, random_state=1)

# summarize data before the transform
print(X[:3, :])

# define the transform with 10 bins, integer bins and a uniform distribution.
trans = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')

# transform the data
X_discrete = trans.fit_transform(X)

# summarize data after the transform
print(X_discrete[:3, :])


[[ 2.39324489 -5.77732048 -0.59062319 -2.08095322  1.04707034]
 [-0.45820294  1.94683482 -2.46471441  2.36590955 -0.73666725]
 [ 2.35162422 -1.00061698 -0.5946091   1.12531096 -0.65267587]]
[[7. 0. 4. 1. 5.]
 [4. 7. 2. 6. 4.]
 [7. 5. 4. 5. 4.]]


In [9]:
#Lesson 07: Dimensionality Reduction With PCA (Feature Extraction)
"""Generally speaking, more input features make modeling harder, in the same sense that more ingredients in a recipe can 
make it more complex and difficult. The sk-learn class PCA uses Principal Component Analysis (throw back to MATH 415 @ UIUC).
The case for PCA is the situation when you have many features after feature selection, PCA will generate a number 
(n_components) of eigenvectors.
"""
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA

# define dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=7, random_state=1)

# summarize data before the transform
print(X[:3, :])

# define the transform
trans = PCA(n_components=3)

# transform the data
X_dim = trans.fit_transform(X)

# summarize data after the transform
print(X_dim[:3, :])

[[-0.53448246  0.93837451  0.38969914  0.0926655   1.70876508  1.14351305
  -1.47034214  0.11857673 -2.72241741  0.2953565 ]
 [-2.42280473 -1.02658758 -2.34792156 -0.82422408  0.59933419 -2.44832253
   0.39750207  2.0265065   1.83374105  0.72430365]
 [-1.83391794 -1.1946668  -0.73806871  1.50947233  1.78047734  0.58779205
  -2.78506977 -0.04163788 -1.25227833  0.99373587]]
[[-1.64710578 -2.11683302  1.98256096]
 [ 0.92840209  4.8294997   0.22727043]
 [-3.83677757  0.32300714  0.11512801]]
