<a id="top">

# Machine Learning Data Preparation Using scikit-learn

**Features Data Preparation:**

- Numerical
    - [min-max scaling](#min_max) (a.k.a. normalization)
        - susceptible to outliers
    - [standardization](#standardization)
- Text
    - nominal (order does not matter)
        - [label encode (0 to n) and then one-hot encode (matrix of 0s and 1s)](#label_1hot)
    - ordinal (order does matter)
        - [label encode](#label_1hot)
    - document type (free-hand text)
        - [CountVectorize()](#count_vectorize)
        - [remove STOP WORDS to improve model accuracy](#stop_words)
            - ensure such words can be safely removed
        - [tfidftransform()](#tfidf)

** Target/Class Data Preparation:**
   - Text: [LabelBinarizer](#labelbinarizer)

## Features Data Preparation

<a id="min_max">

### Min-Max Scaling

**WARNING:** Be careful of outliers.  Remove them if using min-max scaling

[[back to top]](#top)

In [1]:
# Rescale data (between 0 and 1)
import pandas
import scipy
import numpy
from sklearn.preprocessing import MinMaxScaler
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX = scaler.fit_transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.353  0.744  0.59   0.354  0.     0.501  0.234  0.483]
 [ 0.059  0.427  0.541  0.293  0.     0.396  0.117  0.167]
 [ 0.471  0.92   0.525  0.     0.     0.347  0.254  0.183]
 [ 0.059  0.447  0.541  0.232  0.111  0.419  0.038  0.   ]
 [ 0.     0.688  0.328  0.354  0.199  0.642  0.944  0.2  ]]


<a id="standardization">

### Standardization

[[back to top]](#top)

In [2]:
# Standardize data (0 mean, 1 stdev)
from sklearn.preprocessing import StandardScaler
import pandas
import numpy
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
# separate array into input and output components
X = array[:,0:8]
Y = array[:,8]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
# summarize transformed data
numpy.set_printoptions(precision=3)
print(rescaledX[0:5,:])

[[ 0.64   0.848  0.15   0.907 -0.693  0.204  0.468  1.426]
 [-0.845 -1.123 -0.161  0.531 -0.693 -0.684 -0.365 -0.191]
 [ 1.234  1.944 -0.264 -1.288 -0.693 -1.103  0.604 -0.106]
 [-0.845 -0.998 -0.161  0.155  0.123 -0.494 -0.921 -1.042]
 [-1.142  0.504 -1.505  0.907  0.766  1.41   5.485 -0.02 ]]


<a id="label_1hot">

### Label Encode and One-Hot Encode Multiple Columns

[[back to top]](#top)

In [3]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [5]:
df = pd.read_csv('titanic_data.csv')

In [7]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [16]:
# limit to categorical data using df.select_dtypes()
X = df.select_dtypes(include=[object]).fillna('')

In [17]:
X.columns

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

In [18]:
X.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,"Braund, Mr. Owen Harris",male,A/5 21171,,S
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,PC 17599,C85,C
2,"Heikkinen, Miss. Laina",female,STON/O2. 3101282,,S
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,113803,C123,S
4,"Allen, Mr. William Henry",male,373450,,S


LabelEncoder() only accepts 1-D array, so need to use DataFrame's apply() function per this SO [question](https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn) to label encode across all columns:

In [19]:
le = LabelEncoder()
X_le = X.apply(le.fit_transform)

In [23]:
X_le.head()

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
0,108,1,523,0,3
1,190,0,596,82,1
2,353,0,669,0,3
3,272,0,49,56,3
4,15,1,472,0,3


In [27]:
X_le.shape

(891, 5)

**OneHotEncoder() accepts multidimensional array, but it returns sparse matrix.  Use .toarray() to obtain just the array**

In [28]:
onehot_enc = OneHotEncoder()
X_1hot = onehot_enc.fit_transform(X_le).toarray()
X_1hot.shape

(891, 1726)

In [29]:
X_1hot

array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.]])

Alternatively, instead of using scikit-learn's OneHotEncoder(), you can use pd.get_dummies()

In [45]:
X_1hot2 = pd.get_dummies(data=X_le, columns=X_le.columns)
X_1hot2.head()

Unnamed: 0,Name_0,Name_1,Name_2,Name_3,Name_4,Name_5,Name_6,Name_7,Name_8,Name_9,...,Cabin_142,Cabin_143,Cabin_144,Cabin_145,Cabin_146,Cabin_147,Embarked_0,Embarked_1,Embarked_2,Embarked_3
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [44]:
X_1hot2.shape

(891, 1726)

<a id="count_vectorize">

### CountVectorize()

[[back to top]](#top)

In [48]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

In [57]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35482)

<a id="stop_words">

### Stop Words

[[back to top]](#top)

In [68]:
from sklearn.feature_extraction import text

len(text.ENGLISH_STOP_WORDS)

318

In [70]:
text.ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

If you want to add additional stop words, then use the union() function since the built-in English stop word is of type Python **```set```** data structure

In [71]:
my_additional_stop_words = ['customer','state','states','cust','advise']
updated_stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_words)
len(updated_stop_words)

323

In [72]:
updated_stop_words

frozenset({'a',
           'about',
           'above',
           'across',
           'advise',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',

With updated stop words list, pass the new list to the CountVectorizer constructor:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(stop_words=updated_stop_words)

<a id="tfidf">

### TfidfTransform()

[[back to top]](#top)

In [73]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35482)

<a id="labelbinarizer">

## Target or Label Data Preparation

[[back to top]](#top)

**If your target or label data is text, you can apply both transformations (label encode and one-hot encode) in one shot using ```LabelBinarizer```:**

In [None]:
from sklearn.preprocessing import LabelBinarizer

encoder = LabelBinarizer()
class_1hot3 = encoder.fit_transform(X)

This returns a regular/dense matrix.  To return a sparse matrix, just pass ```sparse_output=True``` to the constructor:

In [None]:
encoder = LabelBinarizer(sparse_output=True)