# Overview and Preprocessing
## Big Picture

### Example
Lets assume we have collected a data set about cars:

|Customer Group| Model | Mileage | Power | Price |
|-|-|-|-|-|
|Family| Renault Scenic | 50,000 | 132 | 5,000|
|Upper Class | Porsche Carrera | 10,000 | 332 | 50,000|
|Family | Touran  | 80,000 | 90 | 15,000|
| ... | ... | ... | ... | ... |
|?| Wonder Car| 500 | 4000 | ?|

- Given a large set of cars, can we group together cars with similar price, power and mileage?
- Can we predict the price of a new car given mileage and power?
- Can we predict the customer group?
- What kind of cars do upper class people drive?

### General Approach

The usual machine learning setup is:

1. **$n$ data samples** (e.g. $n$ cars), representing the past experience
2. Every data sample is described by a **set of d features/attributes** (e.g. horsepower and price of the car)

|$Attribut_1$|$Attribut_2$|$\ldots$|$Attribut_d$|
|-|-|-|-|
|$Attribut_1$ of $Example_1$|$Attribut_2$ of $Example_1$ |$\ldots$|$Attribut_d$ of $Example_1$|
|$Attribut_1$ of $Example_2$|$Attribut_2$ of $Example_2$ |$\ldots$|$Attribut_d$ of $Example_2$|
|$\ldots$|$\ldots$|$\ldots$|$\ldots$|
|$Attribut_d$ of $Example_n$|$Attribut_2$ of $Example_n$ |$\ldots$|$Attribut_d$ of $Example_n$|

Machine learning estimates a **model** (also called hypothesis) that **'best' fits the data**. Fitting means the model

1. **predicts** features of yet unkown examples (e.g. predict the customer group of a car)
2. **describes** properties of the examples (e.g. points belonging together)

Building such a model is called learning, training or model fitting.

Using such a model is often call "testing", "model estimation" or "inference step".

Converting data into the necessary format for learning and testing is called **preprocessing**

## Scikit-Learn

scikit-learn is a Machine Learning library in Python ([Homepage](http://scikit-learn.org/stable/)).


* Simple and efficient tools for data mining and data analysis
* Accessible to everybody, and reusable in various contexts
* Built on NumPy, SciPy, and matplotlib
* Open source, commercially usable - BSD license

## Preprocessing
Refers to the task to create and prepare the data to be consumed by the learning algorithm. Usually, the target format is a matrix holding the preprocessed data. Sklearn uses numpy for representing data.


Preprocessing steps can be summarized as follows:

1. **Feature Extraction/Integration**: Convert data into matrix or integrate different data sources into one matrix
2. **Feature Manipulation**: Manipulate and reorganise the features of a matrix
    * *Feature Weighting/Scaling*: Convert the range of feature values
    * *Feature Selection*: Removing unnecessary or low quality features
    * *Feature Transformation (Dimensionality Reduction)*: merge or combine existing features to create new features   
    
3. **Dataset Manipulation**: Manipulate/eliminate data points
    * *Subsampling*: Reduce the amount of data points in case the data set is to large (Squashing)
    * *Outlier Detection*: Remove data points that do not fit to the data distribution
             
<p>
<div class="alert alert-info">
**Feature Engineering**, the task of creating features from real world data, is often the most important and time consuming step (when you apply machine learning techniques)
</div>

See http://scikit-learn.org/stable/data_transforms.html for details on preprocessing.

### Extracting Features from Dicts
sklearn allows you to convert python dictionaries, that represent features, into Numpy arrays.

For nominal data it implements a "one-hot" coding (e..g one Attribute that is on or off)

In [27]:
measurements = [
        {'Model': 'Renault Scenic', 'Mileage': 50000, 'Power': 132, 'Price':5000},
        {'Model': 'Porsche Carrera', 'Mileage': 10000, 'Power': 332, 'Price':50000},
        {'Model': 'Touran', 'Mileage': 80000, 'Power': 90, 'Price': 15000}
        ]

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()

print(vec.fit_transform(measurements).toarray())
vec.get_feature_names()

[[  5.00000000e+04   0.00000000e+00   1.00000000e+00   0.00000000e+00
    1.32000000e+02   5.00000000e+03]
 [  1.00000000e+04   1.00000000e+00   0.00000000e+00   0.00000000e+00
    3.32000000e+02   5.00000000e+04]
 [  8.00000000e+04   0.00000000e+00   0.00000000e+00   1.00000000e+00
    9.00000000e+01   1.50000000e+04]]


['Mileage',
 'Model=Porsche Carrera',
 'Model=Renault Scenic',
 'Model=Touran',
 'Power',
 'Price']

### Text Preprocessing using sklearn
sklearn supports several counting methods for converting text into a matrix representation. The simplest one is the count vectorizer.

Vectorizers can use analyzers (to be set in the constructor), which tokenize the text. Here you can integrate tokenizers from other libraries, like for example NLTK.

In [28]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=1)
corpus = ['This is the first document.',
        'This is the second second document.',
        'And the third one.',
        'Is this the first document?']
word_counts = vectorizer.fit_transform(corpus)
print(word_counts)
print(vectorizer.get_feature_names())
word_counts

  (0, 1)	1
  (0, 2)	1
  (0, 6)	1
  (0, 3)	1
  (0, 8)	1
  (1, 5)	2
  (1, 1)	1
  (1, 6)	1
  (1, 3)	1
  (1, 8)	1
  (2, 4)	1
  (2, 7)	1
  (2, 0)	1
  (2, 6)	1
  (3, 1)	1
  (3, 2)	1
  (3, 6)	1
  (3, 3)	1
  (3, 8)	1
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

### Feature Scaling and Weighting
After extracting features one needs to consider the scale and/or value of a feature. Most likely, value ranges are not sufficiently prepared for subsequent machine learning. 

For example, raw counts of feature occurence may not provide a meaningful feature representation. In text for example, the words with the highest frequency are stopwords and hence we need to reweight the value of a feature. 


As a second example, data coming from a sensor might contain wrong measurements or the scale between two sensors might be wrong and needs to be rescaled/normalized. 
<p>

<div class = "alert alert-info">
When preprocessing data, always check that 
<ol>
<li> The extracted features (i.e. the attributes/dimensions) are meaningful and represent information such that the learning task can be solved 
  <li> The value range of the features is as expected by the machine learning algorithm and has been cleaned from problematic data
</ol>

</div>

#### TFIDF Weighting
TF IDF weighting stands for "Term Frequency vs. Inverse Document Frequency" weighting and is mostly used for representing textual data.

The weight is proportional to the frequency how often a word occurs in a text multiplied by the inverse document frequency, i.e. how many documents contain a certain text.

In [29]:
from sklearn.feature_extraction.text import TfidfTransformer
transformer = TfidfTransformer()
tfidf_counts = transformer.fit_transform(word_counts)

print("Doc\tfeature\t\ttfidf\t\tcount")
for row in range(tfidf_counts.shape[0]):
    for col,name in enumerate(vectorizer.get_feature_names()):
        print("%d\t%s\t\t%f\t%d"%\
              (row,name,
               tfidf_counts[row,col],
               word_counts[row,col]))

Doc	feature		tfidf		count
0	and		0.000000	0
0	document		0.438777	1
0	first		0.541977	1
0	is		0.438777	1
0	one		0.000000	0
0	second		0.000000	0
0	the		0.358729	1
0	third		0.000000	0
0	this		0.438777	1
1	and		0.000000	0
1	document		0.272301	1
1	first		0.000000	0
1	is		0.272301	1
1	one		0.000000	0
1	second		0.853226	2
1	the		0.222624	1
1	third		0.000000	0
1	this		0.272301	1
2	and		0.552805	1
2	document		0.000000	0
2	first		0.000000	0
2	is		0.000000	0
2	one		0.552805	1
2	second		0.000000	0
2	the		0.288477	1
2	third		0.552805	1
2	this		0.000000	0
3	and		0.000000	0
3	document		0.438777	1
3	first		0.541977	1
3	is		0.438777	1
3	one		0.000000	0
3	second		0.000000	0
3	the		0.358729	1
3	third		0.000000	0
3	this		0.438777	1


#### Standardization (mean removal/variance scaling)
Some machine learning methods do not work well if the value range of attributes is not standardized. Standardization assume that values are normally distributed and aims at removing mean and scaling the values to unit variance.

Standardization is often refered to as Feature Normalization (i.e. normalization along one attribute).

In [30]:
from sklearn import preprocessing
import numpy as np
X = np.array([[ 1., -1.,  2.],
               [ 2.,  0.,  0.],
               [ 0.,  1., -1.]])
X_scaled = preprocessing.scale(X)
print("Mean original data", np.mean(X, axis=0))
print("Var  original data",np.var(X, axis=0))
print("Mean scaled   data",np.mean(X_scaled, axis=0))
print("Var  scaled   data",np.var(X_scaled, axis=0))

Mean original data [ 1.          0.          0.33333333]
Var  original data [ 0.66666667  0.66666667  1.55555556]
Mean scaled   data [ 0.  0.  0.]
Var  scaled   data [ 1.  1.  1.]


##### Min/Max Scaling
Alternatively, one can simply scale the feature range according to the minimum and maximum value in the data set such that the new feature range is in the range $[0:1]$. 

This is done by the `MinMaxScaler`.

In [31]:
min_max_scaler = preprocessing.MinMaxScaler()
X_scale_minmax = min_max_scaler.fit_transform(X)
print(X_scale_minmax)

[[ 0.5         0.          1.        ]
 [ 1.          0.5         0.33333333]
 [ 0.          1.          0.        ]]


In [32]:
print(X)

[[ 1. -1.  2.]
 [ 2.  0.  0.]
 [ 0.  1. -1.]]


In [33]:
#alternative calculation
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) 
print(X_std)

[[ 0.5         0.          1.        ]
 [ 1.          0.5         0.33333333]
 [ 0.          1.          0.        ]]


#### Normalization
Normalization deals with the problem that data point vectors can be of very different length. Consider for example a short and a long document.

Normalization brings all data points to unit length. This is necessary by methods relying on the dot product between data points.

In [34]:
X_normalized = preprocessing.normalize(X, norm='l2')
print(np.linalg.norm(X, ord=2, axis=1))
print(np.linalg.norm(X_normalized, ord=2, axis=1))

[ 2.44948974  2.          1.41421356]
[ 1.  1.  1.]


### Final Remarks
<div class = "alert alert-info">
In practical applications, **Preprocessing** is the most cruical step in applying machine learning. It depends on the machine learning technique used afterwards, the data at hand and the skill of the feature engineer.
<p>
So do not underestimate this step. A good to ask is whether you, as a human, could solve the task given the information obtained from preprocessing. If you can't, the machine, most likely, can't do it either.
</div>

## References
- Chapter 1 in Tom Mitchell (1997), Machine Learning, McGraw-Hill. Chapter slides for instructors are [available](http://www.cs.cmu.edu/~tom/mlbook-chapter-slides.html)
- Tutorial [An introduction to machine learning with scikit-learn](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)