# T81-558: Applications of Deep Neural Networks
**Class 6: Preprocessing.**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Why is Preprocessing Necessary

The feature vector, the input to a model (such as a neural network), must be completely numeric. Converting non-numeric data into numeric is one major component of preprocessing.  It is also often important to preprocess numeric values.  Scikit-learn provides a large number of preprocessing functions: 

* [Scikit-Learn Preprocessing](http://scikit-learn.org/stable/modules/preprocessing.html)

However, this is just the beginning.  The success of your neural network's predictions is often directly tied to the data representation.

# Preprocessing Functions

The following functions will be used in conjunction with TensorFlow to help preprocess the data.  Some of these were [covered previously](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class2_tensor_flow.ipynb), some are new.

It is okay to just use them. For better understanding, try to see how they work.

These functions allow you to build the feature vector for a neural network. Consider the following:

* Predictors/Inputs 
    * Fill any missing inputs with the median for that column.  Use **missing_median**.
    * Encode textual/categorical values with **encode_text_dummy** or more creative means (see last part of this class session). 
    * Encode numeric values with **encode_numeric_zscore**, **encode_numeric_binary** or **encode_numeric_range**. 
    * Consider removing outliers: **remove_outliers**
* Output
    * Discard rows with missing outputs.
    * Encode textual/categorical values with **encode_text_index**. 
    * Do not encode output numeric values.
    * Consider removing outliers: **remove_outliers**
* Produce final feature vectors (x) and expected output (y) with **to_xy**. 

# Helpful Functions

These are exactly the same feature vector encoding functions from [Class 3](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class3_training.ipynb).  They must be defined for this class as well.  For more information, refer to class 3.

In [1]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name, tv)
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
               * (normalized_high - normalized_low) + normalized_low

# Analyzing a Dataset

The following script can be used to give a high level overview of how a dataset appears.

In [2]:
ENCODING = 'utf-8'

def expand_categories(values):
    result = []
    s = values.value_counts()
    t = float(len(values))
    for v in s.index:
        result.append("{}:{}%".format(v,round(100*(s[v]/t),2)))
    return "[{}]".format(",".join(result))
        
def analyze(filename):
    print()
    print("Analyzing: {}".format(filename))
    df = pd.read_csv(filename,encoding=ENCODING)
    cols = df.columns.values
    total = float(len(df))

    print("{} rows".format(int(total)))
    for col in cols:
        uniques = df[col].unique()
        unique_count = len(uniques)
        if unique_count>100:
            print("** {}:{} ({}%)".format(col,unique_count,int(((unique_count)/total)*100)))
        else:
            print("** {}:{}".format(col,expand_categories(df[col])))
            expand_categories(df[col])

The analyze script can be run on the MPG dataset.

In [3]:
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
analyze(filename_read)


Analyzing: ./data/auto-mpg.csv
398 rows
** mpg:129 (32%)
** cylinders:[4:51.26%,8:25.88%,6:21.11%,3:1.01%,5:0.75%]
** displacement:[97.0:5.28%,98.0:4.52%,350.0:4.52%,250.0:4.27%,318.0:4.27%,140.0:4.02%,400.0:3.27%,225.0:3.27%,91.0:3.02%,232.0:2.76%,121.0:2.76%,302.0:2.76%,151.0:2.51%,120.0:2.26%,231.0:2.01%,200.0:2.01%,90.0:2.01%,85.0:2.01%,351.0:2.01%,304.0:1.76%,122.0:1.76%,105.0:1.76%,156.0:1.51%,79.0:1.51%,119.0:1.51%,108.0:1.26%,107.0:1.26%,89.0:1.26%,258.0:1.26%,135.0:1.26%,360.0:1.01%,86.0:1.01%,116.0:1.01%,112.0:1.01%,305.0:1.01%,134.0:1.01%,455.0:0.75%,307.0:0.75%,429.0:0.75%,173.0:0.75%,198.0:0.75%,168.0:0.75%,113.0:0.75%,260.0:0.75%,146.0:0.75%,70.0:0.75%,383.0:0.5%,71.0:0.5%,163.0:0.5%,262.0:0.5%,141.0:0.5%,199.0:0.5%,440.0:0.5%,104.0:0.25%,390.0:0.25%,454.0:0.25%,340.0:0.25%,110.0:0.25%,267.0:0.25%,88.0:0.25%,111.0:0.25%,144.0:0.25%,181.0:0.25%,145.0:0.25%,100.0:0.25%,81.0:0.25%,183.0:0.25%,131.0:0.25%,78.0:0.25%,80.0:0.25%,130.0:0.25%,72.0:0.25%,101.0:0.25%,115.0:0.25%,1

# Preprocessing Examples

The above preprocessing functions can be used in a variety of ways.

In [4]:
import tensorflow.contrib.learn as skflow
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])

# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
#encode_numeric_binary(df,'mpg',20)
#df['origin'] = df['origin'].astype(str)
#encode_text_tfidf(df, 'origin')

# Drop outliers in horsepower
print("Length before MPG outliers dropped: {}".format(len(df)))
remove_outliers(df,'mpg',2)
print("Length after MPG outliers dropped: {}".format(len(df)))

print(df)


Length before MPG outliers dropped: 398
Length after MPG outliers dropped: 388
      mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
0    18.0          8         307.0       130.0    3504          12.0    70   
1    15.0          8         350.0       165.0    3693          11.5    70   
2    18.0          8         318.0       150.0    3436          11.0    70   
3    16.0          8         304.0       150.0    3433          12.0    70   
4    17.0          8         302.0       140.0    3449          10.5    70   
5    15.0          8         429.0       198.0    4341          10.0    70   
6    14.0          8         454.0       220.0    4354           9.0    70   
7    14.0          8         440.0       215.0    4312           8.5    70   
8    14.0          8         455.0       225.0    4425          10.0    70   
9    15.0          8         390.0       190.0    3850           8.5    70   
10   15.0          8         383.0       170.0    3563         

# Feature Ranking

Feature ranking is an important process where you determine which input columns (features) are the most important. I implemented several feature ranking algorithms for the following academic paper:

Heaton, J., McElwee, S., & Cannady, J. (May 2017). [Early stabilizing feature importance for TensorFlow deep neural networks](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/pdf/heaton_et_al_ijcnn_2017-pre.pdf). In *International Joint Conference on Neural Networks (IJCNN 2017)* (accepted for publication). IEEE.

Two feature ranking algorithms are provided here (a total of 4 are in the paper):

* **CorrelationCoefficientRank** - A simple statistical analysis of the correlation between each input field and the target.  Does not require a trained neural network and does not consider interactions.
* **InputPerturbationRank** - Uses a trained neural network and scrambles each input one-by-one. Neural network does not need to be retrained.  Slower, but more accurate, than CorrelationCoefficientRank.

Some of the code from this paper is provieded here:

In [5]:
# Feature ranking code

class Ranking(object):
    def __init__(self, names):
        self.names = names

    def _normalize(self, x, y, impt):
        impt = impt / sum(impt)
        impt = list(zip(impt, self.names, range(x.shape[1])))
        impt.sort(key=lambda x: -x[0])
        return impt
    
class CorrelationCoefficientRank(Ranking):
    def __init__(self, names):
        super(CorrelationCoefficientRank, self).__init__(names)

    def rank(self, x, y, model=None):
        impt = []

        for i in range(x.shape[1]):
            c = abs(np.corrcoef(x[:, i], y[:, 0]))
            impt.append(abs(c[1, 0]))

        impt = impt / sum(impt)
        impt = list(zip(impt, self.names, range(x.shape[1])))
        impt.sort(key=lambda x: -x[0])

        return (impt)


class InputPerturbationRank(Ranking):
    def __init__(self, names):
        super(InputPerturbationRank, self).__init__(names)

    def _raw_rank(self, x, y, network):
        impt = np.zeros(x.shape[1])

        for i in range(x.shape[1]):
            hold = np.array(x[:, i])
            np.random.shuffle(x[:, i])

            # Handle both TensorFlow and SK-Learn models.
            if 'tensorflow' in str(type(network)).lower():
                pred = list(network.predict(x, as_iterable=True))
            else:
                pred = network.predict(x)

            rmse = metrics.mean_squared_error(y, pred)
            impt[i] = rmse
            x[:, i] = hold

        return impt

    def rank(self, x, y, network):
        impt = self._raw_rank(x, y, network)
        return self._normalize(x, y, impt)

In [6]:
# Rank MPG fields

import tensorflow as tf
from sklearn.model_selection import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping

path = "./data/"

# Set the desired TensorFlow output level for this example
tf.logging.set_verbosity(tf.logging.ERROR)

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])

# create feature vector
missing_median(df, 'horsepower')
df.drop('name',1,inplace=True)
encode_numeric_zscore(df, 'horsepower')
encode_numeric_zscore(df, 'weight')
encode_numeric_zscore(df, 'cylinders')
encode_numeric_zscore(df, 'displacement')
encode_numeric_zscore(df, 'acceleration')
encode_text_dummy(df, 'origin')

# Encode to a 2D matrix for training
x,y = to_xy(df,'mpg')

# Split into train/test
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.20, random_state=42)

model = Sequential()
model.add(Dense(10, input_dim=x.shape[1], activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
model.fit(x,y,validation_data=(x_test,y_test),callbacks=[monitor],verbose=0,epochs=1000)
    
# Fit/train neural network
names = list(df.columns) 
names.remove('mpg') # must remove target field MPG so that index aligns with x (which does not have mpg)

ranker = InputPerturbationRank
print()
print("*** InputPerturbationRank ***")
l1 = ranker(names).rank(x_test, y_test, model)

for itm in l1:
    print(itm)

Using TensorFlow backend.


Epoch 00392: early stopping

*** InputPerturbationRank ***
(0.14533266419625726, 'year', 5)
(0.13929055840496291, 'weight', 3)
(0.12582514508793347, 'cylinders', 0)
(0.12219814896254259, 'horsepower', 2)
(0.10844029542596251, 'displacement', 1)
(0.091625192197964611, 'origin-1', 6)
(0.090854499350810303, 'origin-3', 8)
(0.088911064809467755, 'acceleration', 4)
(0.087522431564098613, 'origin-2', 7)


In [7]:
ranker = CorrelationCoefficientRank
print()
print("*** CorrelationCoefficientRank ***")
l1 = ranker(names).rank(x_test, y_test, model)

for itm in l1:
    print(itm)


*** CorrelationCoefficientRank ***
(0.1523953056674856, 'weight', 3)
(0.14617209454138816, 'displacement', 1)
(0.14485626531604004, 'horsepower', 2)
(0.14248640927492129, 'cylinders', 0)
(0.099341860269407153, 'year', 5)
(0.097932975012336415, 'acceleration', 4)
(0.095503698574517557, 'origin-1', 6)
(0.084529264419704569, 'origin-3', 8)
(0.036782126924199104, 'origin-2', 7)


# Other Examples: Dealing with Addresses

Addresses can be difficult to encode into a neural network.  There are many different approaches, and you must consider how you can transform the address into something more meaningful.  Map coordinates can be a good approach.  [Latitude and longitude](https://en.wikipedia.org/wiki/Geographic_coordinate_system) can be a useful encoding.  Thanks to the power of the Internet, it is relatively easy to transform an address into its latitude and longitude values.  The following code determines the coordinates of [Washington University](https://wustl.edu/):

In [8]:
import requests

address = "1 Brookings Dr, St. Louis, MO 63130"

response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address='+address)

resp_json_payload = response.json()

print(resp_json_payload['results'][0]['geometry']['location'])

{'lat': 38.6471178, 'lng': -90.3026148}


If latitude and longitude are simply fed into the neural network as two features, they might not be overly helpful.  These two values would allow your neural network to cluster locations on a map.  Sometimes cluster locations on a map can be useful.  Consider the percentage of the population that smokes in the USA by state:

![Smokers by State](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_6_smokers.png "Smokers by State")

The above map shows that certian behaviors, like smoking, can be clustered by global region. 

However, often you will want to transform the coordinates into distances.  It is reasonably easy to estimate the distance between any two points on Earth by using the [great circle distance](https://en.wikipedia.org/wiki/Great-circle_distance) between any two points on a sphere:

The following code implements this formula:

$\Delta\sigma=\arccos\bigl(\sin\phi_1\cdot\sin\phi_2+\cos\phi_1\cdot\cos\phi_2\cdot\cos(\Delta\lambda)\bigr)$

$d = r \, \Delta\sigma$


In [9]:
from math import sin, cos, sqrt, atan2, radians

# Distance function
def distance_lat_lng(lat1,lng1,lat2,lng2):
    # approximate radius of earth in km
    R = 6373.0

    # degrees to radians (lat/lon are in degrees)
    lat1 = radians(lat1)
    lng1 = radians(lng1)
    lat2 = radians(lat2)
    lng2 = radians(lng2)

    dlng = lng2 - lng1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlng / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    return R * c

# Find lat lon for address
def lookup_lat_lng(address):
    response = requests.get('https://maps.googleapis.com/maps/api/geocode/json?address='+address)
    json = response.json()
    if len(json['results']) == 0:
        print("Can't find: {}".format(address))
        return 0,0
    map = json['results'][0]['geometry']['location']
    return map['lat'],map['lng']


# Distance between two locations

import requests

address1 = "1 Brookings Dr, St. Louis, MO 63130" 
address2 = "3301 College Ave, Fort Lauderdale, FL 33314"

lat1, lng1 = lookup_lat_lng(address1)
lat2, lng2 = lookup_lat_lng(address2)

print("Distance, St. Louis, MO to Ft. Lauderdale, FL: {} km".format(
        distance_lat_lng(lat1,lng1,lat2,lng2)))



Distance, St. Louis, MO to Ft. Lauderdale, FL: 1685.0869618595307 km


Distances can be useful to encode addresses as.  You must consider what distance might be useful for your dataset.  Consider:

* Distance to major metropolitan area
* Distance to competitor
* Distance to distribution center
* Distance to retail outlet

The following code calculates the distance between 10 universities and washu:

In [10]:
# Encoding other universities by their distance to Washington University

schools = [
    ["Princeton University, Princeton, NJ 08544", 'Princeton'],
    ["Massachusetts Hall, Cambridge, MA 02138", 'Harvard'],
    ["5801 S Ellis Ave, Chicago, IL 60637", 'University of Chicago'],
    ["Yale, New Haven, CT 06520", 'Yale'],
    ["116th St & Broadway, New York, NY 10027", 'Columbia University'],
    ["450 Serra Mall, Stanford, CA 94305", 'Stanford'],
    ["77 Massachusetts Ave, Cambridge, MA 02139", 'MIT'],
    ["Duke University, Durham, NC 27708", 'Duke University'],
    ["University of Pennsylvania, Philadelphia, PA 19104", 'University of Pennsylvania'],
    ["Johns Hopkins University, Baltimore, MD 21218", 'Johns Hopkins']
]

lat1, lng1 = lookup_lat_lng("1 Brookings Dr, St. Louis, MO 63130")

for address, name in schools:
    lat2,lng2 = lookup_lat_lng(address)
    dist = distance_lat_lng(lat1,lng1,lat2,lng2)
    print("School '{}', distance to wustl is: {}".format(name,dist))


School 'Princeton', distance to wustl is: 1354.7554246422899
School 'Harvard', distance to wustl is: 1670.4803488515167
School 'University of Chicago', distance to wustl is: 418.07074534593164
School 'Yale', distance to wustl is: 1508.0574212290953
School 'Columbia University', distance to wustl is: 1418.0702619177403
School 'Stanford', distance to wustl is: 2781.0147376094565
School 'MIT', distance to wustl is: 1672.3056259012085
School 'Duke University', distance to wustl is: 1046.5733989340915
School 'University of Pennsylvania', distance to wustl is: 1307.0188113752533
School 'Johns Hopkins', distance to wustl is: 1184.1983102146019


# Other Examples: Bag of Words

The Bag of Words algorithm is a common means of encoding strings. (Harris, 1954) Each input represents the count of one particular word. The entire input vector would contain one value for each unique word. Consider the following strings.

```
Of Mice and Men
Three Blind Mice
Blind Man’s Bluff
Mice and More Mice
```

We have the following unique words. This is our “dictionary.”

```
Input 0 : and
Input 1 : blind
Input 2 : bluff
Input 3 : man’s
Input 4 : men
Input 5 : mice
Input 6 : more
Input 7 : of
Input 8 : three
```

The four lines above would be encoded as follows.

```
Of Mice and Men [ 0 4 5 7 ]
Three Blind Mice [ 1 5 8 ]
Blind Man ’ s Bl u f f [ 1 2 3 ]
Mice and More Mice [ 0 5 6 ]
```

Of course we have to fill in the missing words with zero, so we end up with
the following.

* Of Mice and Men [ 1 , 0 , 0 , 0 , 1 , 1 , 0 , 1 , 0 ]
* Three Blind Mice [ 0 , 1 , 0 , 0 , 0 , 1 , 0 , 0 , 1 ]
* Blind Man’s Bluff [ 0 , 1 , 1 , 1 , 0 , 0 , 0 , 0 , 0 ]
* Mice and More Mice [ 1 , 0 , 0 , 0 , 0 , 2 , 1 , 0 , 0 ]

Notice that we now have a consistent vector length of nine. Nine is the total
number of words in our “dictionary”. Each component number in the vector is
an index into our dictionary of available words. At each vector component is
stored a count of the number of words for that dictionary entry. Each string
will usually contain only a small subset of the dictionary. As a result, most of
the vector values will be zero.

As you can see, one of the most difficult aspects of machine learning programming
is translating your problem into a fixed-length array of floating point
numbers. The following section shows how to translate several examples.


* [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?']

vectorizer = CountVectorizer(min_df=1)

vectorizer.fit(corpus)

print("Mapping")
print(vectorizer.vocabulary_)

print()
print("Encoded")
x = vectorizer.transform(corpus)
print(x.toarray())

Mapping
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}

Encoded
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


In [12]:
from sklearn.feature_extraction.text import CountVectorizer

path = "./data/"

filename_read = os.path.join(path,"auto-mpg.csv")
df = pd.read_csv(filename_read,na_values=['NA','?'])

corpus = df['name']

vectorizer = CountVectorizer(min_df=1)

vectorizer.fit(corpus)

print("Mapping")
print(vectorizer.vocabulary_)

print()
print("Encoded")
x = vectorizer.transform(corpus).toarray()
print(x)

print(len(vectorizer.vocabulary_))

# reverse lookup for columns
bag_cols = [0] * len(vectorizer.vocabulary_)
for i,key in enumerate(vectorizer.vocabulary_):
    bag_cols[i] = key


Mapping
{'chevrolet': 94, 'chevelle': 91, 'malibu': 185, 'buick': 75, 'skylark': 251, '320': 37, 'plymouth': 220, 'satellite': 243, 'amc': 62, 'rebel': 228, 'sst': 257, 'ford': 143, 'torino': 274, 'galaxie': 147, '500': 44, 'impala': 167, 'fury': 145, 'iii': 166, 'pontiac': 221, 'catalina': 84, 'ambassador': 61, 'dpl': 131, 'dodge': 129, 'challenger': 88, 'se': 245, 'cuda': 115, '340': 39, 'monte': 202, 'carlo': 83, 'estate': 136, 'wagon': 293, 'sw': 268, 'toyota': 276, 'corona': 107, 'mark': 188, 'ii': 165, 'duster': 132, 'hornet': 164, 'maverick': 191, 'datsun': 123, 'pl510': 219, 'volkswagen': 290, '1131': 4, 'deluxe': 125, 'sedan': 247, 'peugeot': 215, '504': 47, 'audi': 67, '100': 1, 'ls': 179, 'saab': 239, '99e': 56, 'bmw': 73, '2002': 22, 'gremlin': 154, 'f250': 138, 'chevy': 95, 'c20': 77, 'd200': 120, 'hi': 161, '1200d': 7, 'vega': 285, '2300': 26, 'pinto': 218, 'custom': 116, 'matador': 190, 'brougham': 74, 'monaco': 200, 'country': 110, 'squire': 256, 'safari': 240, 'sportab

In [13]:
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping

#x = x.toarray() #.as_matrix()
y = df['mpg'].as_matrix()

model = Sequential()
model.add(Dense(10, input_dim=x.shape[1], activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x,y,verbose=0,epochs=1000)

# Rank features
ranker = InputPerturbationRank
print()
print("*** Feature Ranking ***")
l1 = ranker(bag_cols).rank(x, y, model)

for itm in l1:
    print(itm)
    



*** Feature Ranking ***
(0.017396697112590658, 'valiant', 123)
(0.014139692034698025, 'escort', 276)
(0.011201757315901189, 'man', 220)
(0.0086706736749374985, 'lynx', 290)
(0.0085017738804555234, 'monza', 162)
(0.0077187852924681449, 'cruiser', 129)
(0.0073104410283607261, 'manta', 140)
(0.0072324516270989656, 'delta', 94)
(0.0070759292805923114, 'dasher', 150)
(0.0069773848567330002, 'sapporo', 232)
(0.0068839778951273849, 'beetle', 126)
(0.0068752741084926704, 'arrow', 194)
(0.0065229629846020225, 'dart', 143)
(0.0064331948403496941, 'stanza', 292)
(0.005975436666200715, '88', 95)
(0.0057479199208103123, 'lecar', 263)
(0.0050237642451710848, 'astro', 163)
(0.0049504991808959776, 'turbo', 225)
(0.0047836146704541092, 'monaco', 67)
(0.004664201409191355, 'lesabre', 92)
(0.004620721069649684, 'royal', 99)
(0.0045376763835871109, 'opel', 77)
(0.0044229760364832751, 'concord', 221)
(0.0044046449363231823, 'coupe', 102)
(0.0040178697488529739, '2000', 76)
(0.0040063471383821004, 'hatchba

# Other Examples: Time Series

Time series data will need to be encoded for a regular feedforward neural network.  In a few classes we will see how to use a recurrent neural network to find patterns over time.  For now, we will encode the series into input neurons.

Financial forecasting is a very popular form of temporal algorithm. A temporal algorithm is one that accepts input for values that range over time. If the algorithm supports short term memory (internal state) then ranges over time are supported automatically. If your algorithm does not have an internal state then you should use an input window and a prediction window. Most algorithms do not have an internal state. To see how to use these windows, consider if you would like the algorithm to predict the stock market. You begin with the closing price for a stock over several days:

```
Day 1 : $45
Day 2 : $47
Day 3 : $48
Day 4 : $40
Day 5 : $41
Day 6 : $43
Day 7 : $45
Day 8 : $57
Day 9 : $50
Day 10 : $41
```

The first step is to normalize the data. This is necessary whether your algorithm has internal state or not. To normalize, we want to change each number into the percent movement from the previous day. For example, day 2 would become 0.04, because there is a 4% difference between $45 and $47. Once you perform this calculation for every day, the data set will look like the following:

```
Day 2 : 0. 04
Day 3 : 0. 02
Day 4:−0.16
Day 5 : 0. 02
Day 6 : 0. 04
Day 7 : 0. 04
Day 8 : 0. 04
Day 9:−0.12
Day 10:−0.18
```

In order to create an algorithm that will predict the next day’s values, we need to think about how to encode this data to be presented to the algorithm. The encoding depends on whether the algorithm has an internal state. The internal state allows the algorithm to use the last few values inputted to help establish trends.

Many machine learning algorithms have no internal state. If this is the case, then you will typically use a sliding window algorithm to encode the data. To do this, we use the last three prices to predict the next one. The inputs would be the last three-day prices, and the output would be the fourth day. The above data could be organized in the following way to provide training data.

These cases specified the ideal output for the given inputs:

```
[ 0.04 , 0.02 , −0.16 ] −> 0.02
[ 0.02 , −0.16 , 0.02 ] −> 0.04
[ −0.16 , 0.02 , 0.04 ] −> 0.04
[ 0.02 , 0.04 , 0.04 ] −> 0. 26
[ 0.04 , 0.04 , 0.26 ] −> −0.12
[ 0.04 , 0.26 , −0.12 ] −> −0.18
```

The above encoding would require that the algorithm have three inputs and one output.

In [14]:
import numpy as np

def normalize_price_change(history):
    last = None
    
    result = []
    for price in history:
        if last is not None:
            result.append( float(price-last)/last )
        last = price

    return result

def encode_timeseries_window(source, lag_size, lead_size):
    """
    Encode raw data to a time-series window.
    :param source: A 2D array that specifies the source to be encoded.
    :param lag_size: The number of rows uses to predict.
    :param lead_size: The number of rows to be predicted
    :return: A tuple that contains the x (input) & y (expected output) for training.
    """
    result_x = []
    result_y = []

    output_row_count = len(source) - (lag_size + lead_size) + 1
    

    for raw_index in range(output_row_count):
        encoded_x = []

        # Encode x (predictors)
        for j in range(lag_size):
            encoded_x.append(source[raw_index+j])

        result_x.append(encoded_x)

        # Encode y (prediction)
        encoded_y = []

        for j in range(lead_size):
            encoded_y.append(source[lag_size+raw_index+j])

        result_y.append(encoded_y)

    return result_x, result_y


price_history = [ 45, 47, 48, 40, 41, 43, 45, 57, 50, 41 ]
norm_price_history = normalize_price_change(price_history)

print("Normalized price history:")
print(norm_price_history)

print()
print("Rounded normalized price history:")
norm_price_history = np.round(norm_price_history,2)
print(norm_price_history)


print()
print("Time Boxed(time series encoded):")
x, y = encode_timeseries_window(norm_price_history, 3, 1)

for x_row, y_row in zip(x,y):
    print("{} -> {}".format(np.round(x_row,2), np.round(y_row,2)))


Normalized price history:
[0.044444444444444446, 0.02127659574468085, -0.16666666666666666, 0.025, 0.04878048780487805, 0.046511627906976744, 0.26666666666666666, -0.12280701754385964, -0.18]

Rounded normalized price history:
[ 0.04  0.02 -0.17  0.02  0.05  0.05  0.27 -0.12 -0.18]

Time Boxed(time series encoded):
[ 0.04  0.02 -0.17] -> [ 0.02]
[ 0.02 -0.17  0.02] -> [ 0.05]
[-0.17  0.02  0.05] -> [ 0.05]
[ 0.02  0.05  0.05] -> [ 0.27]
[ 0.05  0.05  0.27] -> [-0.12]
[ 0.05  0.27 -0.12] -> [-0.18]
