## Chapter 18. Neural Networks

An artificial neural network (or neural network for short) is a predictive model motivated
by the way the brain operates. Think of the brain as a collection of neurons wired together.
Each neuron looks at the outputs of the other neurons that feed into it, does a calculation,
and then either fires (if the calculation exceeds some threshhold) or doesn’t (if it doesn’t).
Accordingly, artificial neural networks consist of artificial neurons, which perform similar
calculations over their inputs. Neural networks can solve a wide variety of problems like
handwriting recognition and face detection, and they are used heavily in deep learning,
one of the trendiest subfields of data science. However, most neural networks are “black
boxes” — inspecting their details doesn’t give you much understanding of how they’re
solving a problem. And large neural networks can be difficult to train. For most problems
you’ll encounter as a budding data scientist, they’re probably not the right choice.
Someday, when you’re trying to build an artificial intelligence to bring about the
Singularity, they very well might be

### Perceptrons
Pretty much the simplest neural network is the perceptron, which approximates a single
neuron with n binary inputs. It computes a weighted sum of its inputs and “fires” if that
weighted sum is zero or greater:


In [1]:
import math

def entropy(class_probs):
    """Given a list of class probabilities, compute the entropy"""
    return sum(-p*math.log(p,2) for p in class_probs if p) # ignore p = 0

Our data = consist of pairs `(input, label)`, which means we’ll need to compute the class probabilities ourselves. *We don’t actually care which label is associated w each probability, only what the probabilities are:*


In [2]:
from collections import Counter
#Counter({'blue': 3, 'red': 2, 'green': 1})

def class_probs(labels):
    total_count = len(labels)
    # for each count of a label, find the fraction of that label count
    # over total DP count
    return [count / total_count for count in Counter(labels).values()]

def data_entropy(labeled_data):
    labels = [label for _, label in labeled_data]
    probs = class_probs(labels)
    return entropy(probs)

### The Entropy of a Partition

We've computed entropy (“uncertainty”) of a single set of labeled data. Each stage of a DT involves asking a question in which the answer partitions data into 1 or (hopefully) more subsets. 
* Ex: “does it have > 5 legs?” partitions animals into those w/ > 5 legs + those that don’t

Correspondingly, want some notion of the entropy that will result from partitioning a set of data in a certain way = **want a partition to:
* have low entropy if it splits data into subsets that *themselves* have low entropy (i.e., are highly certain)
* have high entropy if it contains subsets that (are large +) have high entropy (i.e., are highly uncertain).**

Ex: “does it have > 5 legs?” question = pretty dumb, as it partitioned remaining animals at that point into = `S1={echidna}` and `S2={everything else}`, where `S2` = both large + high-entropy. (`S1` has no entropy but represents a small fraction of the remaining “classes.”)

Mathematically, if we partition data `S` into subsets `{S1, ..., Sm}` , containing proportions `{q1, ..., qm}` of the data, then we compute **entropy of the partition** as a weighted sum:
* `H = q_1*H(S1) + ... + q_m*H(Sm)`

which we can implement as:

In [3]:
def partition_entropy(subsets):
    """Find entropy from a parition of data into subsets, where
    subsets = a list of lists of labeled data"""
    total_count = sum(len(subset) for subset in subsets)
    return sum(data_entropy(subset) * len(subset) / total_count
               for subset in subsets)

***NOTE***: 1 problem w/ this approach = partitioning by an attribute w/ many different values results in very low entropy due to overfitting. 
* Ex: trying to build a DT predict which customers of a bank = likely to default on mortgages using some historical data as a training set. 
* Imagine further that the data set contains each customer’s SSN, + partitioning on SSN will produce 1-person subsets, each of which necessarily has 0 entropy. 
* But a model that relies on SSN is *CERTAIN* not to generalize beyond training
* For this reason, probably avoid (or bucket, if appropriate) attributes w/ large #'s of possible values when creating DT's

### Creating a Decision Tree

VP provides you w/ interviewee data, consisting of (per your specification) pairs `(input, label)`, where each `input` = a `dict` of candidate attributes, + each `label` =  either `True` (candidate interviewed well) or `False` (candidate interviewed poorly).

In particular, you're provided w/ each candidate’s `level`, preferred `language`, whether active on `Twitter`, + whether has a `PhD`:

In [4]:
inputs = [
    ({'level':'Senior', 'lang':'Java', 'tweets':'no', 'phd':'no'}, False),
    ({'level':'Senior', 'lang':'Java', 'tweets':'no', 'phd':'yes'}, False),
    ({'level':'Mid', 'lang':'Python', 'tweets':'no', 'phd':'no'}, True),
    ({'level':'Junior', 'lang':'Python', 'tweets':'no', 'phd':'no'}, True),
    ({'level':'Junior', 'lang':'R', 'tweets':'yes', 'phd':'no'}, True),
    ({'level':'Junior', 'lang':'R', 'tweets':'yes', 'phd':'yes'}, False),
    ({'level':'Mid', 'lang':'R', 'tweets':'yes', 'phd':'yes'}, True),
    ({'level':'Senior', 'lang':'Python', 'tweets':'no', 'phd':'no'}, False),
    ({'level':'Senior', 'lang':'R', 'tweets':'yes', 'phd':'no'}, True),
    ({'level':'Junior', 'lang':'Python', 'tweets':'yes', 'phd':'no'}, True),
    ({'level':'Senior', 'lang':'Python', 'tweets':'yes', 'phd':'yes'}, True),
    ({'level':'Mid', 'lang':'Python', 'tweets':'no', 'phd':'yes'}, True),
    ({'level':'Mid', 'lang':'Java', 'tweets':'yes', 'phd':'no'}, True),
    ({'level':'Junior', 'lang':'Python', 'tweets':'no', 'phd':'yes'}, False)
]

Our DT = consists of decision **nodes** (ask a question + direct us differently depending on answer) + **leaf** nodes (give a prediction). We will build it using relatively simple **ID3 algorithm**, which operates in the following manner. 
* Given some labeled data + a list of attributes to consider branching on.
* If the data all have the same label, create a leaf node that predicts that label + stop.
* If the list of attributes = empty (i.e., no more possible questions to ask), create a leaf node that predicts the most common label + stop.
* Otherwise, try partitioning the data by each of the attributes
* Choose the partition w/ lowest partition entropy
* Add a decision node based on the chosen attribute
* Recur on each partitioned subset using remaining attributes

This = a **“greedy” algorithm** b/c, at each step, it chooses the **most
immediately best option**. Given a data set, there may be a better tree w/ a worse-looking first move. If so, this algorithm won’t find it. Nonetheless, it is relatively easy to understand + implement, which makes it a good place to begin exploring DTs.

Interviewee data set has both `True` + `False` labels w/ 4 attributes to split on. 1st step = find the partition w/ least entropy via a function that does the partitioning:

In [5]:
## 1st step = find partition w/ least entropy
from collections import defaultdict

# do the partitioning
def partition_by(inputs,attribute):
    """Each input = pair (attribute_dict, label)
    This returns a dict: attribute_value -> inputs"""
    groups = defaultdict(list)
    for input in inputs:
        key=input[0][attribute] # get value of specified attribute
        groups[key].append(input)
    return groups

# compute entropy
def partition_entropy_by(inputs, attributes):
    """Computes entropy corresponding to given partition"""
    partitions = partition_by(inputs,attributes)
    
    return partition_entropy(partitions.values())

Then find the minimum-entropy partition for the whole data set:

In [6]:
for key in ['level','lang','tweets','phd']:
    print(key, partition_entropy_by(inputs,key))

level 0.6935361388961919
lang 0.8601317128547441
tweets 0.7884504573082896
phd 0.8921589282623617


Lowest entropy = splitting on `level` ==> make a subtree for each possible `level` value. Every `Mid` candidate = labeled `True`, which means that `Mid` subtree = leaf node predicting `True`. For `Senior` candidates, we have a mix of `True`s and `False`s, so split again:

In [7]:
senior_inputs = [(input,label)
                for input,label in inputs if input["level"] == "Senior"]

for key in ['lang','tweets','phd']:
    print(key,partition_entropy_by(senior_inputs,key))

lang 0.4
tweets 0.0
phd 0.9509775004326938


Next split = on `tweets` = a 0-entropy partition. For these Senior-level candidates, “yes” tweets always = `True` while “no” tweets always = `False`.

Finally, if we do the same thing for the `Junior` candidates, end up splitting on `phd`, after which we find that no PhD always results in `True` and PhD always = `False`

### Putting It All Together

Now we’ve seen how the algorithm works, implement it more generally = means we need to decide how we want to represent trees. We’ll use pretty
much the most lightweight representation possible + define a tree to be one of the following:
* True
* False
* a tuple (attribute, subtree_dict)

Here `True` = leaf node that returns `True` for any input, `False` = leaf node that returns `False` for any input, + a `tuple` represents a decision node that, for any input, finds its attribute value + classifies the input using the corresponding subtree. W/ this representation, our hiring tree would look like:

In [8]:
('level',
 # if their level = junior, check PHD in new subtree
 {'Junior': ('phd',
             {'no': True, 'yes': False}),
  # if their level = mid, predict paid
  'Mid': True,
  # if their level = senior, check PHD in new subtree
  'Senior': ('tweets',
             {'no': False, 'yes': True})})

('level',
 {'Junior': ('phd', {'no': True, 'yes': False}),
  'Mid': True,
  'Senior': ('tweets', {'no': False, 'yes': True})})

There’s still the question of what to do if we encounter an *unexpected* (or missing) attribute value, like if it encounters a candidate whose `level` = “Intern”? We’ll handle this case by adding a `None` key that just predicts the most common label. (bad idea if `None` is actually a value that appears in the data). Given such a representation, we can classify an input with:

In [9]:
def classify_dt(tree,input):
    """Classify an input using the given decision tree"""
    # if on a leaf node, return its value (i.e. level = Mid)
    if tree in [True,False]:
        return tree
    
    # otherwise, tree consists of an attribute to split on and a 
    #  dict whose keys = values of that attribute and 
    #  values = subtrees to consider next
    attribute, subtree_dict = tree
    
    subtree_key = input.get(attribute) # returns None is input is missing attribute
    
    # if no subtree for key, use the None subtree
    if subtree_key not in subtree_dict:
        subtree_key = None
        
    # choose appropriate subtree + use it to classify input
    subtree = subtree_dict[subtree_key]
    return classify_dt(subtree,input)

Now build the tree representation from our training data:

In [10]:
from functools import partial

def build_tree_id3(inputs,split_candidates=None):
    ## If on 1st pass, all keys of 1st input = split candidates
    if split_candidates is None:
        split_candidates = inputs[0][0].keys() # ['level','lang','tweets','phd']
    
    ## count Trues + Falses for inputs
    num_inputs = len(inputs)
    # item = dictionary, label = single boolean label
    num_trues = len([label for item, label in inputs if label])
    num_false = num_inputs - num_trues
    
    ## if we have no trues/falses, return a false/true leaf
    if num_trues == 0:
        return False
    if num_false == 0:
        return True
    
    ## if we have no split candidates left, return majority leaf
    if not split_candidates:
        return num_trues >= num_false
    
    ## if still have split candidates, split on 'best' attribute
    best_attribute = min(split_candidates,
                        key = partial(partition_entropy_by,inputs))
    partitions = partition_by(inputs,best_attribute)
    new_candidates = [a for a in split_candidates if a != best_attribute]
    
    ## recursively build subtrees
    subtrees = {attribute_value: build_tree_id3(subset,new_candidates)
               for attribute_value, subset in partitions.items()}
    subtrees[None] = num_trues >= num_false # default case
    
    return (best_attribute,subtrees)

In this tree, every leaf consisted entirely of `True` inputs or entirely of `False` inputs = means the tree predicts perfectly on training, but we can also apply it to new data that wasn’t in training and data w/ missing or unexpected values:

In [11]:
tree = build_tree_id3(inputs)

print(classify_dt(tree,
            { "level" : "Junior",
             "lang" : "Java",
             "tweets" : "yes",
             "phd" : "no"}))
print(classify_dt(tree,
            { "level" : "Junior",
             "lang" : "Java",
             "tweets" : "yes",
             "phd" : "yes"}))
# new/unexpected values
print(classify_dt(tree,
            {"level" : "Intern"}))
print(classify_dt(tree,
            {"level" : "Senior"}))

True
False
True
False


***NOTE***: Since our goal was mainly to demonstrate how to build a tree, we built it using the entire data set. As always, if we were really trying to create a good model for something, we would've (collected more data +) split the data into train/validation/test subsets.

### Random Forests

Given how closely DT'd can fit themselves to training, it’s not surprising they have a tendency to overfit. 1 way of avoiding this = **random forests** = build multiple DTs + let them "vote" on how to classify inputs:

In [12]:
def rf_classify(trees,input):
    votes = [classify_dt(tree,input) for tree in trees]
    
    vote_counts = Counter(votes)
    # return 1st most common vote + its 
    return vote_counts.most_common(1)[0][0]

Tree-building process was **deterministic**, so how do we get **random** trees?
* 1) **bootstrapping** data = Rather than training each tree on all inputs in training, train each tree on a result of `bootstrap_sample(inputs)`. 
    * Since each tree = built using different data, each tree will be different from every other tree
    * Side benefit = it’s totally fair to use the nonsampled data to test each tree, which means you can get away w/ using all data as training if you're clever in how you measure performance
    * This technique = **bootstrap aggregating** or **bagging**.
* 2) Changing the way we chose `best_attribute` to split on.
    * Rather than looking @ all remaining attributes, 1st choose a random subset of them + then split on whichever of those is best:

In [None]:
import random

# if already have few-enough split candidates, look @ all of them
if len(split_candidates) <= self.num_split_candidates:
    sampled_split_candidates = split_candidates
# otherwise pick a random sample
else:
    sampled_split_candidates = random.sample(split_candidates,
                                            self.num_split_candidates)
    
# now choose best attribute only from those candidates
best_attribute = min(sampled_split_candidates),
                    key=partial(partition_entropy_by,inputs))
paritions = partition_by(inputs,best_attribute)

This = an example of a broader technique called **ensemble learning** = combine several weak learners (typically high-bias, low-variance models) in order to produce an overall strong model.

Random forests = 1 of the most popular and versatile models around.

### For Further Exploration
* scikit-learn has many DT models + an ensemble module that includes a RandomForestClassifier as well as other ensemble methods.