# MSDS688 -- Artifical Intelligence

## Week 7 - Machine Learning in AI

![Wolfgang von Kempelen designed a speaking machine](../images/kempelen-speaking-machine.png)

About a decade later, a Hungarian engineer named Wolfgang von Kempelen designed a speaking machine using an ivory glottis, bellows for lungs, a leather vocal tract with a hinged tongue, a rubber oral cavity and mouth, and a nose with two little pipes as nostrils. Its pronouncements were more whimsical than those of Mical’s talking heads: “my wife is my friend”, for example, and “come with me to Paris”.

Cite: Riskin, J. (n.d.). Frolicsome Engines: The Long Prehistory of Artificial Intelligence. Retrieved April 10, 2018, from [https://publicdomainreview.org/2016/05/04/frolicsome-engines-the-long-prehistory-of-artificial-intelligence/](https://publicdomainreview.org/2016/05/04/frolicsome-engines-the-long-prehistory-of-artificial-intelligence/)

# Review - Concepts and techniques

# Quiz / Exercise

# Lecture

_Note: Start with a promise_ 

## Learning Objectives

1. Explain how computation learning theory eliminates poorly performing models leaving those that are probably approximately correct.

1. Construct a decision tree by recursively selecting the most important feature.

1. Compare common approaches to improving the performance of decision trees.

1. Evaluate model performance and how it can be improved through cross-validation and regularization.

1. Understand the trade-off between bias, variance and model complexity.


## Types of Learning

**Goal: To teach the agent a function that maps from an input image to one of those strings**

1. Unsupervised learning

1. Re-enforcement learning

1. Supervised learning -- Our focus

### Unsupervised learning

Unsupervised Learning the agent learns patterns in the input even though no explicit feedback is supplied. The most common type is clustering: detecting potential useful clusters of input examples.

_Example: A taxi agent would develop a concept of good traffic days and bad traffic days without ever being given labeled examples._

_

### Re-inforcement learning

In Reinforcement Learning the agent learns from a series of reinforcements—rewards or punishments.

_Example: Take a game playing agent and reward it for good play and penalize it for bad.  Eventually, the agent will learn what actions are best for a particular circumstance._

### Supervised learning

In Supervised Learning the agent observes some example input-output pairs and learns a function that maps from input to output.

_Example: Providing a agent an image of a cat or a dog. To teach this agent, we will give a lot of input-output pairs like {cat image-"cat"}, {dog image-"dog"} to the agent._

**Cite** "Learning.ipynb." Aima-Python, GitHub, May 2018, [github.com/aimacode/aima-python](github.com/aimacode/aima-python). Python implementation of algorithms from Russell And Norvig's "Artificial Intelligence - A Modern Approach"

## Supervised Learning

### Bias and Variance

* Bias
    + Informal: A systematic and consistent error in a model's results.
    + Formal: The amount the expected value of the results differ from the true value.

* Variance
    + Informal: Inaccurate predictions resulting from over trained model.
    + Formal: The expected value of the squared deviation of the results from the mean of the results.

![Bias and variance concept](../images/bias-and-variance-concept.png)

![Bias and variance explanation](../images/bias-and-variance-explanation.png)

![Bias and variance questions](../images/bias-and-variance-questions.png)

![Bias equation](../images/bias-equation.png)

![Variance equation](../images/variance-equation.png)

![Bias and variance target](../images/bias-and-variance-target.png)

### Exercise 

Find a partner and draw archery results illustrating the following 4 scenarios:
A) Low bias, low variance
B) Medium bias, high variance
C) High bias, low variance
D) Low bias, high variance

Who’s the best archer?

![Figure 18.1 ](../images/Figure-S18-1-overfitting.png)

* Which of models do you like best?  Why? 

* Overfitting results in poor predictive preformance

* Apply Ocam's Razor and choose the simplest model that models the data well

## Decision Trees 

### A Visual Introduction
![Decision Trees](../images/Decision_Trees_web.png)

Cite: Albon, Chris. “Machine Learning Flashcards.” Machine Learning Flashcards, 2018, machinelearningflashcards.com.

### An Interactive Demo

Take a few minutes and work through these outstanding dynamic visualizations at: [www.r2d3.us/](http://www.r2d3.us/)

### Pros and Cons

Pros: Computationally cheap to use, easy for humans to understand learned results, missing values OK, can deal with irrelevant features

Cons: Prone to overfitting

Works with: Numeric values, nominal values

Cite: “3.1 Tree Construction.” Machine Learning in Action, by Peter Harrington, Manning Publications Co., 2012, pp. 39–60.

### Example: Picking a Restaurant

* What factors are important to you when you decide where to eat?

![](../images/Figure-S18-3-dining-examples.png)

* Which feature should we use as a label?

### Example Dataset

* How would you go about deciding where to eat?

![](../images/Figure-S18-2-dining-decision-tree.png)

### Feature Importance

* What feature should we split on?  Why?

![](../images/Figure-S18-4-dining-best-splits.png)

## Measuring Importance


$$
\large
\text{Entropy:} \;
H(V) = \sum_{k} P(v_k) \log_{2} \left( \frac{1}{P(v_k)} \right) = - \sum_{k} P(v_k) \log_{2} P(v_k)\\
\begin{align}
&V \;\text{random variable}\\
&P \; \text{probability}
\newline
\end{align}
$$

$$
\large
\text{Information Gain:} \; 
IG(S, C) = H(S) - \sum_{k} \frac{\aleph\left({C_i}\right)}{\aleph\left({S}\right)} \log_{2} H(C_i) \\
\begin{align}
&S \; \text{parent node}\\
&C_i \; \text{ith child node}\\
&IG(S, C) \; \text{entropy gain from split}\\
&H(S) \; \text{entropy of }S\\
&H(C_i) \; \text{entropy of } C_i\\
&\aleph\left({C_i}\right) \; \text{number of elements in } C_i\\
&\aleph\left({S}\right) \; \text{number of elements in } S
\end{align}
$$

### Alogorithm

1. Iterate over available features

1. Find the with the greatest information gain with respect to the goal/target/label

1. Create a node representing that feature

1. Create an edge for all possible feature values

1. Remove feature from further consideration

1. Continue until the stopping criteria is met

    * All features have been consummed
    
    * Each node contains examples that have the same goal/target/label value
    
    * Information gain is zero or smaller than a cutoff threshold value

### Calculating the Best Split -- An Illustrated Example

![Calculating the best split 1](../images/decision-tree-splits-1.png)

![Calculating the best split 2](../images/decision-tree-splits-2.png)

![Calculating the best split 3](../images/decision-tree-splits-3.png)

![Calculating the best split 4](../images/decision-tree-splits-4.png)

![Calculating the best split 5](../images/decision-tree-splits-5.png)

![Calculating the best split 6](../images/decision-tree-splits-6.png)

![Calculating the best split 7](../images/decision-tree-splits-7.png)

![Calculating the best split 8](../images/decision-tree-splits-8.png)

# Break 

![]()

# Demonstration

# Exercise

_Note: End with humor_