# Predictive Modeling

Predictive modeling is one of the main topics of data mining and can range from correlation to supervised segmentation

## Supervised Segmentation

In [364]:
import python_libraries
import plotly
import plotly.tools as tls
from IPython.display import display
import cufflinks as cf
import pandas as pd
import numpy as np
import math
import plotly.plotly as py
from plotly.graph_objs import *

# Calculates the entropy of the given data set for the target attribute.
def entropy(data, target_attr):
 
    val_freq = {}
    data_entropy = 0.0
 
    # Calculate the frequency of each of the values in the target attr
    for record in data:
        
        if (record[target_attr] in val_freq):
            val_freq[record[target_attr]] += 1.0
        else:
            val_freq[record[target_attr]]  = 1.0
 
    # Calculate the entropy of the data for the target attribute
    for freq in val_freq.values():
        data_entropy += (-freq/len(data)) * math.log(freq/len(data), 2) 
 
    return round(data_entropy,2)


def calc_entropy(criteria,target):    
    print ("Entropy (Randomness): " +  str(entropy(movie_history[criteria].to_dict('records'),target)))
    display (movie_history[criteria])
    return entropy(movie_history[criteria].to_dict('records'),target)


py.sign_in('cloaked', 'wiwoxbjrvr')
#plotly.offline.init_notebook_mode() # run at the start of every notebook
movie_history = pd.read_csv('https://docs.google.com/spreadsheets/d/1fJ0DKxFkMr2XjzyspZFLv6yOlHJvl639Qpap0O4U78E/pub?gid=0&single=true&output=csv')
movie_history.head(len(movie_history))

Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
0,Raiders of the Lost Ark,1981,"Action, Adventure",1h 55min,8.5,Yes
1,The Lord of the Rings 3,2003,"Adventure, Drama",3h 21min,8.9,Yes
2,Fight Club,1999,Drama,2h 19min,8.9,Yes
3,Braveheart,1995,"Biography, Drama",2h 58min,8.4,Yes
4,Inception,2010,"Action, Mystery, Sci-Fi",2h 28 min,8.8,No
5,The Godfather,1972,"Crime, Drama",2h 55min,9.2,No
6,Aliens,1986,"Action, Horror, Sci-Fi",2h 17min,8.4,No
7,Once upon a time in America,1984,"Crime, Drama",3h 49min,8.4,No


(SS.10) The first concept within predictive modeling is supervised segmentation. Consider this example of a movie subscription service, like netflix, where there is a movie database that contains the history of whether a person watched a movie or not. Based on this history, we would like to predict which movie out of a potential list of unwatched movies, the person is likely to watch in the future.

(SS.20) So the target variable that we are interested in from the history is called *Watched*, which is a Yes or No value that represents whether the person watched a movie. Figuring out how we can find a formula, or in other words ***segment*** the history dataset to result in the value of *Watched = Yes*, is an example of supervised segmentation. 

(SS.30) Additionally, in this case we are talking about predicting a Yes or No value for the target variable *Watched*, which means we are doing a ***classification***. If we were predicting a numeric value such as the likelihood of watching a movie, instead of a Yes or No value, then we would be doing a ***regression***. In other words, a regression is to predict a numeric value for a target variable of interest.

(SS.50) So what is an informative attribute? An informative attribute is something that reduces uncertainity about the target variable. Let's for assume for a second that we perfectly understand the person's preferences. The person prefers the drama genre over others, but the length of the movie doesn't matter to the person. In this case the *Genre* attribute is more informative, as in it reduces the uncertainity of predicting whether the person will watch a movie or not. However the *Length* attribute is not informative because, in this case, it doesn't contribute to reducing the uncertainty of whether or not the person will watch a movie.

(SS.40) The first part of supervised segmentation is to select the important informative attributes. The attributes we have in this dataset are *Title*,*Year*, *Genre*, *Length* and *Rating*. A single row of atrribute-values represent what is called a Feature vector. As an example, a feature vector here is: [ Title: The Godfather, Year : 1972, Genre: Crime/Drama, Length: 2h 55min, Rating: 9.2 ]. Each value in the feature vector is called a feature value. The attribute *Watched?* is the target variable.

## Models, induction and deduction

(MID.10) We need a systematic mechanism or *Model* in order to figure out which attributes are informative. The process of creating models from data is called **Induction** and the procedure that creates a model from the data is called a *learner* or an *induction algorithm*. Induction essentially refers to generalizing from a specific case to general rules. We use the process of induction to create models from training data, which in this case is the movie dataset and then use *deduction* to use the model to predict target values for other instances of feature vectors, which in this case is the list of unwatched movies.

### Full list with no attribute s

In [358]:
no_filter = calc_entropy(criteria=movie_history.index >= 0,target='Watched')

Entropy (Randomness): 1.0


Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
0,Raiders of the Lost Ark,1981,"Action, Adventure",1h 55min,8.5,Yes
1,The Lord of the Rings 3,2003,"Adventure, Drama",3h 21min,8.9,Yes
2,Fight Club,1999,Drama,2h 19min,8.9,Yes
3,Braveheart,1995,"Biography, Drama",2h 58min,8.4,Yes
4,Inception,2010,"Action, Mystery, Sci-Fi",2h 28 min,8.8,No
5,The Godfather,1972,"Crime, Drama",2h 55min,9.2,No
6,Aliens,1986,"Action, Horror, Sci-Fi",2h 17min,8.4,No
7,Once upon a time / America,1984,"Crime, Drama",3h 49min,8.4,No


(MID.20) Now let's try to create a model to figure out which attributes are more informative than others. Let's start with Genre. When we filter the history by Genre = Drama we get 3 watched movies and 2 unwatched movies, which is a highly random result. If we would have got a result of 5 watched movies or 5 unwatched movies, then the result would have been non random or **pure**.

In [359]:
genre_filter = calc_entropy(criteria=movie_history.Genre.str.contains('Drama'),target='Watched')

Entropy (Randomness): 0.97


Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
1,The Lord of the Rings 3,2003,"Adventure, Drama",3h 21min,8.9,Yes
2,Fight Club,1999,Drama,2h 19min,8.9,Yes
3,Braveheart,1995,"Biography, Drama",2h 58min,8.4,Yes
5,The Godfather,1972,"Crime, Drama",2h 55min,9.2,No
7,Once upon a time / America,1984,"Crime, Drama",3h 49min,8.4,No


(MID.30) To quantify the randomness of the result we use a concept called entropy. In this example the value of entropy is 0.97 and is defined by the following mathematical formula:
$$entropy = - p_1 log (p_1) - p_2 log (p_2) - ...$$

(MID.40) Where $p_1$, in this example, is the probability of the target variable *Watched = Yes* and $p_2$ is the probability of the target variable *Watched = No* in the result set. Simply looking at this formula tells us that if all resulting values had *Watched = Yes*, then $p_1 = 1 and p_2 = 0$ and therefore $log (p_1) = 0 $ which makes the entropy equal 0. Hence an entropy of 0 denotes the least randomness, while an entropy of 1 means the most randomness. Let's see a couple of more examples.

In [360]:
rating_filter = calc_entropy(criteria=movie_history.Rating > 8.8,target='Watched')

Entropy (Randomness): 0.92


Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
1,The Lord of the Rings 3,2003,"Adventure, Drama",3h 21min,8.9,Yes
2,Fight Club,1999,Drama,2h 19min,8.9,Yes
5,The Godfather,1972,"Crime, Drama",2h 55min,9.2,No


(MID.40) Here we pick the attribute *Rating* with the condition *Rating > 8.8* and we get a result of 2 watched movies and 1 unwatched movies, and an entropy of 0.92, which is less than what we saw in the previous example but still quite a random segmentation. Let's see another example.

In [363]:
genre_and_year_filter = calc_entropy(criteria=movie_history.Genre.str.contains('Crime'),target='Watched')

Entropy (Randomness): 0.0


Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
5,The Godfather,1972,"Crime, Drama",2h 55min,9.2,No
7,Once upon a time / America,1984,"Crime, Drama",3h 49min,8.4,No


(MID.50) Here we pick the attribute *Genre* with the condition *Genre = Crime* and we get a result of 0 watched movies and 2 unwatched movies, and an entropy of 0, which is the least random segmentation we have seen so far.

In [370]:
# Calculates the information gain (reduction in entropy) 
#that would result by splitting the data on the chosen attribute (attr).
def gain(data, attr, target_attr):
 
    val_freq = {}
    subset_entropy = 0.0
 
    # Calculate the frequency of each of the values in the target attribute
    for record in data:
        if (record[attr] in val_freq):
            val_freq[record[attr]] += 1.0
        else:
            val_freq[record[attr]]  = 1.0
 
    # Calculate the sum of the entropy for each subset of records weighted by 
    #their probability of occuring in the training set.
    for val in val_freq.keys():
        val_prob = val_freq[val] / sum(val_freq.values())
        data_subset = [record for record in data if record[attr] == val]
        subset_entropy += val_prob * entropy(data_subset, target_attr)
        print (subset_entropy)
 
    # Subtract the entropy of the chosen attribute from the entropy of the whole
    #data set with respect to the target attribute (and return it)
    return round((entropy(data, target_attr) - subset_entropy),2)
 

def calc_information_gain_entropy(criteria,attr,attr_name,target):    
    print ("Information gain with attribute " + attr_name + " is : " + str(gain(movie_history[criteria].to_dict('records'),attr, target)))    
    return gain(movie_history[criteria].to_dict('records'),attr, target)

criteria=movie_history.Genre.str.contains('Drama')
criteria=movie_history.Genre.str.contains('Crime')
calc_information_gain_entropy(criteria,'Genre','Genre = Drama','Watched')


0.0
Information gain with attribute Genre = Drama is : 0.0
0.0


0.0

(MID.60) With these examples we have seen that measuring the entropy helps us determine the degree of randomness of the result. In order to measure how much an attribute improves (decreases) entropy over the whole segmentation it creates, we use a concept called information gain. Information gain measures the change in entropy due to any amount of new information being added. 

In [362]:
py.iplot({
"data": [
    Scatter(x=[1, 2, 3, 4], y=[no_filter, genre_filter, rating_filter, genre_and_year_filter])
],
"layout": Layout(
    title="Entropy"
)
})

(MID.20) Now let's try to create a model to figure out which attributes are more informative than others. Let's start with *Genre*. When we filter the history by *Genre = Drama* we get 2 watched movies and 2 unwatched movies. So there is a 50-50 chance of getting an expected result when we pick *Genre = Drama*, which is a highly random prediction. Let's try another attribute *Rating*. When we filter the history by *Rating > 8.8* we get 2 watched movies and 1 unwatched movie, which is a less random result. Let's try a combination of *Genre = Drama AND Year > 1995*. In this case we get 2 watched movies and no unwatched movies, which is the least random result of the attributes that we tested. So, by measuring the randomness of the target values within the dataset, we can determine how informative an attribute is, based how much it reduces the randomness of the target values within the dataset. 

(MID.30) We can measure randomness using the concept of entropy. Entropy is defined as the following measure of probability : 
$$entropy = - p_1 log (p_1) - p_2 log (p_2) - ...$$
Where $p_1$, in this example, is the probability of the target variable *Watched = Yes* and $p_2$ is the probability of the target variable *Watched = No* in the set of target values produced across the dataset. So in this case where we had used attributes of *Genre = Drama AND Year > 1995*, the result was 2 watched movies and 0 unwatched movies. Which means probability(Watched = Yes) = 1; hence entropy = 1*log(1) = 0

[Google](https://www.google.com)
> ###Quote
$x=y$



(MID.30) An entropy of 1 denotes the most randomness, while an entropy of 0 means least randomness. So this formula can help us determine how much an attribute decreases entropy of the segmentation it creates. This is also called *information gain* which measures the the change in entropy. Hence, we would want attributes that drive the highest information gain when we are trying to predict a target value.

## Decision Trees

Decision trees are an important concept due to their ability to learn to classify individual records in a dataset. They are used in many areas, from typical marketing scenarios such as targeting to airplane autopilots and medical diagnoses. 

A decision tree is essentially a series of if-then statements, which when applied to a record in a dataset, results in the classification of thar record. So in our example, the program that you create from a decision tree will be able to predict the likelihood of the person watching a movie from the history of movies.

At every step of the decision tree, the next most informative attribute is selected to divide the dataset according to some predefined criteria. One of the most popular approaches uses the concept of entropy to calculate which attrubute is best to use for dividing the data into subgroups.

Christopher Roach: Building Decision Trees in Python http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html