# Predictive Modeling

Predictive modeling is one of the main topics of data mining and can range from correlation to supervised segmentation

## Supervised Segmentation

In [263]:
import python_libraries
import plotly
import plotly.tools as tls
from IPython.display import display
import cufflinks as cf
import pandas as pd
import numpy as np
import math
import plotly.plotly as py
import plotly.graph_objs as go

# Calculates the entropy of the given data set for the target attribute.
def entropy(data, target_attr):
 
    val_freq = {}
    data_entropy = 0.0
 
    # Calculate the frequency of each of the values in the target attr
    for record in data:
        
        if (record[target_attr] in val_freq):
            val_freq[record[target_attr]] += 1.0
        else:
            val_freq[record[target_attr]]  = 1.0
 
    # Calculate the entropy of the data for the target attribute
    for freq in val_freq.values():
        data_entropy += (-freq/len(data)) * math.log(freq/len(data), 2) 
 
    return data_entropy


def calc_entropy(criteria,target):    
    print ("Entropy (Randomness): " +  str(entropy(movie_history[criteria].to_dict('records'),target)))
    display (movie_history[criteria])
    return entropy(movie_history[criteria].to_dict('records'),target)

# Calculates the information gain (reduction in entropy) 
#that would result by splitting the data on the chosen attribute (attr).
def gain(data, attr, target_attr):
 
    val_freq = {}
    subset_entropy = 0.0
 
    # Calculate the frequency of each of the values in the target attribute
    for record in data:
        if (record[attr] in val_freq):
            val_freq[record[attr]] += 1.0
        else:
            val_freq[record[attr]]  = 1.0
 
    # Calculate the sum of the entropy for each subset of records weighted by 
    #their probability of occuring in the training set.
    for val in val_freq.keys():
        val_prob = val_freq[val] / sum(val_freq.values())
        data_subset = [record for record in data if record[attr] == val]
        subset_entropy += val_prob * entropy(data_subset, target_attr)       
 
    # Subtract the entropy of the chosen attribute from the entropy of the whole
    #data set with respect to the target attribute (and return it)
    return round((entropy(data, target_attr) - subset_entropy),2)
 
def calc_information_gain_entropy(data,attr,attr_name,target):    
    print ("Information gain with attribute " + attr_name + " is : " + str(gain(movie_history[criteria].to_dict('records'),attr, target)))    
    return gain(data.to_dict('records'),attr, target)

py.sign_in('cloaked', 'wiwoxbjrvr')

movie_history = pd.read_csv('https://docs.google.com/spreadsheets/d/1fJ0DKxFkMr2XjzyspZFLv6yOlHJvl639Qpap0O4U78E/pub?gid=0&single=true&output=csv')
display(movie_history)

Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
0,Raiders of the Lost Ark,80s,Action,Average,Average,Yes
1,The Lord of the Rings 3,00s,Drama,Long,High,Yes
2,Fight Club,90s,Drama,Average,High,Yes
3,Braveheart,90s,Drama,Long,Average,Yes
4,Inception,10s,Action,Average,High,No
5,The Godfather,70s,Drama,Average,High,No
6,Aliens,80s,Horror,Average,Average,No
7,Once upon a time in America,80s,Crime,Long,Average,No


(SS.10) The first concept within predictive modeling is supervised segmentation. Consider this example of a movie subscription service, like netflix, where there is a movie database that contains the history of whether a person watched a movie or not. Based on this history, we would like to predict which movie out of a potential list of unwatched movies, the person is likely to watch in the future.

(SS.20) So the target variable that we are interested in from the history is called *Watched*, which is a Yes or No value that represents whether the person watched a movie. Figuring out how we can find a formula, or in other words ***segment*** the history dataset to result in the value of *Watched = Yes*, is an example of supervised segmentation. 

(SS.30) Additionally, in this case we are talking about predicting a Yes or No value for the target variable *Watched*, which means we are doing a ***classification***. If we were predicting a numeric value such as the likelihood of watching a movie, instead of a Yes or No value, then we would be doing a ***regression***. In other words, a regression is to predict a numeric value for a target variable of interest.

(SS.50) So what is an informative attribute? An informative attribute is something that reduces uncertainity about the target variable. Let's assume for a second that we perfectly understand the person's preferences. The person prefers the drama genre over others, but the length of the movie doesn't matter to the person. In this case the *Genre* attribute is more informative, as in it reduces the uncertainity of predicting whether the person will watch a movie or not. However the *Length* attribute is not informative because, in this case, it doesn't contribute to reducing the uncertainty of whether or not the person will watch a movie.

(SS.40) The first part of supervised segmentation is to select the important informative attributes. The attributes we have in this dataset are *Title*,*Year*, *Genre*, *Length* and *Rating*. A single row of atrribute-values represent what is called a Feature vector. As an example, a feature vector here is: [ Title: The Godfather, Year : 70s, Genre: Drama, Length: Average, Rating: High ]. Each value in the feature vector is called a feature value. The attribute *Watched?* is the target variable.

## Models, induction and deduction

(MID.10) We need a systematic mechanism or *Model* in order to figure out which attributes are informative. The process of creating models from data is called **Induction** and the procedure that creates a model from the data is called a *learner* or an *induction algorithm*. Induction essentially refers to generalizing from a specific case to general rules. We use the process of induction to create models from training data, which in this case is the movie dataset and then use *deduction* to use the model to predict target values for other instances of feature vectors, which in this case is the list of unwatched movies.

In [264]:
no_filter = calc_entropy(criteria=movie_history.index >= 0,target='Watched')

Entropy (Randomness): 1.0


Unnamed: 0,Title,Year,Genre,Length,Rating,Watched
0,Raiders of the Lost Ark,80s,Action,Average,Average,Yes
1,The Lord of the Rings 3,00s,Drama,Long,High,Yes
2,Fight Club,90s,Drama,Average,High,Yes
3,Braveheart,90s,Drama,Long,Average,Yes
4,Inception,10s,Action,Average,High,No
5,The Godfather,70s,Drama,Average,High,No
6,Aliens,80s,Horror,Average,Average,No
7,Once upon a time in America,80s,Crime,Long,Average,No


(MID.20) Now let's try to create a model to figure out which attributes are more informative than others. The first concept is to quantify the randomness of the dataset for the target variable. To quantify the randomness of the dataset we use a concept called entropy. Here the entropy of the dataset with respect to the target variable *Watched* is 1.0. 

(MID.30) Entropy is a measure of probability or relative percentage of each unique value occuring among all values in a dataset . Entropy is defined by the following mathematical formula:
$$entropy = - p_1 log (p_1) - p_2 log (p_2) - ...$$

(MID.40) Where $p_1$, in this example, is the probability of the target variable *Watched = Yes* and $p_2$ is the probability of the target variable *Watched = No* in the result set. Simply looking at this formula tells us that if all records had values of *Watched = Yes*, then $p_1 = 1 and p_2 = 0$ and therefore $log (p_1) = 0 $ which makes the entropy equal 0. A segment with an entropy of 0 denotes a **pure** segment.

In [265]:
data = [{'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'}]


def target_values(d):
    vals = ""
    for i in range(0,len(data)):
        if i != 0:
            vals = vals + ","
        if i == len(data)/2:
            vals = vals + "<br>"
        vals = vals + d[i]['Watched']
    return vals

last_index = len(data) 
entropies = pd.DataFrame(columns=('X','Y','label'))
   
for i in range(0,last_index):        
    entropies.loc[i]= [i,entropy(data,"Watched"),target_values(data)]
    data[i]['Watched'] = 'Yes' 
entropies.loc[i+1]= [i+1,entropy(data,"Watched"),target_values(data)]

trace1 = go.Scatter(
    x=entropies.X,
    y=entropies.Y,
    mode='lines+markers+text',
    name='Lines, Markers and Text',
    text= entropies.label
)

display_data = [trace1]
layout = go.Layout(
    title = 
    'Entropy variation with the set of target values of <i>Watched</i> changing from all <b>No</b> to all <b>Yes</b> within the dataset',
    showlegend=False, font=dict(family='Source Sans Pro')
)
fig = go.Figure(data=display_data, layout=layout)

py.iplot(fig,filename='entropy_concept')

(MID.50) When you take the movie dataset of 8 rows and plot the entropy assuming all movies are unwatched, which means an entropy of 0, to assuming all movies are watched, which also means an entropy of 0, we find that the entropy is the highest at the midpoint where 50% of the movies are watched and 50% of movies are unwatched, which is the point of maximum randomness. Hence entropy helps us measure the general disorder of the segmentation which is an indicator that could be used to quantify how much a particular attribute increases or decreases the disorder of the segmentation produced. To increase the predictability of the segmentation, we want to indentify attributes that decrease the randomness of the segmentation.

(MID.60) In order to mathematically identify how much an attribute decreases randomness over the whole segmentation it creates, we use a concept called information gain. Here is a plot for information gain of each of the attributes in the movie history dataset.

In [266]:
# Calculates the information gain (reduction in entropy) 
#that would result by splitting the data on the chosen attribute (attr).
def gain(data, attr, target_attr):
 
    val_freq = {}
    subset_entropy = 0.0
 
    # Calculate the frequency of each of the values in the target attribute
    for record in data:
        if (record[attr] in val_freq):
            val_freq[record[attr]] += 1.0
        else:
            val_freq[record[attr]]  = 1.0
 
    # Calculate the sum of the entropy for each subset of records weighted by 
    #their probability of occuring in the training set.
    for val in val_freq.keys():
        val_prob = val_freq[val] / sum(val_freq.values())
        data_subset = [record for record in data if record[attr] == val]
        subset_entropy += val_prob * entropy(data_subset, target_attr)
 
    # Subtract the entropy of the chosen attribute from the entropy of the whole
    #data set with respect to the target attribute (and return it)
    return round((entropy(data, target_attr) - subset_entropy),2)
 

def calc_information_gain(criteria,attr,target):        
    return gain(movie_history[criteria].to_dict('records'),attr, target)



In [267]:
target_variable = 'Watched'
gains = pd.DataFrame(columns=('Attribute','Gain'))
for i in range (0,len(movie_history.columns)):
    gains.loc[i]= [movie_history.columns.values[i],
                   calc_information_gain(movie_history.index >= 0,
                     movie_history.columns.values[i],target_variable)]

#Exclude target variable
gains = (gains[gains.Attribute.str.find(target_variable) == -1])  
gains = (gains[gains.Attribute.str.find('Title') == -1])    
trace1 = go.Bar(
    x=gains.Attribute,
    y=gains.Gain,
    text=gains.Gain    
)

display_data = [trace1]
layout = go.Layout(
    title = 
    'Information gain by attribute',
    showlegend=False, font=dict(family='Source Sans Pro')
)

fig = go.Figure(data=display_data, layout=layout)

py.iplot(fig,filename='information_gain_concept')



In [268]:
target_variable = 'Watched'
gains = pd.DataFrame(columns=('Attribute','Gain'))
for value in movie_history.columns.values:
    if value != target_variable:
        print (value + " " + str((calc_information_gain(movie_history.index > 0,value,target_variable))))
        
data = [{'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'},
 {'Watched': 'No'}]


def target_values(d):
    vals = ""
    for i in range(0,len(data)):
        if i != 0:
            vals = vals + ","
        if i == len(data)/2:
            vals = vals + "<br>"
        vals = vals + d[i]['Watched']
    return vals

last_index = len(data) 

   
for i in range(0,last_index):        
    entropies.loc[i]= [i,entropy(data,"Watched"),target_values(data)]
    data[i]['Watched'] = 'Yes' 
entropies.loc[i+1]= [i+1,entropy(data,"Watched"),target_values(data)]

trace1 = go.Scatter(
    x=entropies.X,
    y=entropies.Y,
    mode='lines+markers+text',
    name='Lines, Markers and Text',
    text= entropies.label
)

display_data = [trace1]
layout = go.Layout(
    title = 
    'Entropy variation with the set of target values of <i>Watched</i> changing from all <b>No</b> to all <b>Yes</b> within the dataset',
    showlegend=False, font=dict(family='Source Sans Pro')
)
fig = go.Figure(data=display_data, layout=layout)

py.iplot(fig,filename='entropy_concept')


Title 0.99
Year 0.99
Genre 0.52
Length 0.13
Rating 0.02


## Decision Trees

In [270]:
trace1 = Scatter(
    x=[2.6644980675659595, 1.2509801194371117, None, 2.6644980675659595, 4.086390232868387, None],
    y=[-4.6076777515108125, -4.559748930034137, None, -4.6076777515108125, -4.645195358685002, None],
    hoverinfo='none',
    line=Line(
        color='rgb(210,210,210)',
        width=2
    ),
    mode='lines',
    name='Trace 0, y'
)
trace2 = Scatter(
    x=[2.6644980675659595, 1.2509801194371117, 4.086390232868387, 0.7158689493718534, 0.4215468767348411, 4.910287453614753],
    y=[-4.6076777515108125, -4.559748930034137, -4.645195358685002, -3.3266007351115277, -5.703800295665652, -5.675463958802986],
    hoverinfo='text',
    marker=Marker(
        color='#6175c1',
        line=Line(
            color='rgb(50,50,50)',
            width=2
        ),
        size=20,
        symbol='dot'
    ),
    mode='markers',
    name='',
    opacity=0.8,
    text=['0', '1', '2', '3', '4', '5']
)
data = Data([trace1, trace2])
layout = Layout(
    annotations=Annotations([
        Annotation(
            x=2.6644980675659595,
            y=-4.6076777515108125,
            font=Font(
                color='rgb(250,250,250)',
                size=10
            ),
            showarrow=False,
            text='0',
            xref='x1',
            yref='y1'
        ),
        Annotation(
            x=1.2509801194371117,
            y=-4.559748930034137,
            font=Font(
                color='rgb(250,250,250)',
                size=10
            ),
            showarrow=False,
            text='1',
            xref='x1',
            yref='y1'
        ),
        Annotation(
            x=4.086390232868387,
            y=-4.645195358685002,
            font=Font(
                color='rgb(250,250,250)',
                size=10
            ),
            showarrow=False,
            text='2',
            xref='x1',
            yref='y1'
        ),
        Annotation(
            x=0.7158689493718534,
            y=-3.3266007351115277,
            font=Font(
                color='rgb(250,250,250)',
                size=10
            ),
            showarrow=False,
            text='3',
            xref='x1',
            yref='y1'
        ),
        Annotation(
            x=0.4215468767348411,
            y=-5.703800295665652,
            font=Font(
                color='rgb(250,250,250)',
                size=10
            ),
            showarrow=False,
            text='4',
            xref='x1',
            yref='y1'
        ),
        Annotation(
            x=4.910287453614753,
            y=-5.675463958802986,
            font=Font(
                color='rgb(250,250,250)',
                size=10
            ),
            showarrow=False,
            text='5',
            xref='x1',
            yref='y1'
        )
    ]),
    autosize=False,
    font=Font(
        size=14
    ),
    height=700,
    hovermode='closest',
    margin=Margin(
        r=40,
        t=100,
        b=85,
        l=40
    ),
    plot_bgcolor='#EFECEA',
    showlegend=False,
    title='Tree',
    width=700,
    xaxis=XAxis(
        showgrid=False,
        showline=False,
        showticklabels=False,
        title='',
        zeroline=False
    ),
    yaxis=YAxis(
        showgrid=False,
        showline=False,
        showticklabels=False,
        title='',
        zeroline=False
    )
)
fig = Figure(data=data, layout=layout)
py.iplot(fig)

[Google](https://www.google.com)
> ###Quote
$x=y$



Decision trees are an important concept due to their ability to learn to classify individual records in a dataset. They are used in many areas, from typical marketing scenarios such as targeting to airplane autopilots and medical diagnoses. 

A decision tree is essentially a series of if-then statements, which when applied to a record in a dataset, results in the classification of thar record. So in our example, the program that you create from a decision tree will be able to predict the likelihood of the person watching a movie from the history of movies.

At every step of the decision tree, the next most informative attribute is selected to divide the dataset according to some predefined criteria. One of the most popular approaches uses the concept of entropy to calculate which attrubute is best to use for dividing the data into subgroups.

Christopher Roach: Building Decision Trees in Python http://www.onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html