# Concept Learning

## Motivation

In this tutorial we will be implementing concept learning algorithms, specifically Find-S 

This tutorial will focus on a very basic machine learning algorithm. Although they are not the most accurate algorithms (in fact they often induce heavy bias) they are often not the most difficult to implement since they are conceptually easy to understand and are less computationally difficult than other machine leanring algorithms that must loop constantly over data to achieve the minimal error. This algorithm simply takes the idea that hypothesis can be represented as general cases for specific variables and implements it in code.

##  Find-S

The Find-S algorithm can be esssentially boiled down to finding the smallest hypothesis space where all positive examples in a dataset can be properly expressed. In this case we do not allow negative examples to influence our final hypothesis. A hypothesis in this case approximates an exact relationship between the predictor variables and response variable (or label). For this project specifically, we only consider binary data in this case (attributes are either yes or no) with a label of either positive or negative examples. Note this can work with other categorical data and in some instances ranges of numerical values.

Our data will have the following attributes. 

• Gender (Male, Female)

• Age (Young, Old)

• Student? (Yes, No)

• Previously Declined? (Yes, No)

• Hair Length (Long, Short)

• Employed? (Yes , No)

• Type Of Colateral (House, Car)

• First Loan (Yes, No)

• Life Insurance (Yes, No)

Along with a "Risk" factor of high or low that acts as the label.




## Importing data

We first must import the data and decide how we organize it.

In this example we will simply import the training file as an array of arrays. Each line will be representative of one example in the dataset: Every odd element of the inner array will be the attribute and the every even element will be the attribute's corresponding the value (to the attribute before it in the array)

We could elect to only keep track of the attribute's values or keep track of the values in a dictionary format, but this format makes debugging easier and is very simple in terms of importing data, no other processing is needed.

In [9]:
import sys  

def import_data(train, dev, classify):
    '''
    Imports data from training and dev files to be use in Find-S algorithm
    
    Inputs:
        train: file path to training file
        dev: file path to development file
        classify: file path to classify file
    Outputs:
        tuple: training data and dev data and classify data in 2-d array form
    '''
    
    '''arrays to hold data'''
    trainData = []
    devData = []
    classifyData = []
    
    '''processing data to put into arrays '''
    nFile = open(train, 'r+')
    nFile2 = open(dev, 'r+')
    nFile3 = open(classify, 'r+')
    
    for i in nFile:
        line = i.split()
        trainData.append(line)
        
    for i in nFile2:
        line = i.split()
        devData.append(line)
    
    for i in nFile3:
        line = i.split()
        classifyData.append(line)
        
    return (trainData, devData, classifyData)

datasets = import_data("9Cat-Train.labeled","9Cat-Dev.labeled", "9Cat-class.labeled")
print datasets

([['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Short', 'Employed?', 'Yes', 'TypeOfColateral', 'Car', 'FirstLoan', 'No', 'LifeInsurance', 'Yes', 'Risk', 'high'], ['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Short', 'Employed?', 'Yes', 'TypeOfColateral', 'Car', 'FirstLoan', 'Yes', 'LifeInsurance', 'No', 'Risk', 'high'], ['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Short', 'Employed?', 'Yes', 'TypeOfColateral', 'Car', 'FirstLoan', 'Yes', 'LifeInsurance', 'Yes', 'Risk', 'high'], ['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Long', 'Employed?', 'No', 'TypeOfColateral', 'House', 'FirstLoan', 'No', 'LifeInsurance', 'No', 'Risk', 'low'], ['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Long', 'Employed?', 'No', 'TypeOfColateral', 'House', 'Fir

## Training the hypothesis

We define our hypothesis to be a conjunction of the 9 attributes that can express the dataset. The hypothesis attributes can take a three types of values "Null", value = [0,1], "?". "Null" indicates no hypothesis has been made or the hypothesis always returns false. the value can either take either 0 or 1 which describes the entireity of the dataset will always have this attriubte. "?" indicates this attribute has values of both 0 and 1 for some selection of the dataset in question. The hypothesis will be used in our development and test datasets to determine its accuracy.

We can also define some terms input space and hypothesis space to help understand what exactly we are training. The input space is the number of possible unique inputs. In this case, the data is binary so we have 2 * 2 * 2 *... = 2^9 = 512.

The hypothesis space is the number of possible representations that a hypothesis can take. The hypothesis has 3 possible values, "?" , 0, or 1. It can also be as a whole be entirely NULL if no current hypothesis already exists (or the empty hypothesis) so 3*3*3*...+1 = 3^9+1 = 19864

We wil input some training data set to train our hypothesis. Take note that we will only train off the positive examples in this dataset, so we ignore the examples where risk = "low"

We have the hypothesis as an array that contains both attributes and its values. Likewise with the data that we imported we construct the hypothesis as odd elements as attributes and even elements as its values.

We initially construct the values as all "null" to indicate that this hypothesis without training will return false on all inputs of test examples. When we iterate through the training data, at each step we determine if can include this example into the data by determining if the example is postive or negative. If it is positive we proceed.

We take the values of the current example and compare them to the current hypothesis. If the hypothesis is all "NULL", we replace it with the current data sets values: this example is currenlty the broadest definition of the dataset. If an attribute in the hypothesis has a specific value and we see the opposite in an example, we change the hypothesis's attribute to "?" to indicate it can be both values.

Every 30 examples we print out the hypothesis to keep track of it. If the hypothesis is at any point all "?" we can end the for loop and break out which saves some time.




In [6]:
def train(data):
    '''
    Trains the hypothesis based of training data
    
    Inputs:
        data: (list of lists) Training dataset processed from import_data
    Outputs:
        list: hypothesis in format of a list (odd element is feature, even element is value)
    '''
    false_str = "NULL" #default string
    train_data = data
    n = len(train_data[0])/2 - 1
    hyp = []
    dummy_r = train_data[0]
    
    '''Input and Hypothesis space'''
    concept = 2**(n)
    print "Concept Space: " + str(concept)
    
    inputS = 3**(n)
    inputS +=1 
    print "Input Space: " + str(inputS)
    
    
    
    '''Construct Initial hypothesis'''
    for i in range(n):
        hyp.append(dummy_r[i*2])
    for i in range(n):
        hyp.insert(i*2 +1, false_str)
    print hyp
    
    count = 0
    
    '''function for replacing hypothesis'''
    def inner(x,i):
        hyp[i*2+1] = x  
        
    '''Iterates through training data'''
    for i in range(len(train_data)):
        
        row = train_data[i]
        
        '''Checks if label is "high" or "low"'''
        risk = row[-1]
        
        '''Every 30 prints hypothesis'''
        count += 1
        if count == 30:
            count = 0
            print hyp
            
        '''checks if entire hypothesis is "?" If it is ends algorithmn to save time'''
        all_true = 1
        for k in range(n):
            if hyp[k*2 + 1] != "?":
                all_true = 0
            else:
                pass
        if all_true == 1:
            break
        
        '''Iterates through a row of the training data if label is high, follows specs mentioned above'''
        if risk == "high":
            numOfIter = n
            for j in range(numOfIter):
                hypVal = hyp[1 + j*2]
                trainVal = row[1 + j*2]
                if hypVal == false_str:
                    hyp[1+j*2] = trainVal
                elif hypVal == "?":
                    pass
                else:
                    if hypVal == trainVal:
                        inner(trainVal, j)
                    else:
                        hyp[1+j*2] = "?"
    return (hyp)

hyp = train(datasets[0])
print "final hypothesis: " + str(hyp)
        
    

Concept Space: 512
Input Space: 19684
['Gender', 'NULL', 'Age', 'NULL', 'Student?', 'NULL', 'PreviouslyDeclined?', 'NULL', 'HairLength', 'NULL', 'Employed?', 'NULL', 'TypeOfColateral', 'NULL', 'FirstLoan', 'NULL', 'LifeInsurance', 'NULL']
['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', '?', 'HairLength', '?', 'Employed?', '?', 'TypeOfColateral', 'Car', 'FirstLoan', '?', 'LifeInsurance', '?']
['Gender', 'Male', 'Age', 'Young', 'Student?', '?', 'PreviouslyDeclined?', '?', 'HairLength', '?', 'Employed?', '?', 'TypeOfColateral', 'Car', 'FirstLoan', '?', 'LifeInsurance', '?']
['Gender', 'Male', 'Age', 'Young', 'Student?', '?', 'PreviouslyDeclined?', '?', 'HairLength', '?', 'Employed?', '?', 'TypeOfColateral', 'Car', 'FirstLoan', '?', 'LifeInsurance', '?']
['Gender', 'Male', 'Age', 'Young', 'Student?', '?', 'PreviouslyDeclined?', '?', 'HairLength', '?', 'Employed?', '?', 'TypeOfColateral', 'Car', 'FirstLoan', '?', 'LifeInsurance', '?']
['Gender', 'Male', 'Age', '?

## Evaluating Performance of Hypothesis

After training the hypothesis, we can use it to evaluate other test data. With this ML algorithmn we expect to have a high success rate due to how we trained our hypothesis. We take the most general hypothesis possible, one that includes all positive examples in the training set which would imply that most cases are included in any other data set if the training was sufficient.

We have a new dataset in the same format as the training of the dataset which can we use to test data. We iterate through the rows of the dataset. We create a function that checks if the hypothesis encompasses the current iterated row and returns true if it does, false otherwise. We use this information depending on the label to determined if this hypothesis succesfully classified the example.

If the current row was encompassed in the hypothesis and the label was positive then we increase our counter for number of succesful classifications (since our hypothesis only ahs positive examples). If it was not encompassed in our hypothesis and the label was negative, we add it to our succesful classificaitons. We then take the number of succseful classifications and divide it by the total number of examples.

In [8]:
def test(hyp, data):
    '''
    Determines the quality of a certain hypothesis given test data
    
    Inputs:
        hyp: (list) a hypothesis to use
        data: (list of lists) Training data set
    Outputs:
        int: Success rate of hypothesis
    '''
    train = data
    dummy_tr = train[0]
    n = len(dummy_tr)/2 - 1
    trainingN = 0.0
    '''General Case'''
    gen = "?"
    
    '''checks if the hypothesis matches row of test data'''
    def check(row):
        numOfIter = n
        for j in range(numOfIter):
            if hyp[j*2+1] != gen: #Not general case
                if hyp[j*2+1] != row[j*2+1]: #Does not match
                    return False
                else:
                    pass
        return True
    
    '''iterates through the test data, if the label is high we match the hypothesis vice versa with a low label'''
    for i in range(len(data)):
        row = data[i]
        last = row[-1] #gets label
        same = check(row)
        if same:
            if last == "high":
                trainingN += 1
 
        else:
            if last == "low":
                trainingN += 1
  
    probability = trainingN / len(data)
    
    return probability

print test(hyp, datasets[1])

0.85


## Classify

After running a test on the data we will now classify instances of development data with out labels. We use our hypothesis to now determine if a set of attributes are positive or negative.

Like other examples we iterate through the rows to determine each examples proper label as according to our hypothesis(It is very similar to the test method). If every  attribute in the hypothesis is '?' or matches the attribute of the current row we classify it as a positive example else we classify it as false. Like above we define a function to do this for each row and simply iterate through each row.


In [13]:
def classify(hyp, data):
    '''
    Classifies some test data
    
    Inputs:
        hyp: (list) a hypothesis to use
        data: (list of lists) classification data set
    Outputs:
        None: Prints out classifications for each example
    '''
    
    
    classify = data
    dummy_tr = classify[0]
    n = len(dummy_tr)/2
    numOfIter = n
    '''General Case'''
    gen = "?"
    
    '''checks if the hypothesis matches row of test data'''
    def check(row):        
        for j in range(numOfIter):
            if hyp[j*2+1] != gen: #Not general case
                if hyp[j*2+1] != row[j*2+1]: #Does not match
                    return False
            else: 
                pass
        return True
    
    '''iterates through the class data, if it matches the hypothesis classify as high else low'''        
    for i in range(len(classify)):
        row = classify[i]
        last = row[-1] #gets label
        same = check(row)
        if same:
            print "high"
        else:
            print "low"
 

classify(hyp, datasets[2])

['Gender', 'Female', 'Age', 'Young', 'Student?', 'Yes', 'PreviouslyDeclined?', 'Yes', 'HairLength', 'Short', 'Employed?', 'No', 'TypeOfColateral', 'Car', 'FirstLoan', 'No', 'LifeInsurance', 'No'] True
high
['Gender', 'Female', 'Age', 'Young', 'Student?', 'Yes', 'PreviouslyDeclined?', 'Yes', 'HairLength', 'Short', 'Employed?', 'No', 'TypeOfColateral', 'Car', 'FirstLoan', 'No', 'LifeInsurance', 'Yes'] True
high
['Gender', 'Female', 'Age', 'Young', 'Student?', 'Yes', 'PreviouslyDeclined?', 'Yes', 'HairLength', 'Short', 'Employed?', 'No', 'TypeOfColateral', 'Car', 'FirstLoan', 'Yes', 'LifeInsurance', 'No'] True
high
['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Short', 'Employed?', 'No', 'TypeOfColateral', 'House', 'FirstLoan', 'No', 'LifeInsurance', 'No'] False
low
['Gender', 'Male', 'Age', 'Young', 'Student?', 'No', 'PreviouslyDeclined?', 'No', 'HairLength', 'Short', 'Employed?', 'No', 'TypeOfColateral', 'House', 'FirstLoan', 'No', 'Life

## Further Resources

For further resources: 

Graphical Interpertation of Find-S http://stackoverflow.com/questions/5757233/find-s-algorithm-simple-question

Some Slides about find-S http://ml.informatik.uni-freiburg.de/_media/documents/teaching/ss11/ml/02_versionspace.printer.pdf

Other algorithms to read about https://en.wikipedia.org/wiki/List_of_machine_learning_concepts