## Machine Learning - CS60050
### Roll No: 19EC30055
### Name: Ujwal Nitin Nayak
### Assignment No: 1

#### Importing Required Libraries

In [None]:
import numpy as np #For mathematical functions
import pandas as pd #For handling imported data

#### Importing Training Data and Storing in Pandas DataFrame 

In [None]:
attribute_names=['price','maint','doors','persons','lug_boot','safety','class']
df=pd.read_csv('project1.data',header=None,names=attribute_names);
df.head(10)

#### Defining the Entropy Function

At every level of the tree, we need to find the attribute providing the maximum information
gain. The information gain calculation requires calculation of Shannon entropy which is done using the get_entropy function. The get_entropy function takes the classification column of the entire dataset (root level) or its subset (for lower levels) and calculates the entropy of the dataset remaining at a particular level.

In [None]:
def get_entropy(target):
    #Finding the counts of all unique elements in the target column
    _,counts=np.unique(target,return_counts=True)
    tot_count=np.sum(counts)
    entropy=0.0
    for i in range(len(counts)):
        #Finding probability corresponding to each count
        p=counts[i]/tot_count
        #Plugging in the probabilities into the entropy formula
        entropy+=((-p)*np.log2(p))
    return entropy

#### Defining Information Gain Function

Now, let us calculate the information gain for a given attribute using the get_info_gain function. This function takes three arguments. The training data (could be the entire dataset or its subset), the attribute for which information gain is to be calculated, that is, the attribute over which the dataset is being split and the target attribute in relation to whose values the gain is being calculated. For this dataset, the target is the 'class' attribute.

In [None]:
def get_info_gain(data,attribute,target='class'):
    #Calculating entropy of entire dataset
    dataset_entropy = get_entropy(data[target])
    #Calculating probabilistic entropy for the split attribute
    values,counts=np.unique(data[attribute],return_counts=True)
    tot_count=np.sum(counts)
    probabilistic_entropy=0.0
    for i in range(len(counts)):
        #Finding probability of the particular class value
        p=counts[i]/tot_count
        subset=data.where(data[attribute]==values[i]).dropna()
        #Fining entropy of reduced dataset
        entropy_attr=get_entropy(subset[target])
        #Adding probability weighted entropies
        probabilistic_entropy+=p*entropy_attr
    #Subtracting the weighted entropy from the dataset (or subset based on the level of tree) entropy
    info_gain=dataset_entropy-probabilistic_entropy
    return info_gain

#### Implementing the Decision Tree Algorithm and Fitting the Training Data

Finally, let us define the decision_tree function to build the Decision Tree using the ID3 algorithm. It takes 5 arguments.
- fullset - The complete dataset
- data - The subset of the data that is available at a given level
- attributes - Set of all the attributes of the dataset present at a given level
- target - The tagert attribute
- parent_class - The class label of the attribute immediately before the current attribute

In [None]:
def decision_tree(fullset,data,attributes,target='class',parent_class=None):
    #3 base cases to terminate the recursion
    #If the target values are all the same, the branching must stop and example should be
    #classified as this target value
    if len(np.unique(data[target]))<=1:
        return np.unique(data[target])[0]
    #The dataset at this level may be empty in which case the maximally occuring class label in
    #the original dataset becomes the 
    elif len(data)==0:
        _,counts=np.unique(fullset[target],return_counts=True);
        return np.unique(fullset[target])[np.argmax(counts)]
    #If all the attributes are exhausted, the maximally occuring class label of the parent is 
    #the answer
    elif len(attributes)==0:
        return parent_class
    else:
        #Finding the best attribute to choose using information gain concept
        ig_arr=[];
        for attr in attributes:
            ig=get_info_gain(data,attr,target)
            ig_arr.append(ig);
        chosen_attr=attributes[np.argmax(ig_arr)]
        
        #Setting default value of the current node to the maximally occuring class label
        _,counts=np.unique(data[target],return_counts=True)
        parent_class=np.unique(data[target])[np.argmax(counts)]
        
        #Adding attribute to the tree 
        tree={chosen_attr:{}}
        
        #Removing chosen_attr from attributes using remove function
        attributes=[attr for attr in attributes if attr is not chosen_attr]
        
        #Adding branches to the tree below the chosen attribute
        values=np.unique(data[chosen_attr])
        #Generating subtree below each of the possible attribute values
        for val in values:
            #Segmenting the data to contain only those examples with attribute value fixed to val
            subset=data.where(data[chosen_attr]==val).dropna()
            #Making a recursive call to work on subtrees
            subtree=decision_tree(fullset,subset,attributes,target,parent_class)
            #Assigning the subtree to the corresponding value of the chosen attribute
            tree[chosen_attr][val]=subtree
            
        return tree

The Decision Tree Classifier is trained to fit the given training data and is stored in the dictionary variable tree. The arguments passed to the function are:
- fullset = complete training set df
- data = complete training set df, since in level 0 (root), the entire dataset is under consideration
- attributes = all column headers except the class column df.column[:-1]
- target = since the 'class' column is the classification column, the default value is taken
- parent_class = since the root node has no parent and hence no parent class, default value None is taken

In [None]:
#Training the decision treee on the dataframe of training examples
tree=decision_tree(df,df,df.columns[:-1])

#### Defining a Function to Print the Decision Tree and Printing it in the Specified Format

Now that the tree is trained to fit the training examples, we need to define a function to print in a manner that is easy to read and interpret. For this the following function is implemented. It takes two arguments:
- tree = dictionary of tree elements (attribute nodes are mapped to their values and further to their subtrees)
- indent = the level of the tree which is used to decide the indentation

In [None]:
#Print function
def dtprint(tree,indent=0):
    #Base case - if the tree specified is a string then the string is printed. Since the dictionary consists of node 
    #values followed by the subtree, this occurs when the tree has been traversed completely until the leaf attribute
    #and the mapping contains only the node value and no subsequent tree
    if isinstance(tree,str):
        print(': ',tree)
    else:
        #List of keys contains only one element at every stage which is the attribute at that level of the tree
        key=list(tree.keys())[0]
        #Iterating over the the possible values that the attribute can take
        for val in list(tree[key]):
            #Setting indentations based on the example given in the project pdf
            for i in range(indent):
                if i==0:
                    continue
                print('\t',end="")
            if indent>=1:
                print('| ',end="")
            #Printing the attribute name 
            print(key,' = ',end="")
            #Printing the attribute value
            print(val,end="")
            #Finding if the next value is a subtree (list type) or a verdict (string type)
            if not isinstance(tree[key][val],str):
                print()
            dtprint(tree[key][val],indent+1)

Let us print the tree using the above function. The function call consists of only the tree and indent is 0 by default

In [None]:
dtprint(tree)

#### Importing Test Data and Making Predictions Using the Trained Decision Tree Classifier

In [None]:
attribute_names=['price','maint','doors','persons','lug_boot','safety','class']
df_test=pd.read_csv('project1_test.data',header=None,names=attribute_names)
print(df_test)

In [None]:
#Cleaning error in data - change 56 to 6 in doors attribute
df_test['doors']=df_test['doors'].replace({56:6});
print(df_test)

To make the predictions, the find_class function is defined which takes in the following parameters
- entry = each example in the test set converted in a records style dictionary
- tree = trained decision tree

In [None]:
def find_class(entry,tree):
    #The entry dictionary consists of attribute-value pairs corresponding to each row of the test dataframe
    #Iterating over these keys and finding if the subtree (full tree in first run) at this stage consists 
    #of this key
    for key in list(entry.keys()):
        if key in list(tree.keys()):
            #If the key exists in the list of subtree keys, the subtree for the next recursion will become that branch 
            #of the subtree (having the entry attribute as the parent) which has entry attribute=entry value. This will
            #occur in the same order as that in the tree 
            
            #Example: In the first run, the only attribute in the list of tree keys is 'safety'. When entry keys are 
            #read, no operation will be perfomed until the 'safety' attribute is under consideration. If, suppose, the
            #attribute value is 'high', the subtree for the next recursion will be that under safety=high
            result=tree[key][entry[key]]
            #If the subtree (result) selected for the next operation is not a dictionary but a string value, a leaf  
            #node is reached and the result is returned. If it is a dictionary, recursion occurs and the for loop is
            #run again to find the value of the root attribute of the subtree in the entry
            
            #Example: Suppose after safety=high, the next attribute is 'price' in the subtree. So the for loop will
            #run until the price attribute of entry is found. Let the value of this attribute in entry be 'high'.
            #Then the subtree with price=high will be selected as shown below (... represents subtree)
            #safety='high'
            #|price='vhigh'
            #...
            #|price='high'  <- selected for next recursion step
            #...
            #|price='med'
            #...
            #|price='low'
            #...
            #safety='mid'
            #...
            #safety='low'
            #...
            
            #If the subtree is a single string value, the leaf node
            #is reached and result (prediction) is returned. Otherwise the process repeats. 
            if isinstance(result,dict):
                return find_class(entry,result)
            else:
                return result

The test function is used to manipulate the test data and store it in a form suitable for find_class function. This function also prints the predicted class next to the test examples and calculates the accuracy. The function takes in two arguments:
- data = test examples
- tree = trained decision tree

In [None]:
def test(data,tree):
    #Taking all columns from test data except the final column
    data_no_class=data.iloc[:,:-1]
    #Converting the data into a dictionary with each row represented as a list of attribute-value pairs
    entries=data_no_class.to_dict(orient='records')
    #Creating a dataframe for the predicted class values
    predicted_class=pd.DataFrame(columns=['predicted_class'])
    #Finding class for each entry and adding the result to predicted_class dataframe
    for i in range(len(data)):
        predicted_class.loc[i,'predicted_class']=find_class(entries[i],tree)
    #Concatenating data and predicted_class dataframes
    data_with_preds=pd.concat([data,predicted_class],axis=1)
    print(data_with_preds)
    #Calculating accuracy by calculating the percentage of entries in which the given class and predicted class
    #have matched
    accuracy=(np.sum(predicted_class['predicted_class']==data['class'])/len(data))*100;
    print('Accuracy = ',accuracy,'%')

In [None]:
test(df_test,tree)

Thus, we can see that the decision tree classifier trained on the training examples has successfully classified all
the testset examples with 100% accuracy. 

## --------END--------