Group nr:

Name 1 and CID: name surname (CID)

Name 2 and CID: name surname (CID)

In [22]:
import numpy as np
import copy
import random
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from mining_world import Environment
from IPython.display import Image


# Mining world 

<img src="imgs/poster.png" width="800"/>

## Scenario


Humanity has now reached a point where we need to extract and refine more Copium, a precious resource with great value. The only problem is that Copium can only be found on certain uninhabitable planets. This of course means that automated robots are sent instead.      

Copium is naturally very unstable and is only exists very temporary before it decays. There are very specific geological activities and circumstances needed for copium to form. The life cycle of Copium follows. First, a hot stream of liquid magma flows to the surface, creating a hotspot that that looks like a small creater. At the surface, if the conditions are correct, copium can form during the cool-down period. But as stated previously, Copium is unstable in its natural environment and decays to other materials shortly after. 

The formation of these deposits creaters are very random, but the heat from them can easilly be detected with a satellite. But there is no way of knowing if the newly formed depoist contains copium from just a satellite, therefor there is a robot rover on the ground with sensors that can collect further measurements. The rover's job is to move to the hotspots and identify if there could be Copium there or not. The rover has many sensors that can measure the properties of the ground below it, but of course, Copium can not directly be detected with these types of sensors. This is where the machine learning approach will be used, to take all those measurements and try to classify if the deposit contains Copium or not.  


<img src="imgs/overView.png" width="500"/>


# The enviornment

The enviornment can be initilized like below. For each step a direction is specified (North, South, East, West) for the rover.

## Actions 

<img src="imgs/actions.png" width="300"/>


In [23]:
env = Environment(map_type=1, fps=5, resolution=(1000, 1000))
actions = env.get_action_space()  
print('Possible actions', actions)
for i in range(20):
    env.step(random.choice(actions) )# random action.
    env.render() 

env.exit()


Possible actions ['N', 'S', 'W', 'E']


# Navigation - Tree search

This section will show how the naivigation is done. This is not a part of the assignment to understand, but will be used. 

##  Breadth first

The method used is a breadth first search algorithm, it is one of the simplest tree search algorithms and basically tries every option for a fixed number of steps and chooses the best one. 

In [24]:
class Node():
    def __init__(self, actor):
        self.actor = actor
        self.total_score = 0
    
    def update(self, action, inherited_score):
        score = self.actor.step(action)
        self.total_score = 1.05*inherited_score + score
        return self.total_score
    
    def get_score(self):
        return self.total_score

In [25]:
def breadth_first_search(actor, max_depth, action_space):
    node = Node(copy.deepcopy(actor)) 
    queue_keys = ['0'] # queue to keep track of nodes that has not yet been expanded.  
    visited = {queue_keys[0]: node} # saves visited nodes in order to not recalulate the entire path for each step. 
    
    max_score = -np.inf
    best_action = None

    while True:
        key = queue_keys.pop(0)
        if len(key) > max_depth: # stop at a set depth 
            break    
        node = visited[key]
        
        for action in action_space: # expand all children nodes
            child_node = copy.deepcopy(node)  # copy current node
            score = child_node.update(action=action, inherited_score=node.get_score()) # update node with action
            child_key = key + action # create child node key
            
            if score > max_score: # save best path 
                max_score = score
                best_action = child_key[1]
                
            visited[child_key] = child_node  # add child node to visited nodes.
            queue_keys.append(child_key)  # add child node queue of non expanded nodes. 
            
    return best_action


In [26]:
env = Environment(map_type=1, fps=10, resolution=(1000, 1000))

for i in range(100):
    action = breadth_first_search(actor=env.get_actor(), max_depth=3, action_space=env.get_action_space())
    env.step(action)
    env.render()

env.exit()

# Exersice 1: Collect data
The first step is to collect some data that will be used for training and validation. The available types features can be seen with env.get_sensor_properties() and the actual measurements can be retrieved with env.get_sensor_readings(). It will return a dictionary with the same keys as in env.get_sensor_properties() containg a value for each feature. If the robot is not currently over a deposit, then it will return None. The label can be extracted with env.get_ground_truth(), which will return a 1 if there is copium in the deposit and 0 if not. 
  

In [27]:
sensor_properties = env.get_sensor_properties()
print('Sensor properties', sensor_properties)


Sensor properties ['ground_density', 'moist', 'reflectivity', 'silicon_rate', 'oxygen_rate', 'iron_rate', 'aluminium_rate', 'magnesium_rate', 'undetectable']


The exersice is then to append the sensor readings to the cooresponing feature in the data dictionary and if there is copium or not.  

In [28]:
env = Environment(map_type=1, fps=500, resolution=(1000, 1000))
sensor_properties = env.get_sensor_properties()

# We can initilize the dictionary the following way.
data = dict()
data['copium'] = [] 
for key in sensor_properties:
    data[key] = []
    
for i in range(5000):
    action = breadth_first_search(actor=env.get_actor(), max_depth=3, action_space=env.get_action_space())
    env.step(action)
    # if we are over a deposit. 
    if env.get_sensor_readings() is not None:
        sensor_readings = env.get_sensor_readings()
        copium = env.get_ground_truth()
        
        # TODO: Append the sensor readings and copium to the data dictionary
        for s in sensor_readings:
            data[s].append(sensor_readings[s])
        data['copium'].append(copium)
        
        
    env.render()

env.exit()


# Exerscie 2: Data structure

## a) Pandas data frame 
In this assignment we will work pandas data frame for storing the collected data. First create a pandas data frame from the dictionary. The documentation for it can be found at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html, only the data feild needs to be filled in with the created dictionary. Call this data frame df. 


In [29]:
# TODO: create pandas dataframe
df = pd.DataFrame(data)
print(df)

      copium  ground_density     moist  reflectivity  silicon_rate  \
0          0        1.793619  0.094003      0.123946      0.129768   
1          1        0.400000  0.043079      0.593544      0.125845   
2          0        0.883474  0.215157      0.477182      0.254534   
3          0        0.863021  0.033451      0.487063      0.202344   
4          0        2.402137  0.268513      0.079751      0.198611   
...      ...             ...       ...           ...           ...   
1581       0        1.360968  0.166711      0.422346      0.195678   
1582       0        1.595746  0.154722      0.158401      0.041010   
1583       0        1.297176  0.099321      0.178333      0.092233   
1584       0        1.357793  0.120754      0.010710      0.145720   
1585       0        1.342187  0.175615      0.377811      0.306602   

      oxygen_rate  iron_rate  aluminium_rate  magnesium_rate  undetectable  
0        0.013928   0.465310        0.259880        0.087914      0.043200  
1    

In [30]:
# From the data frame you can access all data for a key with for example:
print("All data for a feature \n", df["copium"])
print()

# You can access a single sample with:
print("Single sample from index \n", df.iloc[1])
print()

# You can access all freatures but one with:
all_features_without_copium = df.drop(columns='copium')
print("All features without copium \n", all_features_without_copium)

All data for a feature 
 0       0
1       1
2       0
3       0
4       0
       ..
1581    0
1582    0
1583    0
1584    0
1585    0
Name: copium, Length: 1586, dtype: int64

Single sample from index 
 copium            1.000000
ground_density    0.400000
moist             0.043079
reflectivity      0.593544
silicon_rate      0.125845
oxygen_rate       0.102825
iron_rate         0.136515
aluminium_rate    0.294268
magnesium_rate    0.255484
undetectable      0.085063
Name: 1, dtype: float64

All features without copium 
       ground_density     moist  reflectivity  silicon_rate  oxygen_rate  \
0           1.793619  0.094003      0.123946      0.129768     0.013928   
1           0.400000  0.043079      0.593544      0.125845     0.102825   
2           0.883474  0.215157      0.477182      0.254534     0.089217   
3           0.863021  0.033451      0.487063      0.202344     0.006961   
4           2.402137  0.268513      0.079751      0.198611     0.064476   
...              ... 

## b) Part 1: Analyse data balance

The occurance can be retrevied with .value_counts() from a pandas date frame. Here get the occurance of copium in the samples. Is the dataset balanced?

Answer:

In [31]:
# TODO: Get number of samples with copium and the number of samples without copium.
with_copium = len(df[df['copium'] > 0])
no_copium = len(df[df['copium'] == 0])

print(with_copium)
print(no_copium)



284
1302


## b) Part 2: Balance data, do this exercise later! 

We have seen what happens with unbalanced data, now try to balance the data set. You will also need to change in ex c) so that it uses the balanced data. We show how it can be done for downsampling the one that is more common, in a similar way your job is to instead create an upsampled balanced data set. You only need to use the upsampled data set for the rest of the other part 2) exerices. 

What could be the reason for choosing one of these over the other?

Answer:

In [33]:
# TODO balance data det. 
# step 1: seperate the data into something that contains copium and one that doesn't,
# can for example be done with df[df["copium"]==0] etc.
df_zero = df[df["copium"]==0]
df_one = df[df["copium"]==1]

# downsample majority
df_zero_downsampled = resample(df_zero,
                               n_samples=df_one.shape[0])

df_balanced_downsampled = pd.concat([df_one, df_zero_downsampled])


# TODO: upsample minority 

df_balanced_upsampled = 


SyntaxError: invalid syntax (848346332.py, line 16)

## c ) Split data

Here we will devide the data into a training set and a test set. Good rule of thumb is to use 80% of the data in the training set and 20 % in the test set. The the two data sets should be randomly sampled (shuffle). This is done with train_test_split() from sklearn, https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. The syntax looks like 

train, test = train_test_split(dataframe, test_size=ratio_test_set, shuffle=True)

Why is it important that the data is shuffled when it is split, what could happen otherwise?

Answer:



In [34]:
# TODO: devide the data into train and test set. 
train, test = train_test_split(df, test_size=0.2, shuffle=True)


# Exercise 3: Performance evaluation

Here we will define a class that later will be used for evaluation the performance of the classification models. More information about precision and recall can be found at https://en.wikipedia.org/wiki/Precision_and_recall. 

Explain why the different metics are usefull, why is not always accuarcy enough?

Answer:


In [35]:
class Classification_eval(object):
    def __init__(self):
        # counters 
        self.TP = 0 # correctly identified positive 
        self.FP = 0 # falsely identified positive 
        self.TN = 0 # correctly identified negative 
        self.FN = 0 # falsely identified negative 
    
    def update(self, pred, label):
        """
        pred - is the prediction will be either a 1 or 0. 
        label - is the correct answer, will be either a 1 or 0.
        """
        # TODO: add to one of the counters each time this function is called. 
        if pred == label:
            if pred == 1:
                self.TP += 1
            else:
                self.TN += 1
        else:
            if pred == 1:
                self.FP += 1
            else: 
                self.FN += 1

    
    def accuracy(self): 
        # returns the accuracy 
        if (self.TP + self.TN) == 0:
            return 0
        # TODO: calculate the accuracy.
        accuracy = (self.TP + self.TN)/(self.TP + self.TN + self.FP + self.FN)
        return np.round(accuracy, 4)
    
    def precision(self): # percentage of the estimated positive that actually is positive
        if self.TP == 0:
            return 0
        # TODO: calculate the precision.
        precision = (self.TP/(self.TP + self.FP))
        return np.round(precision, 4)
    
    def recall(self): # percentage of correctly identified positive of the total positive
        if self.TP == 0:
            return 0
        # TODO: calculate the recall.
        recall = (self.TP/(self.TP + self.FN))
        return np.round(recall, 4)

# Exercise 4: K- nearest neighbours

## a) Normalize

Here we will code our K-NN classifier, method 2.1 on page 21 in the book has the psudo code for K-NN. We will start with the data normalization, i.e. we will normalize the input data so that each feature has the same range in terms of max/min values. The min value can be found with data.min(), similarly for the max value. 

Why is it important that the data is normalized for the K-NN algorithm?

Answer:


In [36]:
class Normalize(object):
    def __init__(self):
        self.min = None
        self.max = None
    
    def normalize(self, data):
        # normalize the data and return it. 
        return (data-self.min)/(self.max-self.min)
    
    def update_normalization(self, data):
        # Save the min and max values for each feature. This funciton is only used for the training data.
        self.min = data.min()
        self.max = data.max()
    

## b) K-NN
Lets make the K-NN algorithm, fill in the TODO.

In [37]:

class KNN(object):
    def __init__(self, k):
        self.features = None # normalized features from training data 
        self.labels = None # the corresponding labels (if there is copium)
        self.normalize = Normalize() # class instance for normalization
        self.k = k # the k value in k-nn algorithm.
        
    def fit(self, features, labels):
        # This is where we save the training data. 
        # TODO: update the normalize filter, normalize the features (save to self.features)
        # and save the labels to self.labels
        self.normalize.update_normalization(features)
        
        self.features = self.normalize.normalize(features)
        self.labels = labels
    
    def predict(self, features):
        # here we get one sample to make a predicion 
        # TODO normalize the input features. 
        features_norm = self.normalize.normalize(features)
        
        # TODO, loop through all points in the previously saved features and save 
        # the labels for the k points with the smallest distance to the normalized 
        # input features in a list, call this list list_prediction. It could for example be inilized with
        # list_prediction = [0]*self.k 
        #list_prediction = [0]*self.k
        
        # lists to hold k nearest points
        list_dist = np.ones(self.k)*np.inf
        list_prediction = np.zeros(self.k)

        # find k nearest points
        for i in range(train.shape[0]):
            dist = np.linalg.norm(self.features.iloc[i] - features_norm)
            if dist < np.max(list_dist):
                idx = np.argmax(list_dist)
                list_dist[idx] = dist
                list_prediction[idx] = self.labels.iloc[i]


        return self.majority_vote(list_prediction)
    
    def majority_vote(self, pred_list):
        # Here is a function that will return the majority vote from a list. 
        keys = list(Counter(pred_list).keys())
        occurance = list(Counter(pred_list).values())
        idx = np.argmax(occurance)
        return keys[idx]
        


## c) part 1: Evaluate the K-NN 
Evaluate the K-NN and choose a suitable k value. 

In [38]:
train_labels = train['copium']
train_features = train.drop(columns='copium')

y = test['copium']
x = test.drop(columns='copium')

# TODO, try differenent values of k. 
knn = KNN(k=50)
knn.fit(train_features, train_labels)

log = Classification_eval()
for i in range(x.shape[0]):
    print(i)
    pred = knn.predict(x.iloc[i])
    log.update(pred, y.iloc[i])

print('Accuarcy', log.accuracy())
print('Precision', log.precision())
print('Recall', log.recall())


0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27


Try some differnet values of k and just looking at these resutlts would the klassifier work well for all k? 

Answer:

| k | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1 | 0.8531 |0.5294 | 0.3673 |  
| 5 | 0.8656 | 0.8| 0.1633 |   
| 20 | 0.8531 |1.0 | 0.0408 |   
| 50 | 0.8469 | 0 | 0 |   

## c) part 2, do later!
Now we have balanced data, try the same k values as in part 1. Have the results changed since ex 4 c) part 1? Would this klassifier work better? 

Answer:

| k | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1 |  | |  |  
| 5 |  |  |  |   
| 20 |  |  | |   
| 50 |  |  | | 

# Exercise 5: Learn tree based classifier

Here we will code our tree based classifier. We will start with coding a function that can find the best (according to gini) spliting point for a given data set and then define a recursive class for the Nodes that will make up our tree. 

## a) Find split point

The first step is to define a function that can find the splitting criteria with the highest gini value. 

The gini value can be described as:

If $\Gamma$ contains the set of all labeles, then $\Gamma(x_1 < 1)$ would be all labels that belong to the criteria $x_1 < 1$, more generally we could say $\Gamma(x_i < c)$, where i is the index of one of the features and c is the criteria. Then we can define:

$v_1 = mean(\Gamma(x_i < c))$

$v_2 = mean(\Gamma(x_i \geq c))$

$s_1 = v_1^2 + (1-v_1)^2$

$s_2 = v_2^2 + (1-v_2)^2$

We define len(x) to give the number of elements of x, then the weighted gini value is:

$s = \frac{len(\Gamma(x_i < c))}{len(\Gamma)}*s_1 + \frac{len(\Gamma(x_i \geq c))}{len(\Gamma)}*s_2$

The goal is to split the data so we maximizes $s$, there will be one $s$ for every combination of $x_i$ and $c$. Here we will use a c value that is the average between two data points that are sorted. 

In [65]:
print(type(test[0:1]["silicon_rate"].iloc[0]))


def find_split_point(data, label, parameter):
    """
    data - all the data we want to split, our (gamma)
    label - the parameter we want to classify. 
    parameter - the parameter we want to check for, our x_i
    -----------
    retrun:
    split_value - the spliting value, our c. 
    gini_value - the gini value for the best c.
    df_head - the data frame belonging to x_i < c
    df_tail - the data frame belonging to x_i => c
    """
    # beging by sorting the data after the paramter. 
    sorted_data = data.sort_values(by=parameter)
    sorted_label = sorted_data[label]
    # TODO loop through all the split points in the sorted data and find 
    # the best gini_value (s) and split_value (c).  
    
    gini_value = 0         
    split_value = 0
    df_head = 0
    df_tail = 0
    
    for i in range(len(data)):
        head = sorted_data[:i]
        tail = sorted_data[i:]
        
        v1 = head[label].mean()
        v2 = tail[label].mean()
        s1 = v1**2+(1-v1)**2
        s2 = v2**2+(1-v2)**2
        s = len(head)*s1/len(sorted_data) + len(tail)*s2/len(sorted_data) 
        
        if s > gini_value:
            gini_value = s
            split_value =sorted_data[i:i+1][parameter]
            df_head = head
            df_tail=tail
    # TODO: get the two data frames the corresponds to the split data. 
    # the functions .head(split_index) and .tail(split_index) could be useful. 
    
    
    return split_value, gini_value, df_head, df_tail

<class 'numpy.float64'>


## b) Tree Node



In [70]:
class TreeNode():
    def __init__(self, classification=None):
        self.split_value = None # the splitting value (c)
        self.split_parameter = None # what feature where uesd for the split (x_i)
        self.child_nodes = [] # list that contains two child nodes, if not leaf_node
        self.leaf_node = 0 # is this leaf_node (0= no, 1=yes)
        self.classification = classification # classification made in this node.
        
    def predict(self, data):
        # TODO: we need to traverse the tree recursivly down to a leaf node.
        # step 1: check if this is a leaf node, if it is then return classification otherwise contine with step 2.
        if self.leaf_node==1:
            return self.classification
        
        # step 2: check the input data for the splitting criteria, i.e. data[x_i] < c ...
        # (data[x_i] < c would corresponds to child_node[0] and data[x_i] => c to child_node[1])
        # step 3: call the predict function in the corresponding child_node and return the prediction. 
        if data[self.split_parameter] < self.split_value:
            return self.child_nodes[0].predict(data)
        else:
            return self.child_nodes[1].predict(data)
        
        
        
            
    def learn(self, data, label, min_node_size):
        """
        data - the training data
        label - the parameter we want to classify
        min_node_size - number of data points in a node for it to become a leaf node. 
        """ 
        # TODO: write the learning function. 
        # Step 1: check if the data fullfils the min_node_size criteria, if so make this node a leaf node and return.
        if len(data)<= min_node_size:
            self.leaf_node = 1
            self.classification = self.majority_vote(data[label])
            return
        
        # Step 1.5: Check if the data is homogenious i.e. only contains one type for the labels, if thats
        # the case then make this node a leaf node and return.
        all_should_match = data[0:1][label].iloc[0]
        all_matches = True
        for i in range(len(data)):
            value = data[i:i+1][label].iloc[0]
            print("value", value)
            print("all_should_match",all_should_match)
            if value != all_should_match:
                all_matches = False
                break
        
        if all_matches:
            self.leaf_node = 1
            self.classification = self.majority_vote(data[label])
            return
    
    
        # Step 2: Loop over all features and get the best gini and split_value for each feature. 
        features = list(test.columns)
        features.remove(label) # dont wanna split on label
        
        
        best_split_feature="" # the spliting feature,. 
        best_split_value = 0
        best_gini_value=0  #the gini value for the best feature.
        best_df_head=None  #the data frame belonging to x_i < c
        best_df_tail=None  #the data frame belonging to x_i => c
        
        for feature in features:
            
            split_value, gini_value, df_head, df_tail = find_split_point(data, label, feature)
            
            if gini_value> best_gini_value:
                best_split_feature=feature 
                best_split_value = split_value
                best_gini_value=gini_value
                best_df_head=df_head  
                best_df_tail=df_tail  
            
        
        
        
        # Step 2.5: Save the best split value and split_paramter and the two new data frames 
        # corresponding to the split [df_head, df_tail] (these will be data frames for the child nodes).  
        
        # Step 3: Calculate the majority vote classification of the two data frames. 
        
        best_df_head_class = self.majority_vote(best_df_head[label])
        best_df_tail_class = self.majority_vote(best_df_tail[label])
        
        
        # Step 3.5: Create two child nodes, (eg child_0 = TreeNode(classification=1)) and and call the 
        # learn function with the corresponding data frame. 
        child_0 = TreeNode(classification = best_df_head_class)
        child_0.learn(best_df_head, label, min_node_size)
        child_1 = TreeNode(classification = best_df_tail_class)
        child_1.learn(best_df_tail, label, min_node_size)
        
        # Step 4: append the the child node to the self.child_nodes. It should be in the order 
        # of the child node correspoinding to [df_head, df_tail].
        
        self.child_nodes.append(child_0)
        self.child_nodes.append(child_1)
        
        return
    
    def majority_vote(self, pred_list):
        # Here is a function that will return the majority vote from a list. 
        keys = list(Counter(pred_list).keys())
        occurance = list(Counter(pred_list).values())
        idx = np.argmax(occurance)
        return keys[idx]

##  Train the Tree

In [71]:
tree = TreeNode() # create root node
# learn the tree structure
tree.learn(train, "copium", min_node_size=5)


value 1
all_should_match 1
value 0
all_should_match 1
value 1
all_should_match 1
value 0
all_should_match 1
value 0
all_should_match 0
value 1
all_should_match 0
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 0
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
v

value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 0
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 0
value 0
all_should_match 0
v

value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
v

value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
v

value 0
all_should_match 0
value 1
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 0
all_should_match 0
value 1
all_should_match 1
value 1
all_should_match 1
value 0
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
value 1
all_should_match 1
v

## Test

In [72]:
y = test['copium']
x = test.drop(columns='copium')
log = Classification_eval()

for i in range(x.shape[0]):
    pred = tree.predict(x.iloc[i])
    log.update(pred, y.iloc[i])
        
print('accuarcy', log.accuracy())
print('precision', log.precision())
print('recall', log.recall())

KeyError: None

## c) part 1

Try some differnet values of min_node_size. How does these differ from the K-NN?

Answer:

| min_node_size | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1  |  |  |  |  
| 10 |  |  |  |   
| 20 |  |  |  |   
| 50 |  |  |  |   

## c) part 2, Try with balanced data, do this later!

Try some differnet values of min_node_size.

Answer:

| min_node_size | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1  |  | |  |  
| 10 |  | |  |   
| 20 |  | |  |   
| 50 |  | |  | 

## Exersice 6: Deployment

Here we will try the learned classifiers on a larger map. Make sure that the last run version of K-NN and tree have good parameters i.e. k and min_node_size values. 


In [None]:
env = Environment(map_type=2, fps=5, resolution=(1000, 1000))

sensor_properties = env.get_sensor_properties()
sensor_sample = dict()
for key in sensor_properties:
    sensor_sample[key] = [0]

log_knn = Classification_eval()
log_tree = Classification_eval()

    
for i in range(500):
    action = breadth_first_search(actor=env.get_actor(), max_depth=3, action_space=env.get_action_space())
    env.step(action)
    if env.get_sensor_readings() is not None:
        sensor_readings = env.get_sensor_readings()
        for key in sensor_readings:
            sensor_sample[key][0] = sensor_readings[key]
        sensor_sample_df = pd.DataFrame(sensor_sample)
        log_knn.update(knn.predict(sensor_sample_df.iloc[0]), env.get_ground_truth())
        log_tree.update(tree.predict(sensor_sample_df.iloc[0]), env.get_ground_truth())
        env.plt_acc.update_acc(log_tree.accuracy(), log_knn.accuracy())
    env.render()

env.exit()

print("K-NN accuracy ", log_knn.accuracy(), "Tree accuracy", log_tree.accuracy())
print("Number of copium deposits foun, K-NN:", log_knn.TP, " Tree:", log_tree.TP)




## Exersice 7: Balance data

Go to 2 b) part 2 and balance the data, then do part 2 on the exersices that have it. Lastly run execise 6 with the classifers trained on ballanced data. What is the major difference?

Answer:

| Balanced | Accuracy k-nn| nr found copium k-nn | Accuracy tree | nr found copium tree|  
| --- | --- | --- | --- | --- |
| NO |  |  |  |  | 
| YES | |  |  |  | 