# 熵在机器学习中的应用

## 类（Class）和对象（Object）

对象是将变量和函数封装到单个实体中。 对象从类中获取它们的变量和函数。 类本质上是创建对象的模板。

一个非常基本的类看起来像这样：

In [1]:
class MyClass:
    variable = "blah"

    def function(self):
        print("This is a message inside the class.")

稍后我们将解释为什么必须将“self”作为参数包含在内。 首先，要将上述类（模板）分配给一个对象，您需要执行以下操作：

In [2]:
class MyClass:
    variable = "blah"

    def function(self):
        print("This is a message inside the class.")

myobjectx = MyClass()

现在，变量`myobjectx`保存了`MyClass`类的对象，该对象包含在名为`MyClass`的类中定义的变量和函数。

### 访问对象变量
要访问新创建的对象`myobjectx`中的变量，您可以执行以下操作：

In [3]:
class MyClass:
    variable = "blah"

    def function(self):
        print("This is a message inside the class.")

myobjectx = MyClass()

myobjectx.variable

'blah'

例如下面将输出字符串“blah”：

In [4]:
class MyClass:
    variable = "blah"

    def function(self):
        print("This is a message inside the class.")

myobjectx = MyClass()

print(myobjectx.variable)

blah


您可以创建多个属于同一类的不同对象（定义了相同的变量和函数）。 
但是，每个对象都包含类中定义的变量的独立副本。 
例如，如果我们要使用`MyClass`类定义另一个对象，然后更改上面变量中的字符串：

In [5]:
class MyClass:
    variable = "blah"

    def function(self):
        print("This is a message inside the class.")

myobjectx = MyClass()
myobjecty = MyClass()

myobjecty.variable = "yackity"

# Then print out both values
print(myobjectx.variable)
print(myobjecty.variable)

blah
yackity


### 访问对象函数
要访问对象内部的函数，您可以使用类似于访问变量的符号：

In [6]:
class MyClass:
    variable = "blah"

    def function(self):
        print("This is a message inside the class.")

myobjectx = MyClass()

myobjectx.function()

This is a message inside the class.


上面会打印出消息，“This is a message inside the class.”

### `__init__()`
`__init__()` 函数是一个特殊函数，在启动类时调用。 它用于在类中分配值。

In [7]:
class NumberHolder:

   def __init__(self, number):
       self.number = number
        

In [8]:
class Student:
    variable = "blah"
    
    def __init__(self, name, age):
        self.student_name = name
        self.student_age = age
        self.print_information()
        

    def print_information(self):
        print("name:", self.student_name)
        print("age:", self.student_age)

a_student = Student("John", 15)
print(a_student.student_name, a_student.student_age)


name: John
age: 15
John 15


## 字典（Dictionary）
字典是一种类似于数组的数据类型，但使用键和值而不是索引。 
存储在字典中的每个值都可以使用键来访问，键是任何类型的对象（字符串、数字、列表等），而不是使用其索引来寻址它。

例如，电话号码数据库可以使用这样的字典来存储：

In [9]:
phonebook = {}
phonebook["John"] = 938477566
phonebook["Jack"] = 938377264
phonebook["Jill"] = 947662781
print(phonebook)

{'John': 938477566, 'Jack': 938377264, 'Jill': 947662781}


或者，可以使用以下符号中的相同值初始化字典：

In [10]:
phonebook = {
    "John" : 938477566,
    "Jack" : 938377264,
    "Jill" : 947662781
}
print(phonebook)

{'John': 938477566, 'Jack': 938377264, 'Jill': 947662781}


### 遍历字典
字典可以迭代，就像列表一样。 但是，字典与列表不同，它不会保持存储在其中的值的顺序。 要迭代键值对，请使用以下语法：

In [11]:
phonebook = {"John" : 938477566,"Jack" : 938377264,"Jill" : 947662781}
for name, number in phonebook.items():
    print("Phone number of %s is %d" % (name, number))

Phone number of John is 938477566
Phone number of Jack is 938377264
Phone number of Jill is 947662781


### 删除值
要删除指定的索引，请使用以下任一表示法： 

In [12]:
phonebook = {
   "John" : 938477566,
   "Jack" : 938377264,
   "Jill" : 947662781
}
del phonebook["John"]
print(phonebook)

{'Jack': 938377264, 'Jill': 947662781}


或

In [13]:
phonebook = {
   "John" : 938477566,
   "Jack" : 938377264,
   "Jill" : 947662781
}
phonebook.pop("John")
print(phonebook)

{'Jack': 938377264, 'Jill': 947662781}


## Numpy 数组

Numpy 数组是 Python 列表的绝佳替代品。 
Numpy 数组的一些关键优势是它们速度快、易于使用，并且让用户有机会跨整个数组执行计算。

在以下示例中，您将首先创建两个 Python 列表。 然后，您将导入 numpy 包并从新创建的列表中创建 numpy 数组。 

In [14]:
# Import the numpy package as np
import numpy as np

# Create 2 new lists height and weight
height = [1.87,  1.87, 1.82, 1.91, 1.90, 1.85]
weight = [81.65, 97.52, 95.25, 92.98, 86.18, 88.45]

# Create 2 numpy arrays from height and weight
np_height = np.array(height)
np_weight = np.array(weight)

逐元素计算
现在我们可以对身高和体重进行逐元素计算。 
例如，您可以获取上述所有 6 个身高和体重观测值，并使用单个方程计算每个观测值的 BMI。 
这些操作非常快速且计算高效。 当您的数据中有 1000 个观察值时，它们特别有用。

In [15]:
# Calculate bmi
bmi = np_weight / np_height ** 2

# Print the result
print(bmi)

[23.34925219 27.88755755 28.75558507 25.48723993 23.87257618 25.84368152]


### 子集
Numpy 数组的另一个重要特性是子集化的能力。 
例如，如果您想知道我们的 BMI 数组中哪些观察值高于 23，我们可以快速对其进行子集化以找出答案。

In [16]:
# For a boolean response
bmi > 23

# Print only those observations above 23
bmi[bmi > 23]

array([23.34925219, 27.88755755, 28.75558507, 25.48723993, 23.87257618,
       25.84368152])

Numpy的具体教程见 `https://numpy.org/doc/stable/user/absolute_beginners.html`

## 练习

### 决策树特征选择

决策树（decision tree）是一种基本的分类与回归方法，我们这节课主要讨论用于分类的决策树。决策树呈树形结构，可认为是if-then规则的集合。

决策树通常包括3个步骤：特征选择、决策树的生成河决策树的修建。本次上机我们需要完成决策树的特征选择部分。


In [17]:
dataset = [
    {"age": 19, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 19, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 19, "male": True,  "single": False, "visit_library_in_Sunday": True},
    {"age": 19, "male": True,  "single": True,  "visit_library_in_Sunday": True},
    {"age": 19, "male": False, "single": False, "visit_library_in_Sunday": False},
    
    {"age": 20, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 20, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 20, "male": True,  "single": True,  "visit_library_in_Sunday": True},
    {"age": 20, "male": False, "single": True,  "visit_library_in_Sunday": True},
    {"age": 20, "male": False, "single": True,  "visit_library_in_Sunday": True},
    
    {"age": 21, "male": False, "single": True,  "visit_library_in_Sunday": True},
    {"age": 21, "male": False, "single": True,  "visit_library_in_Sunday": True},
    {"age": 21, "male": True,  "single": False, "visit_library_in_Sunday": True},
    {"age": 21, "male": True,  "single": False, "visit_library_in_Sunday": True},
    {"age": 21, "male": False, "single": False, "visit_library_in_Sunday": False}
]

我们希望构建一个函数，来通过学生的“年龄”、“性别”和“单身情况”，来预测学生是否周日会去图书馆。
这个函数我们可以根据一个预先取得的数据来构建，这个数据集叫做训练数据集。

如果我们希望只通过一项指标来进行判断，我们应该选哪个呢？
例如：如果我们使用“年龄”判断，年龄大于19岁的学生周日会去图书馆。即

In [18]:
def tree_example(data):
    if data["age"]>19:
        return True
    else:
        return False
    
true_classification = 0
    
for data in dataset:
    if tree_example(data) == data["visit_library_in_Sunday"]:
        true_classification += 1
    
print("Classification accuracy:", true_classification/len(dataset))

Classification accuracy: 0.6666666666666666


在这个数据集上，我们会得到约为0.667\%的准确率。那如何选择指标得到最高的准确率呢？
具体教程请见附件`exercise_2_supp.pdf`。

请完成以下函数，选择最好的指标

In [3]:
import numpy as np

class decision_tree:
    def __init__(self, features, output, dataset):
        self.features = features
        self.output = output
        self.dataset = dataset
        
    def log(self, x):
        return np.log2(x)
    
    # you can use this function to calculate the empirical probability of a random variable under a dataset
    def get_prob(self, array):
        (unique, counts) = np.unique(array, return_counts=True)
        return counts/len(array)
        
    # you can use this function to calculate the empirical entropy of a random variable under a dataset
    def entropy(self, array):
        p = self.get_prob(array)
        return -np.sum(p*np.log2(p))
    
    def output_entropy(self):
        # calculate the empirical entropy of the output
        return -1
    
    def conditional_entropy(self, feature):
        # calculate the empirical conditional entropy of the output relative to the "feature"
        return -1
            
    def feature_selection(self):
        # select the feature has maximum mutual information
        return "a"
        
        
        
dataset = [
    {"age": 19, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 19, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 19, "male": True,  "single": False, "visit_library_in_Sunday": True},
    {"age": 19, "male": True,  "single": True,  "visit_library_in_Sunday": True},
    {"age": 19, "male": False, "single": False, "visit_library_in_Sunday": False},
    
    {"age": 20, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 20, "male": False, "single": False, "visit_library_in_Sunday": False},
    {"age": 20, "male": True,  "single": True,  "visit_library_in_Sunday": True},
    {"age": 20, "male": False, "single": True,  "visit_library_in_Sunday": True},
    {"age": 20, "male": False, "single": True,  "visit_library_in_Sunday": True},
    
    {"age": 21, "male": False, "single": True,  "visit_library_in_Sunday": True},
    {"age": 21, "male": False, "single": True,  "visit_library_in_Sunday": True},
    {"age": 21, "male": True,  "single": False, "visit_library_in_Sunday": True},
    {"age": 21, "male": True,  "single": False, "visit_library_in_Sunday": True},
    {"age": 21, "male": False, "single": False, "visit_library_in_Sunday": False}
]   
    
        
my_tree = decision_tree(\
    ["age", "male", "single"], "visit_library_in_Sunday", dataset)

# Test 1:
print(my_tree.output_entropy(), "should be around 0.971")

# Test 2:
print(my_tree.conditional_entropy("age"), "should be around 0.888")

# Test 3:
print(my_tree.feature_selection(), "should be single")


-1 should be around 0.971
-1 should be around 0.888
a should be single


In [5]:
def tree_example(data):
    if data["single"]:
        return True
    else:
        return False
    
true_classification = 0
    
for data in dataset:
    if tree_example(data) == data["visit_library_in_Sunday"]:
        true_classification += 1
    
print("Classification accuracy:", true_classification/len(dataset))

Classification accuracy: 0.8
