Skip to content

Commit 1bf91b1

Browse files
committed
add: K-means
1 parent 1db8b94 commit 1bf91b1

File tree

2 files changed

+140
-4
lines changed

2 files changed

+140
-4
lines changed

README.md

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ e, f
3333
使用经典的Apriori算法,依次扫描交易记录集,计算出 *k-候选集Ck* 然后去除**支持度sup**小的项集获得 *k-频繁集Lk*, 只计算到 *3-频繁集* ,最后计算管理规则可信度即可。
3434
> 第k个候选集只会从k-1频繁集中的各项目组合连接,然后扫描记录集,以获取Ck中各项集的支持度。
3535
36-
算法输出
36+
#### 3.输出
3737
<center>
3838
<img alt="算法输出" src="https://i.loli.net/2019/06/16/5d05ad0e8f2e762317.png" width="80%" />
3939
</center>
@@ -77,15 +77,48 @@ clf.fit(x_train, y_train)
7777
with open("tree.dot", 'w') as f:
7878
f = tree.export_graphviz(clf, out_file=f)
7979
```
80-
算法输出
80+
#### 3.输出
8181
<center>
8282
<img alt="算法输出" src="https://i.loli.net/2019/06/16/5d05b41f3cca371767.png" width="80%" />
8383
</center>
84-
决策树
84+
8585
<center>
86+
<span>决策树</span>
87+
<br>
8688
<img alt="决策树" src="https://i.loli.net/2019/06/16/5d05b41f6850332395.png" width="80%" />
8789
</center>
8890

8991
<hr>
9092

91-
## 数据聚类K-means算法
93+
## 数据聚类K-means算法
94+
#### 1. 数据集
95+
数据集采用python类库有名的iris坐标点集
96+
```python
97+
from sklearn import datasets
98+
99+
iris = datasets.load_iris()
100+
X, y = iris.data, iris.target
101+
```
102+
数据集样本如下
103+
```bash
104+
[1.5 0.2]
105+
[3.2 0.2]
106+
[3.1 0.2]
107+
[4.6 0.2]
108+
...
109+
```
110+
111+
#### 2. [算法实现](数据聚类(K-means)/k-means.py)
112+
K-means算法需要先指定要分成k类,数据样本只有熟悉,没有类别。
113+
大概步骤:
114+
1. 从数据集X从随机选取k个数据样本作为聚类的初始化代表点,每一个代表点表示一个类别。
115+
2. 对于数据集中的任一样本点,都计算它与这k个初始化代表点的距离(d可用欧氏距离),然后划分到距离最近的分类中去。完成一次聚类
116+
3. 划分好数据后,计算每个聚类的均值,并将之作为该聚类的新代表点,因此得到k个新代表点。
117+
4. 和第二步一样,再继续计算每个点到代表点的距离,划分到距离最小的类
118+
5. 重复3和4,直到各个聚类不再发生变化(样本点划分固定了),即误差平方和准则函数的值达到最优。
119+
120+
#### 3.输出
121+
<center>
122+
<img alt="决策树" src="https://i.loli.net/2019/06/16/5d05bb1a54a9561636.png" width="80%" />
123+
</center>
124+

数据聚类(K-means)/k-means.py

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
from sklearn import datasets
2+
import matplotlib.pyplot as plt
3+
import numpy as np
4+
5+
iris = datasets.load_iris()
6+
X, y = iris.data, iris.target
7+
8+
# 为了便于可视化,只取两个维度
9+
data = X[:,[1,3]]
10+
11+
print(data)
12+
13+
plt.scatter(data[:,0],data[:,1])
14+
15+
ck = 3
16+
'''
17+
随机选取k个点为聚类的初始代表点,即质点
18+
'''
19+
def rand_center(data,k):
20+
"""Generate k center within the range of data set."""
21+
n = data.shape[1] # features
22+
centroids = np.zeros((k,n)) # init with (0,0)....
23+
for i in range(n):
24+
dmin, dmax = np.min(data[:,i]), np.max(data[:,i])
25+
centroids[:,i] = dmin + (dmax - dmin) * np.random.rand(k)
26+
return centroids
27+
28+
# 初始化点列表
29+
centroids = rand_center(data, ck)
30+
print(centroids)
31+
32+
def kmeans(data,k=2):
33+
def _distance(p1,p2):
34+
"""
35+
Return Eclud distance between two points.
36+
p1 = np.array([0,0]), p2 = np.array([1,1]) => 1.414
37+
"""
38+
tmp = np.sum((p1-p2)**2)
39+
return np.sqrt(tmp)
40+
def _rand_center(data,k):
41+
"""Generate k center within the range of data set."""
42+
n = data.shape[1] # features
43+
centroids = np.zeros((k,n)) # init with (0,0)....
44+
for i in range(n):
45+
dmin, dmax = np.min(data[:,i]), np.max(data[:,i])
46+
centroids[:,i] = dmin + (dmax - dmin) * np.random.rand(k)
47+
return centroids
48+
49+
def _converged(centroids1, centroids2):
50+
51+
# if centroids not changed, we say 'converged'
52+
set1 = set([tuple(c) for c in centroids1])
53+
set2 = set([tuple(c) for c in centroids2])
54+
return (set1 == set2)
55+
56+
57+
n = data.shape[0] # number of entries
58+
centroids = _rand_center(data,k)
59+
label = np.zeros(n,dtype=np.int) # track the nearest centroid
60+
assement = np.zeros(n) # for the assement of our model
61+
converged = False
62+
63+
while not converged:
64+
old_centroids = np.copy(centroids)
65+
for i in range(n):
66+
# determine the nearest centroid and track it with label
67+
min_dist, min_index = np.inf, -1
68+
for j in range(k):
69+
dist = _distance(data[i],centroids[j])
70+
if dist < min_dist:
71+
min_dist, min_index = dist, j
72+
label[i] = j
73+
assement[i] = _distance(data[i],centroids[label[i]])**2
74+
75+
# update centroid
76+
for m in range(k):
77+
centroids[m] = np.mean(data[label==m],axis=0)
78+
converged = _converged(old_centroids,centroids)
79+
return centroids, label, np.sum(assement)
80+
81+
82+
# 多运行
83+
best_assement = np.inf
84+
best_centroids = None
85+
best_label = None
86+
87+
for i in range(10):
88+
centroids, label, assement = kmeans(data,ck)
89+
if assement < best_assement:
90+
best_assement = assement
91+
best_centroids = centroids
92+
best_label = label
93+
94+
data0 = data[best_label==0]
95+
data1 = data[best_label==1]
96+
97+
# 打印展示
98+
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,5))
99+
ax1.scatter(data[:,0],data[:,1],c='c',s=30,marker='o')
100+
ax2.scatter(data0[:,0],data0[:,1],c='r')
101+
ax2.scatter(data1[:,0],data1[:,1],c='c')
102+
ax2.scatter(centroids[:,0],centroids[:,1],c='b',s=120,marker='o')
103+
plt.show()

0 commit comments

Comments
 (0)