Skip to content
This repository
Browse code

epoll: limit paths

commit 28d82dc upstream.

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely.  A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited.  Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm.  In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events.  I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'.  In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5.  Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links.  This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'.  In this case, each 'source file descriptor' has a 1 path of
length 1.  Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous.  Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations.  Currently its only used in a subset
of the add paths.  I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths.  I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead.  Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1).  However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order.  Thus, this limit is currently easily bypassed.  The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL.  I've also
testing using the piptest.c epoll tester, which showed no difference in
performance.  I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: Nelson Elhage <nelhage@ksplice.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
  • Loading branch information...
commit 203aa5260edca2ab1872ad8b08386d874f7132f3 1 parent e6aa5c0
jibaron authored January 12, 2012 gregkh committed February 29, 2012
234  fs/eventpoll.c
@@ -197,6 +197,12 @@ struct eventpoll {
197 197
 
198 198
 	/* The user that created the eventpoll descriptor */
199 199
 	struct user_struct *user;
  200
+
  201
+	struct file *file;
  202
+
  203
+	/* used to optimize loop detection check */
  204
+	int visited;
  205
+	struct list_head visited_list_link;
200 206
 };
201 207
 
202 208
 /* Wait structure used by the poll hooks */
@@ -255,6 +261,15 @@ static struct kmem_cache *epi_cache __read_mostly;
255 261
 /* Slab cache used to allocate "struct eppoll_entry" */
256 262
 static struct kmem_cache *pwq_cache __read_mostly;
257 263
 
  264
+/* Visited nodes during ep_loop_check(), so we can unset them when we finish */
  265
+static LIST_HEAD(visited_list);
  266
+
  267
+/*
  268
+ * List of files with newly added links, where we may need to limit the number
  269
+ * of emanating paths. Protected by the epmutex.
  270
+ */
  271
+static LIST_HEAD(tfile_check_list);
  272
+
258 273
 #ifdef CONFIG_SYSCTL
259 274
 
260 275
 #include <linux/sysctl.h>
@@ -276,6 +291,12 @@ ctl_table epoll_table[] = {
276 291
 };
277 292
 #endif /* CONFIG_SYSCTL */
278 293
 
  294
+static const struct file_operations eventpoll_fops;
  295
+
  296
+static inline int is_file_epoll(struct file *f)
  297
+{
  298
+	return f->f_op == &eventpoll_fops;
  299
+}
279 300
 
280 301
 /* Setup the structure that is used as key for the RB tree */
281 302
 static inline void ep_set_ffd(struct epoll_filefd *ffd,
@@ -728,12 +749,6 @@ static const struct file_operations eventpoll_fops = {
728 749
 	.llseek		= noop_llseek,
729 750
 };
730 751
 
731  
-/* Fast test to see if the file is an eventpoll file */
732  
-static inline int is_file_epoll(struct file *f)
733  
-{
734  
-	return f->f_op == &eventpoll_fops;
735  
-}
736  
-
737 752
 /*
738 753
  * This is called from eventpoll_release() to unlink files from the eventpoll
739 754
  * interface. We need to have this facility to cleanup correctly files that are
@@ -954,6 +969,99 @@ static void ep_rbtree_insert(struct eventpoll *ep, struct epitem *epi)
954 969
 	rb_insert_color(&epi->rbn, &ep->rbr);
955 970
 }
956 971
 
  972
+
  973
+
  974
+#define PATH_ARR_SIZE 5
  975
+/*
  976
+ * These are the number paths of length 1 to 5, that we are allowing to emanate
  977
+ * from a single file of interest. For example, we allow 1000 paths of length
  978
+ * 1, to emanate from each file of interest. This essentially represents the
  979
+ * potential wakeup paths, which need to be limited in order to avoid massive
  980
+ * uncontrolled wakeup storms. The common use case should be a single ep which
  981
+ * is connected to n file sources. In this case each file source has 1 path
  982
+ * of length 1. Thus, the numbers below should be more than sufficient. These
  983
+ * path limits are enforced during an EPOLL_CTL_ADD operation, since a modify
  984
+ * and delete can't add additional paths. Protected by the epmutex.
  985
+ */
  986
+static const int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };
  987
+static int path_count[PATH_ARR_SIZE];
  988
+
  989
+static int path_count_inc(int nests)
  990
+{
  991
+	if (++path_count[nests] > path_limits[nests])
  992
+		return -1;
  993
+	return 0;
  994
+}
  995
+
  996
+static void path_count_init(void)
  997
+{
  998
+	int i;
  999
+
  1000
+	for (i = 0; i < PATH_ARR_SIZE; i++)
  1001
+		path_count[i] = 0;
  1002
+}
  1003
+
  1004
+static int reverse_path_check_proc(void *priv, void *cookie, int call_nests)
  1005
+{
  1006
+	int error = 0;
  1007
+	struct file *file = priv;
  1008
+	struct file *child_file;
  1009
+	struct epitem *epi;
  1010
+
  1011
+	list_for_each_entry(epi, &file->f_ep_links, fllink) {
  1012
+		child_file = epi->ep->file;
  1013
+		if (is_file_epoll(child_file)) {
  1014
+			if (list_empty(&child_file->f_ep_links)) {
  1015
+				if (path_count_inc(call_nests)) {
  1016
+					error = -1;
  1017
+					break;
  1018
+				}
  1019
+			} else {
  1020
+				error = ep_call_nested(&poll_loop_ncalls,
  1021
+							EP_MAX_NESTS,
  1022
+							reverse_path_check_proc,
  1023
+							child_file, child_file,
  1024
+							current);
  1025
+			}
  1026
+			if (error != 0)
  1027
+				break;
  1028
+		} else {
  1029
+			printk(KERN_ERR "reverse_path_check_proc: "
  1030
+				"file is not an ep!\n");
  1031
+		}
  1032
+	}
  1033
+	return error;
  1034
+}
  1035
+
  1036
+/**
  1037
+ * reverse_path_check - The tfile_check_list is list of file *, which have
  1038
+ *                      links that are proposed to be newly added. We need to
  1039
+ *                      make sure that those added links don't add too many
  1040
+ *                      paths such that we will spend all our time waking up
  1041
+ *                      eventpoll objects.
  1042
+ *
  1043
+ * Returns: Returns zero if the proposed links don't create too many paths,
  1044
+ *	    -1 otherwise.
  1045
+ */
  1046
+static int reverse_path_check(void)
  1047
+{
  1048
+	int length = 0;
  1049
+	int error = 0;
  1050
+	struct file *current_file;
  1051
+
  1052
+	/* let's call this for all tfiles */
  1053
+	list_for_each_entry(current_file, &tfile_check_list, f_tfile_llink) {
  1054
+		length++;
  1055
+		path_count_init();
  1056
+		error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
  1057
+					reverse_path_check_proc, current_file,
  1058
+					current_file, current);
  1059
+		if (error)
  1060
+			break;
  1061
+	}
  1062
+	return error;
  1063
+}
  1064
+
957 1065
 /*
958 1066
  * Must be called with "mtx" held.
959 1067
  */
@@ -1015,6 +1123,11 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
1015 1123
 	 */
1016 1124
 	ep_rbtree_insert(ep, epi);
1017 1125
 
  1126
+	/* now check if we've created too many backpaths */
  1127
+	error = -EINVAL;
  1128
+	if (reverse_path_check())
  1129
+		goto error_remove_epi;
  1130
+
1018 1131
 	/* We have to drop the new item inside our item list to keep track of it */
1019 1132
 	spin_lock_irqsave(&ep->lock, flags);
1020 1133
 
@@ -1039,6 +1152,14 @@ static int ep_insert(struct eventpoll *ep, struct epoll_event *event,
1039 1152
 
1040 1153
 	return 0;
1041 1154
 
  1155
+error_remove_epi:
  1156
+	spin_lock(&tfile->f_lock);
  1157
+	if (ep_is_linked(&epi->fllink))
  1158
+		list_del_init(&epi->fllink);
  1159
+	spin_unlock(&tfile->f_lock);
  1160
+
  1161
+	rb_erase(&epi->rbn, &ep->rbr);
  1162
+
1042 1163
 error_unregister:
1043 1164
 	ep_unregister_pollwait(ep, epi);
1044 1165
 
@@ -1303,18 +1424,36 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
1303 1424
 	int error = 0;
1304 1425
 	struct file *file = priv;
1305 1426
 	struct eventpoll *ep = file->private_data;
  1427
+	struct eventpoll *ep_tovisit;
1306 1428
 	struct rb_node *rbp;
1307 1429
 	struct epitem *epi;
1308 1430
 
1309 1431
 	mutex_lock_nested(&ep->mtx, call_nests + 1);
  1432
+	ep->visited = 1;
  1433
+	list_add(&ep->visited_list_link, &visited_list);
1310 1434
 	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
1311 1435
 		epi = rb_entry(rbp, struct epitem, rbn);
1312 1436
 		if (unlikely(is_file_epoll(epi->ffd.file))) {
  1437
+			ep_tovisit = epi->ffd.file->private_data;
  1438
+			if (ep_tovisit->visited)
  1439
+				continue;
1313 1440
 			error = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
1314  
-					       ep_loop_check_proc, epi->ffd.file,
1315  
-					       epi->ffd.file->private_data, current);
  1441
+					ep_loop_check_proc, epi->ffd.file,
  1442
+					ep_tovisit, current);
1316 1443
 			if (error != 0)
1317 1444
 				break;
  1445
+		} else {
  1446
+			/*
  1447
+			 * If we've reached a file that is not associated with
  1448
+			 * an ep, then we need to check if the newly added
  1449
+			 * links are going to add too many wakeup paths. We do
  1450
+			 * this by adding it to the tfile_check_list, if it's
  1451
+			 * not already there, and calling reverse_path_check()
  1452
+			 * during ep_insert().
  1453
+			 */
  1454
+			if (list_empty(&epi->ffd.file->f_tfile_llink))
  1455
+				list_add(&epi->ffd.file->f_tfile_llink,
  1456
+					 &tfile_check_list);
1318 1457
 		}
1319 1458
 	}
1320 1459
 	mutex_unlock(&ep->mtx);
@@ -1335,8 +1474,31 @@ static int ep_loop_check_proc(void *priv, void *cookie, int call_nests)
1335 1474
  */
1336 1475
 static int ep_loop_check(struct eventpoll *ep, struct file *file)
1337 1476
 {
1338  
-	return ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
  1477
+	int ret;
  1478
+	struct eventpoll *ep_cur, *ep_next;
  1479
+
  1480
+	ret = ep_call_nested(&poll_loop_ncalls, EP_MAX_NESTS,
1339 1481
 			      ep_loop_check_proc, file, ep, current);
  1482
+	/* clear visited list */
  1483
+	list_for_each_entry_safe(ep_cur, ep_next, &visited_list,
  1484
+							visited_list_link) {
  1485
+		ep_cur->visited = 0;
  1486
+		list_del(&ep_cur->visited_list_link);
  1487
+	}
  1488
+	return ret;
  1489
+}
  1490
+
  1491
+static void clear_tfile_check_list(void)
  1492
+{
  1493
+	struct file *file;
  1494
+
  1495
+	/* first clear the tfile_check_list */
  1496
+	while (!list_empty(&tfile_check_list)) {
  1497
+		file = list_first_entry(&tfile_check_list, struct file,
  1498
+					f_tfile_llink);
  1499
+		list_del_init(&file->f_tfile_llink);
  1500
+	}
  1501
+	INIT_LIST_HEAD(&tfile_check_list);
1340 1502
 }
1341 1503
 
1342 1504
 /*
@@ -1344,8 +1506,9 @@ static int ep_loop_check(struct eventpoll *ep, struct file *file)
1344 1506
  */
1345 1507
 SYSCALL_DEFINE1(epoll_create1, int, flags)
1346 1508
 {
1347  
-	int error;
  1509
+	int error, fd;
1348 1510
 	struct eventpoll *ep = NULL;
  1511
+	struct file *file;
1349 1512
 
1350 1513
 	/* Check the EPOLL_* constant for consistency.  */
1351 1514
 	BUILD_BUG_ON(EPOLL_CLOEXEC != O_CLOEXEC);
@@ -1362,11 +1525,25 @@ SYSCALL_DEFINE1(epoll_create1, int, flags)
1362 1525
 	 * Creates all the items needed to setup an eventpoll file. That is,
1363 1526
 	 * a file structure and a free file descriptor.
1364 1527
 	 */
1365  
-	error = anon_inode_getfd("[eventpoll]", &eventpoll_fops, ep,
  1528
+	fd = get_unused_fd_flags(O_RDWR | (flags & O_CLOEXEC));
  1529
+	if (fd < 0) {
  1530
+		error = fd;
  1531
+		goto out_free_ep;
  1532
+	}
  1533
+	file = anon_inode_getfile("[eventpoll]", &eventpoll_fops, ep,
1366 1534
 				 O_RDWR | (flags & O_CLOEXEC));
1367  
-	if (error < 0)
1368  
-		ep_free(ep);
1369  
-
  1535
+	if (IS_ERR(file)) {
  1536
+		error = PTR_ERR(file);
  1537
+		goto out_free_fd;
  1538
+	}
  1539
+	fd_install(fd, file);
  1540
+	ep->file = file;
  1541
+	return fd;
  1542
+
  1543
+out_free_fd:
  1544
+	put_unused_fd(fd);
  1545
+out_free_ep:
  1546
+	ep_free(ep);
1370 1547
 	return error;
1371 1548
 }
1372 1549
 
@@ -1432,21 +1609,27 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
1432 1609
 	/*
1433 1610
 	 * When we insert an epoll file descriptor, inside another epoll file
1434 1611
 	 * descriptor, there is the change of creating closed loops, which are
1435  
-	 * better be handled here, than in more critical paths.
  1612
+	 * better be handled here, than in more critical paths. While we are
  1613
+	 * checking for loops we also determine the list of files reachable
  1614
+	 * and hang them on the tfile_check_list, so we can check that we
  1615
+	 * haven't created too many possible wakeup paths.
1436 1616
 	 *
1437  
-	 * We hold epmutex across the loop check and the insert in this case, in
1438  
-	 * order to prevent two separate inserts from racing and each doing the
1439  
-	 * insert "at the same time" such that ep_loop_check passes on both
1440  
-	 * before either one does the insert, thereby creating a cycle.
  1617
+	 * We need to hold the epmutex across both ep_insert and ep_remove
  1618
+	 * b/c we want to make sure we are looking at a coherent view of
  1619
+	 * epoll network.
1441 1620
 	 */
1442  
-	if (unlikely(is_file_epoll(tfile) && op == EPOLL_CTL_ADD)) {
  1621
+	if (op == EPOLL_CTL_ADD || op == EPOLL_CTL_DEL) {
1443 1622
 		mutex_lock(&epmutex);
1444 1623
 		did_lock_epmutex = 1;
1445  
-		error = -ELOOP;
1446  
-		if (ep_loop_check(ep, tfile) != 0)
1447  
-			goto error_tgt_fput;
1448 1624
 	}
1449  
-
  1625
+	if (op == EPOLL_CTL_ADD) {
  1626
+		if (is_file_epoll(tfile)) {
  1627
+			error = -ELOOP;
  1628
+			if (ep_loop_check(ep, tfile) != 0)
  1629
+				goto error_tgt_fput;
  1630
+		} else
  1631
+			list_add(&tfile->f_tfile_llink, &tfile_check_list);
  1632
+	}
1450 1633
 
1451 1634
 	mutex_lock_nested(&ep->mtx, 0);
1452 1635
 
@@ -1465,6 +1648,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
1465 1648
 			error = ep_insert(ep, &epds, tfile, fd);
1466 1649
 		} else
1467 1650
 			error = -EEXIST;
  1651
+		clear_tfile_check_list();
1468 1652
 		break;
1469 1653
 	case EPOLL_CTL_DEL:
1470 1654
 		if (epi)
@@ -1483,7 +1667,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
1483 1667
 	mutex_unlock(&ep->mtx);
1484 1668
 
1485 1669
 error_tgt_fput:
1486  
-	if (unlikely(did_lock_epmutex))
  1670
+	if (did_lock_epmutex)
1487 1671
 		mutex_unlock(&epmutex);
1488 1672
 
1489 1673
 	fput(tfile);
1  include/linux/eventpoll.h
@@ -61,6 +61,7 @@ struct file;
61 61
 static inline void eventpoll_init_file(struct file *file)
62 62
 {
63 63
 	INIT_LIST_HEAD(&file->f_ep_links);
  64
+	INIT_LIST_HEAD(&file->f_tfile_llink);
64 65
 }
65 66
 
66 67
 
1  include/linux/fs.h
@@ -1001,6 +1001,7 @@ struct file {
1001 1001
 #ifdef CONFIG_EPOLL
1002 1002
 	/* Used by fs/eventpoll.c to link all the hooks to this file */
1003 1003
 	struct list_head	f_ep_links;
  1004
+	struct list_head	f_tfile_llink;
1004 1005
 #endif /* #ifdef CONFIG_EPOLL */
1005 1006
 	struct address_space	*f_mapping;
1006 1007
 #ifdef CONFIG_DEBUG_WRITECOUNT

0 notes on commit 203aa52

Please sign in to comment.
Something went wrong with that request. Please try again.