Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Commits on Oct 31, 2011
  1. kernel: Map most files to use export.h instead of module.h

    Paul Gortmaker authored
    The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else.  Revector them
    onto the isolated export header for faster compile times.
    
    Nothing to see here but a whole lot of instances of:
    
      -#include <linux/module.h>
      +#include <linux/export.h>
    
    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.
    
    Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Commits on Jul 20, 2011
  1. make sure that nsproxy_cache is initialized early enough

    Al Viro authored
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Commits on May 27, 2011
  1. @dlezcano

    cgroup: remove the ns_cgroup

    dlezcano authored committed
    The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
    leads to some problems:
    
      * cgroup creation is out-of-control
      * cgroup name can conflict when pids are looping
      * it is not possible to have a single process handling a lot of
        namespaces without falling in a exponential creation time
      * we may want to create a namespace without creating a cgroup
    
      The ns_cgroup was replaced by a compatibility flag 'clone_children',
      where a newly created cgroup will copy the parent cgroup values.
      The userspace has to manually create a cgroup and add a task to
      the 'tasks' file.
    
    This patch removes the ns_cgroup as suggested in the following thread:
    
    https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
    
    The 'cgroup_clone' function is removed because it is no longer used.
    
    This is a userspace-visible change.  Commit 4553175 ("cgroup: notify
    ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
    printk warning users that the feature is planned for removal.  Since that
    time we have heard from XXX users who were affected by this.
    
    Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr>
    Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Jamal Hadi Salim <hadi@cyberus.ca>
    Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
    Acked-by: Paul Menage <menage@google.com>
    Acked-by: Matt Helsley <matthltc@us.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on May 10, 2011
  1. @ebiederm

    ns: Introduce the setns syscall

    ebiederm authored
    With the networking stack today there is demand to handle
    multiple network stacks at a time.  Not in the context
    of containers but in the context of people doing interesting
    things with routing.
    
    There is also demand in the context of containers to have
    an efficient way to execute some code in the container itself.
    If nothing else it is very useful ad a debugging technique.
    
    Both problems can be solved by starting some form of login
    daemon in the namespaces people want access to, or you
    can play games by ptracing a process and getting the
    traced process to do things you want it to do. However
    it turns out that a login daemon or a ptrace puppet
    controller are more code, they are more prone to
    failure, and generally they are less efficient than
    simply changing the namespace of a process to a
    specified one.
    
    Pieces of this puzzle can also be solved by instead of
    coming up with a general purpose system call coming up
    with targed system calls perhaps socketat that solve
    a subset of the larger problem.  Overall that appears
    to be more work for less reward.
    
    int setns(int fd, int nstype);
    
    The fd argument is a file descriptor referring to a proc
    file of the namespace you want to switch the process to.
    
    In the setns system call the nstype is 0 or specifies
    an clone flag of the namespace you intend to change
    to prevent changing a namespace unintentionally.
    
    v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com>
    v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com>
    v4: Moved wiring up of the system call to another patch
    v5: Cleaned up the system call arguments
        - Changed the order.
        - Modified nstype to take the standard clone flags.
    v6: Added missing error handling as pointed out by Matt Helsley <matthltc@us.ibm.com>
    
    Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Commits on Mar 24, 2011
  1. @hallyn

    userns: user namespaces: convert several capable() calls

    hallyn authored committed
    CAP_IPC_OWNER and CAP_IPC_LOCK can be checked against current_user_ns(),
    because the resource comes from current's own ipc namespace.
    
    setuid/setgid are to uids in own namespace, so again checks can be against
    current_user_ns().
    
    Changelog:
    	Jan 11: Use task_ns_capable() in place of sched_capable().
    	Jan 11: Use nsown_capable() as suggested by Bastian Blank.
    	Jan 11: Clarify (hopefully) some logic in futex and sched.c
    	Feb 15: use ns_capable for ipc, not nsown_capable
    	Feb 23: let copy_ipcs handle setting ipc_ns->user_ns
    	Feb 23: pass ns down rather than taking it from current
    
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
    Acked-by: David Howells <dhowells@redhat.com>
    Cc: James Morris <jmorris@namei.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. @hallyn

    userns: add a user namespace owner of ipc ns

    hallyn authored committed
    Changelog:
    	Feb 15: Don't set new ipc->user_ns if we didn't create a new
    		ipc_ns.
    	Feb 23: Move extern declaration to ipc_namespace.h, and group
    		fwd declarations at top.
    
    Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
    Acked-by: David Howells <dhowells@redhat.com>
    Cc: James Morris <jmorris@namei.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  3. @hallyn

    userns: allow sethostname in a container

    hallyn authored committed
    Changelog:
    	Feb 23: let clone_uts_ns() handle setting uts->user_ns
    		To do so we need to pass in the task_struct who'll
    		get the utsname, so we can get its user_ns.
    	Feb 23: As per Oleg's coment, just pass in tsk, instead of two
    		of its members.
    
    Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
    Acked-by: David Howells <dhowells@redhat.com>
    Cc: James Morris <jmorris@namei.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  4. @hallyn

    userns: add a user_namespace as creator/owner of uts_namespace

    hallyn authored committed
    The expected course of development for user namespaces targeted
    capabilities is laid out at https://wiki.ubuntu.com/UserNamespace.
    
    Goals:
    
    - Make it safe for an unprivileged user to unshare namespaces.  They
      will be privileged with respect to the new namespace, but this should
      only include resources which the unprivileged user already owns.
    
    - Provide separate limits and accounting for userids in different
      namespaces.
    
    Status:
    
      Currently (as of 2.6.38) you can clone with the CLONE_NEWUSER flag to
      get a new user namespace if you have the CAP_SYS_ADMIN, CAP_SETUID, and
      CAP_SETGID capabilities.  What this gets you is a whole new set of
      userids, meaning that user 500 will have a different 'struct user' in
      your namespace than in other namespaces.  So any accounting information
      stored in struct user will be unique to your namespace.
    
      However, throughout the kernel there are checks which
    
      - simply check for a capability.  Since root in a child namespace
        has all capabilities, this means that a child namespace is not
        constrained.
    
      - simply compare uid1 == uid2.  Since these are the integer uids,
        uid 500 in namespace 1 will be said to be equal to uid 500 in
        namespace 2.
    
      As a result, the lxc implementation at lxc.sf.net does not use user
      namespaces.  This is actually helpful because it leaves us free to
      develop user namespaces in such a way that, for some time, user
      namespaces may be unuseful.
    
    Bugs aside, this patchset is supposed to not at all affect systems which
    are not actively using user namespaces, and only restrict what tasks in
    child user namespace can do.  They begin to limit privilege to a user
    namespace, so that root in a container cannot kill or ptrace tasks in the
    parent user namespace, and can only get world access rights to files.
    Since all files currently belong to the initila user namespace, that means
    that child user namespaces can only get world access rights to *all*
    files.  While this temporarily makes user namespaces bad for system
    containers, it starts to get useful for some sandboxing.
    
    I've run the 'runltplite.sh' with and without this patchset and found no
    difference.
    
    This patch:
    
    copy_process() handles CLONE_NEWUSER before the rest of the namespaces.
    So in the case of clone(CLONE_NEWUSER|CLONE_NEWUTS) the new uts namespace
    will have the new user namespace as its owner.  That is what we want,
    since we want root in that new userns to be able to have privilege over
    it.
    
    Changelog:
    	Feb 15: don't set uts_ns->user_ns if we didn't create
    		a new uts_ns.
    	Feb 23: Move extern init_user_ns declaration from
    		init/version.c to utsname.h.
    
    Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
    Acked-by: David Howells <dhowells@redhat.com>
    Cc: James Morris <jmorris@namei.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Mar 30, 2010
  1. include cleanup: Update gfp.h and slab.h includes to prepare for brea…

    Tejun Heo authored
    …king implicit slab.h inclusion from percpu.h
    
    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files.  percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.
    
    percpu.h -> slab.h dependency is about to be removed.  Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability.  As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.
    
      http://userweb.kernel.org/~tj/misc/slabh-sweep.py
    
    The script does the followings.
    
    * Scan files for gfp and slab usages and update includes such that
      only the necessary includes are there.  ie. if only gfp is used,
      gfp.h, if slab is used, slab.h.
    
    * When the script inserts a new include, it looks at the include
      blocks and try to put the new include such that its order conforms
      to its surrounding.  It's put in the include block which contains
      core kernel includes, in the same order that the rest are ordered -
      alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
      doesn't seem to be any matching order.
    
    * If the script can't find a place to put a new include (mostly
      because the file doesn't have fitting include block), it prints out
      an error message indicating which .h file needs to be added to the
      file.
    
    The conversion was done in the following steps.
    
    1. The initial automatic conversion of all .c files updated slightly
       over 4000 files, deleting around 700 includes and adding ~480 gfp.h
       and ~3000 slab.h inclusions.  The script emitted errors for ~400
       files.
    
    2. Each error was manually checked.  Some didn't need the inclusion,
       some needed manual addition while adding it to implementation .h or
       embedding .c file was more appropriate for others.  This step added
       inclusions to around 150 files.
    
    3. The script was run again and the output was compared to the edits
       from #2 to make sure no file was left behind.
    
    4. Several build tests were done and a couple of problems were fixed.
       e.g. lib/decompress_*.c used malloc/free() wrappers around slab
       APIs requiring slab.h to be added manually.
    
    5. The script was run on all .h files but without automatically
       editing them as sprinkling gfp.h and slab.h inclusions around .h
       files could easily lead to inclusion dependency hell.  Most gfp.h
       inclusion directives were ignored as stuff from gfp.h was usually
       wildly available and often used in preprocessor macros.  Each
       slab.h inclusion directive was examined and added manually as
       necessary.
    
    6. percpu.h was updated not to include slab.h.
    
    7. Build test were done on the following configurations and failures
       were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
       distributed build env didn't work with gcov compiles) and a few
       more options had to be turned off depending on archs to make things
       build (like ipr on powerpc/64 which failed due to missing writeq).
    
       * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
       * powerpc and powerpc64 SMP allmodconfig
       * sparc and sparc64 SMP allmodconfig
       * ia64 SMP allmodconfig
       * s390 SMP allmodconfig
       * alpha SMP allmodconfig
       * um on x86_64 SMP allmodconfig
    
    8. percpu.h modifications were reverted so that it could be applied as
       a separate patch and serve as bisection point.
    
    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.
    
    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Commits on Mar 12, 2010
  1. nsproxy: remove INIT_NSPROXY()

    Alexey Dobriyan authored committed
    Remove INIT_NSPROXY(), use C99 initializer.
    Remove INIT_IPC_NS(), INIT_NET_NS() while I'm at it.
    
    Note: headers trim will be done later, now it's quite pointless because
    results will be invalidated by merge window.
    
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Acked-by: Serge Hallyn <serue@us.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Jun 18, 2009
  1. nsproxy: extract create_nsproxy()

    Alexey Dobriyan authored committed
    clone_nsproxy() does useless copying of old nsproxy -- every pointer will
    be rewritten to new ns or to old ns.  Remove copying, rename
    clone_nsproxy(), create_nsproxy() will be used by C/R code to create fresh
    nsproxy on restart.
    
    Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
    Acked-by: Serge Hallyn <serue@us.ibm.com>
    Cc: Pavel Emelyanov <xemul@openvz.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Nov 24, 2008
  1. User namespaces: set of cleanups (v2)

    Serge Hallyn authored
    The user_ns is moved from nsproxy to user_struct, so that a struct
    cred by itself is sufficient to determine access (which it otherwise
    would not be).  Corresponding ecryptfs fixes (by David Howells) are
    here as well.
    
    Fix refcounting.  The following rules now apply:
            1. The task pins the user struct.
            2. The user struct pins its user namespace.
            3. The user namespace pins the struct user which created it.
    
    User namespaces are cloned during copy_creds().  Unsharing a new user_ns
    is no longer possible.  (We could re-add that, but it'll cause code
    duplication and doesn't seem useful if PAM doesn't need to clone user
    namespaces).
    
    When a user namespace is created, its first user (uid 0) gets empty
    keyrings and a clean group_info.
    
    This incorporates a previous patch by David Howells.  Here
    is his original patch description:
    
    >I suggest adding the attached incremental patch.  It makes the following
    >changes:
    >
    > (1) Provides a current_user_ns() macro to wrap accesses to current's user
    >     namespace.
    >
    > (2) Fixes eCryptFS.
    >
    > (3) Renames create_new_userns() to create_user_ns() to be more consistent
    >     with the other associated functions and because the 'new' in the name is
    >     superfluous.
    >
    > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
    >     beginning of do_fork() so that they're done prior to making any attempts
    >     at allocation.
    >
    > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
    >     to fill in rather than have it return the new root user.  I don't imagine
    >     the new root user being used for anything other than filling in a cred
    >     struct.
    >
    >     This also permits me to get rid of a get_uid() and a free_uid(), as the
    >     reference the creds were holding on the old user_struct can just be
    >     transferred to the new namespace's creator pointer.
    >
    > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
    >     preparation rather than doing it in copy_creds().
    >
    >David
    
    >Signed-off-by: David Howells <dhowells@redhat.com>
    
    Changelog:
    	Oct 20: integrate dhowells comments
    		1. leave thread_keyring alone
    		2. use current_user_ns() in set_user()
    
    Signed-off-by: Serge Hallyn <serue@us.ibm.com>
Commits on Aug 23, 2008
  1. removed unused #include <linux/version.h>'s

    Adrian Bunk authored committed
    This patch lets the files using linux/version.h match the files that
    #include it.
    
    Signed-off-by: Adrian Bunk <bunk@kernel.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Jul 25, 2008
  1. cgroup_clone: use pid of newly created task for new cgroup

    Serge E. Hallyn authored committed
    cgroup_clone creates a new cgroup with the pid of the task.  This works
    correctly for unshare, but for clone cgroup_clone is called from
    copy_namespaces inside copy_process, which happens before the new pid is
    created.  As a result, the new cgroup was created with current's pid.
    This patch:
    
    	1. Moves the call inside copy_process to after the new pid
    	   is created
    	2. Passes the struct pid into ns_cgroup_clone (as it is not
    	   yet attached to the task)
    	3. Passes a name from ns_cgroup_clone() into cgroup_clone()
    	   so as to keep cgroup_clone() itself simpler
    	4. Uses pid_vnr() to get the process id value, so that the
    	   pid used to name the new cgroup is always the pid as it
    	   would be known to the task which did the cloning or
    	   unsharing.  I think that is the most intuitive thing to
    	   do.  This way, task t1 does clone(CLONE_NEWPID) to get
    	   t2, which does clone(CLONE_NEWPID) to get t3, then the
    	   cgroup for t3 will be named for the pid by which t2 knows
    	   t3.
    
    (Thanks to Dan Smith for finding the main bug)
    
    Changelog:
    	June 11: Incorporate Paul Menage's feedback:  don't pass
    	         NULL to ns_cgroup_clone from unshare, and reduce
    		 patch size by using 'nodename' in cgroup_clone.
    	June 10: Original version
    
    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Serge Hallyn <serge@us.ibm.com>
    Acked-by: Paul Menage <menage@google.com>
    Tested-by: Dan Smith <danms@us.ibm.com>
    Cc: Balbir Singh <balbir@in.ibm.com>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Apr 29, 2008
  1. ipc: sysvsem: refuse clone(CLONE_SYSVSEM|CLONE_NEWIPC)

    Serge E. Hallyn authored committed
    CLONE_NEWIPC|CLONE_SYSVSEM interaction isn't handled properly.  This can cause
    a kernel memory corruption.  CLONE_NEWIPC must detach from the existing undo
    lists.
    
    Fix, part 3: refuse clone(CLONE_SYSVSEM|CLONE_NEWIPC).
    
    With unshare, specifying CLONE_SYSVSEM means unshare the sysvsem.  So it seems
    reasonable that CLONE_NEWIPC without CLONE_SYSVSEM would just imply
    CLONE_SYSVSEM.
    
    However with clone, specifying CLONE_SYSVSEM means *share* the sysvsem.  So
    calling clone(CLONE_SYSVSEM|CLONE_NEWIPC) is explicitly asking for something
    we can't allow.  So return -EINVAL in that case.
    
    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
    Cc: Manfred Spraul <manfred@colorfullife.com>
    Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Pavel Emelyanov <xemul@openvz.org>
    Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
    Cc: Pierre Peiffer <peifferp@gmail.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Feb 8, 2008
  1. @xemul

    namespaces: move the IPC namespace under IPC_NS option

    xemul authored Linus Torvalds committed
    Currently the IPC namespace management code is spread over the ipc/*.c files.
    I moved this code into ipc/namespace.c file which is compiled out when needed.
    
    The linux/ipc_namespace.h file is used to store the prototypes of the
    functions in namespace.c and the stubs for NAMESPACES=n case.  This is done
    so, because the stub for copy_ipc_namespace requires the knowledge of the
    CLONE_NEWIPC flag, which is in sched.h.  But the linux/ipc.h file itself in
    included into many many .c files via the sys.h->sem.h sequence so adding the
    sched.h into it will make all these .c depend on sched.h which is not that
    good.  On the other hand the knowledge about the namespaces stuff is required
    in 4 .c files only.
    
    Besides, this patch compiles out some auxiliary functions from ipc/sem.c,
    msg.c and shm.c files.  It turned out that moving these functions into
    namespaces.c is not that easy because they use many other calls and macros
    from the original file.  Moving them would make this patch complicated.  On
    the other hand all these functions can be consolidated, so I will send a
    separate patch doing this a bit later.
    
    Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
    Acked-by: Serge Hallyn <serue@us.ibm.com>
    Cc: Cedric Le Goater <clg@fr.ibm.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Oct 19, 2007
  1. @xemul

    pid namespaces: allow cloning of new namespace

    xemul authored Linus Torvalds committed
    When clone() is invoked with CLONE_NEWPID, create a new pid namespace and then
    create a new struct pid for the new process.  Allocate pid_t's for the new
    process in the new pid namespace and all ancestor pid namespaces.  Make the
    newly cloned process the session and process group leader.
    
    Since the active pid namespace is special and expected to be the first entry
    in pid->upid_list, preserve the order of pid namespaces.
    
    The size of 'struct pid' is dependent on the the number of pid namespaces the
    process exists in, so we use multiple pid-caches'.  Only one pid cache is
    created during system startup and this used by processes that exist only in
    init_pid_ns.
    
    When a process clones its pid namespace, we create additional pid caches as
    necessary and use the pid cache to allocate 'struct pids' for that depth.
    
    Note, that with this patch the newly created namespace won't work, since the
    rest of the kernel still uses global pids, but this is to be fixed soon.  Init
    pid namespace still works.
    
    [oleg@tv-sign.ru: merge fix]
    Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
    Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
    Cc: Paul Menage <menage@google.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. @xemul

    Make access to task's nsproxy lighter

    xemul authored Linus Torvalds committed
    When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists.  This is
    slow on read-only paths and may be impossible in some cases.
    
    E.g.  Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.
    
    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.
    
    The access to other task namespaces is proposed to be performed
    like this:
    
         rcu_read_lock();
         nsproxy = task_nsproxy(tsk);
         if (nsproxy != NULL) {
                 / *
                   * work with the namespaces here
                   * e.g. get the reference on one of them
                   * /
         } / *
             * NULL task_nsproxy() means that this task is
             * almost dead (zombie)
             * /
         rcu_read_unlock();
    
    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.
    
    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Cc: Serge Hallyn <serue@us.ibm.com>
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  3. pid namespaces: define and use task_active_pid_ns() wrapper

    Sukadev Bhattiprolu authored Linus Torvalds committed
    With multiple pid namespaces, a process is known by some pid_t in every
    ancestor pid namespace.  Every time the process forks, the child process also
    gets a pid_t in every ancestor pid namespace.
    
    While a process is visible in >=1 pid namespaces, it can see pid_t's in only
    one pid namespace.  We call this pid namespace it's "active pid namespace",
    and it is always the youngest pid namespace in which the process is known.
    
    This patch defines and uses a wrapper to find the active pid namespace of a
    process.  The implementation of the wrapper will be changed in when support
    for multiple pid namespaces are added.
    
    Changelog:
    	2.6.22-rc4-mm2-pidns1:
    	- [Pavel Emelianov, Alexey Dobriyan] Back out the change to use
    	  task_active_pid_ns() in child_reaper() since task->nsproxy
    	  can be NULL during task exit (so child_reaper() continues to
    	  use init_pid_ns).
    
    	  to implement child_reaper() since init_pid_ns.child_reaper to
    	  implement child_reaper() since tsk->nsproxy can be NULL during exit.
    
    	2.6.21-rc6-mm1:
    	- Rename task_pid_ns() to task_active_pid_ns() to reflect that a
    	  process can have multiple pid namespaces.
    
    Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
    Acked-by: Pavel Emelianov <xemul@openvz.org>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Dave Hansen <haveblue@us.ibm.com>
    Cc: Serge Hallyn <serue@us.ibm.com>
    Cc: Herbert Poetzel <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  4. cgroups: implement namespace tracking subsystem

    Serge E. Hallyn authored Linus Torvalds committed
    When a task enters a new namespace via a clone() or unshare(), a new cgroup
    is created and the task moves into it.
    
    This version names cgroups which are automatically created using
    cgroup_clone() as "node_<pid>" where pid is the pid of the unsharing or
    cloned process.  (Thanks Pavel for the idea) This is safe because if the
    process unshares again, it will create
    
    	/cgroups/(...)/node_<pid>/node_<pid>
    
    The only possibilities (AFAICT) for a -EEXIST on unshare are
    
    	1. pid wraparound
    	2. a process fails an unshare, then tries again.
    
    Case 1 is unlikely enough that I ignore it (at least for now).  In case 2, the
    node_<pid> will be empty and can be rmdir'ed to make the subsequent unshare()
    succeed.
    
    Changelog:
    	Name cloned cgroups as "node_<pid>".
    
    [clg@fr.ibm.com: fix order of cgroup subsystems in init/Kconfig]
    Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
    Cc: Paul Menage <menage@google.com>
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Oct 17, 2007
  1. @xemul

    Use KMEM_CACHE macro to create the nsproxy cache

    xemul authored Linus Torvalds committed
    The blessed way for standard caches is to use it.  Besides, this may give
    this cache a better alignment.
    
    Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
    Acked-by: Cedric Le Goater <clg@fr.ibm.com>
    Acked-by: Serge Hallyn <serue@us.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Oct 10, 2007
  1. @ebiederm

    [NET]: Add network namespace clone & unshare support.

    ebiederm authored David S. Miller committed
    This patch allows you to create a new network namespace
    using sys_clone, or sys_unshare.
    
    As the network namespace is still experimental and under development
    clone and unshare support is only made available when CONFIG_NET_NS is
    selected at compile time.
    
    As this patch introduces network namespace support into code paths
    that exist when the CONFIG_NET is not selected there are a few
    additions made to net_namespace.h to allow a few more functions
    to be used when the networking stack is not compiled in.
    
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
Commits on Jul 20, 2007
  1. @pmundt

    mm: Remove slab destructors from kmem_cache_create().

    pmundt authored
    Slab destructors were no longer supported after Christoph's
    c59def9 change. They've been
    BUGs for both slab and slub, and slob never supported them
    either.
    
    This rips out support for the dtor pointer from kmem_cache_create()
    completely and fixes up every single callsite in the kernel (there were
    about 224, not including the slab allocator definitions themselves,
    or the documentation references).
    
    Signed-off-by: Paul Mundt <lethal@linux-sh.org>
Commits on Jul 16, 2007
  1. @ebiederm

    namespace: ensure clone_flags are always stored in an unsigned long

    ebiederm authored Linus Torvalds committed
    While working on unshare support for the network namespace I noticed we
    were putting clone flags in an int.  Which is weird because the syscall
    uses unsigned long and we at least need an unsigned to properly hold all of
    the unshare flags.
    
    So to make the code consistent, this patch updates the code to use
    unsigned long instead of int for the clone flags in those places
    where we get it wrong today.
    
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Acked-by: Cedric Le Goater <clg@fr.ibm.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. @legoater

    add a kmem_cache for nsproxy objects

    legoater authored Linus Torvalds committed
    It should improve performance in some scenarii where a lot of
    these nsproxy objects are created by unsharing namespaces. This is
    a typical use of virtual servers that are being created or entered.
    
    This is also a good tool to find leaks and gather statistics on
    namespace usage.
    
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  3. @legoater

    fix create_new_namespaces() return value

    legoater authored Linus Torvalds committed
    dup_mnt_ns() and clone_uts_ns() return NULL on failure.  This is wrong,
    create_new_namespaces() uses ERR_PTR() to catch an error.  This means that the
    subsequent create_new_namespaces() will hit BUG_ON() in copy_mnt_ns() or
    copy_utsname().
    
    Modify create_new_namespaces() to also use the errors returned by the
    copy_*_ns routines and not to systematically return ENOMEM.
    
    [oleg@tv-sign.ru: better changelog]
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Serge E. Hallyn <serue@us.ibm.com>
    Cc: Badari Pulavarty <pbadari@us.ibm.com>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  4. user namespace: add unshare

    Serge E. Hallyn authored Linus Torvalds committed
    This patch enables the unshare of user namespaces.
    
    It adds a new clone flag CLONE_NEWUSER and implements copy_user_ns() which
    resets the current user_struct and adds a new root user (uid == 0)
    
    For now, unsharing the user namespace allows a process to reset its
    user_struct accounting and uid 0 in the new user namespace should be contained
    using appropriate means, for instance selinux
    
    The plan, when the full support is complete (all uid checks covered), is to
    keep the original user's rights in the original namespace, and let a process
    become uid 0 in the new namespace, with full capabilities to the new
    namespace.
    
    Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Acked-by: Pavel Emelianov <xemul@openvz.org>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Chris Wright <chrisw@sous-sol.org>
    Cc: Stephen Smalley <sds@tycho.nsa.gov>
    Cc: James Morris <jmorris@namei.org>
    Cc: Andrew Morgan <agm@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  5. @legoater

    user namespace: add the framework

    legoater authored Linus Torvalds committed
    Basically, it will allow a process to unshare its user_struct table,
    resetting at the same time its own user_struct and all the associated
    accounting.
    
    A new root user (uid == 0) is added to the user namespace upon creation.
    Such root users have full privileges and it seems that theses privileges
    should be controlled through some means (process capabilities ?)
    
    The unshare is not included in this patch.
    
    Changes since [try #4]:
    	- Updated get_user_ns and put_user_ns to accept NULL, and
    	  get_user_ns to return the namespace.
    
    Changes since [try #3]:
    	- moved struct user_namespace to files user_namespace.{c,h}
    
    Changes since [try #2]:
    	- removed struct user_namespace* argument from find_user()
    
    Changes since [try #1]:
    	- removed struct user_namespace* argument from find_user()
    	- added a root_user per user namespace
    
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
    Acked-by: Pavel Emelianov <xemul@openvz.org>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Kirill Korotaev <dev@sw.ru>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Chris Wright <chrisw@sous-sol.org>
    Cc: Stephen Smalley <sds@tycho.nsa.gov>
    Cc: James Morris <jmorris@namei.org>
    Cc: Andrew Morgan <agm@google.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  6. @legoater

    remove CONFIG_UTS_NS and CONFIG_IPC_NS

    legoater authored Linus Torvalds committed
    CONFIG_UTS_NS and CONFIG_IPC_NS have very little value as they only
    deactivate the unshare of the uts and ipc namespaces and do not improve
    performance.
    
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Acked-by: "Serge E. Hallyn" <serue@us.ibm.com>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Pavel Emelianov <xemul@openvz.org>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Jun 24, 2007
  1. @legoater

    fix refcounting of nsproxy object when unshared

    legoater authored Linus Torvalds committed
    When a namespace is unshared, a refcount on the previous nsproxy is
    abusively taken, leading to a memory leak of nsproxy objects.
    
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Badari Pulavarty <pbadari@us.ibm.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on May 8, 2007
  1. Merge sys_clone()/sys_unshare() nsproxy and namespace handling

    Badari Pulavarty authored Linus Torvalds committed
    sys_clone() and sys_unshare() both makes copies of nsproxy and its associated
    namespaces.  But they have different code paths.
    
    This patch merges all the nsproxy and its associated namespace copy/clone
    handling (as much as possible).  Posted on container list earlier for
    feedback.
    
    - Create a new nsproxy and its associated namespaces and pass it back to
      caller to attach it to right process.
    
    - Changed all copy_*_ns() routines to return a new copy of namespace
      instead of attaching it to task->nsproxy.
    
    - Moved the CAP_SYS_ADMIN checks out of copy_*_ns() routines.
    
    - Removed unnessary !ns checks from copy_*_ns() and added BUG_ON()
      just incase.
    
    - Get rid of all individual unshare_*_ns() routines and make use of
      copy_*_ns() instead.
    
    [akpm@osdl.org: cleanups, warning fix]
    [clg@fr.ibm.com: remove dup_namespaces() declaration]
    [serue@us.ibm.com: fix CONFIG_IPC_NS=n, clone(CLONE_NEWIPC) retval]
    [akpm@linux-foundation.org: fix build with CONFIG_SYSVIPC=n]
    Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
    Signed-off-by: Serge Hallyn <serue@us.ibm.com>
    Cc: Cedric Le Goater <clg@fr.ibm.com>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: <containers@lists.osdl.org>
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Jan 30, 2007
  1. Revert "[PATCH] namespaces: fix exit race by splitting exit"

    Linus Torvalds authored
    This reverts commit 7a238fc in
    preparation for a better and simpler fix proposed by Eric Biederman
    (and fixed up by Serge Hallyn)
    
    Acked-by: Serge E. Hallyn <serue@us.ibm.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
  2. [PATCH] namespaces: fix exit race by splitting exit

    Serge E. Hallyn authored Linus Torvalds committed
    Fix exit race by splitting the nsproxy putting into two pieces.  First
    piece reduces the nsproxy refcount.  If we dropped the last reference, then
    it puts the mnt_ns, and returns the nsproxy as a hint to the caller.  Else
    it returns NULL.  The second piece of exiting task namespaces sets
    tsk->nsproxy to NULL, and drops the references to other namespaces and
    frees the nsproxy only if an nsproxy was passed in.
    
    A little awkward and should probably be reworked, but hopefully it fixes
    the NFS oops.
    
    Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Oleg Nesterov <oleg@tv-sign.ru>
    Cc: "Eric W. Biederman" <ebiederm@xmission.com>
    Cc: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Daniel Hokka Zakrisson <daniel@hozac.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commits on Dec 13, 2006
  1. @ebiederm

    [PATCH] Revert "[PATCH] identifier to nsproxy"

    ebiederm authored Linus Torvalds committed
    This reverts commit 373beb3.
    
    No one is using this identifier yet.  The purpose of this identifier is to
    export nsproxy to user space which is wrong.  nsproxy is an internal
    implementation optimization, which should keep our fork times from getting
    slower as we increase the number of global namespaces you don't have to
    share.
    
    Adding a global identifier like this is inappropriate because it makes
    namespaces inherently non-recursive, greatly limiting what we can do with
    them in the future.
    
    Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Cedric Le Goater <clg@fr.ibm.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commits on Dec 8, 2006
  1. @legoater

    [PATCH] to nsproxy

    legoater authored Linus Torvalds committed
    Add the pid namespace framework to the nsproxy object.  The copy of the pid
    namespace only increases the refcount on the global pid namespace,
    init_pid_ns, and unshare is not implemented.
    
    There is no configuration option to activate or deactivate this feature
    because this not relevant for the moment.
    
    Signed-off-by: Cedric Le Goater <clg@fr.ibm.com>
    Cc: Kirill Korotaev <dev@openvz.org>
    Cc: Eric W. Biederman <ebiederm@xmission.com>
    Cc: Herbert Poetzl <herbert@13thfloor.at>
    Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Something went wrong with that request. Please try again.