Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge secure-root.yml into server.yml #274

Merged
merged 1 commit into from Jul 26, 2015

Conversation

@fullyint
Copy link
Member

commented Jul 14, 2015

PermitRootLogin is a config, not a separate event. PermitRootLogin no seems like a config to set and forget, like other configs in server.yml. This PR moves it into server.yml, then drops secure-root.yml.

User experience. For users who don't want to secure root, this changes nothing in how they setup trellis or run its commands.

Users who do want to secure root just toggle sshd_permit_root_login to "no" in group_vars/all. They no longer have to edit server.yml or remember run a separate playbook for each site they want to secure. They won't even need to use --ask-sudo-pass on the first run of server.yml (runs as root).

Users can toggle sshd_permit_root_login back and forth between "yes" or "no" at any time. Previously, trellis offered no way of undoing the effects of secure-root.yml. Users had to change it manually, via ssh.

sshd role should be standard. This PR moves the sshd role into server.yml and dev.yml.

sudoers role merged into users role. Parts of the sudoers role were redundant with the users role. This PR moves the two non-redundant tasks into the users role and drops the sudoers role.

Remote user based on gathered fact. It is routine for Ansible tasks to use previously gathered facts and registered variables.

This PR implements this concept in an uncommon manner at one point. A variable registered in the first play of server.yml serves as the remote_user in the second play. The first play executes the new remote-user role, a single task determining whether remote_user in the second play should be root or the new admin_user variable. Edit: I revised the PR. Now the first play in server.yml determines -- and sets a fact for -- the proper ansible_ssh_user for each host to use in the second play.

The second play is what was already in server.yml.

New admin_user variable. This PR adds a admin_user variable. It is used in the users list in group_vars/all and to set the ansible_ssh_user for the second play in server.yml (when root can't connect). Some people may want to set admin_user to ubuntu (see AWS discussion).

fail2ban. If someone disables root login then quickly and repeatedly runs server.yml, fail2ban could ban them due to failed test connections for root (remote-user role). Users could adjust fail2ban defaults.

History. An earlier iteration of this PR is at #246

@louim

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2015

👏

As a side note I think well need to have some kind of server spec testing soon because it's hard to "evaluate" a PR like this without testing it on a real environment.

@austinpray

This comment has been minimized.

Copy link
Member

commented Jul 14, 2015

@louim yep that's in the pipe

groups: [sudo]
admin_user: admin
sshd_permit_root_login: "yes" # If "no", admin_user must be in 'users' above (with sudo group) and in sudoer_passwords
sshd_password_authentication: "no"

This comment has been minimized.

Copy link
@swalkinshaw

swalkinshaw Jul 14, 2015

Member

no need for quotes on these values?

This comment has been minimized.

Copy link
@fullyint

fullyint Jul 14, 2015

Author Member

Without quotes, my Ansible 1.9.2 templating treats yes/no values as boolean, so sshd_config would have PermitRootLogin True, which caused my ansible playbook connections to fail when I tested it. Using quotes, ansible gives us the needed PermitRootLogin yes

@swalkinshaw

View changes

roles/sshd/defaults/main.yml Outdated
@@ -11,7 +11,7 @@ sshd_server_key_bits: "768"
sshd_syslog_facility: "AUTH"
sshd_log_level: "INFO"
sshd_login_grace_time: "120"
sshd_permit_root_login: "no"
sshd_permit_root_login: "yes"

This comment has been minimized.

Copy link
@swalkinshaw

swalkinshaw Jul 14, 2015

Member

might as well get rid of the unneeded quotes in this file

This comment has been minimized.

Copy link
@fullyint

fullyint Jul 14, 2015

Author Member

We could remove quotes (except for yes/no values). Or we could leave them, figuring this is kind of an external role: nickjj/ansible-sshd

This comment has been minimized.

Copy link
@swalkinshaw

swalkinshaw Jul 14, 2015

Member

Rather remove them to be consistent with the rest of our codebase since it's in our repo.

@fullyint fullyint force-pushed the fullyint:merge-secure-root branch Jul 14, 2015

@fullyint

View changes

roles/remote-user/tasks/main.yml Outdated
- name: Determine whether to connect as root or admin_user
local_action: shell ssh -o PasswordAuthentication=no root@{{ groups.web.0 }} "echo root" || echo {{ admin_user }}
changed_when: false
register: remote_user

This comment has been minimized.

Copy link
@fullyint

fullyint Jul 14, 2015

Author Member

The goal of this task is to determine the appropriate remote_user for server.yml's main play. That play will run on hosts: web, so this task tests whether root can connect to the first host in the web group. The hosts in this group are accessible in the ansible variable groups.web.

I will replace root@{{ groups.web.0 }} with root@{{ groups.web | first }} to switch from a hard-coded feel to something that may communicate the concept more intuitively. Suggestions welcome.

Aside: There's a rare potential that the web group could consist of multiple hosts, added at different times, some permitting root login while others forbid it. I commented on this scenario in #246 ("Conditional remote_user" section).

This comment has been minimized.

Copy link
@fullyint

fullyint Jul 15, 2015

Author Member

I've adjusted this to check all hosts instead of just the first. It sets the ansible_ssh_user fact for each host. Now the playbook can handle variability in whether or not hosts permit root login.

@fullyint fullyint force-pushed the fullyint:merge-secure-root branch 2 times, most recently Jul 14, 2015

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2015

Improved handling of remote_user fact. With prompting from @swalkinshaw, I revised the PR to no longer use a registered variable across the two plays. Now the remote-user role in the first play sets the ansible_ssh_user fact for each host to use in the second play. Here are the benefits:

  • playbook can now handle a group of hosts who differ on whether or not they PermitRootLogin

  • the remote-user task can switch from the questionable-looking root@{{ groups.web.0 }} to the natural-looking root@{{ inventory_hostname }}

  • the main play in server.yml can drop the remote_user parameter altogether, which is good news because my recent iteration of this PR had it looking a little ugly, like this:

    remote_user: "{{ hostvars.localhost.remote_user.stdout }}"

Why two plays instead of one? The remote-user role must be in its own separate play with the parameter gather_facts: false. It can't connect to the remote hosts to "gather facts" without potentially failing because it doesn't know the proper user for the connection. Yet, the main play needs gather_facts: true to access some of the variables it uses. Examples:

  • ansible_distribution (xdebug role)
  • ansible_os_family (ntp role)
  • ansible_pkg_mgr (logrotate role)

Remote user variable precedence. The ansible_ssh_user fact set in the remote-user role will override:

  • remote_user parameter for the play (if set)
  • ansible_ssh_user defined for each host in the inventory
  • --user argument passed via CLI

I suppose users wanting finer grain control could just comment out the remote-user role and set their users in their inventory, etc.

Here is some discussion over at Ansible about dynamically setting ansible_ssh_useror remote_user.

@austinpray

This comment has been minimized.

Copy link
Member

commented Jul 15, 2015

Testing:

# create do droplet then set ip to variable
export $DO_IP=xx.xx.xx.xxx
git checkout -b fullyint-merge-secure-root master
git pull https://github.com/fullyint/trellis.git merge-secure-root
sed -i 's/sshd_permit_root_login: "yes"/sshd_permit_root_login: "no"/g' group_vars/all
sed -i "s/192.168.50.5/$DO_IP/g" hosts/staging
ansible-playbook -i hosts/staging server.yml
# success
./deploy.sh staging example.com
# success
ssh root@$DO_IP
# Permission denied (publickey).
ssh web@$DO_IP
# Welcome to Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-52-generic x86_64)

Cool beans

@fullyint fullyint force-pushed the fullyint:merge-secure-root branch Jul 15, 2015

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 15, 2015

(Edit: This whole comment is inapplicable. I reverted these changes below.)

My initial comment said...

fail2ban. If someone disables root login then quickly and repeatedly runs server.yml, fail2ban could ban them due to failed test connections for root (remote-user role). Users could adjust fail2ban defaults.

I've modified the PR to give users the option to set admin_user.try_root: false, if they prefer. That way, a dev can have sshd_permit_root_login: "no", yet still avoid failed root connections and potential fail2ban jail time while running and re-running the playbook (e.g., during initial setup or repeated testing). The Security wiki can mention this option when addressing the option for sshd_permit_root_login: "no".

Here are the changes that I applied:

# group_vars/all
- admin_user: admin
+ admin_user:
+   name: admin
+   try_root: true # If true, server.yml will first attempt to connect as root
# roles/remote_user/tasks/main.yml

  - name: Determine whether to connect as root or admin_user
-   local_action: shell ssh -o PasswordAuthentication=no root@{{ inventory_hostname }} "echo root" || echo {{ admin_user }}
+   local_action: shell ssh -o PasswordAuthentication=no root@{{ inventory_hostname }} "echo root" || echo {{ admin_user.name }}
    changed_when: false
+   when: admin_user.try_root
    register: remote_user

  - name: Set remote user for each host
    set_fact:
-     ansible_ssh_user: "{{ remote_user.stdout }}"
+     ansible_ssh_user: "{{ admin_user.try_root | ternary(remote_user.stdout, admin_user.name) }}"

At one point I had hoped to skip the root connection test when users specify --user <name> on the CLI (ansible-playbook command). I gave up on that because although we can access the ansible_ssh_user variable, I don't see how we can know whether it was set via the CLI. In any case, the admin_user.try_root: false approach has the advantage that the user wouldn't have to add the --user arg on the CLI every time.

@fullyint fullyint force-pushed the fullyint:merge-secure-root branch Jul 15, 2015

@austinpray

This comment has been minimized.

Copy link
Member

commented Jul 16, 2015

@fullyint maybe I am misunderstanding, but if root is disabled shouldn't it never try to connect as root ever?

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 16, 2015

root must be used on the first run of server.yml, given that root is the only user than can connect on new DigitalOcean droplets. The soonest the PermitRootLogin no could be in effect is after that first run.

That was the most challenging issue to handle in this PR, and if there is anything people will resist in this PR, it is the attempt to dynamically determine the remote user. This attempt is not uncommon, however (example), and I think this PR's implementation is pretty decent. All this is why so many of my comments in this thread talk about remote_user, to give rationale.

The "Determine Remote User" play in server.yml dynamically determines the user so that users don't have to take a manual action related to user. Before, users had to run the separate secure-root.yml playbook (as remote_user: root), and then edit the remote_user in server.yml to be admin. Now, all the user stuff just works for the majority of use cases.

For the less common case of someone setting PermitRootLogin no then pounding on server.yml till they wind up with fail2ban jail time, they can set admin_user.try_root: false to tell server.yml from now on, "never try to connect as root ever." Or, temporarily while testing, they could set sshd_permit_root_login: "yes", or adjust fail2ban settings, but then they'd have to remember to set these back to the more secure setting. Setting admin_user.try_root: false seems easiest.

As I described in my comment opening the PR, I think this PR offers better user experience for all users. All that said, I'd love it if anyone would suggest a better implementation.

@fullyint fullyint force-pushed the fullyint:merge-secure-root branch Jul 18, 2015

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 18, 2015

Managing fail2ban via ignoreip. I revised the PR to dynamically append the control machine's external public IP to fail2ban's ignoreip list. This seems better than admin_user.try_root, which I've reverted.

IP lookup. The IP is retrieved using a dig dns lookup with the Ansible pipe lookup plugin. The Ansible dig lookup plugin would require users to install an extra python package: dnspython. For a one-time simple lookup, I didn't want to add a dependency.

force_handlers. The fail2ban role runs before the sshd role, so the control machine's IP should make it into fail2ban's ignoreip list before PermitRootLogin no takes effect. I've added force_handlers = True in ansible.cfg to ensure the "restart fail2ban" handler will run and ignoreip will take effect, even if the playbook fails before completion. We could have added a force_handlers: True parameter in server.yml but that doesn't seem better necessarily, and it would cause Ansible < 1.9.1 to fail. Older versions don't fail if it is specified in ansible.cfg.

Bumping Ansible requirement. force_handlers is only available in Ansible >= 1.9.1. I bumped the Ansible requirement in the README to >=1.9.2 (from 1.9), skipping 1.9.1 because of this bug. If you prefer to leave it at 1.9 that would work, but the force_handlers wouldn't be in effect for people on < 1.9.1.

"Remote server setup" section in README. Many support requests boil down to users not having configured SSH or SSH agent forwarding. I added explicit mention of each to the "Remote server setup" section in the README. You can review my wording.

@louim

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2015

@fullyint will the DNS resolver work from inside the VM when run from windows user?

Also just a friendly tip, please don't rebase your pull request until it is fully ready, it makes it harder to see the incremental changes you talk about. :)

@louim

This comment has been minimized.

Copy link
Contributor

commented Jul 18, 2015

Also, is the point about SSH forwarding really needed in the readme? I never had problems with it since the ansible.cfg already enable agent forwarding. I find the doc hard to understand for people who don't have a lot of knowledge with SSH.

Would it be enough to just mention that users shouldn't forget to load their SSH key(s) in the ssh agent with ssh-add? (And maybe add the link to the wiki)

@swalkinshaw

This comment has been minimized.

Copy link
Member

commented Jul 19, 2015

@louim I just assumed you still needed forwarding enabled on the host machine on top of the Ansible config setting. Is that not the case?

@louim

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2015

@swalkinshaw if you want to have SSH forwarding enabled when manually ssh-ing to the destination server, then yes you need to have ssh-config file specifying ForwardAgent yes or you can do it manually each connection by adding -o ForwardAgent=yes to ssh command.

But all SSH connexion handled by Ansible (including all playbook run and the deploy process) do not need additional configuration other than what's already in ansible.cfg. You still need to load your SSH key in the ssh-agent if you haven't already done so. This could be automated / validated in the deploy.sh script.

For windows users, i don't know if your key is forwarded when using vagrant ssh to run the ansible commands.

@swalkinshaw

This comment has been minimized.

Copy link
Member

commented Jul 19, 2015

@louim good to know.

I agree we can remove that then and add a mention somewhere about ssh-add.

@austinpray

This comment has been minimized.

Copy link
Member

commented Jul 19, 2015

FWIW some of the only problems I've had setting people up with ansible deploys has been improperly configured SSH evironments. People not having their key added to their agent etc. Not trellis' fault, but still good to have that stuff documented.

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2015

@louim Thanks for tip about only adjusting the PR via new commits, keeping incremental changes clear.

SSH agent forwarding. Also, thanks for pointing out that ForwardAgent yes is not required in ssh-config. This led me to review discourse threads regarding problems with SSH agent forwarding. People didn't always report a specific fix, but when they did, it was mac users needing to add their ssh key password to Keychain.

The SSH keys wiki mentions the Keychain issue, but not prominently. We can make it more prominent there in the wiki. Personally, I'm still inclined to make it explicit in the README so that users isolate and fix as many non-Trellis problems as possible before running their first Ansible command.

Would the following feel acceptable, or do others feel it would clutter up the README?

3. Configure SSH
  * Ensure you can SSH to your server, e.g., by running `ssh root@server_ip_or_domain`
- * Ensure you can clone a private repo from your remote server using [SSH agent forwarding](https://developer.github.com/guides/using-ssh-agent-forwarding/).
+ * Mac OS X users must have added their SSH key password to Keychain by running `/usr/bin/ssh-add -K /path/to/private/key`.
  * Specify public SSH keys for `users` in `group_vars/all`. See the [SSH Keys wiki](https://github.com/roots/trellis/wiki/SSH-Keys).
  * Consider setting `sshd_permit_root_login: "no"` in `group_vars/all`. See the [Security wiki](https://github.com/roots/trellis/wiki/Security).

IP lookup on Windows.

will the DNS resolver work from inside the VM when run from windows user?

I'm afraid it'll be a few days before I can set up tests for this. It would be super helpful if one of our Windows users could check this. After you run server.yml from this PR's branch, does your /etc/fail2ban/jail.local include your IP at the end of the ignoreip parameter?

@louim

This comment has been minimized.

Copy link
Contributor

commented Jul 20, 2015

@fullyint

SSH agent forwarding. I think that the part:

Mac OS X users must have added their SSH key password to Keychain by running /usr/bin/ssh-add -K /path/to/private/key

is still oddly specific. What if you don't use a key with a password? Why Mac OS X users only (and not Linux)? Why /usr/bin/ssh-add instead of ssh-add? In fact, ssh-add -K would suffice unless using a custom named SSH key.

I think that we could also prevent most of the problem with custom error messages. Like we could use git ls-remote to validate we are able to connect to the specified remote, and if not the case, print a helpful error message and bail out.

Something like:

"We were unable to connect to the specified git repo (git repo url here) from your server. Double check that the SSH key you use normally is correctly loaded in the ssh agent by using the command ssh-add -L. If you don't see your key, add it by running ssh-add. For more info, check the Trellis wiki at: https://github.com/roots/trellis/wiki/SSH-Keys"

IP lookup on Windows.: I'll try to check out on my windows machine tonight.

@swalkinshaw

This comment has been minimized.

Copy link
Member

commented Jul 20, 2015

@louim good idea about checking git ls-remote with a custom error message. There's probably a few other places we can do that as well but out of scope for this PR.

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 21, 2015

@louim Thanks for helping me recognize that the note about Mac users adding their SSH key password to Keychain would be too specific for the general users' README. It struck me as common, but stepping back, I see that it is "common" only when looking at specific users in a specific situation: when trellis fails to clone a repo. So, I agree it seems better to address it in an error message specific to that situation.

I've added a commit trimming back the changes I made to the README. I also removed the README section about secure-root.yml, which I had forgotten to remove initially. Future PRs can address how to help users with SSH problems.

@louim

This comment has been minimized.

Copy link
Contributor

commented Jul 21, 2015

@fullyint Tested it on Windows and the DNS resolved correctly 👍

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 21, 2015

Thank you for all your great help @louim !

@swalkinshaw

This comment has been minimized.

Copy link
Member

commented Jul 25, 2015

This should be good to go. Squash commits 👍

@fullyint fullyint force-pushed the fullyint:merge-secure-root branch to de49c28 Jul 26, 2015

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Jul 26, 2015

squashed

swalkinshaw added a commit that referenced this pull request Jul 26, 2015

Merge pull request #274 from fullyint/merge-secure-root
Merge secure-root.yml into server.yml

@swalkinshaw swalkinshaw merged commit db63a89 into roots:master Jul 26, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@swalkinshaw

This comment has been minimized.

Copy link
Member

commented Jul 26, 2015

Great work, thanks @fullyint

@franzliedke

This comment has been minimized.

Copy link

commented Oct 6, 2017

Sorry to re-raise a dead thread, but I traced my issue back to this change...

I recently spent multiple days trying to find out why I could not re-provision a new server with the command from the documentation, after I had successfully provisioned another server before. (This was when we were experimenting with moving our existing WordPress site to Trellis & Bedrock.)

The SSH authentication for the admin user would fail. It took me a long time to find out that I could rerun the command with the --user=root argument, which would then work.

I have no clue about how Ansible works, but apparently it remembers "facts" across playbook runs (?). I assume that once it had set up the admin user for the original server, it remembered the fact that it could now use admin for logging in. For the new server, that user was not yet available, though.

Is there anything that can be done about this? IMO, these facts should not be retained when provisioning a new server (with a new IP). At the very least, this should be mentioned in the documentation...

@fullyint

This comment has been minimized.

Copy link
Member Author

commented Oct 6, 2017

@franzliedke Good job finding the --user=root workaround and digging in to what must have been a frustrating issue. As it turns out, I don't think it's a bug. So, could we continue any discussion at https://discourse.roots.io so our messages don't ping subscribers to the bug tracker?

Trellis and Ansible do not remember the user across runs. You'll notice this PR introduces a task Determine whether to connect as root or admin_user. If root can connect, the rest of the playbook tries to run as root. If root fails, it tries admin_user instead.

If the playbook suddenly seemed to only try admin instead of root for new servers, and you hadn't changed anything with Trellis, naturally you'd figure that Trellis was remembering the admin user from a previous server (e.g., where root login had been disabled). However, I suspect what happened was you updated your Ansible version, e.g., to 2.3.

There have been a couple Ansible updates that have "broken" the Trellis feature of choosing root vs. admin_user based on a connection test.

  • #631 (Aug 6, 2016) fixed a problem in response to the release of Ansible 2.1.1.0
  • #813 (Apr 15, 2017) fixed a problem in response to the release of Ansible 2.3

I suggest you update your Trellis version and see if Trellis then is able to correctly choose the root user. Note that the latest Trellis requires the recently released Ansible 2.4 in order to accommodate revised code required to not cause deprecation warnings in 2.4. If you prefer to stay on Ansible 2.3, only update your Trellis to 8a4677c

Please don't respond here. If you have any trouble, post at https://discourse.roots.io and someone will try to offer support. Post your Ansible version (ansible-playbook --version), your Trellis version (e.g., post the topmost entry in your Trellis CHANGELOG.md), and your verbose log output from running ansible-playbook server.yml -e env=production -vvvv. If it turns out to be a bug, we'll gladly try to fix it here in the GitHub bug tracker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.