Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] Increase number of btrfs quotas rescan #1624

Open
MFlyer opened this issue Jan 23, 2017 · 6 comments
Open

[Enhancement] Increase number of btrfs quotas rescan #1624

MFlyer opened this issue Jan 23, 2017 · 6 comments

Comments

@MFlyer
Copy link
Member

MFlyer commented Jan 23, 2017

While checking over shares usage found we perform quota rescan only while creating snapshots, so if users don't have scheduled snapshots (obviously we hope users have them!) or massively delete them, we can have wrong reporting over shares (quota limits in the future: 2015/* qgroups easily reaching quota limit )

Option A:
add quota rescans when deleting snapshots too

Option B:
make it a supervisored task every x (5-10? mins)

@schakrava & @phillxnet ?
While testing shares usage and deleting some snapshots (no scheduled snapshots) my 2015/* was nicely going up while not expected (when deleting snaps with exclusive sizes, 2015/* being their "father" takes those orphans and seems at least 2 rescan required to come back to real values)

@phillxnet
Copy link
Member

@MFlyer Another nice find on your part.

My concern with solutions akin to Option B: (periodic re-scans) is that it will break drive power down features. If it turns out we can't avoid this in order to maintain recent info then maybe we could use the existing drive power state to guide these updates; ie akin to what smartmontools does where it can not query a drive if it is in standby but will only 'be nice' for a set number of attempts. There after it will go ahead and wake the drive in the interests of ensuring a recent read of the drives status.

Also I don't see that these options are mutually exclusive and Option A seems like a good idea anyway.

Do we have to account for quota re-scans potentially taking a long time (ie when there are many snapshots).

Apologies if I have missed the point there.

@MFlyer
Copy link
Member Author

MFlyer commented Feb 1, 2017

@phillxnet from btrfs changelog https://btrfs.wiki.kernel.org/index.php/Changelog#btrfs-progs_v4.9.1_.28Jan_2017.29

btrfs-progs v4.9.1 (Jan 2017)

check:

  • use correct inode number for lost+found files
  • lowmem mode: fix false alert on dropped leaf

size reports: negative numbers might appear in size reports during device deletes (previously in EiB units)
mkfs: print device being trimmed
defrag: v1 ioctl support dropped
quota: print message before starting to wait for rescan
qgroup show: new option to sync before printing the stats
other:

  • corrupt-block enhancements
  • backtrace and co. cleanups
  • doc fixes

Migrating to >=4.9 we can avoid additional rescan task and have btrfs rescans while updating shares usage (once every min)

@MFlyer
Copy link
Member Author

MFlyer commented Feb 1, 2017

Ref to @schakrava too, hands up for update to latest btrfs (pls remember my tests with 4.9 working fine)

Mirko

@phillxnet
Copy link
Member

@MFlyer Yes I saw that one come up and meant to pop it in here for context. We still have the suspend issue for drives of course. Where is the 'every min' element enforced? Maybe we can have this configurable (if on our side) with a link to drive power down if relevant.

Thumbs up for btrfs-progs update on my part as only way to go really, especially given your recent findings re issues with size reporting on our current version. Should we also not have our kernel updated to be at least of 4.9 version also (elrepo ml now has 4.9.6-1)? My understanding is that it is best to keep kernel version and btrfs-progs as close as we can.

@MFlyer
Copy link
Member Author

MFlyer commented Feb 1, 2017

We still have the suspend issue for drives of course. Where is the 'every min' element enforced? Maybe we can have this configurable (if on our side) with a link to drive power down if relevant.

Every min taks is under data_collector over refresh-share-state:

    def update_storage_state(self):
        # update storage state once a minute as long as
        # there is a client connected.
        while self.start:
            resources = [{'url': 'disks/scan',
                          'success': 'Disk state updated successfully',
                          'error': 'Failed to update disk state.'},
                         {'url': 'commands/refresh-pool-state',
                          'success': 'Pool state updated successfully',
                          'error': 'Failed to update pool state.'},
                         {'url': 'commands/refresh-share-state',
                          'success': 'Share state updated successfully',
                          'error': 'Failed to update share state.'},
                         {'url': 'commands/refresh-snapshot-state',
                          'success': 'Snapshot state updated successfully',
                          'error': 'Failed to update snapshot state.'}, ]
            for r in resources:
                try:
                    self.aw.api_call(r['url'], data=None, calltype='post',
                                     save_error=False)
                except Exception as e:
                    logger.error('%s. exception: %s'
                                 % (r['error'], e.__str__()))
            gevent.sleep(60)

We can link refresh-share-state to drives power down (ex. run every min with a conditional sync every x - 10?20?30? - mins)

Totally agree with you kernel & btrfs tools working together on 4.9 and same on future releases

Mirko

@MFlyer
Copy link
Member Author

MFlyer commented Feb 1, 2017

Hi @phillxnet , amending my last one, didn't think about data collector nature, check this code:

Note: every RockstorIO obj under datacollector is a namespace attached to Rockstor socket.io implementation, so on SysinfoNamespace (obj handling shares and pools status too) we perform btrfs operations only with a client connected to Rockstor WebUI (while start is True) and stop them asa clients disconnect, this granting a btrfs rescan only when someone is checking via WebUI. Can we accept this? :)

Mirko

class SysinfoNamespace(RockstorIO):

    start = False
    supported_kernel = settings.SUPPORTED_KERNEL_VERSION

    # This function is run once on every connection
    def on_connect(self, sid, environ):

        self.aw = APIWrapper()
        self.emit('connected',
                  {
                      'key': 'sysinfo:connected',
                      'data': 'connected'
                  })
        self.start = True
        self.spawn(self.update_storage_state, sid)
        self.spawn(self.update_check, sid)
        self.spawn(self.update_rockons, sid)
        self.spawn(self.send_kernel_info, sid)
        self.spawn(self.prune_logs, sid)
        self.spawn(self.send_localtime, sid)
        self.spawn(self.send_uptime, sid)

    # Run on every disconnect
    def on_disconnect(self, sid):

        self.cleanup(sid)
        self.start = False

    def send_uptime(self):
        # Seems redundant
        while self.start:
            self.emit('uptime', {'key': 'sysinfo:uptime', 'data': uptime()})
            gevent.sleep(60)

    def send_localtime(self):

        while self.start:

            self.emit('localtime',
                      {
                          'key': 'sysinfo:localtime',
                          'data': time.strftime('%H:%M (%z %Z)')
                      })
            gevent.sleep(40)

    def send_kernel_info(self):

            try:
                self.emit('kernel_info',
                          {
                              'key': 'sysinfo:kernel_info',
                              'data': kernel_info(self.supported_kernel)
                          })
            except Exception as e:
                logger.error('Exception while gathering kernel info: %s' %
                             e.__str__())
                # Emit an event to the front end to capture error report
                self.emit('kernel_error', {
                    'key': 'sysinfo:kernel_error', 'data': str(e)})
                self.error('unsupported_kernel', str(e))

    def update_rockons(self):

        try:
            self.aw.api_call('rockons/update', data=None, calltype='post',
                             save_error=False)
        except Exception as e:
            logger.error('failed to update Rock-on metadata. low-level '
                         'exception: %s' % e.__str__())

    def update_storage_state(self):
        # update storage state once a minute as long as
        # there is a client connected.
        while self.start:
            resources = [{'url': 'disks/scan',
                          'success': 'Disk state updated successfully',
                          'error': 'Failed to update disk state.'},
                         {'url': 'commands/refresh-pool-state',
                          'success': 'Pool state updated successfully',
                          'error': 'Failed to update pool state.'},
                         {'url': 'commands/refresh-share-state',
                          'success': 'Share state updated successfully',
                          'error': 'Failed to update share state.'},
                         {'url': 'commands/refresh-snapshot-state',
                          'success': 'Snapshot state updated successfully',
                          'error': 'Failed to update snapshot state.'}, ]
            for r in resources:
                try:
                    self.aw.api_call(r['url'], data=None, calltype='post',
                                     save_error=False)
                except Exception as e:
                    logger.error('%s. exception: %s'
                                 % (r['error'], e.__str__()))
            gevent.sleep(60)

    def update_check(self):

        uinfo = update_check()
        self.emit('software_update',
                  {
                      'key': 'sysinfo:software_update',
                      'data': uinfo
                  })

    def prune_logs(self):

        while self.start:
            self.aw.api_call('sm/tasks/log/prune', data=None, calltype='post',
                             save_error=False)
            gevent.sleep(3600)

Alternative/enhancement : while users connected have update_storage_state with current 60 secs sleep, while users disconnected perform it anyway (actually we don't do that!) but every 30/60/120mins? to grant fs status updates

@schakrava schakrava added this to the Point Bonita milestone Mar 24, 2017
@schakrava schakrava modified the milestones: Point Bonita, After Six Nov 7, 2017
@phillxnet phillxnet removed this from the After Six milestone Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants