Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git Reporter Hook #330

Closed
nille02 opened this issue Nov 30, 2018 · 11 comments
Closed

Git Reporter Hook #330

nille02 opened this issue Nov 30, 2018 · 11 comments

Comments

@nille02
Copy link

nille02 commented Nov 30, 2018

I'm not really happy with the email notification and how to see changes.

Is there a way to trigger a simple external script with some parameters like the name-of-urls.yaml, name and url? my intention is to pipe all in a text/html/xml file and commit them to a personal git/svn/hg repository. this way i have a nice history and see the changes on different devices in a easy way.

i would use the name of the urls.yaml as a initial folder name (i want to use this to group them), the name for the filename and the the url in the header of the text/html/xml file for a quickly open the site again. maybe complete entry from the urls.yaml.

Something like this is maybe helpful to #53 as well.

@cfbao
Copy link
Contributor

cfbao commented Dec 1, 2018

It sounds like a custom reporter subclass in hooks.py is exactly what you need.

The submit method of every subclass of reporters.ReporterBase is triggered after changes. Anything that you want in an external script, you can write them in submit instead.
self.job_states and self.config should have most (if not all) the info you need.

There's a simple example in the share folder.

@nille02
Copy link
Author

nille02 commented Dec 1, 2018

Thanks, i will give it a try. My Python Skills just crit to -10

@26tajeen
Copy link

26tajeen commented Mar 9, 2019

Did you make any head way with this? This is exactly what I'd like to achieve. Output to a script action!

@nille02
Copy link
Author

nille02 commented Mar 9, 2019

Nothing so far. I didn't find time to dig in.

@nille02
Copy link
Author

nille02 commented Apr 17, 2019

class GitReport(reporters.ReporterBase):
    __kind__ = 'gitreport'
    def submit(self):
        for job_state in self.report.get_filtered_job_states(self.job_states):
                pass

@lancelon with the code snipped you can achieve it, probably. at pass there are only jobs that are changed or new. In html2txt.py is a example how you execute a shell command and read the output from it. if you want the diff from the changes.

Information about job_state are in handler.py. job_state.job comes from jobs.py

I stumble thru it with try and error if i find a bit of time. But my goal has changed a bit. The initial idea was just to call a Script or execute a command but now i try it with gitpython without using the shell.

If i have a working version i probably make a pull request for the hooks example or i just post them in this ticket if its not "well enough"

@nille02
Copy link
Author

nille02 commented Apr 17, 2019

So, i have a running version from my gitreport done. To enable if add

report:
  gitreport:
    enabled: true
    path: '/path/to/repository' #Optional. if not set, it use the cache directory 

to your urlwatch config and it requires gitpython

It create for each Domain a Subfolder in the Repository and create for each job a textfile with the new content. The File is named after its job.name+job.guid.txt
I set a commit message after the job.name but i will probably add the domain to.

All kind of feedback is welcome. (But please remind "My Python Skills just crit to -10")

In the future i plan to add remote support, fix all bugs that i didn't find right now and maybe the stuff that is suggested.
EDIT: Remote Support is now included but you had to add the remote origin yourself. It fetch and pull all changes before changes are made and then push them online. I also test it now on Linux and Windows and it worked on both

import logging
import os
import unicodedata
import string
from urllib.parse import urlparse

from appdirs import AppDirs

import urlwatch
from urlwatch import filters
from urlwatch import reporters

logger = logging.getLogger(__name__)

# Custom Git Reporter

class GitReport(reporters.ReporterBase):
    """Create a File for each Job and Commit it to a Git Repository"""
    __kind__ = 'gitreport'


    def submit(self):
        if self.config.get('enabled', False) is False:
            return

        from git import Repo

        #We look if there is a Git Path in the config or we use a fallback
        urlwatch_cache_dir = AppDirs(urlwatch.pkgname).user_cache_dir
        fallback = os.path.join(urlwatch_cache_dir, 'git')
        git_path = self.config.get('path', fallback)
        if (git_path == ''):
            logger.info('Git path is emptry. Using: ' + os.path.abspath(fallback))
            git_path = fallback

        #Look if the Folder is presend and if not create it
        if not os.path.exists(git_path):
            logger.debug('Create Folder: ' + git_path)
            os.mkdir(git_path)
            # Because its a new Folder, create a new Repository
            repo = Repo.init(os.path.abspath(git_path))
        else:
            repo = Repo(os.path.abspath(git_path))

        # Check for Untracked Files and Abort
        assert repo.untracked_files == []

        #Check if we have a remote Repository and fetch changes befor adding or changin files.
        if repo.remotes != []:
            remote = True
            repo.remotes.origin.fetch()
            repo.remotes.origin.pull()
        else:
            remote = False

        #Write all Changes.
        for job_state in self.report.get_filtered_job_states(self.job_states):

            # We use the Domain as Subdirectory
            parsed_uri = urlparse(job_state.job.get_location())
            result = '{uri.netloc}'.format(uri=parsed_uri)

            #Check if the job_path exist and if not create it
            job_path = os.path.join(git_path, result)
            if not os.path.exists(job_path):
                os.mkdir(job_path)


            # Generate a save Filename
            filename = self.clean_filename(job_state.job.pretty_name())
            filename = filename + '.' + job_state.job.get_guid() + '.txt'

            # Create the File or override the old file
            with open(os.path.join(job_path, filename), 'w+', encoding='utf-8') as writer:
                writer.write(job_state.new_data)

            repo.index.add([os.path.join(job_path, filename)])
            repo.index.commit(job_state.job.pretty_name() + ' \n' + result + ' \n' + job_state.job.get_location())


        #Check if we have a remote Repository and push the changes.
        if remote:
            repo.remotes.origin.push()

    #This Function is from https://gist.github.com/wassname/1393c4a57cfcbf03641dbc31886123b8
    @staticmethod
    def clean_filename(filename, replace=' '):
        whitelist = "-_.() %s%s" % (string.ascii_letters, string.digits)
        char_limit = 210 # I add a Sha-1 Hash and the file extension

        # replace spaces
        for r in replace:
            filename = filename.replace(r, '_')

        # keep only valid ascii chars
        cleaned_filename = unicodedata.normalize('NFKD', filename).encode('ASCII', 'ignore').decode()

        # keep only whitelisted chars
        cleaned_filename = ''.join(c for c in cleaned_filename if c in whitelist)
        if len(cleaned_filename) > char_limit:
            logger.info("Warning, filename truncated because it was over {}. Filenames may no longer be unique".format(char_limit))
        return cleaned_filename[:char_limit]

derp3

@nille02 nille02 changed the title Trigger external Scripts after Changes Trigger external Scripts after Changes | Git Reporter Hook Apr 17, 2019
@nille02 nille02 closed this as completed Apr 17, 2019
@nille02
Copy link
Author

nille02 commented Apr 18, 2019

@cfbao or @thp, is there a way to get the filters in the reporter hook?

@cfbao
Copy link
Contributor

cfbao commented Apr 18, 2019

You mean job_state.job.filter?

@nille02
Copy link
Author

nille02 commented Apr 18, 2019

Thank you. I guess they are generated in JobBase.init() so i didn't find them and didn't understand that code.

And is there a unfiltered version of the request? I did try job_state.job.retrieve(job_state) that works quite well but it would double the requests and the load on the server.

@cfbao
Copy link
Contributor

cfbao commented Apr 19, 2019

The unfiltered data is not saved in the current implementation.

@nille02 nille02 changed the title Trigger external Scripts after Changes | Git Reporter Hook Git Reporter Hook Apr 28, 2019
@nille02
Copy link
Author

nille02 commented Apr 28, 2019

I change it a bit. I moved all changes in a single Commit. I added a pseudo filter to provide a job based subfolder in the repository.

My Todo List is to add the ability to clone a existing repository with the url from the urlwatch-config.

import logging
import os
import unicodedata
import string
from urllib.parse import urlparse

from appdirs import AppDirs
import lxml.html

import urlwatch
from urlwatch import filters
from urlwatch import reporters

class GitSubPath(filters.FilterBase):
    """This is a Dummyfilter for git-report.
    Its only purpose is to provide a subfilter String as path for gitreporter
    """
    __kind__ = 'git-path'

    def filter(self, data, subfilter=None):
        if subfilter is None:
            raise ValueError('git-path needs a name for a Subfolder in the Git Repository')
        return data


class bUnicodeDummy(filters.FilterBase):
    """This is a Dummyfilter for git-report
    If you use non asscii charakters in your Name you can change the filename Whitelist to a Blacklist
    """
    __kind__ = 'bUnicode'

    def filter(self, data, subfilter=None):
        if subfilter is None:
            subfilter = True
        return data


# Custom Git Reporter


class GitReport(reporters.ReporterBase):
    """Create a File for each Job and Commit it to a Git Repository"""
    __kind__ = 'gitreport'

    def submit(self):
        if self.config.get('enabled', False) is False:
            return

        from git import Repo

        # We look if there is a Git Path in the config or we use a fallback
        urlwatch_cache_dir = AppDirs(urlwatch.pkgname).user_cache_dir
        fallback = os.path.join(urlwatch_cache_dir, 'git')
        git_path = self.config.get('path', fallback)
        if (git_path == ''):
            logger.info('Git path is emptry. Using: ' + os.path.abspath(fallback))
            git_path = fallback

        # Look if the Folder is presend and if not create it
        if not os.path.exists(git_path):
            logger.debug('Create Folder: ' + git_path)
            os.mkdir(git_path)
            # Because its a new Folder, create a new Repository
            repo = Repo.init(os.path.abspath(git_path))
        else:
            repo = Repo(os.path.abspath(git_path))

        # Check if we have a remote Repository and fetch changes befor adding or changin files.
        if repo.remotes != []:
            print("Fetch and Pull from Git Repository")
            remote = True
            repo.remotes.origin.fetch()  # Tthis 2 Steps need some time.
            repo.remotes.origin.pull()
        else:
            remote = False

        commit_message = ""

        # Write all Changes.
        for job_state in self.report.get_filtered_job_states(self.job_states):

            # Unchanged or Error states are nothing we can do with
            if (job_state.verb == "unchanged" or job_state.verb == "error"):
                continue
            # I try to get a filterlist with its parameter
            # if we find git-path filter then lets read its parameter
            filters = {}
            if job_state.job.filter is not None:
                filterslist = job_state.job.filter.split(',')
                for key in filterslist:
                    if len(key.split(':', 1)) == 2:
                        filters[key.split(':', 1)[0]] = key.split(':', 1)[1]

            parsed_uri = urlparse(job_state.job.get_location())
            result = '{uri.netloc}'.format(uri=parsed_uri)

            if filters.get('git-path', None) is not None:
                job_path = os.path.join(git_path, filters['git-path'])
                if not os.path.exists(job_path):
                    os.mkdir(job_path)
            else:
                # Check if the job_path exist and if not create it
                job_path = os.path.join(git_path, result)
                if not os.path.exists(job_path):
                    os.mkdir(job_path)

            # Generate a save Filename
            if(filters.get('bUnicode', False)):  # bUnicode is a Dummyfilter, he does nothing else as to provide a Boolean
                filename = self.clean_filename2(job_state.job.pretty_name())
            else:
                filename = self.clean_filename(job_state.job.pretty_name())
            filename = filename + '.' + job_state.job.get_guid() + '.txt'

            # Create the File or override the old file
            with open(os.path.join(job_path, filename), 'w+', encoding='utf-8') as writer:
                writer.write(job_state.new_data)

            repo.index.add([os.path.join(job_path, filename)])
            message = "%s\n%s \n%s\n\n" % (job_state.job.pretty_name(), result, job_state.job.get_location())
            commit_message += message

        # Add all Changes in one Commit
        if (len(list(self.report.get_filtered_job_states(self.job_states))) > 0):
            repo.index.commit(commit_message)

        # Check if we have a remote Repository and push the changes.
        if remote:
            print("Push Changes to the Repository ...")
            repo.remotes.origin.push()
            print("Done.")

    # This Function is from https://gist.github.com/wassname/1393c4a57cfcbf03641dbc31886123b8
    @staticmethod
    def clean_filename(filename, replace=' '):
        whitelist = "-_.() %s%s" % (string.ascii_letters, string.digits)
        char_limit = 210  # I add a Sha-1 Hash and the file extension

        # replace spaces
        for r in replace:
            filename = filename.replace(r, '_')

        # keep only valid ascii chars
        cleaned_filename = unicodedata.normalize('NFKD', filename).encode('ASCII', 'ignore').decode()

        # keep only whitelisted chars
        cleaned_filename = ''.join(c for c in cleaned_filename if c in whitelist)
        if len(cleaned_filename) > char_limit:
            logger.info("Warning, filename truncated because it was over {}. Filenames may no longer be unique".format(char_limit))
        return cleaned_filename[:char_limit]

    # This Function is from https://gist.github.com/wassname/1393c4a57cfcbf03641dbc31886123b8
    # I changed this to a blacklist to fit my needs with asian Filenames
    @staticmethod
    def clean_filename2(filename, replace=' '):
        blacklist = "|*/\\%&$§!?=<>:\""
        char_limit = 210  # I add a Sha-1 Hash and the file extension

        # replace spaces
        for r in replace:
            filename = filename.replace(r, '_')

        # keep only valid ascii chars
        cleaned_filename = unicodedata.normalize('NFKD', filename)

        # remove blacklistet chars
        cleaned_filename = ''.join(c for c in cleaned_filename if c not in blacklist)
        if len(cleaned_filename) > char_limit:
            logger.info("Warning, filename truncated because it was over {}. Filenames may no longer be unique".format(char_limit))
        return cleaned_filename[:char_limit]

EDIT: Fix a Error if no filter is set.
EDIT: Added a other Version of clean Filename with a blacklist of illegal chars instead of a whitelist. You can trigger it with the bUnicode Filter. Its useful if you use a page with non-ascii Letters

Repository owner deleted a comment from azitafomo Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants