Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start of urlchecker python module! #2

Merged
merged 10 commits into from
Mar 20, 2020
13 changes: 13 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: CI

on:
pull_request:
branches_ignore: []

jobs:
testing-docker:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Build container image
run: docker build -t urlstechie .
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ before_script:
- sudo apt-get install python3-pip
- sudo apt-get install python3-setuptools
- pip3 install --upgrade setuptools
- pip3 install -r requirements.txt
- pip3 install -e .
- pip3 install pytest-cov pytest
- pip3 install codecov
- pip3 install coveralls
Expand All @@ -23,7 +23,7 @@ before_script:

# command to run tests
script:
- pytest -v --cov=./
- pytest -vs -x --cov=./ tests/test_check.py tests/test_fileproc.py tests/test_files tests/test_urlproc.py

after_success:
- codecov
Expand Down
17 changes: 17 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# CHANGELOG

This is a manually generated log to track changes to the repository for each release.
Each section should include general headers such as **Implemented enhancements**
and **Merged pull requests**. Critical items to know are:

- renamed commands
- deprecated / removed commands
- changed defaults or behavior
- backward incompatible changes

Referenced versions in headers are tagged on Github, in parentheses are for pypi.

## [vxx](https://github.com/urlstechie/urlschecker-python/tree/master) (master)
- first release of urlchecker module with container, tests, and brief documentation (0.0.1)
- dummy release for pypi (0.0.0)

20 changes: 20 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
FROM bitnami/minideb:stretch
# docker build -t urlschecker .
WORKDIR /code
ENV PATH /opt/conda/bin:${PATH}
ENV LANG C.UTF-8
ENV SHELL /bin/bash
RUN /bin/bash -c "install_packages wget bzip2 ca-certificates git && \
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
rm Miniconda3-latest-Linux-x86_64.sh && \
conda create --name urlchecker && \
conda clean --all -y"
COPY . /code
RUN /bin/bash -c "source activate urlchecker && \
which python && \
pip install ."
RUN echo "source activate urlchecker" > ~/.bashrc
ENV PATH /opt/conda/envs/urlchecker/bin:${PATH}
ENTRYPOINT ["urlchecker"]
CMD ["check", "--help"]
181 changes: 179 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,191 @@
# urlchecker python

[![License](https://img.shields.io/badge/license-MIT-brightgreen)](https://github.com/urlstechie/urlchecker-python/blob/master/LICENSE)

# urlchecker python
![docs/urlstechie.png](docs/urlstechie.png)

This is a python module to collect urls over static files (code and documentation)
and then test for and report broken links.

## Code documentation
## Module Documentation

A detailed documentation of the code is available under [urls-checker.readthedocs.io](https://urls-checker.readthedocs.io/en/latest/)

## Usage

### Install

You can install the urlchecker from [pypi](https://pypi.org/project/urlchecker):

```bash
pip install urlchecker
```

or install from the repository directly:

```bash
git clone https://github.com/urlstechie/urlchecker-python.git
cd urlchecker-python
python setup.py install
```

Installation will place a binary, `urlchecker` in your Python path.

```bash
$ which urlchecker
/home/vanessa/anaconda3/bin/urlchecker
```


### Check Local Folder

Your most likely use case will be to check a local directory with static files (documentation or code)
for files. In this case, you can use urlchecker check:

```bash
$ urlchecker check --help

$ urlchecker check --help
usage: urlchecker check [-h] [-b BRANCH] [--subfolder SUBFOLDER] [--cleanup]
[--force-pass] [--no-print] [--file-types FILE_TYPES]
[--white-listed-urls WHITE_LISTED_URLS]
[--white-listed-patterns WHITE_LISTED_PATTERNS]
[--white-listed-files WHITE_LISTED_FILES]
[--retry-count RETRY_COUNT] [--timeout TIMEOUT]
path

positional arguments:
path the local path or GitHub repository to clone and check

optional arguments:
-h, --help show this help message and exit
-b BRANCH, --branch BRANCH
if cloning, specify a branch to use (defaults to
master)
--subfolder SUBFOLDER
relative subfolder path within path (if not specified,
we use root)
--cleanup remove root folder after checking (defaults to False,
no cleaup)
--force-pass force successful pass (return code 0) regardless of
result
--no-print Skip printing results to the screen (defaults to
printing to console).
--file-types FILE_TYPES
comma separated list of file extensions to check
(defaults to .md,.py)
--white-listed-urls WHITE_LISTED_URLS
comma separated list of white listed urls (no spaces)
--white-listed-patterns WHITE_LISTED_PATTERNS
comma separated list of white listed patterns for urls
(no spaces)
--white-listed-files WHITE_LISTED_FILES
comma separated list of white listed files and
patterns for files (no spaces)
--retry-count RETRY_COUNT
retry count upon failure (defaults to 2, one retry).
--timeout TIMEOUT timeout (seconds) to provide to the requests library
(defaults to 5)
```

You have a lot of flexibility to define patterns of urls or files to skip,
along with the number of retries or timeout (seconds). The most basic usage will
check an entire directory. Let's clone and check the directory of one of the
maintainers:

```bash
git clone https://github.com/SuperKogito/SuperKogito.github.io.git
cd SuperKogito.github.io
urlchecker check .

$ urlchecker check .
original path: .
final path: /tmp/SuperKogito.github.io
subfolder: None
branch: master
cleanup: False
file types: ['.md', '.py']
print all: True
url whitetlist: []
url patterns: []
file patterns: []
force pass: False
retry count: 2
timeout: 5

/tmp/SuperKogito.github.io/README.md
------------------------------------
https://travis-ci.com/SuperKogito/SuperKogito.github.io
https://www.python.org/download/releases/3.0/
https://superkogito.github.io/blog/diabetesML2.html
https://superkogito.github.io/blog/Cryptography.html
http://www.sphinx-doc.org/en/master/
https://github.com/
https://superkogito.github.io/blog/SignalFraming.html
https://superkogito.github.io/blog/VoiceBasedGenderRecognition.html
https://travis-ci.com/SuperKogito/SuperKogito.github.io.svg?branch=master
https://superkogito.github.io/blog/SpectralLeakageWindowing.html
https://superkogito.github.io/blog/Intro.html
https://github.com/SuperKogito/SuperKogito.github.io/workflows/Check%20URLs/badge.svg
https://superkogito.github.io/blog/diabetesML1.html
https://superkogito.github.io/blog/AuthenticatedEncryption.html
https://superKogito.github.io/blog/ffmpegpipe.html
https://superkogito.github.io/blog/Encryption.html
https://superkogito.github.io/blog/NaiveVad.html

/tmp/SuperKogito.github.io/_project/src/postprocessing.py
---------------------------------------------------------
No urls found.
...

https://github.com/marsbroshok/VAD-python/blob/d74033aa08fbbbcdbd491f6e52a1dfdbbb388eea/vad.py#L64
https://github.com/fgnt/pb_chime5
https://ai.facebook.com/blog/wav2vec-state-of-the-art-speech-recognition-through-self-supervision/
https://corplinguistics.wordpress.com/tag/mandarin/
http://www.cs.tut.fi/~tuomasv/papers/ijcnn_paper_valenti_extended.pdf
http://shachi.org/resources
https://conference.scipy.org/proceedings/scipy2015/pdfs/brian_mcfee.pdf
https://www.dlology.com/blog/simple-speech-keyword-detecting-with-depthwise-separable-convolutions/
https://stackoverflow.com/questions/49197916/how-to-profile-cpu-usage-of-a-python-script


Done. All URLS passed.
```

But wouldn't it be easier to not have to clone the repository first?
Of course! We can specify a GitHub url instead, and add `--cleanup`
if we want to clean up the folder after.

```bash
urlchecker check https://github.com/SuperKogito/SuperKogito.github.io.git
```

If you specify any arguments for a white list (or any kind of expected list) make
sure that you provide a comma separated list *without any spaces*

```
urlchecker check --white-listed-files=README.md,_config.yml
```

If you have any questions, please don't hesitate to [open an issue](https://github.com/urlstechie/urlchecker-python).


### Docker

A Docker container is provided if you want to build a base container with urlchecker,
meaning that you don't need to install it on your host. You can build the container as
follows:

```bsah
docker build -t urlchecker .
```

And then the entrypoint will expose the urlchecker.

```bash
docker run -it urlschecker
```

## Development

### Organization
Expand Down
7 changes: 4 additions & 3 deletions tests/test_check.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from urlchecker.core.fileproc import get_file_paths
from urlchecker.main.github import clone_repo, delete_repo, get_branch
from urlchecker.core.check import check_files
from urlchecker.logger import print_failure

@pytest.mark.parametrize('git_path', ["https://github.com/SuperKogito/SuperKogito.github.io"])
def test_clone_and_del_repo(git_path):
Expand Down Expand Up @@ -89,8 +90,8 @@ def test_script(config_fname, cleanup, print_all, force_pass, rcount, timeout):
# Generate command
cmd = ["urlchecker", "check", "--subfolder", "_project", "--file-types", file_types,
"--white-listed-files", "conf.py", "--white-listed-urls", white_listed_urls,
"--white-listed_patterns", white_listed_patterns, "--retry-count", retry_count,
"--timeout", timeout]
"--white-listed_patterns", white_listed_patterns, "--retry-count", str(rcount),
"--timeout", str(timeout)]

# Add boolean arguments
if cleanup:
Expand Down Expand Up @@ -174,7 +175,7 @@ def test_check_generally(retry_count):
elif not force_pass and check_results['failed']:
print("\n\nDone. The following URLS did not pass:")
for failed_url in check_results['failed']:
print_failre(failed_url)
print_failure(failed_url)
if retry_count == 1:
return True
else:
Expand Down
11 changes: 10 additions & 1 deletion tests/test_fileproc.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-
import os
import pytest
from urlchecker.core.fileproc import check_file_type, get_file_paths, collect_links_from_file, include_file
from urlchecker.core.fileproc import check_file_type, get_file_paths, collect_links_from_file, include_file, remove_empty


@pytest.mark.parametrize('file_path', ["tests/test_files/sample_test_file.md",
Expand Down Expand Up @@ -74,3 +74,12 @@ def collect_links_from_file(file_path):
# read file content
urls = collect_links_from_file()
assert(len(url) == 3)


def test_remove_empty():
"""
test that empty urls are removed
"""
urls = ["notempty", "notempty", "", None]
if len(remove_empty(urls)) != 2:
raise AssertionError
14 changes: 3 additions & 11 deletions tests/test_urlproc.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# -*- coding: utf-8 -*-
import pytest
from urlchecker.core.fileproc import collect_links_from_file
from urlchecker.core.urlproc import check_response_status_code, check_urls, remove_empty
from urlchecker.core.urlproc import check_response_status_code, check_urls


@pytest.mark.parametrize('file', ["tests/test_files/sample_test_file.md",
Expand All @@ -11,14 +11,6 @@ def test_check_urls(file):
"""
test check urls check function.
"""
check_results = {"failed": [], "passed": []}
urls = collect_links_from_file(file)
check_urls(file, urls, check_results=[[],[]])


def test_remove_empty():
"""
test that empty urls are removed
"""
urls = ["notempty", "notempty", "", None]
if len(remove_empty(urls)) != 2:
raise AssertionError
check_urls(file, urls, check_results=check_results)
12 changes: 5 additions & 7 deletions urlchecker/client/check.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ def main(args, extra):
- args: the argparse ArgParser with parsed args
- extra: extra arguments not handled by the parser
"""

path = args.path

# Case 1: specify present working directory
Expand All @@ -49,9 +48,9 @@ def main(args, extra):

# Parse file types, and white listed urls and files (includes absolute and patterns)
file_types = args.file_types.split(",")
white_listed_urls = remove_empty(args.white_listed_urls).split(",")
white_listed_patterns = remove_empty(args.white_listed_patterns).split(",")
white_listed_files = remove_empty(args.white_listed_files).split(",")
white_listed_urls = remove_empty(args.white_listed_urls.split(","))
white_listed_patterns = remove_empty(args.white_listed_patterns.split(","))
white_listed_files = remove_empty(args.white_listed_files.split(","))

# Alert user about settings
print(" original path: %s" % args.path)
Expand All @@ -60,7 +59,7 @@ def main(args, extra):
print(" branch: %s" % args.branch)
print(" cleanup: %s" % args.cleanup)
print(" file types: %s" % file_types)
print(" print all: %s" % args.print_all)
print(" print all: %s" % (not args.no_print))
print(" url whitetlist: %s" % white_listed_urls)
print(" url patterns: %s" % white_listed_patterns)
print(" file patterns: %s" % white_listed_files)
Expand All @@ -75,7 +74,7 @@ def main(args, extra):
white_listed_files=white_listed_files,
white_listed_urls=white_listed_urls,
white_listed_patterns=white_listed_patterns,
print_all=args.print_all,
print_all=not args.no_print,
retry_count=args.retry_count,
timeout=args.timeout)

Expand All @@ -89,7 +88,6 @@ def main(args, extra):
print("\n\nDone. No urls were collected.")
sys.exit(0)

# TODO write a function to print failure, print success
# Case 2: We had errors, but force pass is True
elif args.force_pass and check_results['failed']:
print("\n\nDone. The following urls did not pass:")
Expand Down
Empty file added urlchecker/core/__init__.py
Empty file.
4 changes: 2 additions & 2 deletions urlchecker/main/github.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@
import os
import sys
import subprocess
from core import urlproc
from core import fileproc
from urlchecker.core import urlproc
from urlchecker.core import fileproc


def clone_repo(git_path, branch="master"):
Expand Down
Loading