KeywordGrabber

Python Webscraper that gets the count of words and phrases on a page or set of pages. Provides results as a CSV file.

Usage:

In urls.conf, specify the URLs to crawl, one per line.
In the same file, optionally specify the HTML selectors to use below each URL, separated by commas. Prefix the line with 'Selectors:' See urls.conf for examples. All selectors will be used when searching, for example, 'div, .content' will match divs with class .content and ignore any other element with .content.
For each URL, you can specify wildcard characters in each URL. Wildcards allowed: , <0-9> - where each wildcard is substituted with the range of characters. Example: example.com/page-<89-102> will be replaced with example.com/page-89, example.com/page-90, ... example.com/page-102 and each URL will be loaded and parsed.
Ignored words present in ignore.conf will be dropped from the CSV. One word or phrase per line.

Output:
Output will be in two CSV files, phrases.csv and words.csv
They contain the words and phrases, respectively, in descending order of occurence, along with their counts.

Future Enhancements:
Make phrases count more useful, split text on designated separators such as '-' and count the pieces rather than count the whole thing.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README.md		README.md
ignored.conf		ignored.conf
parser.py		parser.py
parser.py~		parser.py~
urls.conf		urls.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KeywordGrabber

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KeywordGrabber

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages