GitHub - shubham-MLwiz/Advanced-Web-Crawler: A Python Implementation of Web crawler and indexer.

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
Models		Models
README.txt		README.txt
counter.py		counter.py
crawler.py		crawler.py
downloader.py		downloader.py
indexer.py		indexer.py
linkanalyser.py		linkanalyser.py
mathutils.py		mathutils.py
parser.py		parser.py
porter.py		porter.py
retriever.py		retriever.py
search.py		search.py
stopwords.txt		stopwords.txt
tokenizedocuments.py		tokenizedocuments.py

Repository files navigation

Required Libraries: BeautifulSoup (for parsing)
                      can be installed in Ubuntu by sudo easy_install BeautifulSoup
                      
Invoke 'python crawler.py' in the terminal to initiate the crawler.

A prompt asks for the starting url.

Enter 'www.iitr.ac.in' and the crawler starts crawling url which are in the same domain only.

The crawler also maintains log of all the discarded, crawled and already visited links in log.txt.

The crawler still needs lot of improvement to be done.

About

A Python Implementation of Web crawler and indexer.

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models

Models

README.txt

README.txt

counter.py

counter.py

crawler.py

crawler.py

downloader.py

downloader.py

indexer.py

indexer.py

linkanalyser.py

linkanalyser.py

mathutils.py

mathutils.py

parser.py

parser.py

porter.py

porter.py

retriever.py

retriever.py

search.py

search.py

stopwords.txt

stopwords.txt

tokenizedocuments.py

tokenizedocuments.py

Repository files navigation

About

Releases

Packages

shubham-MLwiz/Advanced-Web-Crawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks