#  Web scrapping for NLP task.

The goal of this notebook is to build a tool that can scrape text from a given list of websites, in order to use it later for clustering the sites. 

The task indicates that we should get text from the landing page, as well as text from the links contained in the landing page. 

Since many requests will be necessary, some mechanism has to be put in place in order to avoid being blocked. 
(user-agents, proxy, etc.)

As each page contains many links, parallel processing can be implemented in order to speed up the scrapping. 

The final product should be able to take a list of websites and build text files with the contents of each site. 
Additional parameters could be included for managing, for instance, the pareallel processing, or maybe some further filtering of the contents. 

In [None]:
import scraptools as sct

In [None]:
# Read list of sites from csv
SITES_LIST = []

with open('./site_lists/01_websites.csv', 'r', newline = '') as f:
    for site in f.readlines():
        SITES_LIST.append(site.strip())

print(f'The list contains {len(SITES_LIST)} sites.')

# call custom scrapping function
report = sct.scrape_full(SITES_LIST)

print('='*20 + '\n Scrapping Summary \n' +'='*20 +'\n' )
print(f'{len(report["sites"])} sites requested. \n' 
      + f'Scrapping took {report["time_s"]/60:.2f} min ({report["time_s"]:.2f} s) \n' 
      + f'{len(report["succesful"]) } SUCCESFUL. \n'
      + f'{len(report["failed"]) } FAILURES. \n\n'
      + f'Contents in: {report["contents"]} \n'
      + f'Logs in: {report["logs"]} \n'
      + f'Full report: {report["report_name"]}' 
     )