# Selector Crawler

###### This is a simple crawler tool created to test a given selector in a list of pages (URL's).

In order to do this, you are going to need two things, the first would be the CSS selector you would like to tests and the second one would be the list of pages/URLs (xml file) in which you want to test the previously mentioned CSS selector.

## Getting the pages/URLs 

In order to get the xml file containing the list of pages to test the CSS selector on, you have to use an online tool called "XML Sitemap Genertor".

To do so, please follow these steps:
1. Go to the "XML Sitemap Genertor" website: https://www.xml-sitemaps.com/
2. Choose and input in the text field at the top of the page the base URL in which you want this site list to start, so it pulls all the pages that start with that url. E.g. https://attitude.co.uk/ 
3. Click on the "Start" button.
4. As soon as it is done, click on "View Site Map" button
5. Click on "Download your XML sitemap file" to start the download of the xml file containing the list of pages to check.


## Filtering the csv file to get only the URLs wanted

Now that you have downloaded the csv file, we have to clean it up so the crawler only tests the intended pages.

We can do this by specifying a base url to filter for, for example: https://attitude.co.uk/article/ will only test pages that start with that URL as a base.

For example, lets say you pulled a sitelist using the tool "XML Sitemap Genertor" on https://attitude.co.uk/ and you got the following xml file:
<img src="img_xml1.png">

Now, lets say you only want to test article pages, as you can see in the example above, it contains all type of pages not only the article ones (for example: https://attitude.co.uk/category/news/), to do this, you just have to specify a base url to filter for.

For example: if we use the base URL https://attitude.co.uk/article/ on the above XML, this crawler will only test the css selectors only on the below pages.
<img src="img_xml2.png">

In [3]:
import xml.etree.ElementTree as T
import re
import lxml.html
from lxml.cssselect import CSSSelector
from IPython.display import clear_output
import requests
import csv
from ipynb.fs.full.progress_bar import log_progress


#Getting and Parsing the xml code
while True:
    try:
#         xml_str = input('1.- Paste XML code: ')
        xml_str = '''<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<!-- created with Free Online Sitemap Generator www.xml-sitemaps.com -->


<url>
  <loc>https://attitude.co.uk/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>1.00</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/news/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.80</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/entertainment/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.80</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/community/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.80</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/boys/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.80</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/opinion/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.80</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/style/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.80</priority>
</url>

<url>
  <loc>https://attitude.co.uk/article/dancing-on-ice-star-max-evans-opens-up-about-that-full-frontal-shoot-with-brother-thom-1/17429/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/86-year-old-attitude-reader-hugh-gets-naked-to-prove-youre-never-too-old-to-celebrate-your-body/17189/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/austin-armacosts-divorce-diet-reality-star-loses-two-stone-in-just-12-weeks-and-sculpts-his-dream-body-to-celebrate-his-new-life/17148/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/the-gay-star-of-sas-who-dares-wins-is-sticking-two-fingers-up-to-his-childhood-bullies/17011/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/geordie-shores-nathan-henry-how-i-lost-32lbs-and-sculpted-the-superman-body-of-my-dreams-in-just-12-weeks/16962/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/ms-menswear-launch-two-new-innovative-active-ranges-1/16964/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/real-bodies-life-doesnt-end-when-you-get-grey-hair-1/16949/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/new-vieve-sports-nutrition-gives-active-men-protein-without-the-artificial-ingredients/16916/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/max-emerson-strips-off-to-talk-health-and-fitness-with-attitude/16585/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/real-bodies-i-try-not-to-be-naked-even-in-front-of-my-husband-1/16577/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/attitudes-deputy-editor-embarks-on-three-month-body-transformation-part-one-1/16439/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/strictly-come-dancings-gorka-marquez-strips-off-and-talks-fitness-with-attitude-1/16364/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/showbiz-journalist-james-ingham-strips-off-for-attitudes-active-shoot-1/16379/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/gareth-gates-strips-down-and-shows-off-his-body-to-talk-fitness-with-attitude-1/16175/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/category/active/?p=2</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/win-tickets-to-see-oceans-8-at-a-special-screening-and-two-tickets-to-the-premiere/18102/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/win-180-worth-of-absolute-collagen-goodies-1/17728/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/win-tickets-to-bfi-flare-presents-love-simon-terms-and-conditions/17380/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>
<url>
  <loc>https://attitude.co.uk/article/give-him-the-ride-of-his-life-win-a-pair-of-eurostar-tickets-to-amsterdam-1/17239/</loc>
  <lastmod>2018-06-13T15:04:51+00:00</lastmod>
  <priority>0.64</priority>
</url>



</urlset>'''
        clear_output()
        e = T.fromstring(xml_str)
        break
    except T.ParseError:
        print("\x1b[31mInvalid XML, please use valid XML code.\x1b[0m")


#REGEX to filter out the sites being tested
# base_url = input('2.- Enter base URL: ')
base_url = 'https://attitude.co.uk/article/'
clear_output()
regex = re.compile(base_url)


# construct a CSS Selector
# css_selector = input('2.- Enter CSS Selector: ')
css_selector = 'div.article-contents'
clear_output()
sel = CSSSelector(css_selector)
# sel = CSSSelector('div.article-content')




for loc in log_progress(e[:len(e.getchildren())]):
    if(re.match(regex, loc[0].text)): #this will give me all the /article URLs in the XML
        r = requests.get(loc[0].text)
        # build the DOM Tree
        tree = lxml.html.fromstring(r.text)
        if(sel(tree)):
#             filewriter.writerow(['YES', loc.text])
            print('FOUND' + ' ' + loc[0].text)
        else:
#             filewriter.writerow(['NO', loc.text])
            print('NOT FOUND' + ' ' + loc[0].text)

        
# with open('Selector_Crawler.csv', 'w') as csvfile:
#     filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
#     filewriter.writerow(['Matched Selector?', 'Exact Page URL'])

#     for loc in e.iter('{http://www.sitemaps.org/schemas/sitemap/0.9}loc'):
#         if(re.match(regex, loc.text)): #this will give me all the /article URLs in the XML
#             r = requests.get(loc.text)
#             # build the DOM Tree
#             tree = lxml.html.fromstring(r.text)
#             if(sel(tree)):
#                 filewriter.writerow(['YES', loc.text])
#                 print('FOUND' + ' ' + loc.text)
#             else:
#                 filewriter.writerow(['NO', loc.text])
#                 print('NOT FOUND' + ' ' + loc.text)
                







NOT FOUND https://attitude.co.uk/article/dancing-on-ice-star-max-evans-opens-up-about-that-full-frontal-shoot-with-brother-thom-1/17429/
NOT FOUND https://attitude.co.uk/article/86-year-old-attitude-reader-hugh-gets-naked-to-prove-youre-never-too-old-to-celebrate-your-body/17189/
NOT FOUND https://attitude.co.uk/article/austin-armacosts-divorce-diet-reality-star-loses-two-stone-in-just-12-weeks-and-sculpts-his-dream-body-to-celebrate-his-new-life/17148/
NOT FOUND https://attitude.co.uk/article/the-gay-star-of-sas-who-dares-wins-is-sticking-two-fingers-up-to-his-childhood-bullies/17011/
NOT FOUND https://attitude.co.uk/article/geordie-shores-nathan-henry-how-i-lost-32lbs-and-sculpted-the-superman-body-of-my-dreams-in-just-12-weeks/16962/
NOT FOUND https://attitude.co.uk/article/ms-menswear-launch-two-new-innovative-active-ranges-1/16964/
NOT FOUND https://attitude.co.uk/article/real-bodies-life-doesnt-end-when-you-get-grey-hair-1/16949/
NOT FOUND https://attitude.co.uk/article/new-vieve

In [7]:
from IPython.display import display, Markdown, Latex
display(Markdown('# hey $\phi$'))
# If you particularly want to display maths, this is more direct:
# display(Latex('\phi'))

# hey $\phi$

In [10]:
from ipywidgets import Layout, Button, Box

items_layout = Layout( width='auto')     # override the default width of the button to 'auto' to let the button grow

box_layout = Layout(display='flex',
                    flex_flow='column',
                    align_items='stretch',
                    border='solid',
                    width='50%')

words = ['correct', 'horse', 'battery', 'staple']
items = [Button(description=word, layout=items_layout, button_style='danger') for word in words]
box = Box(children=items, layout=box_layout)
box

In [14]:
from ipywidgets import Layout, Button, Box, VBox

# Items flex proportionally to the weight and the left over space around the text
items_auto = [
    Button(description='weight=1; auto', layout=Layout(flex='1 1 auto', width='auto'), button_style='danger'),
    Button(description='weight=3; auto', layout=Layout(flex='3 1 auto', width='auto'), button_style='danger'),
    Button(description='weight=1; auto', layout=Layout(flex='1 1 auto', width='auto'), button_style='danger'),
 ]

# Items flex proportionally to the weight
items_0 = [
    Button(description='weight=1; 0%', layout=Layout(flex='1 1 0%', width='auto'), button_style='danger'),
    Button(description='weight=3; 0%', layout=Layout(flex='3 1 0%', width='auto'), button_style='danger'),
    Button(description='weight=1; 0%', layout=Layout(flex='1 1 0%', width='auto'), button_style='danger'),
 ]
box_layout = Layout(display='flex',
                    flex_flow='row',
                    align_items='stretch',
                    width='70%')
box_auto = Box(children=items_auto, layout=box_layout)
box_0 = Box(children=items_0, layout=box_layout)
VBox([box_auto, box_0])

In [18]:


from ipywidgets import Button, Layout

b = Button(description='(50% width, 80px height) button',
           layout=Layout(width='50%', height='80px'))
b.on_click(on_button_clicked)
b

Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
Button clicked.
