# Let's build a scraper together

First, go to https://www.trustpilot.com and in have a look at the web*site* (site = the whole thing).

Decide on a specific *page* (page = one specific page) to scrape first -- generalization comes later.

In [1]:
import requests
import pandas as pd

from lxml.html import fromstring

In [2]:
url = "https://www.trustpilot.com/review/softwarekeep.com"

r = requests.get(url)
tree = fromstring(r.text)

**Hint**: Make sure you don't re-run the download code every time you play with the parsing below. Maybe save the downloaded file to disk and re-load.

## Let's try to get the rating (number of stars)

In [3]:
# let's print first instead of doing sth, just to check whether it works
for e in tree.xpath("//section/div/div/img"):
    print(e.attrib)

{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 4 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-4.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 stars', 'src': 'https://cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg'}
{'alt': 'Rated 5 out of 5 st

## The review texts
This seems more complicated, I needed quite some guesses for a good XPATH...

Note that we use the `.text_content()` function instead of the `.attrib` attribute. You can aways just get *one* element (e.g. `e = tree.xpath('somexpath')[0]` and then use TAB completion to get all methods and attributes.

In [4]:
for e in tree.xpath('//*[@data-review-content="true"]'):
    print(e.text_content())
    print('\n')

Joe was so helpfulJoe was so helpful today. Very quick and efficient. He knew my problem right away and offered quick solutions. Thank you so much Joe!!


CSR Carl was a great help. Took care of my request very fast and very professionally.CSR Carl was a great help. He took care of my request promptly and very professionally.


Good service always.


I had an issue with downloading…I had an issue with downloading Microsoft to my computer since i used the wrong email and Eric helped me via messages and helped me fix my problem within minutes.


AssistanceMy outlook calendar and contacts were not importing correctly. “HackLtd. Com” was able to assist me and had both issues resolved in no time. Overall, a fantastic experience with a fantastic professional !!!


I had purchased the wrong Microsoft…I had purchased the wrong Microsoft word for my PC. I reached out to customer support via chat and Joe was very helpful and made my experience very stress free. He advised which Microsoft would b

## Let's now bring the this together

Note that the lengths *have* to be the same, otherwise your data don't make sense (you see why, right?)

Putting everything in a dataframe is not really needed, you could also do sth like `for ra, re, nr in zip(ratings,reviews,nr_reviews_by_author):` to process the data directly, write them to a file, or similar. I just use pandas here because it looks nice and is easy -- in a real web scraper, I'd probably throw pandas out at one point, as it seems an unnecessary dependency and does not really add any functionality that we need.

In [5]:
ratings = [e.attrib['alt'] for e in tree.xpath("//section/div/div/img")]
reviews = [e.text_content() for e in tree.xpath('//*[@data-review-content="true"]')]
nr_reviews_by_author = [e.text for e in tree.xpath('//span[@data-consumer-reviews-count-typography="true"]')]
assert len(ratings)==len(reviews)==len(nr_reviews_by_author)

In [6]:
pd.DataFrame({'ratings':ratings, 'reviews':reviews, 'experience':nr_reviews_by_author})

Unnamed: 0,ratings,reviews,experience
0,Rated 5 out of 5 stars,Joe was so helpfulJoe was so helpful today. Ve...,1
1,Rated 5 out of 5 stars,CSR Carl was a great help. Took care of my req...,1
2,Rated 5 out of 5 stars,Good service always.,1
3,Rated 5 out of 5 stars,I had an issue with downloading…I had an issue...,1
4,Rated 4 out of 5 stars,AssistanceMy outlook calendar and contacts wer...,1
5,Rated 5 out of 5 stars,I had purchased the wrong Microsoft…I had purc...,1
6,Rated 5 out of 5 stars,"AwesomeOk, the customer service representative...",1
7,Rated 5 out of 5 stars,Joe did an excellent job helping me get…Joe di...,2
8,Rated 5 out of 5 stars,Eric took over my computer and did a…Eric took...,1
9,Rated 5 out of 5 stars,Great Technical AssistanceHad a major problem ...,1
