# Web Scraping for People Analytics

In this notebook, we will demonstrate how a typical web scraping can be done with the BeautifulSoup package. Web scraping is a technique we use to extract information from websites when the data we want to obtain certain pieces of information and/or data.

We will scrape the contents off Blink, an anonymous employee review website.

## Step 1: Import

For this task, we will need to take advantage of these two libraries, these are:

- `requests` - de-facto package for making HTTP requests in python
- `BeautifulSoup` - package for parsing and extracting information from HTML tags

We can also import `pandas` and `numpy` - that will help us create the Dataframe later.

In [1]:
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

## Step 2: Connect to Website

Before we can scrape a website, we will need to initiate a request to the website. The website will then obtain the necessary information from the server and spits all this information out as a response.

We need to first define the URL that we want to make a request to, in this case http://teamblind.com/company/Grab/posts. We will then use the `requests` module to make a `GET` request to the website. `GET` requests are the most common kind of requests you can make to a website, it's just basically asking the server to send you back information (the other is `POST`, where you send information to a server, usually common through web forms like Google Forms).



In [2]:
url = 'http://teamblind.com/company/Grab/posts'

In [3]:
response = requests.get(url)

Now that we have asked the website to give us their content. How do we ensure that the page load was successful?

This is where we use **HTTP response codes**. Response codes are essentially 3 digit codes that tell us whether the web page loading was successful, and if not, what the cause of the error is.

Common HTTP response codes are:
- 200: OK
- 301: Redirect
- 403: Forbidden (means that you are not authorised to access the content)
- 404: Not found (means that the page you requested could not be found or does not exist)
- 500: Server error
- 502: Bad gateway

The most ideal response code that we want to get is a **200 OK** response. The `requests` package helps us obtain the response code through the `status_code` variable.

Read more about HTTP response codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

So before we get the content of the page, we want to ensure that we are getting a response code of **200**, like below:

In [4]:
if(response.status_code == 200):
    page_content = response.content

page_content

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n    <meta charset="utf-8">\n    <meta data-hid="viewport" name="viewport" content="user-scalable=no, initial-scale=1, width=device-width, minimal-ui, maximum-scale=1" />\n\n    <meta http-equiv="X-UA-Compatible" content="ie=edge">\n\n    <title>Your Anonymous Workplace Community - Blind</title>\n    <link data-n-head="true" rel="icon" type="image/x-icon" href="/favicon.ico" />\n\n    <link rel="stylesheet" type="text/css" href="https://s3-us-west-2.amazonaws.com/www.teamblind.com/error-pages/static-blockpage-last/css/etc.min.css\n    ">\n\n</head>\n<body>\n\n<div class="flex_layout">\n    <!-- fnc_ux -->\n    <header>\n        <div class="wrap">\n            <h2 class="h_blind"><!--add_topic -->\n                <a href="#"><i class="blind">blind</i></a>\n            </h2>\n        </div>\n    </header>\n    <div id="container" role="main">\n        <div class="max_wrap">\n\n            <div class="error">\n                <div class="msg_ar

But what happened here?

Apparently the website detected that we are a bot attempting to scrape their website, which is why they blocked us from accessing their site. That is not what we want. We want to act as if we are a normal human trying to access the website.

To overcome this, we have to add what is called a **User Agent**. User agents are like your personal identity. Just as how you have a name, an identification number (for Singapore we call it NRIC), a date of birth, etc, your user agent is the operating system your machine is on and the browser you use. User agents help websites understand what kind of visitors they are receiving. It will also help them sieve out undesirable visitors who may attempt to do something against the website, such as bots. Find out more about user agents here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent.

We will now try to access Blink again, this time with a specific user agent. Any common user agents, which are browsers, are accepted. We need to define it as part of the header to be sent along with the request.

In [5]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}

In [6]:
response = requests.get(url, headers = headers)

if(response.status_code == 200):
    page_content = response.content

page_content[6000:7000]

b'p">View in App</a> <button class="btn_clse">close</button></div> <!----> <div id="wrap" class="navux"><!----> <div class="fnc_ux"><div class="wrap"><div class="rts_ad"><a href="https://www.rooftopslushie.com" target="_blank" onclick="ga(\'send\', \'event\', \'Menu\', \'Rooftop-slushie\', \'Top-menu\');"><span class="ico ico_rts_logo"><i class="blind">Rooftop Slushie BI</i></span> <span class="arrow">Employee Referrals to Top Tech Companies</span></a></div></div></div> <header class="add_search"><div class="wrap"><div class="user_part"><h2 class="h_blind"><button type="button" class="btn_menu"><span class="ico_hamburger"></span></button> <a href="/" onclick="ga(\'send\', \'event\', \'Menu\', \'Home\', \'Top-menu-BI\');" title="blind"><i class="blind">blind</i></a></h2> <div class="search keyword"><div class="srch"><a href="javascript:window.history.back()" class="btn_prv"><i class="blind">back</i></a> <div class="input_wrap"><input type="search" id="keyword" name="keyword" placeholder

We can see that, after adding the headers with the user agent, the website now accepts our request and we can get the contents of the page. We have truncated the string for brevity.

## Step 3: Parse the Content through BeautifulSoup

Here is where we really start to talk about `BeautifulSoup`. `BeautifulSoup` is a python package designed to parse content from websites. It can detect HTML tags and convert it into a tree structure.

In [7]:
soup = BeautifulSoup(page_content, 'html.parser')

If you try to view the content as is, you might find yourself lost...very very lost...among all the wall of text. To help us read the source a little easier, let's use the `prettify()` function. This function helps us to indent the code so that we can clearly see how the page is structured...

In [8]:
print(soup.prettify())

<!DOCTYPE html>
<html data-n-head="%7B%22lang%22:%7B%22ssr%22:%22en%22%7D%7D" data-n-head-ssr="" lang="en">
 <head>
  <title>
   Grab Discussions I Interview, Salaries, and More - Blind
  </title>
  <meta charset="utf-8" data-n-head="ssr"/>
  <meta content="width=device-width, initial-scale=1, maximum-scale=1, user-scalable=0" data-n-head="ssr" name="viewport"/>
  <meta content="no" data-n-head="ssr" name="apple-mobile-web-app-capable"/>
  <meta content="black" data-n-head="ssr" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="telephone=no" data-n-head="ssr" name="format-detection"/>
  <meta content="https://d2u3dcdbebyaiu.cloudfront.net/img/homepage/us/blind_share.png" data-hid="image" data-n-head="ssr" name="image"/>
  <meta content="874414332573445" data-n-head="ssr" property="fb:app_id"/>
  <meta content="Blind" data-n-head="ssr" property="og:site_name"/>
  <meta content="https://d2u3dcdbebyaiu.cloudfront.net/img/home_us/blind_share.png" data-hid="og:image" data-n-he

You may also take advantage of the developer tools (or devtools as we call it) to help you see the HTML codes clearly. For Google Chrome, you can right click on a text or portion of the page and then 'Inspect'.

## Step 4: Extract the Content using BeautifulSoup

Now comes the real work of extracting the content.

`BeautifulSoup` comes with many functions to help us locate specific tags that we want. For the case of the Blink website, we want to search for reviews, which are often located in `<div class='h_tit>...</div>`. So we ask `BeautifulSoup` to find all `<div>` tags with a class of `h_tit`.

In [9]:
reviews = soup.find_all('div', class_ = 'h_tit')
reviews

[<div class="h_tit"><strong><!-- -->
                     grab india
                 </strong> <span>Is grab india good to join? please share thoughts on it.  position: senior software engineer yoe: 5  please guys provide your opinions.</span></div>,
 <div class="h_tit"><strong><!-- -->
                     Layoff Grab
                 </strong> <span>Is anyone got fired recently due to Covid #severance #layoff</span></div>,
 <div class="h_tit"><strong><!-- -->
                     Grab Hiring Freeze
                 </strong> <span>Hi all,  I'm really keen to work for Grab (Singapore) but I've heard they're on hiring freeze.  Does anyone know if they're planning to resume hiring anytime soon? Since Singapore seems to be easing the measures.   #hiring #grab @Grab #singapore  #recruiting</span></div>,
 <div class="h_tit"><strong><!-- -->
                     Grab recruiter not responding after my onsite
                 </strong> <span>I recently had my virtual onsite with Grab, Singap

Let's extract the first item from the list of reviews to check that we have done it correctly:

In [10]:
reviews[0].find('strong').get_text().strip()

'grab india'

In [11]:
reviews[0].find('span').get_text().strip()

'Is grab india good to join? please share thoughts on it.  position: senior software engineer yoe: 5  please guys provide your opinions.'

To explain the functions in detail:

- `find_all()` function helps us locate tags with similar properties in the document
- `find()` function helps us find the first instance of the specific tag within the document
- `get_text()` is a helper function to help us directly extract the text
- `strip()` is a string function to remove all extraneous spaces or padding before and after the text

We can expand the reviews further...

In [12]:
for review in reviews:
    print(review.find('strong').get_text().strip())
    print(review.find('span').get_text().strip())

grab india
Is grab india good to join? please share thoughts on it.  position: senior software engineer yoe: 5  please guys provide your opinions.
Layoff Grab
Is anyone got fired recently due to Covid #severance #layoff
Grab Hiring Freeze
Hi all,  I'm really keen to work for Grab (Singapore) but I've heard they're on hiring freeze.  Does anyone know if they're planning to resume hiring anytime soon? Since Singapore seems to be easing the measures.   #hiring #grab @Grab #singapore  #recruiting
Grab recruiter not responding after my onsite
I recently had my virtual onsite with Grab, Singapore in first week of March. I am waiting for the result but they stopped responding to my emails. Had sent around 5-6 emails till now but no response. Same thing they did to me last year in April 2019,  after having my virtual onsites they did not ge
vote
                    does zoominfo has potential to grab lot market
its almost 1/3rd market cap compared to linked-in market cap.  wondering does it ha

In [13]:
names = [t.find('strong').get_text().strip() for t in reviews]
comments = [t.find('span').get_text().strip() for t in reviews]
    
names, comments

(['grab india',
  'Layoff Grab',
  'Grab Hiring Freeze',
  'Grab recruiter not responding after my onsite',
  'vote\n                    does zoominfo has potential to grab lot market',
  'What to expect in EM/HM round for Lead/SSE Grab SGP',
  'Is anyone getting fired for performance reasons during this pandemic ?',
  'Anyone else moving to Tahoe?',
  'TC at Amazon India (Bengaluru) at 3YOE',
  'How easy is it to get tn1 visa as Canadian?',
  'Need referral in Canada, Singapore. Have on-site calls from facebook london(E5) and amazon canada(Sde3)',
  'Looking for referral in Canada, Singapore. Have on-site calls from facebook london(E5) and amazon canada(Sde3)',
  'Singapore tech companies?',
  'team match at facebook',
  'Taxes due not being collected',
  'WFH Equipment',
  'Thank you Blinders!',
  'Layoff Grab',
  'Grab Hiring Freeze',
  'Grab Senior Data Engineer Singapore',
  'grab india',
  'Google vs Grab offer',
  'Grab TC',
  'Epic money grab',
  'vote\n                    Got 

...and even make a Pandas Dataframe out of it, which can then be exported to a csv file for further processing.

In [14]:
review_data = {
    'name': names,
    'comments': comments
}

df = pd.DataFrame(review_data)
df

Unnamed: 0,name,comments
0,grab india,Is grab india good to join? please share thoug...
1,Layoff Grab,Is anyone got fired recently due to Covid #sev...
2,Grab Hiring Freeze,"Hi all, I'm really keen to work for Grab (Sin..."
3,Grab recruiter not responding after my onsite,"I recently had my virtual onsite with Grab, Si..."
4,vote\n does zoominfo has po...,its almost 1/3rd market cap compared to linked...
5,What to expect in EM/HM round for Lead/SSE Gra...,"Hi Folks, I have a scheduled EM round (final) ..."
6,Is anyone getting fired for performance reason...,Name and shame them companies I'll grab🍿
7,Anyone else moving to Tahoe?,My GF and I got a 6 month lease in Tahoe. We'r...
8,TC at Amazon India (Bengaluru) at 3YOE,what TC should i be asking for?interviews went...
9,How easy is it to get tn1 visa as Canadian?,Assuming I have a bachelor's in CS and a job o...


## Potential Use-Cases and Future Work

With web scraping, you can potentially do anything and everything that analytics has to offer. For this example, the most common use case will be to perform sentiment analysis on the reviews, to understand how Grabbers are perceiving the company to be.

The example demonstrated above is not entirely perfect. For example, most reviews are truncated in the main page, which means that we will need to switch to a more interactive scraping tool, such as Python Selenium, in order to help us obtain the full review.

## Further Reading

- [HTML Basics](https://www.w3schools.com/html/html_basic.asp)
- [Python Requests](https://realpython.com/python-requests/)
- [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [How to scrape websites without getting blocked](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/)

Or search 'web scraping'.