Web Scraping data from RateMDs website

In [53]:
#Import Webscraping libraries
from bs4 import BeautifulSoup
from requests import get

#Import Data structure libraries
import pandas as pd
from pandas import Series, DataFrame

#Import libraries for controlling crawling rate
from time import sleep, time
from random import randint

#Import library for clearing output
from IPython.core.display import clear_output

Here, the data that I want to scrape is the details of doctors whose specialty is 'Family/G.P.s' and the city they live/practice in is 'New York'.

In [3]:
#Enter the url you need to scrape
url='https://www.ratemds.com/best-doctors/ny/new-york/family-gp'

In [4]:
#use 'get' from requests library to obtain the HTML of the website's home page
response=get(url)
response.content[:1000]

b'\n<!DOCTYPE html>\n<html lang="en" ng-app="RateMds" id="ng-app">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">\n<link rel="apple-touch-icon" sizes="57x57" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-57x57.495c70819b0f.png">\n<link rel="apple-touch-icon" sizes="114x114" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-114x114.abcec11ee43c.png">\n<link rel="apple-touch-icon" sizes="72x72" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-72x72.a21a57e92df8.png">\n<link rel="apple-touch-icon" sizes="144x144" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-144x144.2fcfe644723c.png">\n<link rel="apple-touch-icon" sizes="60x60" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-60x60.5dd1baf2b8ba.png">\n<link rel="apple-touch-icon" sizes="120x120" href="//www.ratemds.com/static/img/favicons/apple-touc

As we can see, the HTML data we extracted is quite messy. Thus, we need to improve the readability of the content.

In [5]:
#Convert the above content into a more readable format
soup=BeautifulSoup((response.content),"lxml")
print(soup.prettify()[:1000]) #prettify() makes the content more readable

<!DOCTYPE html>
<html id="ng-app" lang="en" ng-app="RateMds">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-57x57.495c70819b0f.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-114x114.abcec11ee43c.png" rel="apple-touch-icon" sizes="114x114"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-72x72.a21a57e92df8.png" rel="apple-touch-icon" sizes="72x72"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-144x144.2fcfe644723c.png" rel="apple-touch-icon" sizes="144x144"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-60x60.5dd1baf2b8ba.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-120x120.e5b8c908f646

Now, we need to search for the HTML part of the webpage which we need to scrape i.e. the Doctor Details.
Next, we need to identify the corresponding tag in the HTML and use the find() function over soup to get the required contents.

In [6]:
#Choose the section of interest
summary=soup.find('div',{'id':'doctor-list'})
print(summary.prettify()[:1000])

<div id="doctor-list">
 <div data-react-checksum="817905626" data-reactid=".1fy2aoe824i">
  <h2 class="subtitle heading-withsub" data-reactid=".1fy2aoe824i.0">
   <span data-reactid=".1fy2aoe824i.0.0">
    The
   </span>
   <span data-reactid=".1fy2aoe824i.0.1">
    Best Family Doctors / G.P.s in The World
   </span>
  </h2>
  <a class="heading-sub" data-reactid=".1fy2aoe824i.1" href="/specialties/family-gp/">
   <span data-reactid=".1fy2aoe824i.1.0">
    What is a
   </span>
   <span data-reactid=".1fy2aoe824i.1.1">
    Family Doctor / G.P.
   </span>
   <span data-reactid=".1fy2aoe824i.1.2">
    ?
   </span>
  </a>
  <div class="search-item doctor-profile" data-reactid=".1fy2aoe824i.3:$2139668">
   <img alt="Dr. Anna Becker" class="search-item-image" data-reactid=".1fy2aoe824i.3:$2139668.0" height="100" src="//www.ratemds.com/static/img/doctors/doctor-female.png_thumbs/v1_at_100x100.jpg" width="100"/>
   <h2 class="search-item-doctor-name" data-reactid=".1fy2aoe824i.3:$2139668.1">
  

It's time to pull the details of all the doctors one by one. So below are the details which we aim to scrape:
1. Name of the Doctor
2. Specialty
3. Rating
4. No. of Reviews

1. Name of the Doctor

We need to find the tag which corresponds to the Doctor Name. In this  case its 'a' and class='search-item-doctor-link'.

In [7]:
#Pull the Doctor Names HTML in the form of list
doctor_name=summary.findAll('a',{'class':'search-item-doctor-link'})
doctor_name[:5]

[<a class="search-item-doctor-link" data-reactid=".1fy2aoe824i.3:$2139668.1.0" href="/doctor-ratings/dr-anna-becker-greensboro-nc-us">Dr. Anna Becker</a>,
 <a class="search-item-doctor-link" data-reactid=".1fy2aoe824i.3:$2075115.1.0" href="/doctor-ratings/19756/Dr-Frances+P.-Wong-Jamestown-NC.html">Dr. Frances P. Wong</a>,
 <a class="search-item-doctor-link" data-reactid=".1fy2aoe824i.3:$432370.1.0" href="/doctor-ratings/3554647/Dr-Lewis-Mitchell-Greensboro-NC.html">Dr. Lewis Mitchell</a>,
 <a class="search-item-doctor-link" data-reactid=".1fy2aoe824i.3:$1756511.1.0" href="/doctor-ratings/772997/Dr-Kevin-Little-Greensboro-NC.html">Dr. Kevin Little</a>,
 <a class="search-item-doctor-link" data-reactid=".1fy2aoe824i.3:$1951898.1.0" href="/doctor-ratings/171094/Dr-Mitch-Freeman-YUMA-AZ.html">Dr. Mitch Freeman</a>]

The result we obtain above is a list of Doctor Names in HTML format. Now, we need to extract only the Doctor Names and add it into a new list.

In [8]:
#Extract the Doctor Names from the list
dr_name=[]
for name in doctor_name:
    dr_name.append(name.text)
dr_name[:5]

['Dr. Anna Becker',
 'Dr. Frances P. Wong',
 'Dr. Lewis Mitchell',
 'Dr. Kevin Little',
 'Dr. Mitch Freeman']

2. Specialty

We need to find the tag which provides the details of Doctor's Speciality, similar to what we did for Name of the Doctor.

In [9]:
#Pull the Doctor Specialty HTML in the form of list
doctor_specialty=summary.findAll('div',{'class':'search-item-specialty'})
doctor_specialty[:5]

[<div class="search-item-specialty" data-reactid=".1fy2aoe824i.3:$2139668.3"><a data-reactid=".1fy2aoe824i.3:$2139668.3.0" href="/best-doctors/?specialty=family-gp">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".1fy2aoe824i.3:$2075115.3"><a data-reactid=".1fy2aoe824i.3:$2075115.3.0" href="/best-doctors/?specialty=family-gp">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".1fy2aoe824i.3:$432370.3"><a data-reactid=".1fy2aoe824i.3:$432370.3.0" href="/best-doctors/?specialty=family-gp">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".1fy2aoe824i.3:$1756511.3"><a data-reactid=".1fy2aoe824i.3:$1756511.3.0" href="/best-doctors/?specialty=family-gp">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".1fy2aoe824i.3:$1951898.3"><a data-reactid=".1fy2aoe824i.3:$1951898.3.0" href="/best-doctors/?specialty=family-gp">Family Doctor / G.P.</a></div>]

Now, we'll extract the Doctor Specialty from the above result. It would be same values as we are currently extracting only for one particular Specialty.

In [18]:
#Extract the Doctor Specialty from the list
dr_specialty=[]
for specialty in doctor_specialty:
    dr_specialty.append(specialty.text)
dr_specialty[5:]

['Family Doctor / G.P.',
 'Family Doctor / G.P.',
 'Family Doctor / G.P.',
 'Family Doctor / G.P.',
 'Family Doctor / G.P.']

3. Star Rating

We'll find the tag for Star Rating similar to what we did for above two

In [23]:
#Extracting Star Rating
doctor_star_rating=summary.findAll('span',{'class':'star-rating'})
doctor_star_rating

[<span class="star-rating" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0" title="4.96"><span class="stars" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0.0"><span class="star selected" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0.0.$5"></span><span class="star selected" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0.0.$4"></span><span class="star selected" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0.0.$3"></span><span class="star selected" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0.0.$2"></span><span class="star selected" data-reactid=".1fy2aoe824i.3:$2139668.4.0.0.0.$1"></span></span></span>,
 <span class="star-rating" data-reactid=".1fy2aoe824i.3:$2075115.4.0.0" title="4.99"><span class="stars" data-reactid=".1fy2aoe824i.3:$2075115.4.0.0.0"><span class="star selected" data-reactid=".1fy2aoe824i.3:$2075115.4.0.0.0.$5"></span><span class="star selected" data-reactid=".1fy2aoe824i.3:$2075115.4.0.0.0.$4"></span><span class="star selected" data-reactid=".1fy2aoe824i.3:$2075115.4.0.0.0.$3"></span>

Here the star rating is not available in the form of the text in a tag.Instead its located inside an attribute of the tag. We need to extract the value from the attribute. To do this, we need to understand how do we treat this tag. So, beautiful soup allows us to treat the tags as a dictionary where each attribute corresponds to a key. So, here we will be extracting value from the attribute 'title'.

In [26]:
dr_star_rating=[]
for stars in range(0,len(doctor_star_rating)):
    dr_star_rating.append(doctor_star_rating[stars]['title'])
dr_star_rating

['4.96',
 '4.99',
 '4.98',
 '4.95',
 '4.91',
 '4.95',
 '4.94',
 '4.94',
 '4.94',
 '4.94']

4. No. of reviews

Finally, we'll extract our last required column i.e the no. of reviews provided for each doctor.

In [14]:
#extract Doctor_reviews
doctor_reviews=summary.findAll('div',{'class':'star-rating-count'})
doctor_reviews

[<div class="star-rating-count" data-reactid=".i9tnaeafxq.3:$2139668.4.0.1"><span data-reactid=".i9tnaeafxq.3:$2139668.4.0.1.0">151</span><span data-reactid=".i9tnaeafxq.3:$2139668.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".i9tnaeafxq.3:$2075115.4.0.1"><span data-reactid=".i9tnaeafxq.3:$2075115.4.0.1.0">82</span><span data-reactid=".i9tnaeafxq.3:$2075115.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".i9tnaeafxq.3:$432370.4.0.1"><span data-reactid=".i9tnaeafxq.3:$432370.4.0.1.0">91</span><span data-reactid=".i9tnaeafxq.3:$432370.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".i9tnaeafxq.3:$1756511.4.0.1"><span data-reactid=".i9tnaeafxq.3:$1756511.4.0.1.0">131</span><span data-reactid=".i9tnaeafxq.3:$1756511.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".i9tnaeafxq.3:$1951898.4.0.1"><span data-reactid=".i9tnaeafxq.3:$1951898.4.0.1.0">331</span><span data-reac

Here, we can observe that text data in the tag is in the form 'count reviews'. Eg. '156 Reviews'. But, we only need the number and not the text 'Reviews'. So, we'll split the entire text and only take the numeric value.

In [17]:
dr_reviews=[]
for rev in doctor_reviews:
    dr_reviews.append((rev.text)[0:rev.text.find(" ")])
dr_reviews

['151', '82', '91', '131', '331', '130', '147', '146', '152', '141']

Now, we will combine all these lists into dataframes. There are many ways to do this. Here, I converted the lists into series first and then joined it into dataframe. Another way would be to combine the lists into a data frame using dictionaries.

In [18]:
#Create Series of all the lists
dr_name=Series(dr_name)
dr_specialty=Series(dr_specialty)
dr_star_rating=Series(dr_star_rating)
dr_reviews=Series(dr_reviews)

In [19]:
#Create a Dataframe
family_df=pd.concat([dr_name,dr_specialty,dr_star_rating,dr_reviews],axis=1)
#Name the columns
family_df.columns=['Doctor Name','Specialty','Rating','No. of Reviews']
family_df

Unnamed: 0,Doctor Name,Specialty,Rating,No. of Reviews
0,Dr. Anna Becker,Family Doctor / G.P.,4.96,151
1,Dr. Frances P. Wong,Family Doctor / G.P.,4.99,82
2,Dr. Lewis Mitchell,Family Doctor / G.P.,4.98,91
3,Dr. Kevin Little,Family Doctor / G.P.,4.95,131
4,Dr. Mitch Freeman,Family Doctor / G.P.,4.91,331
5,Dr. Kimberlee Shaw,Family Doctor / G.P.,4.95,130
6,Dr. Elizabeth S. Barnes,Family Doctor / G.P.,4.94,147
7,Dr. William R. Harris,Family Doctor / G.P.,4.94,146
8,Dr. Sharon A. Wolters,Family Doctor / G.P.,4.94,152
9,Dr. Cynthia White,Family Doctor / G.P.,4.94,141


Webscraping multiple pages

To scrape multiple pages from a website, we'll build upon whatever we did above for a single page.
First, We need to first understand the structure of the URL i.e. exactly where is the page value changing. Here the url with page number is in the format https://www.ratemds.com/best-doctors/ny/new-york/family-gp/?page=2. So we need to change the last letter of the url to extract data from each page. For that we need to create a list having the list of pages we need to scrape.

In [54]:
url='https://www.ratemds.com/best-doctors/?specialty=family-gp&page='
#Set pages to be scraped
pages=[str(i) for i in range(1,5)]

Second, we need to create variables that would help us to control and monitor the loop rate. This is done to make sure that we are not bombarding the server with our requests.

In [55]:
#Monitoring the loop
start_time=time()
request=0

Now, we created empty lists which will be used to store the details of the respective doctors after scraping from the website.

In [60]:
#Create empty lists for respective values to be extracted
dr_name_all=[]
dr_reviews_all=[]
dr_specialty_all=[]
dr_star_rating_all=[]

Below is the code which is looped, in order to combine data from all the pages. 

We'll be pausing the loop for a period in the range of 8 to 15 seconds after every page in order to avoid bombarding huge number of requests per second on the server. Neglecting this might lead to ban of our IP Address. Rest all the things are similar to what we did to extract data for a single page.

In [61]:
for page in pages:
    
    #Make a get request
    response=get(url+ page)
    
    #Pause the loop
    sleep(randint(8,15))
    
    #Monitor the requests
    request+=1
    elapsed_time=time()-start_time
    print('Requests: {}; Frequency: {} requests/sec'.format(request, request/elapsed_time))
#     clear_output(wait=True)
    
    #Empty the contents into usable format
    soup=BeautifulSoup(response.content,"lxml")
    
    #Select the container having all the doctor details
    summary=soup.find('div',{'id':'doctor-list'})
    
    #Extract doctor names
    doctor_name=summary.findAll('a',{'class':'search-item-doctor-link'})
    for name in doctor_name:
        dr_name_all.append(name.text)
    
    #Extract doctor specialty
    doctor_specialty=summary.findAll('div',{'class':'search-item-specialty'})
    for specialty in doctor_specialty:
        dr_specialty_all.append(specialty.text)
    
    #Extracting star rating
    doctor_star_rating=summary.findAll('span',{'class':'star-rating'})
    for stars in range(0,len(doctor_star_rating)):
        dr_star_rating_all.append(doctor_star_rating[stars]['title'])
    
    #Extracting no. of reviews
    doctor_reviews=summary.findAll('div',{'class':'star-rating-count'})
    for reviews in doctor_reviews:
        dr_reviews_all.append((reviews.text)[0:(reviews.text).find(" ")])

Requests: 5; Frequency: 0.012877130601245134 requests/sec
Requests: 6; Frequency: 0.01492068350543763 requests/sec
Requests: 7; Frequency: 0.016905420779806418 requests/sec
Requests: 8; Frequency: 0.018812133237574673 requests/sec


In [68]:
dr_reviews_all[:5]

['151', '82', '91', '131', '331']

In [72]:
#Make series of all lists
dr_name=Series(dr_name_all)
dr_reviews=Series(dr_reviews_all)
dr_specialty=Series(dr_specialty_all)
dr_star_rating=Series(dr_star_rating_all)

In [78]:
#Create a dataframe and save it to a .csv file
family_df=pd.concat([dr_name, dr_specialty, dr_star_rating, dr_reviews],axis=1,ignore_index=True)
family_df.columns=['Doctor Name','Specialty', 'Rating','No. of Reviews']
family_df.to_csv('RateMDs-Family-G.Ps.csv')
family_df.head()

Unnamed: 0,Doctor Name,Specialty,Rating,No. of Reviews
0,Dr. Anna Becker,Family Doctor / G.P.,4.96,151
1,Dr. Frances P. Wong,Family Doctor / G.P.,4.99,82
2,Dr. Lewis Mitchell,Family Doctor / G.P.,4.98,91
3,Dr. Kevin Little,Family Doctor / G.P.,4.95,131
4,Dr. Mitch Freeman,Family Doctor / G.P.,4.91,331
