# Scraping details of doctors from a doctor rating website

## Motivation:

After finishing my basics (Statistics, SQL and Python), I was quite excited to jump into the world of Machine Learning. Reading about the cool projects on GitHub every day motivated me more to start such things ASAP. So, as usual like others, I completed a MOOC on Machine Learning and started searching for a dataset to get my hands dirty. But, one fine day, I came across this [article](https://www.kdnuggets.com/2018/11/get-hired-as-data-scientist.html) on KDNuggets: **To get hired as a Data Scientist, don't follow the herd**. This article changed the way I looked at Data-Science projects and what attracted me the most in that article was the last point:

_"Do things that seem crazy. Everyone goes to the UCI repository, or uses some stock dataset (yawn) to build their project. Don’t do that. Learn how to use a web scraping library, or some under-appreciated API to build your own, custom dataset. Data is hard to come by, and companies often need to rely on their engineers to get it for them. Your goal should be to come across as the kind of data science-obsessed lunatic who will build your own goddamn dataset if that’s what it takes to get the job done."_

We cannot use the concepts of statistics, databases or machine learning if we don’t have the required data. Thus, I started learning about web-scraping, and eventually began working on a research project at my Graduate School where my first task was to scrape data from a Doctor Rating Website in the US. So, here is my code which I used to accomplish my task.

## Overview:

The code is designed to help us scrape data for all the doctors from a popular Doctor Rating website [RateMDs](https://www.ratemds.com). The doctors can be filtered based on specialty, City, Gender, whether verified or not and whether accepting patients or not. These filters need to be applied on the website and the corresponding URL needs to be entered as an input here and rest will be taken care of by the code.

The details of the doctor obtained as an output includes:
1. Name of the Doctor
2. Specialty
3. Star-Rating
4. No. of Reviews

I divided the code into two parts. The motive was to make this code as beginner-friendly as possible. The two parts are as follows:
1. Scraping Data from a single page (I tried explained how exactly each detail of the doctor is scraped)
2. Scraping Data from multiple pages (This involves creating loop for scraping pages and monitoring and controlling the loop rate)

### 1st Part: Scraping data from a single page

Importing the required libraries

In [1]:
#Import Webscraping libraries
from bs4 import BeautifulSoup
from requests import get

#Import Data structure libraries
import pandas as pd
from pandas import Series, DataFrame

#Import libraries for controlling crawling rate
from time import sleep, time
from random import randint

#Import library for clearing output
from IPython.core.display import clear_output

The data that I want to scrape are the details of doctors whose specialty is 'Family/G.P.s' and the city they live/practice in is 'New York'.

In [2]:
#Enter the url you need to scrape
url='https://www.ratemds.com/best-doctors/ny/new-york/family-gp'

In [3]:
#use 'get' from requests library to obtain the HTML of the website's home page
response=get(url)
response.content[:1000]

b'\n<!DOCTYPE html>\n<html lang="en" ng-app="RateMds" id="ng-app">\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no">\n<link rel="apple-touch-icon" sizes="57x57" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-57x57.495c70819b0f.png">\n<link rel="apple-touch-icon" sizes="114x114" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-114x114.abcec11ee43c.png">\n<link rel="apple-touch-icon" sizes="72x72" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-72x72.a21a57e92df8.png">\n<link rel="apple-touch-icon" sizes="144x144" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-144x144.2fcfe644723c.png">\n<link rel="apple-touch-icon" sizes="60x60" href="//www.ratemds.com/static/img/favicons/apple-touch-icon-60x60.5dd1baf2b8ba.png">\n<link rel="apple-touch-icon" sizes="120x120" href="//www.ratemds.com/static/img/favicons/apple-touc

As we can see, the HTML data we extracted is quite messy. Let's improve the readability of the content by using the BeautifulSoup.

In [4]:
#Convert the above content into a more readable format
soup=BeautifulSoup((response.content),"lxml")

#prettify() makes the content more readable
print(soup.prettify()[:1000]) 

<!DOCTYPE html>
<html id="ng-app" lang="en" ng-app="RateMds">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-57x57.495c70819b0f.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-114x114.abcec11ee43c.png" rel="apple-touch-icon" sizes="114x114"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-72x72.a21a57e92df8.png" rel="apple-touch-icon" sizes="72x72"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-144x144.2fcfe644723c.png" rel="apple-touch-icon" sizes="144x144"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-60x60.5dd1baf2b8ba.png" rel="apple-touch-icon" sizes="60x60"/>
  <link href="//www.ratemds.com/static/img/favicons/apple-touch-icon-120x120.e5b8c908f646

Now, we need to search for the HTML part of the webpage which we need to scrape.
For that, we need to identify the corresponding tag in the HTML and use the find() function over soup to get the required contents.

In [5]:
#Search for the tag under which all the doctor details are present
summary=soup.find('div',{'id':'doctor-list'})
print(summary.prettify()[:2000])

<div id="doctor-list">
 <div data-react-checksum="815957376" data-reactid=".2e42e0x76zs">
  <h2 class="subtitle heading-withsub" data-reactid=".2e42e0x76zs.0">
   <span data-reactid=".2e42e0x76zs.0.0">
    The
   </span>
   <span data-reactid=".2e42e0x76zs.0.1">
    Best Family Doctors / G.P.s in New York City, NY
   </span>
  </h2>
  <a class="heading-sub" data-reactid=".2e42e0x76zs.1" href="/specialties/family-gp/">
   <span data-reactid=".2e42e0x76zs.1.0">
    What is a
   </span>
   <span data-reactid=".2e42e0x76zs.1.1">
    Family Doctor / G.P.
   </span>
   <span data-reactid=".2e42e0x76zs.1.2">
    ?
   </span>
  </a>
  <div class="search-item doctor-profile" data-reactid=".2e42e0x76zs.3:$2074748">
   <img alt="Dr. Natan Schleider" class="search-item-image" data-reactid=".2e42e0x76zs.3:$2074748.0" height="100" src="https://cdn1.ratemds.com/media/doctors/doctor/image/NatanSchleider.jpg_thumbs/v1_at_100x100.jpg" width="100"/>
   <h2 class="search-item-doctor-name" data-reactid=".2

It's time to pull the details of all the doctors one by one. So below are the details which we aim to scrape:
- Name of the Doctor
- Specialty
- Rating
- No. of Reviews
- Name of the Doctor

1. Name of the Doctor 


Let's find the tag which corresponds to the Doctor Name. In this  case its 'a' and class='search-item-doctor-link'.

In [6]:
#Pull the Doctor Names HTML in the form of list
doctor_name=summary.findAll('a',{'class':'search-item-doctor-link'})
doctor_name[:5]

[<a class="search-item-doctor-link" data-reactid=".2e42e0x76zs.3:$2074748.1.0" href="/doctor-ratings/20189/Dr-Natan-Schleider-New+York-NY.html">Dr. Natan Schleider</a>,
 <a class="search-item-doctor-link" data-reactid=".2e42e0x76zs.3:$2046835.1.0" href="/doctor-ratings/53736/Dr-George+P.-Liakeas-New+York-NY.html">Dr. George P. Liakeas</a>,
 <a class="search-item-doctor-link" data-reactid=".2e42e0x76zs.3:$2033902.1.0" href="/doctor-ratings/68700/Dr-Tammy+L.-Leopold-New+York-NY.html">Dr. Tammy L. Leopold</a>,
 <a class="search-item-doctor-link" data-reactid=".2e42e0x76zs.3:$853961.1.0" href="/doctor-ratings/3207299/Dr-Irina-Korneeva-Vladimirsky-New+York-NY.html">Dr. Irina Korneeva-Vladimirsky</a>,
 <a class="search-item-doctor-link" data-reactid=".2e42e0x76zs.3:$2096001.1.0" href="/doctor-ratings/4020537/Dr-Alexander-Blinski-New+York-NY.html">Dr. Alexander Blinski</a>]

The result we obtain above is a list of Doctor Names in HTML format. Now, we need to extract only the Doctor Names and add it into a new list.

In [7]:
#Extract the Doctor Names from the list
dr_name=[]
for name in doctor_name:
    dr_name.append(name.text)
dr_name[:5]

['Dr. Natan Schleider',
 'Dr. George P. Liakeas',
 'Dr. Tammy L. Leopold',
 'Dr. Irina Korneeva-Vladimirsky',
 'Dr. Alexander Blinski']

2. Specialty

Let's find the tag which provides the details of Doctor's Speciality, similar to what we did for Name of the Doctor.

In [8]:
#Pull the Doctor Specialty HTML in the form of list
doctor_specialty=summary.findAll('div',{'class':'search-item-specialty'})
doctor_specialty[:5]

[<div class="search-item-specialty" data-reactid=".2e42e0x76zs.3:$2074748.3"><a data-reactid=".2e42e0x76zs.3:$2074748.3.0" href="/best-doctors/ny/new-york/family-gp/">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".2e42e0x76zs.3:$2046835.3"><a data-reactid=".2e42e0x76zs.3:$2046835.3.0" href="/best-doctors/ny/new-york/family-gp/">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".2e42e0x76zs.3:$2033902.3"><a data-reactid=".2e42e0x76zs.3:$2033902.3.0" href="/best-doctors/ny/new-york/family-gp/">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".2e42e0x76zs.3:$853961.3"><a data-reactid=".2e42e0x76zs.3:$853961.3.0" href="/best-doctors/ny/new-york/family-gp/">Family Doctor / G.P.</a></div>,
 <div class="search-item-specialty" data-reactid=".2e42e0x76zs.3:$2096001.3"><a data-reactid=".2e42e0x76zs.3:$2096001.3.0" href="/best-doctors/ny/new-york/family-gp/">Family Doctor / G.P.</a></div>]

Now, we'll extract the Doctor Specialty from the above result. All the specialty values will be similar as we are currently extracting only for one particular Specialty.

In [9]:
#Extract the Doctor Specialty from the list
dr_specialty=[]
for specialty in doctor_specialty:
    dr_specialty.append(specialty.text)
dr_specialty[5:]

['Family Doctor / G.P.',
 'Family Doctor / G.P.',
 'Family Doctor / G.P.',
 'Family Doctor / G.P.',
 'Family Doctor / G.P.']

3. Star Rating

We'll find the tag for Star Rating similar to what we did for above two.

In [10]:
#Extracting Star Rating
doctor_star_rating=summary.findAll('span',{'class':'star-rating'})
doctor_star_rating

[<span class="star-rating" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0" title="4.91"><span class="stars" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0.0"><span class="star selected" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0.0.$5"></span><span class="star selected" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0.0.$4"></span><span class="star selected" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0.0.$3"></span><span class="star selected" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0.0.$2"></span><span class="star selected" data-reactid=".2e42e0x76zs.3:$2074748.4.0.0.0.$1"></span></span></span>,
 <span class="star-rating" data-reactid=".2e42e0x76zs.3:$2046835.4.0.0" title="4.56"><span class="stars" data-reactid=".2e42e0x76zs.3:$2046835.4.0.0.0"><span class="star half" data-reactid=".2e42e0x76zs.3:$2046835.4.0.0.0.$5"></span><span class="star selected" data-reactid=".2e42e0x76zs.3:$2046835.4.0.0.0.$4"></span><span class="star selected" data-reactid=".2e42e0x76zs.3:$2046835.4.0.0.0.$3"></span><spa

Here the star rating is not available in the form of text in the tag.Instead, its located inside an attribute of the tag. We need to extract the value from the attribute. To do this, we need to understand how do we treat this tag: 

Beautiful soup allows us to treat the tags as a dictionary where each attribute corresponds to a key. So, here we will be extracting value from the attribute 'title'.

In [11]:
dr_star_rating=[]
for stars in range(0,len(doctor_star_rating)):
    dr_star_rating.append(doctor_star_rating[stars]['title'])
dr_star_rating

['4.91',
 '4.56',
 '4.56',
 '4.38',
 '5.00',
 '4.78',
 '5.00',
 '4.14',
 '4.80',
 '4.31']

4. No. of reviews

Finally, we'll extract our last required column i.e the no. of reviews provided for each doctor.

In [12]:
#extract Doctor_reviews
doctor_reviews=summary.findAll('div',{'class':'star-rating-count'})
doctor_reviews

[<div class="star-rating-count" data-reactid=".2e42e0x76zs.3:$2074748.4.0.1"><span data-reactid=".2e42e0x76zs.3:$2074748.4.0.1.0">34</span><span data-reactid=".2e42e0x76zs.3:$2074748.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".2e42e0x76zs.3:$2046835.4.0.1"><span data-reactid=".2e42e0x76zs.3:$2046835.4.0.1.0">33</span><span data-reactid=".2e42e0x76zs.3:$2046835.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".2e42e0x76zs.3:$2033902.4.0.1"><span data-reactid=".2e42e0x76zs.3:$2033902.4.0.1.0">25</span><span data-reactid=".2e42e0x76zs.3:$2033902.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".2e42e0x76zs.3:$853961.4.0.1"><span data-reactid=".2e42e0x76zs.3:$853961.4.0.1.0">28</span><span data-reactid=".2e42e0x76zs.3:$853961.4.0.1.1"> reviews</span></div>,
 <div class="star-rating-count" data-reactid=".2e42e0x76zs.3:$2096001.4.0.1"><span data-reactid=".2e42e0x76zs.3:$2096001.4.0.1.0">7</span><span

Here, we can observe that text data in the tag is in the form 'count reviews'. Eg. '156 Reviews'. But, we only need the number and not the text 'Reviews'. So, we'll extract only the numeric value.

In [13]:
dr_reviews=[]
for rev in doctor_reviews:
    dr_reviews.append((rev.text)[0:rev.text.find(" ")])
dr_reviews

['34', '33', '25', '28', '7', '12', '6', '33', '14', '14']

Now, it's time to create a table out of the extracted data. 

I converted the lists into series first and then joined it into dataframe. Another way would be to combine the lists into a data frame using dictionaries.

In [14]:
#Create Series of all the lists
dr_name=Series(dr_name)
dr_specialty=Series(dr_specialty)
dr_star_rating=Series(dr_star_rating)
dr_reviews=Series(dr_reviews)

In [15]:
#Create a Dataframe
family_df=pd.concat([dr_name,dr_specialty,dr_star_rating,dr_reviews],axis=1)
#Name the columns
family_df.columns=['Doctor Name','Specialty','Rating','No. of Reviews']
family_df

Unnamed: 0,Doctor Name,Specialty,Rating,No. of Reviews
0,Dr. Natan Schleider,Family Doctor / G.P.,4.91,34
1,Dr. George P. Liakeas,Family Doctor / G.P.,4.56,33
2,Dr. Tammy L. Leopold,Family Doctor / G.P.,4.56,25
3,Dr. Irina Korneeva-Vladimirsky,Family Doctor / G.P.,4.38,28
4,Dr. Alexander Blinski,Family Doctor / G.P.,5.0,7
5,Dr. Pamela Hops,Family Doctor / G.P.,4.78,12
6,Dr. Leslie H. Gerstman,Family Doctor / G.P.,5.0,6
7,Dr. Valerie Lyon,Family Doctor / G.P.,4.14,33
8,Dr. D. Steenkamp,Family Doctor / G.P.,4.8,14
9,Dr. Sanford M. Gould,Family Doctor / G.P.,4.31,14


## 2nd Part: Webscraping multiple pages

To scrape multiple pages from a website, we'll build upon whatever we did above for a single page.
First, We need to first understand the structure of the URL i.e. exactly where is the page value changing. Here the url with page number two is in the format https://www.ratemds.com/best-doctors/ny/new-york/family-gp/?page=2. So we need to change the last letter of the url to extract data from each page. For that we need to create a list having the list of pages we need to scrape. Here, I'll be scraping first 5 pages of the result.

In [23]:
url_by_user='https://www.ratemds.com/best-doctors/ny/new-york/family-gp/?page='
#Set pages to be scraped
pages=[str(i) for i in range(1,6)]

Second, we need to create variables that would help us to control and monitor the loop rate. This is done to make sure that we are not bombarding the server with our requests.

In [24]:
#Monitoring the loop
start_time=time()
request=0

Now, we created empty lists which will be used to store the details of the respective doctors after scraping from the website.

In [25]:
#Create empty lists for respective values to be extracted
dr_name_all=[]
dr_reviews_all=[]
dr_specialty_all=[]
dr_star_rating_all=[]

Below is the code which is looped, in order to combine data from all the pages. 

We'll be pausing the loop for a period in the range of 8 to 15 seconds after every page in order to avoid bombarding huge number of requests per second on the server. Neglecting this might lead to ban of our IP Address. Rest all the things are similar to what we did to extract data for a single page.

In [26]:
for page in pages:
    
    #Make a get request
    response=get(url_by_user+page)
    
    #Pause the loop
    sleep(randint(8,15))
    
    #Monitor the requests
    request+=1
    elapsed_time=time()-start_time
    print('Requests: {}; Frequency: {} requests/sec'.format(request, request/elapsed_time))
    clear_output(wait=True)
    
    #Empty the contents into usable format
    soup=BeautifulSoup(response.content,"lxml")
    
    #Select the container having all the doctor details
    summary=soup.find('div',{'id':'doctor-list'})
    
    #Extract doctor names
    doctor_name=summary.findAll('a',{'class':'search-item-doctor-link'})
    for name in doctor_name:
        dr_name_all.append(name.text)
    
    #Extract doctor specialty
    doctor_specialty=summary.findAll('div',{'class':'search-item-specialty'})
    for specialty in doctor_specialty:
        dr_specialty_all.append(specialty.text)
    
    #Extract star rating
    doctor_star_rating=summary.findAll('span',{'class':'star-rating'})
    for stars in range(0,len(doctor_star_rating)):
        dr_star_rating_all.append(doctor_star_rating[stars]['title'])
    
    #Extract no. of reviews
    doctor_reviews=summary.findAll('div',{'class':'star-rating-count'})
    for reviews in doctor_reviews:
        dr_reviews_all.append((reviews.text)[0:(reviews.text).find(" ")])

Requests: 5; Frequency: 0.07837724123369154 requests/sec


In [27]:
#Make series of all the lists
dr_name=Series(dr_name_all)
dr_reviews=Series(dr_reviews_all)
dr_specialty=Series(dr_specialty_all)
dr_star_rating=Series(dr_star_rating_all)

In [28]:
#Create a dataframe and save it to a .csv file
family_df=pd.concat([dr_name, dr_specialty, dr_star_rating, dr_reviews],axis=1,ignore_index=True)
family_df.columns=['Doctor Name','Specialty', 'Rating','No. of Reviews']
family_df.to_csv('RateMDs-Family-G.Ps.csv')

In [29]:
#A glimpse of our dataset
family_df.head(10)

Unnamed: 0,Doctor Name,Specialty,Rating,No. of Reviews
0,Dr. Natan Schleider,Family Doctor / G.P.,4.91,34
1,Dr. George P. Liakeas,Family Doctor / G.P.,4.56,33
2,Dr. Tammy L. Leopold,Family Doctor / G.P.,4.56,25
3,Dr. Irina Korneeva-Vladimirsky,Family Doctor / G.P.,4.38,28
4,Dr. Alexander Blinski,Family Doctor / G.P.,5.0,7
5,Dr. Pamela Hops,Family Doctor / G.P.,4.78,12
6,Dr. Leslie H. Gerstman,Family Doctor / G.P.,5.0,6
7,Dr. Valerie Lyon,Family Doctor / G.P.,4.14,33
8,Dr. D. Steenkamp,Family Doctor / G.P.,4.8,14
9,Dr. Sanford M. Gould,Family Doctor / G.P.,4.31,14
