## my analysis on scraping google scholar

simple scraping with requests/beautifulsoup will NOT work because Google detects scraping and enforces IP blocking. 

More robust solutions:
    
- [scholarly](https://github.com/scholarly-python-package/scholarly)
    - free, no h-index, but useful for getting scholar_id and list of publications
    - see [`scholarly.ipynb`](https://github.com/wgong/py4kids/blob/master/lesson-11-scrapy/scrap-cs-faculty/scholarly.ipynb)

- [SerpAPI](https://serpapi.com/) - Scrape Google and other search engines from our fast, easy, and complete API (paid).
    - A blog on scaping Google Scholar - https://plainenglish.io/blog/scrape-google-scholar-with-python-fc6898419305

In [1]:
from IPython.display import display, Markdown, Latex
from scrap_cs_faculty import *

In [2]:
URL = SCHOOL_DICT["Google-Scholar"]["url"]  
print(URL)

https://scholar.google.com


In [3]:
def get_url_scholar(author_org, base_url=URL):
    search_str = "+".join(author_org.split())
    return f"{base_url}/scholar?hl=en&as_sdt=0%2C34&q={search_str}"

In [4]:
def get_url_citation(href, base_url=URL):
    return f"{base_url}{href}"

## search google scholar by name/org

In [6]:
author_org = "Deborah Estrin cornell"
url = get_url_scholar(author_org)

In [7]:
page = requests.get(url, headers=BROWSER_HEADERS)

In [8]:
soup = BeautifulSoup(page.content, "html.parser")

In [9]:
citation_node = soup.find("h4", class_="gs_rt2")
if citation_node is not None:
    citation_url = results.find("a")["href"]
    print(get_url_citation(citation_url))

## get citation

In [40]:
url = get_url_citation(citation_url)
page2 = requests.get(url, headers=BROWSER_HEADERS)
soup2 = BeautifulSoup(page2.content, "html.parser")

In [43]:
hindex = soup2.find("table", id_="gsc_rsb_st")

In [44]:
hindex

In [45]:
print(soup2.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="initial-scale=1" name="viewport"/>
  <title>
   https://scholar.google.com/citations?user=3_WYcR4AAAAJ&amp;hl=en&amp;oi=ao
  </title>
 </head>
 <body onload="e=document.getElementById('captcha');if(e){e.focus();} if(solveSimpleChallenge) {solveSimpleChallenge(,);}" style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px; overscroll-behavior:contain;">
  <div style="max-width:400px;">
   <hr noshade="" size="1" style="color:#ccc; background-color:#ccc;"/>
   <br/>
   <div style="font-size:13px;">
    Our systems have detected unusual traffic from your computer network.  Please try your request again later.
    <a href="#" onclick="document.getElementById('infoDiv0').style.display='block';">
     Why did this happen?
    </a>
    <br/>
    <br/>
    <div id="infoDiv0" style="

Blocking due to `Our systems have detected unusual traffic from your computer network.`

if user profiles exists,

<h3 class="gs_rt"><a href="/citations?view_op=search_authors&amp;mauthors=Deborah+Estrin+cornell&amp;hl=en&amp;oi=ao">User profiles for <b>Deborah Estrin cornell</b></a></h3>


hindex_url = https://scholar.google.com/citations?user=3_WYcR4AAAAJ&hl=en&oi=ao


<h4 class="gs_rt2"><a href="/citations?user=3_WYcR4AAAAJ&amp;hl=en&amp;oi=ao"><b>Deborah Estrin</b></a></h4>

grap URL within above element

table element 
<table id="gsc_rsb_st"><thead><tr><th class="gsc_rsb_sth"></th><th class="gsc_rsb_sth">All</th><th class="gsc_rsb_sth">Since 2018</th></tr></thead><tbody><tr><td class="gsc_rsb_sc1"><a href="javascript:void(0)" class="gsc_rsb_f gs_ibl" title="This is the number of citations to all publications. The second column has the &quot;recent&quot; version of this metric which is the number of new citations in the last 5 years to all publications.">Citations</a></td><td class="gsc_rsb_std">130042</td><td class="gsc_rsb_std">16516</td></tr><tr><td class="gsc_rsb_sc1"><a href="javascript:void(0)" class="gsc_rsb_f gs_ibl" title="h-index is the largest number h such that h publications have at least h citations. The second column has the &quot;recent&quot; version of this metric which is the largest number h such that h publications have at least h new citations in the last 5 years.">h-index</a></td><td class="gsc_rsb_std">138</td><td class="gsc_rsb_std">60</td></tr><tr><td class="gsc_rsb_sc1"><a href="javascript:void(0)" class="gsc_rsb_f gs_ibl" title="i10-index is the number of publications with at least 10 citations. The second column has the &quot;recent&quot; version of this metric which is the number of publications that have received at least 10 new citations in the last 5 years.">i10-index</a></td><td class="gsc_rsb_std">361</td><td class="gsc_rsb_std">189</td></tr></tbody></table>


actual h-index inside element

<tr><td class="gsc_rsb_sc1"><a href="javascript:void(0)" class="gsc_rsb_f gs_ibl" title="h-index is the largest number h such that h publications have at least h citations. The second column has the &quot;recent&quot; version of this metric which is the largest number h such that h publications have at least h new citations in the last 5 years.">h-index</a></td><td class="gsc_rsb_std">138</td><td class="gsc_rsb_std">60</td></tr>
