# Scraping Organic Search Results of Google Pages

Author: Zipporah Cohen based on the work of Malika Parkhomchuk

**Summary**

This notebook contains and calls a function that can scrape the organic search results from locally saved HTML files.

**Table of Contents**
1. [Create Functions](#sec1)
2. [Call Functions on HTMLs](#sec2)

**Import all libraries**

In [6]:
import os
import requests
from bs4 import BeautifulSoup as BS
import json
from urllib.parse import urlparse 

<a id="sec1"></a>
### Create functions to extract top search result info

- `desired_results()` weed out for only organic search results
- `find_top_results()` select all webpage results from the html
- `create_json()` dump the information from only organic search results into an html

Function filters out results that are not organic

In [4]:
def desired_results(result):
    """
    Determines if the given result is desireable.
    In this use case, a desirable result is one of the 10 organic results from the search page.
    """
    if result.parent.has_attr('id') and (result.parent['id'][0] == 'bres' or result.parent['id'] == 'bres'):
        return False
    elif result.parent.has_attr('class') and result.parent['class'][0] == 'ULSxyf':
        return False
    else:
        return True

Function scrapes all top organic results from the HTML

In [3]:
def find_top_results(soup):
    """
    Scrapes top results from a given SERP.
    """
    all_top_results = []
    
    result_divs = soup.find_all('div', class_='MjjYud')
    
    organic_results = list(filter(desired_results, result_divs))
        
    for div in organic_results:
        try:
            a = div.find('a')
            link = a.get('href')
            title = a.find('h3').text
            domain = urlparse(link).netloc
            all_top_results.append({'title': title, 
                                    'domain': domain, 
                                    'link': link})
        except AttributeError:
            print('skipping a div for attribute error')
            
    return all_top_results

Function loads the top organic results into a JSON with correct folder hierarchy

In [2]:
def create_json(ls):
    """
    Creates new json files with all of the organic search results for each pre-existing html file in the given list.
    
    Parameters:
    ls - a list of existing queries for which to scrape the organic results
    """
    for q in ls:  
        folder = os.listdir(f'queries/{q}')
        html = list(filter(lambda el: el == f'{q}.html', folder))[0]

        soup = BS(open(f"queries/{q}/{html}", 'r').read(), 'html.parser')
        topResults = find_top_results(soup)

        with open(f'queries/{q}/{q}.json', 'w') as outFile:
            json.dump(topResults, outFile)
        print(f'{q} ---- {len(topResults)} results')

Confirm expected results from the search-phrases JSON dictionary

In [23]:
with open(f'search-phrases.json', 'r') as inFile:
    queries = json.load(inFile)
        
queries['anatomical-terms']

['menstruation and vaginas',
 'breasts hurt during period',
 'tampon isn’t fitting in my vagina',
 'diva cup fit test for vulva',
 'why does menstruation cause cramps and butt pain?',
 'Does menstruation come out of uterus',
 'what is going on in with the uterus and ovaries during menstruation?',
 'what happens in the body during menstruation?',
 'why do boobs hurt during menstruation?',
 'blood coming out of my vagina',
 'my uterus hurts',
 'vagina',
 'uterus']

<a id="sec2"></a>
### Call the Scraping Function on each Query

In [242]:
for category in queries:
    create_json(queries[category])

Why do people get periods ---- 9 results
why do we menstruate? ---- 10 results
skipping a div for attribute error
advice for people who menstruate ---- 10 results
how to mitigate dysphoria while menstruating ---- 10 results
things that everyone who has a period should know ---- 10 results
should menstruators drink while on their period ---- 9 results
do trans men stop menstruating? ---- 9 results
skipping a div for attribute error
do trans women menstruate? ---- 10 results
do trans women ever want to menstruate? ---- 10 results
how does menstruation affect the trans experience? ---- 9 results
skipping a div for attribute error
what menstrual pads are best? ---- 10 results
average length of menstrual cycle ---- 9 results
how much blood is lost in one period ---- 9 results
skipping a div for attribute error
How do periods work ---- 9 results
what's the point of menstruation?what's the deal with menstruation? ---- 10 results
why do people menstruate? ---- 9 results
why is menstruation so 