# Google Search Scraper Program

* To run this program, either click "Run" in order on the code cells below, or go to Cell > Run All.

* This program will output a list of links for a given Google search query, and save them to the file: (search_query)_search_results.txt

# Other notes:

1. This program asks the user to provide a search query and number of results they would like, or has the option of running with the default values: 1 page of results (10 results), and the search query "mapuche tribe argentina chile."

2. This program puts your search request to Google directly, so it is able to refine searches using the same operators that exist in the Google engine.
 * This includes specifying only a certain domain (ex: site:wikipedia.org), performing AND or OR queries, and searching for an exact phrase with quotation marks ("exact phrase here").  
 * See this page for reference (https://support.google.com/websearch/answer/2466433?hl=en).

3. Google may limit requests after running this program a certain amount of times because it will think you are a bot. If you run into issues fetching results, it is likely you will need to change the "user_agent" variable in the 4th cell.

4. You will need to have installed the Jupyter Notebook, requests, and BeautifulSoup libraries, if you haven't already. If you are running with Anaconda, they should already be installed. If not, the commands for doing so are as follows, which you will need to type in the terminal (Command line for Windows).
        * python -m pip install notebook
        * python -m pip install requests
        * python -m pip install beautifulsoup4


In [3]:
import requests
import re
from bs4 import BeautifulSoup

In [31]:
# Note: You may need to change user_agent to your own user agent, given by this page: 
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
# (see this StackOverflow answer: https://stackoverflow.com/a/66866462)
user_agent = None 

# sample user agent
# user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"" "

In [32]:
def strip_anchor_url(url):
    url = url.replace('/url?q=', '')
    url = url.split('&sa=', 1)[0]
    return url


def get_page_url_list(num, query):
    search_engine_prefix = "https://google.com/search?q="
    urls = []
    for i in range(num):
        urls.append(search_engine_prefix + query + "&start=" + str((i) * 10))
    return urls


def get_results_on_page(url, output_file, counter):
#     user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"" "
    headers = {
        "User-Agent": user_agent}
    try:
        result = requests.get(url, headers=headers)
    except (requests.ConnectionError, requests.Timeout):
        print("Issue connecting to website. Check URL and internet connection.")
        return False, counter
    soup = BeautifulSoup(result.text, "html.parser")

    h3s = soup.find_all('h3')
    
    link_strip_regex = re.compile("http.*&sa=U")
    
    output_file.write("url," + url + '\n')
    print("\nurl,", url)
    
    for h in h3s:
        url_title = h.getText()
        try:
            link = h.parent
            if link.name == 'a':
                link_url = link.get('href')
                link_url = strip_anchor_url(link_url)
                output_file.write(str(counter) + "," + url_title + ',' + link_url + '\n')
                print(str(counter), url_title, link_url, sep=',')
                counter += 1
        except AttributeError:
            continue
    return True, counter


def get_inputs():
    default_selection = input("Use default search settings (yes/no) ? (type q or quit to quit) ").lower()
    if default_selection == "yes" or default_selection == 'y':
        text = "mapuche tribe argentina chile"
        num_results = 1
    elif default_selection == "q" or default_selection =='quit':
        return None, None
    else:
        text = input("Enter search query: ")
        pages_of_results = input("Enter how many pages of results you would like (1 page = ~10 results): ")
        try:
            num_results = int(pages_of_results)
            if num_results > 10:
                num_results = 10
        except ValueError:
            print("Invalid input, defaulting to 1.")
            num_results = 1

    return text, num_results


def main():
    text, pages_of_results = get_inputs()
    counter = 1
    while text != None:
        file_name = text.replace(' ', '_') + "_search_results.txt"
        output_file = open(file_name, "w", encoding='utf-8') 
        text = text.replace(' ', '+')
        search_engine_prefix = "https://google.com/search?q="
        page_urls = get_page_url_list(pages_of_results, text)
        for search_url in page_urls:
            results_exist, ctr = get_results_on_page(search_url, output_file, counter)
            if not results_exist:
                break
            else:
                counter = ctr
        counter = 1
        output_file.close()
        print("Query finished.\n")
        text, pages_of_results = get_inputs()
        
    print("Program finished.")

In [None]:
main()
