## Task Description

Navigate to the list of faculty in the Anesthesiology Department (be aware that each website is very different. For instance, the links for Upstate and Westchester are already linked to the Anesthesiology program, but you will have to navigate to the faculty page; the New Mexico website has to be filtered first in order to see just the anesthesiology faculty). 


Once you find the appropriate pages with the faculty listed, I would like you to create an excel file, where each institution as its own sheet. There should be four headers for each sheet: First Name, Last Name, Email, Error. The first name and last name of a clinician should be in the appropriate cell as well as their email (if scrapable). Error should remain blank for now. There should not be any other text in the first name or last name columns (no MD, DO, or any titles, just a single name in each). 

In [79]:
# Install dependencies

!pip install beautifulsoup4
!pip install pandas
!pip install xlsxwriter
!pip install pytesseract
!pip install pillow

[0m

In [101]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import pytesseract
from PIL import Image, ImageOps
from io import BytesIO

In [102]:
#Helper functions

def add_whitespace_border(image, border_size):
    """
    Adds a whitespace border to an image.

    Args:
        image: PIL Image object.
        border_size: Width of the border in pixels.

    Returns:
        PIL Image object with the added border.
    """
    width, height = image.size
    new_width = width + border_size * 2
    new_height = height + border_size * 2

    # Create a new white image with the larger size
    bordered_image = Image.new(image.mode, (new_width, new_height), color='white')

    # Paste the original image onto the new image with the border offset
    bordered_image.paste(image, (border_size, border_size))

    return bordered_image

In [103]:
# Define URLs for each institution
urls = {
    "UNM": "https://hsc.unm.edu/directory/index.json",
    "Upstate": "https://www.upstate.edu/anesthesiology/about-us/faculty.php",
    "Westchester": "https://www.westchestermedicalcenter.org/anesthesiology-residency-program",
}

# Create an empty dictionary to store dataframes for each institution
dfs = {}

# Function to extract faculty information from a given URL
def extract_faculty_data(url, institution_name):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    faculty_data = [] # Initialize an empty list to store data

    if "hsc.unm.edu" in url:
        # Logic for University of New Mexico
        data = response.json()
        
        # Filter for Anesthesiology faculty
        anesthesiology_faculty = [
            faculty
            for faculty in data["faculty"]
            if "SOM - Anesthesiology" in faculty.get("departments", [])
        ]
        
        # Extract and format data
        faculty_data = []
        for faculty in anesthesiology_faculty:
            first_name = faculty["firstName"].strip()
            last_name = faculty["lastName"].strip()
            email = ""  # Email not available in this dataset
            faculty_data.append([first_name, last_name, email, ""])


    elif "upstate.edu" in url:
        # Logic for Upstate Medical University
        faculty_page = soup.find("div", {"id": "faculty_page"})
        
        for link in faculty_page.find_all('a', href=True):
            response = requests.get(url + link['href'])
            person_page = BeautifulSoup(response.content, 'html.parser')
        
            person_page = person_page.find("div", {"class": "block-email"})
            email_image_link = person_page.find("img")["src"]
        
            # Run OCR on image to extract email
            email = ""
            try:
                # Download and read the image
                image_data = requests.get(email_image_link).content
                image = Image.open(BytesIO(image_data))
                image = add_whitespace_border(image, border_size=20)  # Add 20px border
                
                # OCR with configuration
                email = pytesseract.image_to_string(image, config='--psm 6')# Assume single line of text (email) 
                email = email.strip()  # Remove whitespace
            except Exception as e:
                print(f"Error extracting email: {e}")
        
            split_text = link.contents[0].split(",")
            
            if len(split_text) > 1:
                full_name = split_text[0]
                full_name = full_name.split(".")
                if len(full_name) > 1:
                    first_name = full_name[0] + "."
                    last_name = full_name[-1]
                    faculty_data.append([first_name.strip(), last_name.strip(), email.strip(), ""])
                else:
                    full_name = split_text[0].split(" ")
                    first_name = full_name[0]
                    last_name = full_name[-1]
                    faculty_data.append([first_name.strip(), last_name.strip(), email.strip(), ""])

    elif "westchestermedicalcenter.org" in url:
        # Logic for Westchester Medical Center
        faculty_heading = soup.find("h2", string="Faculty")

        if faculty_heading:
            faculty_container = faculty_heading.parent
            #print(f"Debug (faculty_container): {faculty_container}")

            # Extract faculty information
            faculty_data = []
            strong_tags = faculty_container.find_all("strong")  
            for strong_tag in strong_tags:
                split_text = strong_tag.text.split(",")
                if len(split_text) > 1:
                    full_name = split_text[0]
                    full_name = full_name.split(".")
                    if len(full_name) > 1:
                        first_name = full_name[0] + "."
                        last_name = full_name[-1]
                        faculty_data.append([first_name.strip(), last_name.strip(), "", ""])
                    else:
                        full_name = split_text[0].split(" ")
                        first_name = full_name[0]
                        last_name = full_name[-1]
                        faculty_data.append([first_name.strip(), last_name.strip(), "", ""]) 
        else:
            print(f"Error: 'Faculty' heading not found on the {name} webpage.")

    else:
        print(f"Unsupported website: {url}")
        return

    # Create a pandas DataFrame
    df = pd.DataFrame(faculty_data)

    # Set the column headers (order matters)
    df.columns = ["First Name", "Last Name", "Email", "Error"]

    return df

# Iterate through URLs and extract faculty data for each institution
for institution_name, url in urls.items():
    dfs[institution_name] = extract_faculty_data(url, institution_name)

file_name = "anesthesiology_faculty.xlsx"
with pd.ExcelWriter(file_name, engine='xlsxwriter') as writer:
    # Write each DataFrame to a separate sheet in the Excel file
    for institution_name, df in dfs.items():
        df.to_excel(writer, sheet_name=institution_name, index=False)
        print (f"Saved {institution_name} data to {file_name}")

Unnamed: 0,First Name,Last Name,Email,Error
0,Christopher,Arndt,,
1,Nichole,Bordegaray,,
2,Elizabeth,Baker,,
3,Emily,Bui,,
4,Niels,Chapman,,
5,Nivine,Doran,,
6,Andrea,Sandoval,,
7,Jim,Savage,,
8,Jacob,Rothfork,,
9,Aliaksandr,Daroshka,,


https://www.upstate.edu/scripts/eimage.php?string=090050057121097109108121081072086119099051082104100071085117090087082049
https://www.upstate.edu/scripts/eimage.php?string=098071108109081072086119099051082104100071085117090087082049
https://www.upstate.edu/scripts/eimage.php?string=089087120112099048066049099072078048089088082108076109086107100081061061
https://www.upstate.edu/scripts/eimage.php?string=090071086116090088074122098071086065100088066122100071070048090083053108090072085061
https://www.upstate.edu/scripts/eimage.php?string=098071057119090088112106081072086119099051082104100071085117090087082049
https://www.upstate.edu/scripts/eimage.php?string=099109086122100071108113081072086119099051082104100071085117090087082049
https://www.upstate.edu/scripts/eimage.php?string=099109057116089087053118090069066049099072078048089088082108076109086107100081061061
https://www.upstate.edu/scripts/eimage.php?string=09905007012110005007012109808506604909907207804808908808210807610908610710008

Unnamed: 0,First Name,Last Name,Email,Error
0,Reza,Gorji,gorjir@upstate.edu,
1,Fenghua,Li,lif@upstate.edu,
2,Syed,Ali,alis@upstate.edu,
3,Elizabeth,Lavelle,demersle@upstate.edu,
4,Carlos,Lopez,lopezc@upstate.edu,
5,Joseph,Resti,restij@upstate.edu,
6,David,Romano,romanod@upstate.edu,
7,Muhammad,Sarwar,sarwarm@upstate.edu,
8,Vandana,Sharma,sharmav@upstate.edu,
9,Xiuli,Zhang,zhangx@upstate.edu,


Unnamed: 0,First Name,Last Name,Email,Error
0,Peter J.,Panzica,,
1,A.,Elisabeth Abramowicz,,
2,Sarah,Smith,,
3,Irim,Salik,,
4,Richard,Yeom,,
5,Nitin,Sekhri,,
6,Ashley M.,Kelley,,
7,Iyabo,Muse,,
8,Garret M.,Weber,,
9,Michael,Rahimi,,


Saved UNM data to anesthesiology_faculty.xlsx
Saved Upstate data to anesthesiology_faculty.xlsx
Saved Westchester data to anesthesiology_faculty.xlsx
