## Module 10 Assignment - Scraping a Website
* Author: brandon chiazza
* version 2.0

We will be creating a web scraper to parse a table from the Charities Bureau Website. From the website: “All 
charitable organizations operating in New York State are required by law to register and file annual financial reports 
with the Attorney General's Office. This includes any organization that conducts charitable activities, holds property 
that is used for charitable purposes, or solicits financial or other contributions.”

In [21]:
### Load modules
#!pip install webdriver-manager
#!pip install awscli
import selenium
import pandas as pd
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

#### SCRAPE THE WEBSITE ######
### Call the webdriver
s = Service(ChromeDriverManager().install())
browser = webdriver.Chrome(service=s)

# Enter the URL path that needs to be accessed by webdriver
browser.get('https://www.charitiesnys.com/RegistrySearch/search_charities.jsp')

# Identify xpath of location to select element
inputElement = browser.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[2]/td[2]/input[1]') #identifies the location of the EIN element
inputElement.send_keys('0') #sends the "0" as the search value for EIN 
inputElement1 = browser.find_element(By.XPATH,'//*[@id="header"]/div[2]/div/table/tbody/tr/td[2]/div/div/font/font/font/font/font/font/table/tbody/tr[4]/td/form/table/tbody/tr[10]/td/input[1]').click() #instatiates the click of the search
sleep(4) #allow for the page to load by adding a sleep element
# Identify the table to scrape
table = browser.find_element(By.CSS_SELECTOR,'table.Bordered')
sleep(1)

##### CREATE DATA FRAME #####
# Create empty list to store data
data = []

# Loop through table rows to extract data and remove blank rows
for row in table.find_elements(By.CSS_SELECTOR, 'tr'):
    # Extract text from cells in the row
    row_data = [cell.text for cell in row.find_elements(By.CSS_SELECTOR, 'td')]
    # Check if the row contains data
    if row_data:
        data.append(row_data)

# Convert list to DataFrame
df = pd.DataFrame(data, columns=["Organization Name", "NY Reg #", "EIN", "Registrant Type", "City", "State"])

### LOAD THE FILE INTO S3 ####
# Prepare csv file name   
bucket_name = 'm10assignment'  # specify your S3 bucket name
pathname = f's3://{bucket_name}/'  # specify location of s3://{my-bucket}/
filename = 'charities_bureau_scrape_'  # name of your file
datetime = time.strftime("%Y%m%d%H%M%S")  # timestamp
filenames3 = f"{pathname}{filename}{datetime}.csv"  # name of the filepath and csv file

# Load file into S3. Pandas actually leverages boto to connect to S3 and can push the file directly into an S3 bucket
df.to_csv(filenames3, header=True, index=False) 

# Print success message
print("Successfully uploaded file to location:", filenames3)


Successfully uploaded file to location: s3://m10assignment/charities_bureau_scrape_20240420074402.csv


In [22]:
###LOAD THE FILE INTO S3####
# Prepare csv file name   
bucket_name = 'm10assignment'  # specify your S3 bucket name
pathname = f's3://{bucket_name}/'  # specify location of s3://{my-bucket}/
filename = 'charities_bureau_scrape_'  # name of your group
datetime = time.strftime("%Y%m%d%H%M%S")  # timestamp
filenames3 = f"{pathname}{filename}{datetime}.csv"  # name of the filepath and csv file

# Load file into S3. Pandas actually leverages boto to connect to S3 and can push the file directly into an S3 bucket
df.to_csv(filenames3, header=True, index=False)

# Print success message
print("Successfully uploaded file to location:", filenames3)

Successfully uploaded file to location: s3://m10assignment/charities_bureau_scrape_20240420074406.csv


## References
* https://www.programiz.com/python-programming/working-csv-files
* https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_bucket
* https://realpython.com/python-boto3-aws-s3/
* https://robertorocha.info/setting-up-a-selenium-web-scraper-on-aws-lambda-with-python/ 