<a href="https://colab.research.google.com/github/tarachristine88/Tara_Hulcome_IMT542_I4/blob/main/IMT_542_I4_Tara_Hulcome.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [20]:
# prompt: extract sample of data from HTML website using BeautifulSoup

import requests
from bs4 import BeautifulSoup

def extract_sample_data(url, sample_size=10):
    """
    Extracts a sample of data from an HTML website.

    Args:
        url: The URL of the website to scrape.
        sample_size: The number of data points to extract.

    Returns:
        A list of strings, where each string is a data point from the website.
        Returns None if there's an error or no data found.
    """
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for bad status codes

        soup = BeautifulSoup(response.content, 'html.parser')

        # Example: Extract text from all <p> tags
        data_points = [p.get_text(strip=True) for p in soup.find_all('p')]

        # Handle cases where there are no <p> tags or data.
        if not data_points:
          print('No data points found.')
          return None

        # Return a sample
        return data_points[:min(sample_size, len(data_points))]

    except requests.exceptions.RequestException as e:
        print(f"Error fetching the URL: {e}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


# Example usage
url = "https://www.uxmatters.com/mt/archives/2016/11/the-name-of-the-practice-is-information-architecture.php"
sample_data = extract_sample_data(url)

if sample_data:
    print("Sample Data:")
    for item in sample_data:
        print(item) # Added indentation and print statement to process each item



# Pros: Fast and easy to use BeautifulSoup - with assistance of Google Gemini - to scrape information from publicly accessible websites.
# Cons: Resulting output is quite messy and dense. Requires further structuring to improve readibility.

Sample Data:
Today,information architecture(IA) is a recognized term in many technology, product, and Web-design organizations. However, in many other organizations, information architecture is still “the pain with no name.” [1] If you ask senior practitioners of information architecture, they’ll tell you that information architecture is central to the creation of human-computer interfaces. But the fact of the matter is that the popular view of information architecture represents just a very small subset of its total value.
In this column, I’ll first summarize the popular conception of the practice of information architecture, then I’ll highlight the broader scope of the practice that still remains to be realized.
For more than two decades, information-architecture pioneers, passionate thought leaders, and pundits have argued the merits of information architecture. The popular conception of the scope of information architecture has derived from the writings of Peter Morville and Lou Ro

In [21]:
# prompt: extract sample of data from PDF file

!pip install PyPDF2

import PyPDF2

def extract_sample_text_from_pdf(pdf_path, sample_size=10):
    """
    Extracts a sample of text from a PDF file.

    Args:
        pdf_path: The path to the PDF file.
        sample_size: The number of text snippets to extract.

    Returns:
        A list of strings, where each string is a text snippet from the PDF.
        Returns None if there's an error or the file is not found.
    """
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            num_pages = len(reader.pages)
            text_snippets = []

            for page_num in range(num_pages):
                page = reader.pages[page_num]
                text = page.extract_text()
                # Split the text into sentences or paragraphs (adjust as needed)
                snippets = text.split('.')  # Example: Split by periods
                text_snippets.extend(snippets)

            # Handle cases with no text.
            if not text_snippets:
                print("No text found in the PDF.")
                return None

            # Return a sample, handling cases where there's less than sample_size.
            return text_snippets[:min(sample_size, len(text_snippets))]

    except FileNotFoundError:
        print(f"Error: File not found at {pdf_path}")
        return None
    except Exception as e:
        print(f"An error occurred: {e}")
        return None


# Example usage (replace 'your_pdf_file.pdf' with the actual file path)
pdf_file_path = 'digital-trade-strategy.pdf'
sample_text = extract_sample_text_from_pdf(pdf_file_path)

if sample_text:
    print("Sample Text from PDF:")
    for snippet in sample_text:
        print(snippet.strip())

# Pros: Easy to use PyPDF2 package to scrape text from PDF file. Testing this method is very useful for my final project.
# Cons: This method was more labor intensive as I needed to manually download the PDF file from the DFAT website. I then uploaded it as a local file to Google CoLab before extracting text, with assistance of Gemini AI.


Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/232.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Sample Text from PDF:
DIGITAL TRADE STRATEGY  
April 2022
Digital trade strategy  i EXECUTIVE SUMMARY  
Objectives  
Keeping the global economy open and businesses trading is crucial for the economic recovery and ongoing 
prosperity  of Australia – and our region
Digital trade and the technologies that underpin it are fundamental 
to our economic growth and realis ing the G overnment’s vision to be a top 10 digital 

In [19]:
pip install --upgrade tableau-api-lib

Collecting tableau-api-lib
  Downloading tableau_api_lib-0.1.50-py3-none-any.whl.metadata (9.2 kB)
Downloading tableau_api_lib-0.1.50-py3-none-any.whl (144 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.0/144.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tableau-api-lib
Successfully installed tableau-api-lib-0.1.50


In [62]:
# prompt: Retrieve details of workbooks associated with a Tableau Public username using BeautifulSoup. Return as CSV file.

import requests
from bs4 import BeautifulSoup
import csv

def get_tableau_public_workbooks(username):
    url = f"https://public.tableau.com/app/profile/{username}/vizzes"
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Failed to retrieve page for {username}. Status code: {response.status_code}")
        return []

    soup = BeautifulSoup(response.text, 'html.parser')

    workbooks = []
    for workbook in soup.find_all('a', class_='workbook-card__link'):
        title = workbook.get('title')
        link = "https://public.tableau.com" + workbook.get('href')
        workbooks.append({'title': title, 'link': link})

    return workbooks

def save_workbooks_to_csv(workbooks, filename):
    with open(filename, mode='w', newline='', encoding='utf-8') as csv_file:
        fieldnames = ['title', 'link']
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

        writer.writeheader()
        for workbook in workbooks:
            writer.writerow(workbook)

# Example usage
username = 'philip.hulcome1964'  # <-- Replace with the Tableau Public username
workbooks = get_tableau_public_workbooks(username)

if workbooks:
    save_workbooks_to_csv(workbooks, f'{username}_workbooks.csv')
    print(f"Saved {len(workbooks)} workbooks to {username}_workbooks.csv")
else:
    print("No workbooks found.")

# Pros: N/A, as this was an unsuccessful attempt.
# Cons: Based on my troubleshooting attempts, the result may have been because Tableau Public loads workbook content dynamically, and BeautifulSoup couldn't retrieve dynamic elements. I attempted to usee Selenium instead, with no luck. This task was clearly beyond my coding ability...


No workbooks found.


In [63]:
# prompt: Retrieve sample data from CSV file through a website

import pandas as pd

def get_sample_from_csv_url(url, sample_size=5):
    """
    Retrieves a sample of data from a CSV file hosted on a URL.

    Args:
        url: The URL of the CSV file.
        sample_size: The desired sample size.

    Returns:
        A pandas DataFrame containing the sample data, or None if an error occurs.
    """
    try:
        df = pd.read_csv(url)
        return df.sample(n=min(sample_size, len(df)))  # Ensure sample_size doesn't exceed DataFrame length

    except pd.errors.ParserError:
        print(f"Error: Invalid CSV format at {url}.")
        return None
    except Exception as e:
        print(f"An error occurred while fetching data: {e}")
        return None

# Example usage
csv_url = "https://www.usitc.gov/sites/default/files/tata/hts/hts_2025_revision_10_csv.csv" # Replace with a valid URL pointing to your CSV file
sample_df = get_sample_from_csv_url(csv_url)

if sample_df is not None:
    print("Sample data from CSV:")
sample_df

# Pros: Easy to use Pandas - with Gemini AI assistance - to extract sample data from CSV files.
# Cons: Pandas loaded the entire file into its memory, which is not ideal for extremely large data files.


Sample data from CSV:


Unnamed: 0,HTS Number,Indent,Description,Unit of Quantity,General Rate of Duty,Special Rate of Duty,Column 2 Rate of Duty,Quota Quantity,Additional Duties
3883,,5,In airtight containers:,,,,,,
25177,8429.51.50,3,Other,,Free,,35%,,
1821,0702.00.20.92,4,Other,"[""kg""]",,,,,
21818,,5,Other:,,,,,,
32796,9902.08.06,0,Mixtures of Disperse Red 1042A (5-[2-(2-cyano-...,,Free,No change,No change,,
