<a href="https://colab.research.google.com/github/zyferion/datta-able-bootstrap-dashboard/blob/master/notebooks/data_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Extraction Notebook

This notebook explores function that can scrape and extract separate sections of text from websites relating to 'Privacy Policy' and or 'Terms and Conditions' (this includes 'Terms of Service', 'Terms of Use', etc.)

A key challenge with this work is that different websites have different formats:
- sometimes these sites have numbered section headings (either X. or X.Y)
- sometimes they do not have any numbered section headings and are separated by a html header tag (i.e., h2, h3)
- sometimes the website format is tricky -> div panels, javascript


In [12]:
# Read in required packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import csv

import pandas as pd
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [24]:
# Read in raw file from GitHub
url = 'https://raw.githubusercontent.com/zyferion/privacyguardians/main/data/urls_dataset.csv'

df = pd.read_csv(url)# Dataset is now stored in a Pandas Dataframe

In [4]:
df.head(10)

Unnamed: 0,application_name,document_type,url,text_format
0,netflix,privacy policy,https://help.netflix.com/en/legal/privacy,h_sections
1,netflix,terms and conditions,https://help.netflix.com/en/legal/termsofuse,num_sections_2
2,airbnb,privacy policy,https://www.airbnb.com.au/help/article/3175,num_sections_2
3,airbnb,terms and conditions,https://www.airbnb.com.au/help/article/2908,num_sections_2
4,whatsapp,privacy policy,https://www.whatsapp.com/legal/privacy-policy,h_sections
5,whatsapp,terms and conditions,https://www.whatsapp.com/legal/terms-of-service,h_sections
6,temu,privacy policy,https://www.temu.com/au/privacy-and-cookie-pol...,h_sections
7,temu,terms and conditions,https://www.temu.com/au/terms-of-use.html,num_sections_2
8,7plus,terms and conditions,https://support.7plus.com.au/hc/en-au/articles...,num_sections_1
9,7plus,privacy policy,https://www.sevenwestmedia.com.au/privacy-poli...,num_sections_1


In [5]:
# Count of URLs
len(list(df['url']))

35

**A function for websites that DO NOT have numbered sections and are identified by h3 or h2 tag**
* Does not work on everything

In [19]:
# Function for splitting text into sentences
def split_text_into_sentences(text):
    # Use nltk.sent_tokenize() to split the text into sentences
    sentences = nltk.sent_tokenize(text)
    return sentences

# Given URL return pandas df with a row per section
# Only works on websites where a h3 or h2 tag separates each section

def scrape_and_store_h3_h2_text(url):
    data = []  # List to store extracted data

    # Send a GET request to the URL
    response = requests.get(url)
    if response.status_code == 200:  # If the request is successful
        soup = BeautifulSoup(response.text, "html.parser")  # Create a BeautifulSoup object 'soup'
        h3_tags = soup.find_all(["h3", "h2"])  # Find all <h3> and <h2> tags
        for h_tag in h3_tags:
            section_text = ""
            next_sibling = h_tag.find_next_sibling()  # Get the next sibling of the <h3> or <h2> tag
            while next_sibling and (next_sibling.name not in ["h3", "h2"]):
                section_text += str(next_sibling)  # Include the entire tag content
                next_sibling = next_sibling.find_next_sibling()  # Move to the next sibling
            if section_text:
                section_text_cleaned = BeautifulSoup(section_text, "html.parser").get_text()  # Clean HTML content

                # Split the cleaned text into sentences
                sentences = split_text_into_sentences(section_text_cleaned)

                # Add each sentence as a new row in the data list
                for sentence in sentences:
                    data.append({"URL": url, "Section": h_tag.get_text(), "Text": sentence})
    else:
        print("Failed to retrieve the page:", url)

    df = pd.DataFrame(data)  # Create a DataFrame from the collected data
    return df

In [21]:
# https://help.netflix.com/en/legal/privacy
# https://www.whatsapp.com/legal/privacy-policy
# https://www.airbnb.com.au/help/article/3175 (numbered)

URL = "https://bumble.com/en-us/terms"

# Example
if __name__ == "__main__":
    url = URL  # Replace with the actual URL
    df = scrape_and_store_h3_h2_text(url)
    if not df.empty:
        pd.set_option("display.max_colwidth", None)  # Display full content of DataFrame columns

# Show result
df

Unnamed: 0,URL,Section,Text
0,https://bumble.com/en-us/terms,Bumble Terms and Conditions of Use,Welcome to Bumble’s Terms and Conditions of Use (these “Terms”).
1,https://bumble.com/en-us/terms,Bumble Terms and Conditions of Use,This is a contract between you and the Bumble Group (as defined further below) and we want you to know yours and our rights before you use the Bumble website or application (“Bumble” or the “App”).
2,https://bumble.com/en-us/terms,Bumble Terms and Conditions of Use,"Please take a few moments to read these Terms before enjoying the App, because once you access, view or use the App, you are going to be legally bound by these Terms (so probably best to read them first!"
3,https://bumble.com/en-us/terms,Bumble Terms and Conditions of Use,").Please be aware that if you subscribe to services for a term (the “Initial Term”), then the terms of your subscription will be automatically renewed for additional periods of the same duration as the Initial Term at Bumble’s then-current fee for such services, unless you cancel your subscription in accordance with Section 5 below.You should also note that Section 13 of these Terms contains provisions governing how claims that you and Bumble Group have against each other are resolved."
4,https://bumble.com/en-us/terms,Bumble Terms and Conditions of Use,"In particular, it contains an arbitration agreement that will, with limited exceptions, require disputes between us to be submitted to binding and final arbitration."
...,...,...,...
303,https://bumble.com/en-us/terms,Effective date,"The Terms were last updated on: July 24, 2023."
304,https://bumble.com/en-us/terms,Footer,Bumble on Instagram Bumble on Facebook Bumble on Twitter Bumble on Pinterest FAQ Frequently Asked Questions Events Contact Us Guidelines Careers Investors Modern Slavery Act Statement Terms & Conditions Privacy Policy Impressum Do not sell or share my personal information Seasonal Dating Guides © 2023 Bumble | All Rights Reserved
305,https://bumble.com/en-us/terms,Cookies consent,We use cookies to make our site work better.
306,https://bumble.com/en-us/terms,Cookies consent,This includes analytics cookies and advertising cookies.


In [92]:
# Store URLs in a list
url_list = list(df['url'])

# Initialize an empty DataFrame to store the results
result_df = pd.DataFrame()

# Loop through the URLs list and scrape data
for url in url_list:
    scraped_data = scrape_and_store_h3_h2_text(url)
    result_df = result_df.append(scraped_data, ignore_index=True)


  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)


Failed to retrieve the page: https://support.7plus.com.au/hc/en-au/articles/360060194292-Terms-and-Conditions


  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_data, ignore_index=True)
  result_df = result_df.append(scraped_d

In [93]:
# View results
result_df.head(10)

Unnamed: 0,URL,Section,Text
0,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"If you have general questions about your account or how to contact customer service for assistance, please visit our online help center at help.netflix.com."
1,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"For questions specifically about this Privacy\r Statement, or our use of your personal information, cookies or similar technologies, please contact our Data Protection Officer/Privacy Office by email at privacy@netflix.com."
2,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"The data controller of your personal information is Netflix, Inc."
3,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"Please note that if you contact us to assist you, for your safety and ours we may need to authenticate your identity before\r fulfilling your request."
4,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"We receive and store information about you such as:\r Information you provide to us: We collect information you provide to us which includes: your name, email address, address or postal code, payment method(s), telephone number, and other identifiers you might use (such as an in-game name)."
5,https://help.netflix.com/en/legal/privacy,Collection of Information\r,This will include gender and date of birth if you join an ad supported subscription\r plan.
6,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"We collect this information in a number of ways, including when you enter it while using our service, interact with our customer service, or participate in surveys or marketing promotions;\r information when you choose to provide ratings, taste preferences, account settings (including preferences set in the ""Account"" section of our website), or otherwise provide information to us through our service or\r elsewhere."
7,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"Information we collect automatically: We collect information about you and your use of our service, your interactions with us and our advertising, as well as information regarding your network, network devices, and\r your computer or other Netflix capable devices you might use to access our service (such as gaming systems, smart TVs, mobile devices, set top boxes, and other streaming media devices)."
8,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"This information includes:\r your activity on the Netflix service, such as title selections, shows you have watched, ads viewed (if applicable), search queries, and activity in Netflix games;\r your interactions with our emails and texts, and with our messages through push and online messaging channels;\r details of your interactions with our customer service, such as the date, time and reason for contacting us, transcripts of any chat conversations, and if you call us, your phone number and call recordings;\r device IDs or other unique identifiers, including for your network devices (such as your router), and devices that are Netflix capable on your network; resettable device identifiers (also known as advertising identifiers), such as those on mobile devices, tablets, and streaming media devices that include such identifiers (see the ""Cookies and Internet Advertising""\r section below for more information);\r device and software characteristics (such as type and configuration), connection information including type (wifi, cellular), statistics on page views, referring source (for example, referral URLs), IP address (which can be\r used to tell us your general location, such as your city, state/province, and postal code), browser and standard web server log information;\r information collected via the use of cookies, web beacons and other technologies, including ad information (such as information on the availability and delivery of ads, the site URL, as well as the date and time)."
9,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"(See our\r ""Cookies and Internet Advertising"" section for more details.)"


In [94]:
# Summary info on results
result_df.describe()

Unnamed: 0,URL,Section,Text
count,4083,4083,4083
unique,27,490,3738
top,https://www.airbnb.com.au/help/article/2908,Host Terms,Learn more
freq,492,154,12


In [95]:
# Identify missing URls from original set
difference_set = set(df['url'].drop_duplicates()) - set(result_df['URL'].drop_duplicates())

difference_list = list(difference_set)
difference_list

['https://sheingroup.com/privacy-policy/',
 'https://support.7plus.com.au/hc/en-au/articles/360060194292-Terms-and-Conditions',
 'https://sheingroup.com/terms-conditions/',
 'https://help.netflix.com/en/legal/termsofuse',
 'https://web.didiglobal.com/au/legal/privacy-policy/',
 'https://explore.zoom.us/en/terms/',
 'https://web.didiglobal.com/au/legal/passenger-agreement/']

In [109]:
def remove_whitespace_and_carriage_returns(df, column_name):
    """
    Remove leading and trailing whitespaces, newline characters, and carriage returns from a column in a Pandas DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame containing the column.
        column_name (str): The name of the column with whitespace, newline characters, and carriage returns to remove.

    Returns:
        pd.DataFrame: The DataFrame with leading/trailing spaces, newline characters, and carriage returns removed from the specified column.
    """
    # Check if the column exists in the DataFrame
    if column_name not in df.columns:
        raise ValueError(f"Column '{column_name}' not found in the DataFrame.")

    # Use str.replace() with a regular expression to replace multiple spaces with one space
    df[column_name] = df[column_name].str.replace(r'\s+', ' ', regex=True)

    # Use str.replace() to remove carriage returns ('\r'), newline characters ('\n'), and leading/trailing spaces
    df[column_name] = df[column_name].str.replace('[\r\n]+', '', regex=True).str.strip()

    # Use str.replace() to remove exact '\r' patterns
    df[column_name] = df[column_name].str.replace(r'\r', '')

    # Use str.replace() to remove '\' and '/' patterns
    df[column_name] = df[column_name].str.replace('\\', ' ')
    df[column_name] = df[column_name].str.replace('/', ' ')

    return df


In [110]:
result_df = remove_whitespace_and_carriage_returns(result_df, 'Text')
result_df.head(10)

  df[column_name] = df[column_name].str.replace(r'\r', '')
  df[column_name] = df[column_name].str.replace('\\', ' ')


Unnamed: 0,URL,Section,Text
0,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"If you have general questions about your account or how to contact customer service for assistance, please visit our online help center at help.netflix.com."
1,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"For questions specifically about this Privacy Statement, or our use of your personal information, cookies or similar technologies, please contact our Data Protection Officer Privacy Office by email at privacy@netflix.com."
2,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"The data controller of your personal information is Netflix, Inc."
3,https://help.netflix.com/en/legal/privacy,Contacting Us\r,"Please note that if you contact us to assist you, for your safety and ours we may need to authenticate your identity before fulfilling your request."
4,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"We receive and store information about you such as: Information you provide to us: We collect information you provide to us which includes: your name, email address, address or postal code, payment method(s), telephone number, and other identifiers you might use (such as an in-game name)."
5,https://help.netflix.com/en/legal/privacy,Collection of Information\r,This will include gender and date of birth if you join an ad supported subscription plan.
6,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"We collect this information in a number of ways, including when you enter it while using our service, interact with our customer service, or participate in surveys or marketing promotions; information when you choose to provide ratings, taste preferences, account settings (including preferences set in the ""Account"" section of our website), or otherwise provide information to us through our service or elsewhere."
7,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"Information we collect automatically: We collect information about you and your use of our service, your interactions with us and our advertising, as well as information regarding your network, network devices, and your computer or other Netflix capable devices you might use to access our service (such as gaming systems, smart TVs, mobile devices, set top boxes, and other streaming media devices)."
8,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"This information includes: your activity on the Netflix service, such as title selections, shows you have watched, ads viewed (if applicable), search queries, and activity in Netflix games; your interactions with our emails and texts, and with our messages through push and online messaging channels; details of your interactions with our customer service, such as the date, time and reason for contacting us, transcripts of any chat conversations, and if you call us, your phone number and call recordings; device IDs or other unique identifiers, including for your network devices (such as your router), and devices that are Netflix capable on your network; resettable device identifiers (also known as advertising identifiers), such as those on mobile devices, tablets, and streaming media devices that include such identifiers (see the ""Cookies and Internet Advertising"" section below for more information); device and software characteristics (such as type and configuration), connection information including type (wifi, cellular), statistics on page views, referring source (for example, referral URLs), IP address (which can be used to tell us your general location, such as your city, state province, and postal code), browser and standard web server log information; information collected via the use of cookies, web beacons and other technologies, including ad information (such as information on the availability and delivery of ads, the site URL, as well as the date and time)."
9,https://help.netflix.com/en/legal/privacy,Collection of Information\r,"(See our ""Cookies and Internet Advertising"" section for more details.)"


In [111]:
# Download the result dataframe
from google.colab import files

# CSV file
#result_df.to_csv('extracted_data.csv', index=False)  # Set index=False to omit writing row numbers as the first column
#files.download('extracted_data.csv')

# XLSX file
result_df.to_excel("extracted_data.xlsx")
files.download('extracted_data.xlsx')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**A function to handle sections separated by numbers**

In [None]:
def scrape_section_to_dataframe(url):
    # Send an HTTP request to the URL and get the HTML content
    response = requests.get(url)
    html_content = response.text

    # Initialize BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')

    # Define the section pattern as '1.1'
    section_pattern = r'\d+\.\d+'

    # Initialize lists to store data
    urls = []
    section_patterns = []
    section_texts = []

    matching_elements = soup.find_all(text=re.compile(section_pattern))

    for element in matching_elements:
        section_text = element.find_next().text.strip()

        urls.append(url)
        section_patterns.append(section_pattern)
        section_texts.append(section_text)

    data = {'URL': urls, 'Section': section_patterns, 'Text': section_texts}
    df = pd.DataFrame(data)
    return df


In [None]:
scrape_section_to_dataframe('https://web.didiglobal.com/au/legal/passenger-agreement/')

  matching_elements = soup.find_all(text=re.compile(section_pattern))


Unnamed: 0,URL,Section,Text
0,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,.gatsby-image-wrapper{position:relative;overfl...
1,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,(a) be over 18 years old and legally able to e...
2,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,1.3 You acknowledge and agree that DiDi will c...
3,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,1.4 You warrant that the information provided ...
4,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,1.5 DiDi reserves the right to refuse registra...
...,...,...,...
68,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,2.3 If another passenger suffers loss or prope...
69,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,3. Respect
70,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,(a) be on time for your ride – always try to b...
71,https://web.didiglobal.com/au/legal/passenger-...,\d+\.\d+,3.3 We reserve the right to deactivate your ac...


## Text analysis

In [None]:
# Function to count words in a text string
def count_words(text):
    words = text.split()
    return len(words)

# Apply the function to the 'text' column and calculate average
word_count= result_df['Text'].apply(count_words)
avg_word_count = word_count.mean()

# Show result
avg_word_count

247.13774597495527