# Data Management Functionality

##  Introduction

Welcome to this Jupyter notebook, which is designed to showcase and explain the usage of functions defined in the src/data_management module. These functions have been specifically developed to streamline the process of handling and managing data, particularly focusing on the download and processing of .gz files from the sitemap of the WiWi faculty at Humboldt University of Berlin (HU Berlin).

## Objectives

In this notebook, we will:

1.	Introduce the key functions in the src/data_management module.
2.	Demonstrate how to use these functions effectively.
3.	Provide examples and use cases to illustrate their practical application.
4.	Explain the underlying logic and parameters of each function to ensure a clear understanding of their purpose and usage.

## Overview

The functions in src/data_management have been updated to enhance their capabilities, allowing for efficient downloading and processing of compressed .gz files. These updates are crucial for handling large datasets, ensuring that data extraction from the WiWi faculty’s sitemap is both effective and user-friendly.

We will begin by exploring the individual functions, followed by detailed demonstrations that will guide you through the process of utilizing these tools to manage and analyze data efficiently.

Please note, that this notebook will not showcase the upload of the derived files into the databases, as it is not the goal of this notebook.

In [1]:
import os
import pandas as pd
import sys

from pinecone import Pinecone
from dotenv import load_dotenv


# Get the current working directory
cwd = os.getcwd()

# Add the '../scripts' directory to the system path
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../scripts')))
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../src')))

from data_management.multi_destination_file_handler import *
from data_management.xml_sitemap_processor import *
from data_management.html_cleaner_and_processor import *
from data_management.html_content_enricher import *
from data_management.text_embedding_processor import *
from pinecone_func import pinecone_upsert, flatten_values

embed_model = HuggingFaceEmbeddings()

SITEMAP_URL = 'https://www.wiwi.hu-berlin.de/sitemap.xml.gz'
PATTERN = r'''<loc>(https://www\.wiwi\.hu-berlin\.de/en/(?!.*\.jpeg|.*\.pdf|.*\.png|.*\.jpg).*?)(?<!/view)</loc>\s*<lastmod>([^<]+)</lastmod>'''

  from tqdm.autonotebook import tqdm
2024-08-01 14:10:22,582 - INFO - Found credentials in shared credentials file: ~/.aws/credentials
  warn_deprecated(
2024-08-01 14:10:26,825 - INFO - Use pytorch device_name: mps
2024-08-01 14:10:26,825 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
2024-08-01 14:10:29,657 - INFO - Use pytorch device_name: mps
2024-08-01 14:10:29,658 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2


## Download and Unpack the Sitemap File

In this step, we will download the sitemap file from the provided web link, which contains links to all pages of the website along with their last update dates. The sitemap file is compressed in .gz format, so we will also unpack it to access the data.

### Step 1. Download and Unzip the Sitemap File

In [2]:
file = download_file(SITEMAP_URL)

content = unzip_file(file).decode('utf-8')

2024-08-01 14:10:33,434 - INFO - Unzipping file: /tmp/sitemap.xml.gz
2024-08-01 14:10:33,438 - INFO - Successfully unzipped file: /tmp/sitemap.xml.gz


In [3]:
print(f"The preview of the extracted content of the gz file: {content[:500]}")

The preview of the extracted content of the gz file: <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
  <loc>https://www.wiwi.hu-berlin.de</loc>
  <lastmod>2021-11-09T20:03:57+01:00</lastmod>
  
  
</url>
<url>
  <loc>https://www.wiwi.hu-berlin.de/en/Professorships/vwl/microeconomics/people/wlefez</loc>
  <lastmod>2023-


### Step 2. Filter and Process the Content

Using predefined patterns, we filter the content to extract specific information such as URLs and their last update dates. We then store this extracted information in a dictionary for further analysis or processing.

In [4]:
matches = filter_file_content(content, PATTERN)

In [5]:
matches_dict = create_matches_dict(matches)
matches_dict

{'56766e6286a91898d748bec68c322ede': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/microeconomics/people/wlefez',
  'last_updated': '2023-11-12T14:27:41+01:00'},
 '66f155e809a45f493153161960ac6fd4': {'url': 'https://www.wiwi.hu-berlin.de/en/international-office-1/our-team',
  'last_updated': '2024-06-10T09:39:14+02:00'},
 'd86c522bf2b2628d646473d72b589dbc': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/bwl/wi/lehre/introduction-to-python-programming-for-machine-learning-ai-1',
  'last_updated': '2023-11-15T11:59:19+01:00'},
 '3862d4124f5513e11c8b6169e1209c5e': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/statistik/news/junior-professorship-in-applied-statistics-and-data-science',
  'last_updated': '2023-12-05T13:02:33+01:00'},
 '8d562908f4e23f3bfa510962680088e3': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/statistik/news/paper-navigating-the-corporate-disclosure-gap-modelling-of-missing-not-at-random-carbon-data-by-olesiewicz-k

### Step 3. Creating Data Subset
As this notebook is created to showcase the local usage of the modules, let’s create a subset of 10 matches. This will allow us to apply further functions efficiently, saving time and computational resources.

In [6]:
def subset_first_n_items(dictionary, n):
    return {k: dictionary[k] for i, k in enumerate(dictionary) if i < n}

slice_to_test = subset_first_n_items(matches_dict,10)

### Step 4. Fetching URL content
Using ContentEnricher we add HTML content to the dictionary of matches. This step enriches the data by fetching the HTML content directly within the dictionary.

In [7]:
updated_data = add_html_content_to_dict(slice_to_test)

updated_data

2024-08-01 14:10:33,487 - INFO - Processing URL 1/10: https://www.wiwi.hu-berlin.de/en/Professorships/vwl/microeconomics/people/wlefez
2024-08-01 14:10:34,148 - INFO - Processing URL 2/10: https://www.wiwi.hu-berlin.de/en/international-office-1/our-team
2024-08-01 14:10:34,904 - INFO - Processing URL 3/10: https://www.wiwi.hu-berlin.de/en/Professorships/bwl/wi/lehre/introduction-to-python-programming-for-machine-learning-ai-1
2024-08-01 14:10:35,579 - INFO - Processing URL 4/10: https://www.wiwi.hu-berlin.de/en/Professorships/vwl/statistik/news/junior-professorship-in-applied-statistics-and-data-science
2024-08-01 14:10:36,556 - INFO - Processing URL 5/10: https://www.wiwi.hu-berlin.de/en/Professorships/vwl/statistik/news/paper-navigating-the-corporate-disclosure-gap-modelling-of-missing-not-at-random-carbon-data-by-olesiewicz-kooroshy-and-greven-will-appear-in-the-journal-of-impact-esg-investing
2024-08-01 14:10:37,326 - INFO - Processing URL 6/10: https://www.wiwi.hu-berlin.de/en/Pro

{'56766e6286a91898d748bec68c322ede': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/microeconomics/people/wlefez',
  'last_updated': '2023-11-12T14:27:41+01:00',
  'html_content': '<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">\n  <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\n    <meta data-embetty-server="https://www3.hu-berlin.de/embetty/" />\n    <title>wlefez — School of Business and Economics</title>\n    <link id="favicon_ico" rel="icon" href="/++theme++humboldt.theme/++resource++humboldt.policy/favicon.ico" sizes="any" /><!-- 32x32 -->\n    <link id="favicon_svg" rel="icon" href="/++theme++humboldt.theme/++resource++humboldt.policy/icon.svg" type="image/svg+xml" />\n    <link id="favicon_appletouch" rel="apple-touch-icon" href="/++theme++humboldt.theme/++resource++humboldt.policy/apple-touch-icon.png" /><!-- 180x180 -->\n    <link id="manifest_json" rel="manifest" href="/++theme++humboldt.theme/++

### Step 5. Cleaning HTML content

As the fetched HTML content has HTML elements in it and we want to extract text, we apply HTMLCleaner and the relevant function to do so.

This function:
1. extracts text from the HTML content
2. removes HTML tags
3. convert dates to date

In [8]:
data = HTMLCleaner(None, updated_data).process_data_from_dict()

data.head()

Unnamed: 0,id,url,last_updated,html_content,text,len
0,56766e6286a91898d748bec68c322ede,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-12,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Willy Lefez Ph.D. Room: 215 Spandauer Straße 1...,780
1,66f155e809a45f493153161960ac6fd4,https://www.wiwi.hu-berlin.de/en/international...,2024-06-10,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Dr. Anja Schwerk Head of Study and Internation...,690
2,d86c522bf2b2628d646473d72b589dbc,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-15,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Topics & Contents Machine Learning Foundations...,1286
3,3862d4124f5513e11c8b6169e1209c5e,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-12-05,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Reference number: JP/005/23 Location: School o...,3787
4,8d562908f4e23f3bfa510962680088e3,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-10-04,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...","The paper ""Navigating the corporate disclosure...",1908


In [9]:
data['text'] = data['text'].apply(str)
data['chunk'] = data['text'].apply(chunk_text)


In [10]:
new_data = expand_dataframe_with_embeddings(data, embed_model)
document = generate_documents(new_data)

Processing rows: 100%|██████████| 10/10 [00:06<00:00,  1.47it/s]


In [12]:
new_data.head(3)

Unnamed: 0,id,url,last_updated,html_content,text,len,embedding,url_count,unique_id
0,56766e6286a91898d748bec68c322ede,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-12,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Willy Lefez Ph.D. Room: 215 Spandauer Straße 1...,441,"[[-0.062076274305582047, -0.019467005506157875...",1,56766e6286a91898d748bec68c322ede_1
1,56766e6286a91898d748bec68c322ede,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-12,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",M.Sc. in Economics TSE Download CV Research In...,422,"[[-0.007820739410817623, 0.05816511809825897, ...",2,56766e6286a91898d748bec68c322ede_2
2,56766e6286a91898d748bec68c322ede,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-12,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Factor Collusion under Incomplete Information ...,69,"[[-0.017858611419796944, 0.0878477543592453, 0...",3,56766e6286a91898d748bec68c322ede_3


In [13]:
document[:10]

[{'id': '56766e6286a91898d748bec68c322ede_1',
  'values': [[-0.062076274305582047,
    -0.019467005506157875,
    -0.027899939566850662,
    0.027880428358912468,
    0.0073607913218438625,
    -0.024134455248713493,
    0.08830256015062332,
    0.009396318346261978,
    0.009721888229250908,
    0.021395327523350716,
    0.007886795327067375,
    0.03343432396650314,
    0.00860503874719143,
    0.08418939262628555,
    0.0322779081761837,
    -0.012045147828757763,
    0.044548626989126205,
    0.03491812199354172,
    -0.03854002058506012,
    -4.667699977289885e-05,
    0.0005013403715565801,
    -0.043675635010004044,
    0.014115314930677414,
    0.02082625962793827,
    0.04979756847023964,
    0.06026022136211395,
    0.027840809896588326,
    -0.05471426621079445,
    0.018475275486707687,
    -0.007590573281049728,
    0.028185855597257614,
    -0.01621265523135662,
    0.01080153789371252,
    0.01022841315716505,
    2.3254026473296108e-06,
    -0.03408273309469223,
    -0.