# Data Management Functionality

##  Introduction

Welcome to this Jupyter notebook, which is designed to showcase and explain the usage of functions defined in the src/data_management module. These functions have been specifically developed to streamline the process of handling and managing data, particularly focusing on the download and processing of .gz files from the sitemap of the WiWi faculty at Humboldt University of Berlin (HU Berlin).

## Objectives

In this notebook, we will:

1.	Introduce the key functions in the src/data_management module.
2.	Demonstrate how to use these functions effectively.
3.	Provide examples and use cases to illustrate their practical application.
4.	Explain the underlying logic and parameters of each function to ensure a clear understanding of their purpose and usage.

## Overview

The functions in src/data_management have been updated to enhance their capabilities, allowing for efficient downloading and processing of compressed .gz files. These updates are crucial for handling large datasets, ensuring that data extraction from the WiWi faculty’s sitemap is both effective and user-friendly.

We will begin by exploring the individual functions, followed by detailed demonstrations that will guide you through the process of utilizing these tools to manage and analyze data efficiently.

Please note, that this notebook will not showcase the upload of the derived files into the databases, as it is not the goal of this notebook.

In [1]:
import os
import pandas as pd
import sys

# Get the current working directory
cwd = os.getcwd()

# Add the '../scripts' directory to the system path
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../scripts')))
sys.path.insert(0, os.path.abspath(os.path.join(cwd, '../src')))

from data_management.i_scheme_downloader import *
from data_management.ii_unzip_convert_json import *
from data_management.iii_content_downloader import *
from data_management.iv_html_retriever_cleaner import *
from data_management.v_chunker_embedder import *


SITEMAP_URL = 'https://www.wiwi.hu-berlin.de/sitemap.xml.gz'
PATTERN = r'''<loc>(https://www\.wiwi\.hu-berlin\.de/en/(?!.*\.jpeg|.*\.pdf|.*\.png|.*\.jpg).*?)(?<!/view)</loc>\s*<lastmod>([^<]+)</lastmod>'''

  from .autonotebook import tqdm as notebook_tqdm
2024-07-18 14:23:19,068 - INFO - Found credentials in shared credentials file: ~/.aws/credentials
2024-07-18 14:23:21,875 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en
2024-07-18 14:23:24,798 - INFO - 2 prompts are loaded, with the keys: ['query', 'text']


## Download and Unpack the Sitemap File

In this step, we will download the sitemap file from the provided web link, which contains links to all pages of the website along with their last update dates. The sitemap file is compressed in .gz format, so we will also unpack it to access the data.

In [2]:
file = download_file(SITEMAP_URL)

content = unzip_local_file(file).decode('utf-8')

2024-07-18 14:23:24,993 - INFO - Unzipping local file: /tmp/sitemap.xml.gz
2024-07-18 14:23:24,999 - INFO - Successfully unzipped local file: /tmp/sitemap.xml.gz


In [3]:
print(f"The preview of the extracted content of the gz file: {content[:500]}")

The preview of the extracted content of the gz file: <?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
  <loc>https://www.wiwi.hu-berlin.de</loc>
  <lastmod>2021-11-09T20:03:57+01:00</lastmod>
  
  
</url>
<url>
  <loc>https://www.wiwi.hu-berlin.de/en/Professorships/vwl/microeconomics/people/wlefez</loc>
  <lastmod>2023-


In [4]:
matches = filter_file_content(content, PATTERN)

In [5]:
matches_dict = create_matches_dict(matches)
matches_dict

{'56766e6286a91898d748bec68c322ede': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/microeconomics/people/wlefez',
  'last_updated': '2023-11-12T14:27:41+01:00'},
 '66f155e809a45f493153161960ac6fd4': {'url': 'https://www.wiwi.hu-berlin.de/en/international-office-1/our-team',
  'last_updated': '2024-06-10T09:39:14+02:00'},
 'd86c522bf2b2628d646473d72b589dbc': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/bwl/wi/lehre/introduction-to-python-programming-for-machine-learning-ai-1',
  'last_updated': '2023-11-15T11:59:19+01:00'},
 '3862d4124f5513e11c8b6169e1209c5e': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/statistik/news/junior-professorship-in-applied-statistics-and-data-science',
  'last_updated': '2023-12-05T13:02:33+01:00'},
 '8d562908f4e23f3bfa510962680088e3': {'url': 'https://www.wiwi.hu-berlin.de/en/Professorships/vwl/statistik/news/paper-navigating-the-corporate-disclosure-gap-modelling-of-missing-not-at-random-carbon-data-by-olesiewicz-k

In [6]:
jsonprocess = JSONConverter(matches_dict)

In [7]:
def subset_first_n_items(dictionary, n):
    return {k: dictionary[k] for i, k in enumerate(dictionary) if i < n}


In [8]:
slice_to_test = subset_first_n_items(matches_dict,10)
updated_data = jsonprocess.add_html_content_to_dict(slice_to_test)

In [9]:
htmlcleaner = HTMLCleaner(None, updated_data)

In [10]:
data = htmlcleaner.process_data_from_dict()

In [11]:
data

Unnamed: 0,id,url,last_updated,html_content,extracted_texts,len
0,56766e6286a91898d748bec68c322ede,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-12,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Willy Lefez Ph.D. Room: 215 Spandauer Straße 1...,780
1,66f155e809a45f493153161960ac6fd4,https://www.wiwi.hu-berlin.de/en/international...,2024-06-10,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Dr. Anja Schwerk Head of Study and Internation...,690
2,d86c522bf2b2628d646473d72b589dbc,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-11-15,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Topics & Contents Machine Learning Foundations...,1286
3,3862d4124f5513e11c8b6169e1209c5e,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-12-05,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Reference number: JP/005/23 Location: School o...,3787
4,8d562908f4e23f3bfa510962680088e3,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-10-04,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...","The paper ""Navigating the corporate disclosure...",1908
5,1979cabfff62786264bc5f1d6985d4db,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-10-04,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",The preprint Principal component analysis in B...,1708
6,261851c088b6c9fc17b94966abdfc965,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-10-06,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Email: heyaozh@outlook.com Office hours: by ap...,859
7,ba2d6154371b98b65f30edaf65c790fc,https://www.wiwi.hu-berlin.de/en/Professorship...,2023-04-01,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Note: Mr. C. Schlauch was a member of the Chai...,428
8,69106adb78c01a2d60d76310de209586,https://www.wiwi.hu-berlin.de/en/academic-care...,2024-06-25,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...","In 2024, the Graduate Centre of the School of ...",6515
9,bbc3adaea6ab5f8da879135f3c11ee57,https://www.wiwi.hu-berlin.de/en/Professorship...,2024-07-01,"<!DOCTYPE html>\n<html xmlns=""http://www.w3.or...",Recent Advances in Mutual Fund and Hedge Fund ...,4010
