# Load and Chunk 

This notebook focuses on the project data ingestion pipeline: loading and chunking. Here is the pipeline flow:

1) **Load**: Load raw data from README files.
2) **Chunk**: Data is chunked at bullet points' level.
3) **Tokenize**: No tokenization is done for this dataset.
4) **Embed**: Data is embedded using Sentence Transformer.

In [1]:
import re
import pandas as pd
import os
import hashlib
import json
from tqdm.auto import tqdm

### Load all README files in data folder

In [2]:
folder = '../data/'
readme_files = [f for f in os.listdir(folder) if f.endswith(".md")]
readme_files

['module_1_3.md',
 'module_2_4.md',
 'module_3_1.md',
 'module_2_1.md',
 'module_1_2.md',
 'module_3_4.md',
 'module_2_2.md',
 'module_3_3.md',
 'module_1_1.md',
 'module_2_3.md',
 'module_3_2.md']

### Process README files 
* Chunk data by bullet points. All sub bullet points are appended to the main bullet point.
* Create unique document id for each document.

In [3]:
def remove_bullets(s):
    '''
    Returns a string with leading bullets (* or -) and leading/trailing whitespaces removed.
    '''
    return re.sub(r'^\s{0,5}[\-\*]', '', s).strip()

In [4]:
def new_doc(header, subheader, documents, doc_id):
    '''
    Returns a document in JSON format 
    '''
    doc = { 'doc_id': doc_id,
            'header': header,
            'subheader': subheader,
            'document': documents,
            'doc_text': '\n'.join(documents)}
    return doc

In [5]:
def get_doc_id(string):
    '''
    Returns the md5 hash of a given string.
    '''
    return hashlib.md5(string.encode()).hexdigest()[:10]

In [6]:
def extract_document(filename):
    '''
    Extracts the header (starts with ##), subheader (starts with ###), and documents in README file.
    Returns a list of documents, where each document is in JSON format.
    
    JSON format looks like this: {'doc_id': doc_id,
                                 'header': header,
                                 'subheader': subheader,
                                 'documents': list of bullet points
                                 }
    Each document is a list of a bullet point and if any, its sub bullet points as well.
    A header + subheader pair can have one or more documents.
    
    Parameters
    ----------
    filename: String
            file name of README file.
    '''
    header = ""
    subheader = ""
    docs = []
    documents = []

    doc_id = get_doc_id(filename)
    path = folder + filename
    
    with open(path) as f:
        seq = 0
        
        for line in f:
            if line.strip(): 
                # get header
                if line.startswith("## "):
                    header = line.replace("##", "").rstrip()

                # get subheader
                elif line.startswith("###"):
                    # Append previous doc to documents if exists 
                    if docs:
                        seq += 1
                        doc = new_doc(header, subheader, docs, f'{doc_id}_{seq}')
                        documents.append(doc)
                    # clear doc
                    docs = []                    
                    subheader = line.replace("###", "").rstrip()
                
                # Initialise new doc when it's a row of bullet point   
                elif line.startswith("* ") or line.startswith("- "):
                    # Append previous doc to documents if exists
                    if docs:
                        seq += 1
                        doc = new_doc(header, subheader, docs, f'{doc_id}_{seq}')
                        documents.append(doc)
                    # initialise a new doc and append bullet point to doc
                    docs = []            
                    docs.append(remove_bullets(line))
                    
                # Append sub bullet points to doc
                elif line.startswith("    * ") or line.startswith("    - "):
                    docs.append(remove_bullets(line))
                
                # Bypass image links
                elif line.startswith("!"):
                    pass
                    
                # Append plain text to doc
                else:
                    docs.append(line)
        # Append last doc in file to documents 
        seq += 1
        doc = new_doc(header, subheader, docs, f'{doc_id}_{seq}')
        documents.append(doc)
        return documents
            
            

In [7]:
master_document = []
for r in tqdm(readme_files):
    master_document += extract_document(folder+r)

  0%|          | 0/11 [00:00<?, ?it/s]

In [8]:
len(master_document)

385

In [9]:
df = pd.json_normalize(master_document)

In [10]:
df

Unnamed: 0,doc_id,header,subheader,document,doc_text
0,08e49f1028_1,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Customer has maximum control of cloud resourc...,Customer has maximum control of cloud resources.
1,08e49f1028_2,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Customer has largest share of responsibility ...,Customer has largest share of responsibility i...
2,08e49f1028_3,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Only the physical resources are controlled by...,Only the physical resources are controlled by ...
3,08e49f1028_4,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),[Customer is responsible for installation and ...,Customer is responsible for installation and c...
4,08e49f1028_5,Cloud Concepts: Describe cloud service types,Infrastructure as a service (IaaS),"[**Scenarios to use IaaS**:, Lift-and-shift mi...",**Scenarios to use IaaS**:\nLift-and-shift mig...
...,...,...,...,...,...
380,a407052ee6_22,Microsoft Azure Fundamentals: Describe Azure ...,Microsoft Service Trust Portal,[Contains details about Microsoft's implementa...,Contains details about Microsoft's implementat...
381,a407052ee6_23,Microsoft Azure Fundamentals: Describe Azure ...,Microsoft Service Trust Portal,"[To access some of the resources within, you m...","To access some of the resources within, you mu..."
382,a407052ee6_24,Microsoft Azure Fundamentals: Describe Azure ...,Microsoft Service Trust Portal,[The [Service Trust Portal](https://servicetru...,The [Service Trust Portal](https://servicetrus...
383,a407052ee6_25,Microsoft Azure Fundamentals: Describe Azure ...,Microsoft Service Trust Portal,[**NOTE**: Service Trust Portal reports and do...,**NOTE**: Service Trust Portal reports and doc...


In [11]:
df.doc_id.nunique()

385

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 385 entries, 0 to 384
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   doc_id     385 non-null    object
 1   header     385 non-null    object
 2   subheader  385 non-null    object
 3   document   385 non-null    object
 4   doc_text   385 non-null    object
dtypes: object(5)
memory usage: 15.2+ KB


### Export to JSON file

In [13]:
output_file = "readme_notes_with_ids.json"

In [14]:
with open(f'{folder}{output_file}', 'w') as w:
    json.dump(master_document, w)