## Loading Config

In [1]:
from pathlib import Path
import sys 
import os 
import yaml 

# define rooth path 
root = Path(os.getcwd()).parent
sys.path.insert(0, str(root))

# load config 
config_path = root / "configs/config.yaml"
with open(config_path) as f: 
    config = yaml.safe_load(f)

config

{'paths': {'data_dir': 'data/',
  'raw_data_dir': '../data/raw/',
  'filtered_data_dir': 'data/filtered/',
  'chunk_data_dir': 'data/chunks/',
  'model_dir': 'models/'},
 'reddit_scraper': {'credentials': {'client_id': 'XyEwSDvEPKLGTYSfqoLZWg',
   'client_secret': 'DW4m7X7RuuYDGxG51GlEMv_G6Pbq1A',
   'user_agent': 'scraper'},
  'search': {'subreddits': ['changemyview',
    'AskReddit',
    'PoliticalDiscussion'],
   'queries': ['diversity equity inclusion',
    'DEI',
    'inclusion AND leadership'],
   'sort': 'top',
   'time_filter': 'all',
   'limit': 20,
   'rate_limit_delay': 0.1,
   'filters': {'n_comments': 20,
    'min_post_score': 20,
    'min_comment_score': 10,
    'min_char_length': 100}}},
 'semantic_filter': {'topic': 'DEI',
  'n_keywords': 5,
  'similarity_threshold': 0.4,
  'min_token_len': 70},
 'chunker': 'None'}

## Collecting dataset

In [None]:
from src.utils.data_handler import DataHandler
from src.scrapers.reddit_scraper import RedditScraper
from src.kb_builder.semantic_filter import SemanticFilter
from src.kb_builder.chunker import Chunker
from src.RAG.retriever import Retriever
from src.RAG.QA_generator import QAgenerator

In [3]:
# scrape reddit & save raw data
scraper = RedditScraper(
    data_dir=config['paths']['data_dir'],
    scraper_config=config['reddit_scraper']
    )

raw_data = scraper.scrape(save_results=False)

print(f"Number of raw data entries: {len(raw_data)}")
print("\nExample: \n")
print(raw_data[0])

Collecting data...
Removing duplicate posts...
Scraping completed!
Number of raw data entries: 470

Example: 

{'id': 'reddit_otbfy2', 'type': 'post', 'source': 'reddit', 'subreddit': 'changemyview', 'title': 'CMV: Micro-aggressions should take intent into account.', 'text': 'I was required by my company to participate in DEI (Diversity Equity and Inclusion) training that stated, "It does not matter what the aggressor intended, harm was caused." While I understand where this logic comes from, it is a poor principle for the workplace, justice, or society.\n\nLet me use a non-racial analogy: You tripped me.\n\nLet\'s say I am walking past you sitting down. Our legs make contact and I trip and fall. Let\'s look at some possible intents:\n\n1. You didn\'t see me coming and accidentally tripped me, apologized for the accident, and helped me up.\n2. You did see me coming, intended to trip me, laughed as I fell, and did not help me up.\n3. I saw your legs were there, could have slightly alter

In [4]:
# filter raw data on semantic criteria 
semantic_filter = SemanticFilter(
    data_dir=config["paths"]["data_dir"],
    model_dir=config['paths']['model_dir'],
    data=raw_data,
    semantic_filter_config=config['semantic_filter']
)

filtered_data = semantic_filter.filter(save_results=False) 

print(f"Number of filtered data entries: {len(filtered_data)}")
print("\nExample: \n")
print(f"Text keywords: {filtered_data[0]["text_keywords"]}")
print(f"Similarity score: {filtered_data[0]["topic_similarity_score"]}")

Models successfully loaded.
Semantic filtering completed!
Number of filtered data entries: 150

Example: 

Text keywords: ['minorities', 'ethnicities', 'ethnic', 'racism', 'racial', 'racist']
Similarity score: 0.6384389996528625


In [5]:
# chunk 
chunker = Chunker(
    data_dir=config["paths"]["data_dir"],
    data=filtered_data,
)

chunks = chunker.chunk(save_results=False)

print(f"Number of Chunks: {len(chunks)}")
print("\nExample: \n")
print(chunks[0])

Number of Chunks: 150

Example: 

{'id': 'chunk_reddit_1hsaw8u_0', 'source_id': 'reddit_1hsaw8u', 'source_type': 'post', 'subreddit': 'changemyview', 'text': 'Based on my personal experience, what I\'ve been hearing from my relatives, friends and co-workers, and also what I\'ve read online on various forums, blogs, social media posts, I strongly believe that non-white countries are a lot more ignorant toward "minorities" or people who are considered non-white. In modern days most white countries would gladly accept immigrants and politically and socially they have dedicated laws and resources that are meant to help immigrants. Since the majority of white countries have a history of colonizing the world, modern history and social culture focus a lot on the sentiment of accepting people who are different than you, or simply the idea of racial/ethnic diversity and inclusion when it comes to representation and treatment. The school system or general education emphasizes on that, and all th

## Synthetic QA text generation

In [None]:
generator = QAgenerator(
    retriever=Retriever, 
    data=chunks,
    model_dir=config["paths"]["model_dir"]
)

Created new collection: temp_collection
Successfully loaded 150 documents into Chroma


In [7]:
topic = "Diversity, Equity, and Inclusion (DEI)"

question = "In your opinion, what is effective in DEI communication and implementation at this company, and what is not effective?"
answers = generator.generate(query=question, topic=topic, n_documents=4)

for a in answers: 
    print(f"Q: {question}\nA: {a}\n\n")

Q: In your opinion, what is effective in DEI communication and implementation at this company, and what is not effective?
A: Effective DEI communication includes clear goals, measurable progress, and accountability. I've seen it work when everyone is committed to understanding and addressing biases, and when there's a culture of respect and inclusion. However, some DEI initiatives at this company have been vague or performative, lacking substance and impact. It's crucial that we focus on real change rather than just checking boxes.


Q: In your opinion, what is effective in DEI communication and implementation at this company, and what is not effective?
A: While I acknowledge the importance of DEI communication and implementation, I'm skeptical about the impact of empty announcements. Instead, I believe in focusing on tangible actions that demonstrate a company's commitment to diversity, equity, and inclusion. It's crucial for companies to walk the talk and not just talk the talk. Howe

### 