#### News to JSON
Daily news is a whats group that publishes news in the following format

[20/06/23, 11:57:01 AM] Raghu Shell Technology: [20/06, 8:13 am] Mohan Gunderao Heroorkar: Today's Headlines from :

*Economic Times*

📝 IndiGo orders record 500 Airbus aircraft, another 500 in shopping cart

📝 Phoenix ARC, Ares SSG bid for Piramal Group's ₹2,600-cr bad loans

📝 SEBI bars IIFL Securities from taking new clients for 2 years

📝 Tata Power lays out a capex of Rs 12000 crore this fiscal

📝 Centre proposes to keep vehicle insurance premium unchanged for most categories

📝 TPG Investments exits Shriram Finance; offloads 99.18 lakh shares for Rs 1,389 crore

📝 Mukka Proteins refiles DRHP to raise funds via IPO

📝 Tardy monsoon pushes back kharif sowing; India 37% rain deficient

📝 May iPhone exports swell to record Rs 10,000 crore

📝 Intel to spend $33 billion in Germany in landmark expansion

📝 MCA to intensify crackdown on shell firms

📝 Baring EQT, ChrysCap buy HDFC’s Credila for Rs 10,350 cr

*Business Standard*

📝 Adani Transmission gets shareholders' nod to raise up to Rs 8,500 crore

📝 Govt starts interministerial discussions on the upcoming IBC Amendment Bill

📝 Sun Life Financial's GCC to hire 1k people in India, Philippines in 2 years

📝 HPCL to open vehicle service centres in India with Saudi Arabia's Petromin

📝 Chemical firm Lubrizol to invest $150 mn to build CPVC resin plant in Guj

#### USAGE:
- export Whatapp chat from "Daily News Group"
- unzip to get text file

get_news(fname,[itemwise=bool])
    -  fname is filepath to text file
    -  optional itemwise=bool (default is false)
    -  one record per newsitem per day per agency (if itemwise is true)
    -  or each record contains bunched news per day by agency

    - return array of records each of which is a dict

sample itemwise records from "Mint" has 10 news items
```
 [{'ids': ['2023-05-08M1'],
  'documents': ['BetterPlace acquires fintech lending startup Bueno Finance'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M2'],
  'documents': ['LIC, MFs plough $2 billion into IT firms in Q4 as shares tumble'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M3'],
  'documents': ['India eyes clean energy sources to tackle tariffs'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M4'],
  'documents': ['India conducts talks with UAE on pharma export pricing challenges'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M5'],
  'documents': ['Blackstone, ADIA are likely bidders for HDFC’s Credila'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M6'],
  'documents': ['NCLT to hear BoB plea in Rel Home case'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M7'],
  'documents': ["Silver ETFs getting investors' traction; asset bases reach ₹1,800 crore"],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M8'],
  'documents': ['Connekkt Media Network signs ₹270 crore deal with AVS Studios for 3 films'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M9'],
  'documents': ['Defence ministry approves posting women officers of Territorial Army along LoC'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M10'],
  'documents': ['Grindwell Norton Q4 Earnings: PAT rise 10% YoY in Q4, net income up 20%, Board declares highest ever dividend.'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}}]
```

The above news item when bunched appears as below:
```
{'ids': ['2023-05-08M'],
 'documents': "BetterPlace acquires fintech lending startup Bueno Finance,LIC, MFs plough $2 billion into IT firms in Q4 as shares tumble,India eyes clean energy sources to tackle tariffs,India conducts talks with UAE on pharma export pricing challenges,Blackstone, ADIA are likely bidders for HDFC’s Credila,NCLT to hear BoB plea in Rel Home case,Silver ETFs getting investors' traction; asset bases reach ₹1,800 crore,Connekkt Media Network signs ₹270 crore deal with AVS Studios for 3 films,Defence ministry approves posting women officers of Territorial Army along LoC,Grindwell Norton Q4 Earnings: PAT rise 10% YoY in Q4, net income up 20%, Board declares highest ever dividend.",
 'metadatas': {'source': 'Mint', 'date': '2023-05-08'}}

```

The above data is in a convenient form to add to chromadb


In [1]:
import os
import re
from datetime import datetime, timedelta
from tqdm.notebook import tqdm
from time import sleep

In [None]:
## Helper function converts dd/mm/yy hh:mm:ss AM|PM string to UTC 
def convert_to_utc(timestamp_string):
    pattern = r"\[(\d{1,2}/\d{1,2}/\d{2}, \d{1,2}:\d{2}:\d{2} [APap][Mm])\]"
    match = re.search(pattern, timestamp_string)
    
    if match:
        timestamp = match.group(1)
        dt_format = "%d/%m/%y, %I:%M:%S %p"
        dt_object = datetime.strptime(timestamp, dt_format)

        # Check if it's PM and adjust the hour
        if "PM" in timestamp.upper():
            dt_object += timedelta(hours=12)
        
        utc_dt = dt_object - timedelta(hours=dt_object.hour)
        return utc_dt
    else:
        return None
        
## Helper to assemble news item or bunched news items from an agency for a particular day      
def make_news(ldt,id,itms,agency,**kwargs):
    ITEMWISE = kwargs.get("itemwise",False)
    _news = []
    if ITEMWISE:
        for i,itm in enumerate(itms):
            _tmp = {"ids":[id+str(i+1)],"documents":[itm],"metadatas":{"source":agency,"date":ldt}}
            #print(_tmp)
            _news.append(_tmp)
    else:
        _tmp = {"ids":[id],"documents":",".join(itms),"metadatas":{"source":agency,"date":ldt}}
        _news.append(_tmp)
        #print(_tmp)
    return _news

# Returns an array of json containing news in a structure suitable add to a vector database (chroma in this case)

def get_news(fname,**kwargs):
    ## Some constants for news pre processing
    ITEMWISE=kwargs.get("itemwise",False)
    DATE = "["
    DATE_FILTER ="Raghu"
    NEWS_AGENCY = "*"
    NEWS_ITEM = "📝"
    NEW_LINE = "\n"
    SHOW = False
    
    news=[]
    rejects=0
    _items=[]
    try:
        with open(fname,"r") as file:
            for line in file:
                if line[0] in [DATE, NEWS_AGENCY, NEWS_ITEM]:
                    match line[0]:
                        case "[":
                            if "Raghu" in line:
                                if len(_items)!=0:
                                    news.append(make_news(last_date,_id,_items,news_agency,itemwise=ITEMWISE))
                                    _items=[]
                                   
                                last_date = str(convert_to_utc(line.strip()))[0:10]
                                #print("Date",last_date)
                        case "*":
                            if len(_items)!=0:
                                news.append(make_news(last_date,_id,_items,news_agency,itemwise=ITEMWISE))
                                _items=[]       
                            news_agency = line[1:-2]
                            acronym="".join(list(map(lambda a: a[0],news_agency.split())))
                            _id = last_date+acronym
    
                        case "📝":
                            _items.append(line[1:].strip())
                else:
                    rejects +=1
            news.append(make_news(last_date,_id,_items,news_agency,itemwise=ITEMWISE))
                    
    except FileNotFoundError:
        print(f"File '{file_path}' not found.")
    except Exception as e:
        print(f"An error occurred: {e}")
    
    print(f"Rejected lines : {rejects}")
    print(f"News Items : {len(news)}")
    news = [item for sublist in news for item in sublist]
    return news
    

In [10]:
news=get_news("./_chat.txt")
news[3]

Rejected lines : 140
News Items : 12


{'ids': ['2023-05-08M'],
 'documents': "BetterPlace acquires fintech lending startup Bueno Finance,LIC, MFs plough $2 billion into IT firms in Q4 as shares tumble,India eyes clean energy sources to tackle tariffs,India conducts talks with UAE on pharma export pricing challenges,Blackstone, ADIA are likely bidders for HDFC’s Credila,NCLT to hear BoB plea in Rel Home case,Silver ETFs getting investors' traction; asset bases reach ₹1,800 crore,Connekkt Media Network signs ₹270 crore deal with AVS Studios for 3 films,Defence ministry approves posting women officers of Territorial Army along LoC,Grindwell Norton Q4 Earnings: PAT rise 10% YoY in Q4, net income up 20%, Board declares highest ever dividend.",
 'metadatas': {'source': 'Mint', 'date': '2023-05-08'}}

In [14]:
news=get_news("./_chat.txt",itemwise=True)

Rejected lines : 140
News Items : 12


In [16]:
news[31:41]

[{'ids': ['2023-05-08M1'],
  'documents': ['BetterPlace acquires fintech lending startup Bueno Finance'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M2'],
  'documents': ['LIC, MFs plough $2 billion into IT firms in Q4 as shares tumble'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M3'],
  'documents': ['India eyes clean energy sources to tackle tariffs'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M4'],
  'documents': ['India conducts talks with UAE on pharma export pricing challenges'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M5'],
  'documents': ['Blackstone, ADIA are likely bidders for HDFC’s Credila'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M6'],
  'documents': ['NCLT to hear BoB plea in Rel Home case'],
  'metadatas': {'source': 'Mint', 'date': '2023-05-08'}},
 {'ids': ['2023-05-08M7'],
  'documen