# The SuperMind Hackathon


## Sagnik Pramanik
##### sagnikpramanik95@gmail.com

Importing everything we need: langchain components and astradb handler, utils such as os, dotenv and csv for reading csv files

In [79]:
import os
import csv
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv
from astrapy import DataAPIClient
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

Load the env variables and declare the dataset path

In [80]:
load_dotenv()

# Retrieve Astra DB credentials and settings from environment variables
ASTRA_TOKEN = os.getenv("ASTRA_TOKEN")
ASTRA_DB_URL = os.getenv("ASTRA_DB_URL")
COLLECTION_NAME = os.getenv("COLLECTION_NAME")
file_path="./dataset1.csv"

Now, we connect to the DB


In [81]:
client = DataAPIClient(ASTRA_TOKEN)
db = client.get_database_by_api_endpoint(ASTRA_DB_URL)

Creating a new collection if it does not exist already

In [82]:
if COLLECTION_NAME not in db.list_collection_names():
    db.create_collection(COLLECTION_NAME)
    print("Created Collection")
else:
    print("Collection already exists")

Created Collection


Now, we read the rows from the dataset and then feed it to the database one by one

In [83]:
collection = db.get_collection(COLLECTION_NAME)
try:
    with open(file_path, mode='r') as file:
        csv_reader = csv.DictReader(file)
        c=1
        for row in csv_reader:
            # Prepare document to insert into the collection
            document = {
                "post_id": row["post_id"],
                "day_of_posting": row["day_of_posting"],
                "date_of_posting": row["date_of_posting"],
                "time_of_posting": row["time_of_posting"],
                "post_type": row["post_type"],
                "likes": int(row["likes"]),
                "comments": int(row["comments"]),
                "shares": int(row["shares"]),
                "repost": int(row["repost"]),
                "gender": row["gender"],
                "hashtags": row["hashtags"].split(",") if row["hashtags"].strip() else []
            }

            # Insert into the collection
            collection.insert_one(document)
            print("Processed row {}".format(c))
            c+=1

        print("Data successfully inserted into the database.")
except FileNotFoundError:
    print(f"Error: File '{file_path}' not found.")
except KeyError as e:
    print(f"Error: Missing required column in the file: {e}")
except ValueError as e:
    print(f"Error: Invalid data format: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Processed row 1
Processed row 2
Processed row 3
Processed row 4
Processed row 5
Processed row 6
Processed row 7
Processed row 8
Processed row 9
Processed row 10
Data successfully inserted into the database.


### Lets check what is inside the DB

In [84]:
try:
    data = collection.find()
    data= [doc for doc in data]
    print(data)
except Exception as e:
    print(f"Error retrieving data from Astra DB: {e}")

[{'_id': '31f8d89a-e627-4f98-b8d8-9ae6275f98a1', 'post_id': '0000000000000009', 'day_of_posting': 'Tuesday', 'date_of_posting': '2024-12-23', 'time_of_posting': '14:00:00', 'post_type': 'video-carousel', 'likes': 225, 'comments': 48, 'shares': 25, 'repost': 13, 'gender': 'M', 'hashtags': ['#Music']}, {'_id': 'beb4915a-9e8f-4de7-b491-5a9e8fbde732', 'post_id': '0000000000000010', 'day_of_posting': 'Wednesday', 'date_of_posting': '2024-12-22', 'time_of_posting': '10:00:00', 'post_type': 'mixed-carousel', 'likes': 140, 'comments': 30, 'shares': 16, 'repost': 8, 'gender': 'F', 'hashtags': ['#Nature']}, {'_id': 'ff234ed9-b6e5-4b3a-a34e-d9b6e53b3aab', 'post_id': '0000000000000007', 'day_of_posting': 'Sunday', 'date_of_posting': '2024-12-25', 'time_of_posting': '13:30:00', 'post_type': 'reels', 'likes': 300, 'comments': 68, 'shares': 39, 'repost': 23, 'gender': 'M', 'hashtags': ['#Food']}, {'_id': '3b5beb3b-1005-4798-9beb-3b1005179868', 'post_id': '0000000000000002', 'day_of_posting': 'Tuesday

# Option 1: OpenAI

In [85]:
llm = OpenAI(model="gpt-3.5-turbo-instruct",openai_api_key=os.getenv("OPENAI_API_KEY"))
prompt_template = (
    "Here is the entire dataset: {data}."
    "Summarize the key trends, patterns only without providing row-by-row details."
    "For example, One trend could be 'Posts by females receive 46% more likes compared to posts by male'. Similarly generate more insights from the data and use Numbers"
    "Finally, the output should be a list of trends/insights. do not self genererate anything else and no suggestions."
    "Generate as many trends as possible, using all columns of the data"
)
prompt = PromptTemplate(input_variables=["data"], template=prompt_template)
chain = (
    {"data": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


# Option 2: Google Gemini

In [87]:
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=os.getenv("GEMINI_API_KEY"))
prompt_template = (
    "Here is the entire dataset: {data}. "
    "Identify patterns, trends, and insights from this dataset. "
    "Summarize the key trends only without providing row-by-row details."
    "For example, One trend could be 'Posts by females receive 46% more likes compared to posts by male'. similarly generate more insights from the data and use Numbers to further strengthen the trends."
    "Finally, the output should be a list of trends and insights from the data. do not genererate anything else, no suggestions."
    "Generate as many trends as possible, including all columns of the data, except the _id and post_id."
    "One of the trends could be to show the correlation between posts at different times and the engagement"
)
prompt = PromptTemplate(input_variables=["data"], template=prompt_template)
chain = (
    {"data": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [88]:
trends = chain.invoke(str(data))
print(trends)

* Reels and video carousels receive significantly more likes (average 247.5) than other post types (average 116).
* Tuesday and Wednesday show higher average engagement (likes, comments, shares) compared to other days.
* Posts made between 10:00 AM and 14:15 PM tend to receive more engagement than posts made outside this time frame.
* Posts with hashtags receive substantially more engagement (average likes 193) than posts without hashtags (average likes 62.5).
* Male and female posts show similar levels of engagement, with no significant difference in average likes, comments, or shares.
*  There is a strong positive correlation between likes, comments, shares, and reposts; higher likes generally indicate higher comments, shares, and reposts.
* '#Tech' and '#Food' hashtags show higher average engagement than other hashtags.
* 'Mixed-carousel' and 'video-carousel' post types receive slightly higher average shares compared to other post types.


