# The SuperMind Hackathon


## Sagnik Pramanik
##### sagnikpramanik95@gmail.com

Importing everything we need: langchain components and astradb handler, utils such as os, dotenv and csv for reading csv files

In [2]:
import os
import csv
from langchain_openai import OpenAI
from langchain.prompts import PromptTemplate
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv
from astrapy import DataAPIClient
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

Load the env variables and declare the dataset path

In [10]:
load_dotenv()

# Retrieve Astra DB credentials and settings from environment variables
ASTRA_TOKEN = os.getenv("ASTRA_TOKEN")
ASTRA_DB_URL = os.getenv("ASTRA_DB_URL")
COLLECTION_NAME = os.getenv("COLLECTION_NAME")
file_path="./dataset2.csv"

Now, we connect to the DB


In [4]:
client = DataAPIClient(ASTRA_TOKEN)
db = client.get_database_by_api_endpoint(ASTRA_DB_URL)

Creating a new collection if it does not exist already

In [11]:
if COLLECTION_NAME not in db.list_collection_names():
    db.create_collection(COLLECTION_NAME)
    print("Created Collection")
else:
    print("Collection already exists")

Created Collection


Now, we read the rows from the dataset and then feed it to the database one by one

In [12]:
collection = db.get_collection(COLLECTION_NAME)
try:
    with open(file_path, mode='r') as file:
        csv_reader = csv.DictReader(file)
        c=1
        for row in csv_reader:
            # Prepare document to insert into the collection
            document = {
                "post_id": row["post_id"],
                "day_of_posting": row["day_of_posting"],
                "date_of_posting": row["date_of_posting"],
                "time_of_posting": row["time_of_posting"],
                "post_type": row["post_type"],
                "likes": int(row["likes"]),
                "comments": int(row["comments"]),
                "shares": int(row["shares"]),
                "repost": int(row["repost"]),
                "gender": row["gender"],
                "hashtags": row["hashtags"].split(",") if row["hashtags"].strip() else []
            }

            # Insert into the collection
            collection.insert_one(document)
            print("Processed row {}".format(c))
            c+=1

        print("Data successfully inserted into the database.")
except FileNotFoundError:
    print(f"Error: File '{file_path}' not found.")
except KeyError as e:
    print(f"Error: Missing required column in the file: {e}")
except ValueError as e:
    print(f"Error: Invalid data format: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Processed row 1
Processed row 2
Processed row 3
Processed row 4
Processed row 5
Processed row 6
Processed row 7
Processed row 8
Processed row 9
Processed row 10
Processed row 11
Processed row 12
Processed row 13
Processed row 14
Processed row 15
Processed row 16
Processed row 17
Processed row 18
Processed row 19
Processed row 20
Processed row 21
Processed row 22
Processed row 23
Processed row 24
Processed row 25
Processed row 26
Processed row 27
Processed row 28
Processed row 29
Processed row 30
Processed row 31
Processed row 32
Processed row 33
Processed row 34
Processed row 35
Processed row 36
Processed row 37
Processed row 38
Processed row 39
Processed row 40
Processed row 41
Processed row 42
Processed row 43
Processed row 44
Processed row 45
Processed row 46
Processed row 47
Processed row 48
Processed row 49
Processed row 50
Processed row 51
Processed row 52
Processed row 53
Processed row 54
Processed row 55
Processed row 56
Processed row 57
Processed row 58
Processed row 59
Proces

### Lets check what is inside the DB

In [13]:
try:
    data = collection.find()
    data= [doc for doc in data]
    print(data)
except Exception as e:
    print(f"Error retrieving data from Astra DB: {e}")

[{'_id': '9b3aa4c8-5eb5-4a57-baa4-c85eb5ca576c', 'post_id': '0000000000000066', 'day_of_posting': 'Wednesday', 'date_of_posting': '2024-10-27', 'time_of_posting': '12:00:00', 'post_type': 'reels', 'likes': 220, 'comments': 50, 'shares': 30, 'repost': 15, 'gender': 'M', 'hashtags': ['#AI']}, {'_id': 'eb6e4102-3094-4770-ae41-023094e7704f', 'post_id': '0000000000000049', 'day_of_posting': 'Sunday', 'date_of_posting': '2024-11-13', 'time_of_posting': '13:00:00', 'post_type': 'video-carousel', 'likes': 200, 'comments': 45, 'shares': 22, 'repost': 11, 'gender': 'M', 'hashtags': ['#Fitness']}, {'_id': '54df00b0-595d-4129-9f00-b0595d3129db', 'post_id': '0000000000000029', 'day_of_posting': 'Monday', 'date_of_posting': '2024-12-03', 'time_of_posting': '16:30:00', 'post_type': 'video-carousel', 'likes': 200, 'comments': 45, 'shares': 23, 'repost': 11, 'gender': 'M', 'hashtags': ['#Tech']}, {'_id': '753ff794-b0f6-4dd6-bff7-94b0f67dd6fd', 'post_id': '0000000000000013', 'day_of_posting': 'Saturday'

# Option 1: OpenAI

In [8]:
llm = OpenAI(model="gpt-3.5-turbo-instruct",openai_api_key=os.getenv("OPENAI_API_KEY"))
prompt_template = (
    "Here is the entire dataset: {data}."
    "Summarize the key trends, patterns only without providing row-by-row details."
    "For example, One trend could be 'Posts by females receive 46% more likes compared to posts by male'. Similarly generate more insights from the data and use Numbers"
    "Finally, the output should be a list of trends/insights. do not self genererate anything else and no suggestions."
    "Generate as many trends as possible, using all columns of the data"
)
prompt = PromptTemplate(input_variables=["data"], template=prompt_template)
chain = (
    {"data": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


# Option 2: Google Gemini

In [14]:
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash", google_api_key=os.getenv("GEMINI_API_KEY"))
prompt_template = (
    "Here is the entire dataset: {data}. "
    "Identify patterns, trends, and insights from this dataset. "
    "Summarize the key trends only without providing row-by-row details."
    "For example, One trend could be 'Posts by females receive 46% more likes compared to posts by male'. similarly generate more insights from the data and use Numbers to further strengthen the trends."
    "Finally, the output should be a list of trends and insights from the data. do not genererate anything else, no suggestions."
    "Generate as many trends as possible, including all columns of the data, except the _id and post_id."
    "One of the trends could be to show the correlation between posts at different times and the engagement"
)
prompt = PromptTemplate(input_variables=["data"], template=prompt_template)
chain = (
    {"data": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [15]:
trends = chain.invoke(str(data))
print(trends)

* Reels and video-carousel posts receive significantly more likes (average over 200) than static-images and carousel posts (average under 120).

* Female users tend to have a higher average number of likes, comments, shares, and reposts compared to male users.

* Posts with hashtags receive substantially more engagement (likes, comments, shares, and reposts) than posts without hashtags.

* There's a weak positive correlation between the time of posting and engagement; posts made between 10:00 AM and 2:00 PM tend to have higher engagement metrics.

*  The most frequent day of posting is Friday and Sunday.

* "#Fitness" and "#Tech" are the most frequently used hashtags.

* "Video-carousel" is the most frequent post type.

*  There is a positive correlation between likes, comments, shares, and reposts; higher likes generally correspond with higher comments, shares, and reposts.

