## Content Analysis Module

#### Objective

This notebook will cover the entire content analysis pipeline, including:

- Preprocessing (Step 3.1)
- Vectorization (Step 3.2)
- Topic modeling (Step 3.3)
- Keyword extraction (Step 3.4)
- Sentiment analysis (Step 3.5)
- Categorization (Step 3.6)

**Preprocessig pipeline**

| Substep | Concept                     | Purpose                                 |
| ------- | --------------------------- | --------------------------------------- |
| 3.1     | 🔤 **Text preprocessing**   | Clean and normalize text                |
| 3.2     | 🔢 **TF-IDF vectorization** | Turn text into numbers for ML models    |
| 3.3     | 📚 **Topic modeling**       | Discover dominant themes in content     |
| 3.4     | 🔑 **Keyword extraction**   | Find top keywords                       |
| 3.5     | 😊 **Sentiment analysis**   | Detect positivity/negativity            |
| 3.6     | 🏷️ **Categorization**      | Assign content to categories (optional) |


#### Required Setup

In [53]:
from pathlib import Path
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import pandas as pd
import sqlite3

from sklearn.feature_extraction.text import TfidfVectorizer # for TF-IDF Modeling
from sklearn.decomposition import NMF # For topic modelling NMF


In [54]:
# Download stopwords + tokenizer once
nltk.download("punkt_tab") # pre-trained tokenizer model for sentence and word splitting.
nltk.download("stopwords") # Stopwords are common words that appear frequently but carry little meaningful information: e.g. the, is, and, on..

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/vinothhaldorai/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vinothhaldorai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
# Define English stopwords
stop_words = set(stopwords.words("english"))

### Text Preprocessing (Step 3.1)

Raw text contains
- irrelevant symbols (punctuation, emojis, etc.)
- Stopwords (e.g., "the", "and", "in")
- Inconsistent casing
- Words with the same meaning in different forms ("running", "run")

Cleaning improves the quality of:
- TF-IDF vectorization
- Topic modeling
- Sentiment analysis

In [57]:
def clean_text(text):
    if not isinstance(text, str):
        return ""

    # 1. Convert to Lowercase
    text = text.lower()

    # 2. Remove punctuations
    text = text.translate(str.maketrans("", "", string.punctuation))

    # 3. Tokenize
    tokens = word_tokenize(text)

    # 4. Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]

    # 5. Rejoin as clean string
    return " ".join(tokens)

#### Connect to the database and load the content table

In [58]:
"""
# Connnect to the SQL database
conn = sqlite3.connect("data/content_data.db")

# Load the content table into a DataFrame
combined_df = pd.read_sql_query("SELECT * FROM content", conn)

# Close the connection
conn.close()
"""

# Define path to root-level "data" folder
data_dir = Path.cwd().parent / "data"
db_path = data_dir / "content_data.db"

# Connnect to the SQL database
conn = sqlite3.connect(db_path)

# Load the content table into a DataFrame
combined_df = pd.read_sql_query("SELECT * FROM content", conn)

# Close the connection
conn.close()

print("Loaded content data from:", db_path)
combined_df.head()

Loaded content data from: /Users/vinothhaldorai/Documents/Vinoth/PROJECTS/content-marketing-agent/data/content_data.db


Unnamed: 0,title,publishedAt,source,publishedDate
0,AI in content marketing: How creators and mark...,2025-07-08 23:43:01+00:00,Google Search,2025-07-08
1,AI in Marketing recent news | Content Marketin...,2025-07-08 23:43:01+00:00,Google Search,2025-07-08
2,A Complete Guide to Adopting AI in Content Mar...,2025-07-08 23:43:01+00:00,Google Search,2025-07-08
3,AI tools for content marketing : r/marketing,2025-07-08 23:43:01+00:00,Google Search,2025-07-08
4,Artificial Intelligence And The Future Of Cont...,2025-07-08 23:43:01+00:00,Google Search,2025-07-08


In [60]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   title          46 non-null     object
 1   publishedAt    46 non-null     object
 2   source         46 non-null     object
 3   publishedDate  46 non-null     object
dtypes: object(4)
memory usage: 1.6+ KB


In [61]:
# Ensure the date column is in proper UTC format
combined_df["publishedAt"] = pd.to_datetime(combined_df["publishedAt"], utc=True)
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype              
---  ------         --------------  -----              
 0   title          46 non-null     object             
 1   publishedAt    46 non-null     datetime64[ns, UTC]
 2   source         46 non-null     object             
 3   publishedDate  46 non-null     object             
dtypes: datetime64[ns, UTC](1), object(3)
memory usage: 1.6+ KB


#### Apply Preprocessing

In [62]:
# Load CSV updated in data collection step
# combined_df = pd.read_csv("data/combined_data.csv", parse_dates=["publishedAt"])

# Apply text cleaning
combined_df["clean_title"] = combined_df["title"].apply(clean_text)

# preview result
combined_df[["title", "clean_title"]].head()

Unnamed: 0,title,clean_title
0,AI in content marketing: How creators and mark...,ai content marketing creators marketers using ai
1,AI in Marketing recent news | Content Marketin...,ai marketing recent news content marketing ins...
2,A Complete Guide to Adopting AI in Content Mar...,complete guide adopting ai content marketing s...
3,AI tools for content marketing : r/marketing,ai tools content marketing rmarketing
4,Artificial Intelligence And The Future Of Cont...,artificial intelligence future content marketing


In [63]:
# print(clean_text("This is a sample text for text preprocessing. this is a bunch of random words"))

### TF-IDF Vectorization

TF: Term Frequency – how often a word appears in a document

IDF: Inverse Document Frequency – penalizes common words across all docs

#### Build TF-IDF Model

In [64]:
# Create vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=100, ngram_range=(1, 2))

# Fit and transform the cleaned titles
tfidf_matrix = tfidf_vectorizer.fit_transform(combined_df["clean_title"])

# Convert to DataFrame for inspection
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

# Add back original titles for context
tfidf_df["title"] = combined_df["title"]

# Preview result
tfidf_df.head()

Unnamed: 0,10,10 ai,11,11 ai,2022,2023,2024,2025,2025 annual,affiliate,...,review,sql,stingray,stingray group,tools,top,top 10,update,use,title
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AI in content marketing: How creators and mark...
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,AI in Marketing recent news | Content Marketin...
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,A Complete Guide to Adopting AI in Content Mar...
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.377672,0.0,0.0,0.0,0.0,AI tools for content marketing : r/marketing
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Artificial Intelligence And The Future Of Cont...


In [65]:
tfidf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46 entries, 0 to 45
Columns: 101 entries, 10 to title
dtypes: float64(100), object(1)
memory usage: 36.4+ KB


### Topic Modeling (NMF) (Step 3.3)

NMF - Non-negative Matrix Factorization 

Great for short texts like titles

In [66]:
# Fit topic model
nmf_model = NMF(n_components=5, random_state=42)
nmf_model.fit(tfidf_matrix)

# Get top words per topic
feature_names = tfidf_vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(nmf_model.components_):
    top_words = [feature_names[i] for i in topic.argsort()[:-6:-1]]
    print(f"🔹 Topic #{topic_idx + 1}: {', '.join(top_words)}")


🔹 Topic #1: content marketing, content, marketing, ai, ai content
🔹 Topic #2: ai tools, 10, top 10, 10 ai, tools
🔹 Topic #3: ai marketing, marketing tools, marketing, tools, ai
🔹 Topic #4: new, directors, director, charest new, announces
🔹 Topic #5: sql, 2022, affiliate, need, guide


### Extract top keywords from each title (Step 3.4)

In [67]:
# Get top TF-IDF keywords per title
def extract_keywords_from_vector(vector, feature_names, top_n=3):
    sorted_indices = vector.argsort()[::-1][:top_n]
    return [feature_names[i] for i in sorted_indices]


In [68]:
combined_df["top_keywords"] = [
    extract_keywords_from_vector(row, tfidf_vectorizer.get_feature_names_out())
    for row in tfidf_matrix.toarray()
]

# Preview with keywords
combined_df[["title", "top_keywords"]].head()

Unnamed: 0,title,top_keywords
0,AI in content marketing: How creators and mark...,"[ai, ai content, content marketing]"
1,AI in Marketing recent news | Content Marketin...,"[marketing, content marketing, ai marketing]"
2,A Complete Guide to Adopting AI in Content Mar...,"[complete, guide, ai content]"
3,AI tools for content marketing : r/marketing,"[ai tools, content marketing, content]"
4,Artificial Intelligence And The Future Of Cont...,"[future, content marketing, content]"
