# 05_content_categorization.ipynb
**Author: Hoang Ngoc Anh**

This notebook classifies each social media post into key content pillars:
- **Storytelling**: Brand stories, emotional sharing
- **Promotion**: Discounts, promotions, mini-games
- **UGC & Testimonial**: User-generated content, customer feedback
- **Cultural Relevance**: Holidays, cultural or local events

## 📦 Setup
Make sure to install the required libraries and have your crawled social media data available in the `data/` folder for this notebook to run properly.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load crawled data (adjust path if needed)
df = pd.read_csv('data/facebook_data.csv')

# Define content categories and keywords
categories = {
    "Storytelling": ["hành trình", "câu chuyện", "cảm xúc", "truyền cảm hứng", "kỷ niệm"],
    "Promotion": ["giảm giá", "khuyến mãi", "ưu đãi", "giveaway", "minigame", "trúng thưởng"],
    "UGC & Testimonial": ["khách hàng nói", "review", "đánh giá", "cảm nhận", "người dùng"],
    "Cultural Relevance": ["tết", "giáng sinh", "valentine", "trung thu", "ngày lễ", "quốc tế"]
}

# Classification function
def classify_content(text):
    for category, keywords in categories.items():
        for keyword in keywords:
            if pd.notna(text) and keyword.lower() in text.lower():
                return category
    return "Others"

# Apply classification
df['content_category'] = df['text'].apply(classify_content)

# Save the result
df.to_csv('data/facebook_categorized.csv', index=False)
df[['brand', 'time', 'text', 'content_category']].head()

## 📊 Content Categorization Summary
After running the classification function, you will have a categorized version of the social media posts. The next step is to analyze the distribution of content categories by brand and observe trends in engagement and content types.

### Expected Output:
The result will categorize the posts into **Storytelling**, **Promotion**, **UGC & Testimonial**, **Cultural Relevance**, and other possible categories based on the provided keywords.

In [2]:
# Plot the distribution of content categories by brand
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='content_category', hue='brand')
plt.title('Distribution of Content Categories by Brand')
plt.xticks(rotation=45)
plt.show()

## 📈 Engagement Analysis by Content Category
Next, we will analyze the average engagement (likes, comments, shares) for each content category. This will help us understand which content categories are driving more interaction across brands.
We will calculate the average engagement rate per post category and compare it across different brands.

In [3]:
# Calculate total engagement
df['total_engagement'] = df['likes'] + df['comments'] + df['shares']

# Calculate average engagement by content category and brand
engagement_by_category = df.groupby(['content_category', 'brand'])['total_engagement'].mean().reset_index()

# Plot the engagement by content category
plt.figure(figsize=(12, 6))
sns.barplot(data=engagement_by_category, x='content_category', y='total_engagement', hue='brand')
plt.title('Average Engagement by Content Category and Brand')
plt.xticks(rotation=45)
plt.show()

## 📝 Conclusion and Insights
From the engagement analysis, we can draw several insights to improve the content strategy for each brand. For example, if **UGC & Testimonial** posts generate the most engagement, brands might want to focus on encouraging more user-generated content.

Additionally, if **Cultural Relevance** posts are performing well, brands should continue aligning with local events and holidays.

In general, this categorization can guide the next steps in content strategy, identifying areas for improvement, and capitalizing on successful content types.

In [4]:
# Save the categorized content data for further use
df.to_csv('data/facebook_categorized.csv', index=False)
print('✅ Content categorization complete. Data saved to facebook_categorized.csv')