## Is a Picture Worth a Thousand Words? Computer Vision Analysis

**Brand:** Pure New Zealand (@purenewzealand)

**Alternative Chosen:** Alternative 1 - Instagram Brand Engagement Analysis

## **Assignment Overview**

This assignment focuses on analyzing Instagram content to understand what drives engagement for the **Pure New Zealand** tourism brand. Using computer vision and natural language processing techniques, we will:

- Scrape ~500 posts from Pure New Zealand's Instagram page
- Extract image labels using Google Vision API or similar services
- Build predictive models to identify high-engagement content
- Perform topic modeling to discover content themes
- Provide data-driven recommendations to increase engagement

The goal is to help Pure New Zealand optimize their Instagram strategy by understanding which visual and textual elements resonate most with their audience.

---

## **Table of Contents**

- **Task A:** Web Scraping - Extract Instagram posts (images, captions, likes)
- **Task B:** Image Label Extraction - Using Google Vision API/Azure/LLM
- **Task C:** Binary Classification - Creating engagement categories
- **Task D:** Logistic Regression Models - Predicting engagement levels
- **Task E:** Topic Modeling (LDA) - Discovering content themes
- **Task F:** Strategic Recommendations - Actionable insights for Pure New Zealand

---

## Task A: 

Scrape Instagram.py to fetch ~500 posts from the brand‚Äôs Instagram page. Fetch (i) image URLs, (ii) post caption (the text description of a post), and (iii) # likes. Fetching comments is difficult and you can easily get blocked by Insta. Using a dynamic VPN like ExpressVPN is highly recommended.

---

### Objective
Scrape approximately 500 posts from Pure New Zealand's Instagram page (@purenewzealand) to collect:
1. **Image URLs** - Direct links to post images
2. **Post Captions** - Text descriptions accompanying each post
3. **Number of Likes** - Engagement metric for each post

### Methodology
We use **Selenium WebDriver** to automate browser interactions and extract data directly from Instagram's web interface. This approach:
- Mimics human browsing behavior to avoid detection
- Handles dynamic content loading through scrolling
- Extracts data by clicking into individual posts
- Filters out video posts (focusing only on image content)

### Key Features
- **VPN Protection**: Prompts user to confirm VPN connection before scraping (we used Express VPN)
- **Anti-Detection**: Disables automation flags and uses random delays
- **Video Filtering**: Automatically skips video posts to focus on images
- **Robust Extraction**: Multiple fallback methods for likes and caption extraction
- **Data Persistence**: Saves results in both CSV and JSON formats

### Important Notes
**Rate Limiting**: Instagram aggressively blocks scrapers. We implement:
- Long delays between actions (30-60 seconds)
- Random wait times to simulate human behavior
- VPN usage to avoid IP blocking
- Burner account to protect main account

**Ethical Considerations**: This scraping is for educational purposes only and respects Instagram's public content.

---

In [None]:
"""
PURE SELENIUM INSTAGRAM SCRAPER
================================
Collects: (1) Image URLs, (2) Captions, (3) Likes

Uses only Selenium - no JSON API calls needed.
Clicks on each post and extracts data from the modal.

REQUIREMENTS:
pip install selenium
"""

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
from random import uniform
import csv
import json
import os

# ==================== CONFIGURATION ====================
# Burner account credentials (using a dedicated scraping account)
BURNER_USERNAME = "ana.paninew"
BURNER_PASSWORD = "ytrewq54321"

# Target Instagram handle (Pure New Zealand tourism page)
TARGET_HANDLE = "purenewzealand"  # Changed from "zara" to your actual target

# Maximum number of posts to scrape
MAX_POSTS = 500  # Changed from 5 to meet assignment requirements

# VPN safety check before starting
print("\n" + "="*70)
print("VPN CHECK")
print("="*70)
vpn_check = input("Is your VPN connected? (yes/no): ").strip().lower()
if vpn_check != 'yes':
    print("‚ö†Ô∏è  Please connect to VPN before continuing to avoid IP blocking")
    exit()
print("‚úì VPN confirmed\n")

# ==================== INITIALIZE BROWSER ====================
print("Initializing browser...")

# Chrome options to avoid detection
chrome_options = Options()
chrome_options.add_argument("--disable-blink-features=AutomationControlled")  # Hide automation
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

# Initialize Chrome driver with anti-detection settings
driver = webdriver.Chrome(options=chrome_options)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

# Navigate to Instagram homepage
driver.get("https://www.instagram.com/")
print("‚úì Browser opened")
time.sleep(60)  # Wait for page to fully load

# ==================== LOGIN ====================
print("\nLogging in...")

# Wait for and locate username input field
username_input = WebDriverWait(driver, 15).until(
    EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='username']"))
)
# Wait for and locate password input field
password_input = WebDriverWait(driver, 15).until(
    EC.element_to_be_clickable((By.CSS_SELECTOR, "input[name='password']"))
)

# Enter credentials with delays to mimic human typing
username_input.clear()
username_input.send_keys(BURNER_USERNAME)
time.sleep(20)  # Pause between username and password

password_input.clear()
password_input.send_keys(BURNER_PASSWORD)
time.sleep(30)  # Pause before clicking login

# Click login button
login_button = WebDriverWait(driver, 5).until(
    EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']"))
)
login_button.click()
print("‚úì Login submitted")
time.sleep(30)  # Wait for login to process

# ==================== DISMISS POPUPS ====================
print("\nDismissing popups...")

# Instagram shows various popups after login (save info, notifications, etc.)
# Try to dismiss them using common button texts
for button_text in ["Not now", "Not Now", "Never"]:
    try:
        popup_button = WebDriverWait(driver, 5).until(
            EC.element_to_be_clickable((By.XPATH, f'//button[contains(text(), "{button_text}")]'))
        )
        popup_button.click()
        time.sleep(10)
        print(f"‚úì Dismissed popup: {button_text}")
    except:
        continue  # Popup not found, move to next

# ==================== NAVIGATE TO PROFILE ====================
print(f"\nNavigating to @{TARGET_HANDLE}...")

# Direct navigation to profile page (more reliable than using search)
driver.get(f"https://www.instagram.com/{TARGET_HANDLE}/")
print(f"‚úì Loaded profile: @{TARGET_HANDLE}")
time.sleep(60)  # Wait for profile page to fully load

# ==================== COLLECT POST LINKS ====================
print("\nScrolling to collect posts...")

post_links = []  # Store URLs of all posts
scroll_pause = 10  # Seconds to wait between scrolls
last_height = driver.execute_script("return document.body.scrollHeight")

# Scroll through the page to load more posts
for scroll_num in range(50):  # Increased from 5 to load ~500 posts (adjust as needed)
    # Scroll to bottom of page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    print(f"  Scroll {scroll_num + 1}/50")
    time.sleep(scroll_pause)
    
    # Find all anchor tags (links) currently visible on the page
    links = driver.find_elements(By.TAG_NAME, "a")
    for link in links:
        href = link.get_attribute("href")
        # Filter for post links (contain "/p/") and avoid duplicates
        if href and "/p/" in href and href not in post_links:
            post_links.append(href)
    
    # Check if we've reached the bottom of the page
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        print("  ‚úì Reached bottom of page")
        break
    last_height = new_height
    
    # Extra pause every 2 scrolls to avoid rate limiting
    if scroll_num % 2 == 0:
        time.sleep(30)

# Limit to MAX_POSTS to meet assignment requirements
post_links = post_links[:MAX_POSTS]
print(f"\n‚úì Found {len(post_links)} posts to scrape")

# ==================== SCRAPE EACH POST ====================
print(f"\nScraping {len(post_links)} posts...")

all_data = []  # Store all scraped post data

for index, post_url in enumerate(post_links, 1):
    try:
        print(f"\n[{index}/{len(post_links)}] Opening: {post_url}")
        
        # Navigate to individual post page with timeout handling
        try:
            driver.set_page_load_timeout(180)  # 3 minute timeout for slow connections
            driver.get(post_url)
            time.sleep(uniform(30, 50))  # Random delay to mimic human behavior
        except Exception as timeout_error:
            print(f"  ‚ö†Ô∏è  Timeout loading page: {timeout_error}")
            print(f"  Waiting 2 minutes before continuing...")
            time.sleep(120)
            continue
        
        # Check if post is a video - SKIP IT (assignment focuses on images)
        try:
            video_element = driver.find_element(By.TAG_NAME, "video")
            print(f"  ‚è≠Ô∏è  SKIPPING: This is a video post")
            continue
        except NoSuchElementException:
            # No video found, this is an image post - proceed with extraction
            pass
        
        # ===== EXTRACT IMAGE URL =====
        # Look for the main image element (uses object-fit CSS)
        image_url = ""
        try:
            img_element = WebDriverWait(driver, 15).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, "img[style*='object-fit']"))
            )
            image_url = img_element.get_attribute("src")
            print(f"  ‚úì Image URL found")
        except:
            print(f"  ‚ö†Ô∏è  Warning: Could not find image")
            image_url = ""
        
        # ===== EXTRACT CAPTION =====
        # Instagram stores captions in the image's alt attribute
        caption = ""
        try:
            if image_url:
                caption = img_element.get_attribute("alt") or ""
                print(f"  ‚úì Caption: {caption[:50]}..." if caption else "  ‚ÑπÔ∏è  Caption: (none)")
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Warning: Could not extract caption - {e}")
            caption = ""
        
        # ===== EXTRACT LIKES =====
        # Multiple strategies to find the likes count
        likes = 0
        try:
            import re
            # Strategy 1: Find elements containing the word "likes"
            elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'likes')]")
            for elem in elements:
                text = elem.text.strip()
                if 'likes' in text.lower():
                    # Extract number from text like "23,282 likes"
                    numbers = re.findall(r'[\d,]+', text.replace(',', ''))
                    if numbers:
                        likes = int(numbers[0])
                        print(f"  ‚úì Likes: {likes:,} (from: '{text}')")
                        break
            
            # Strategy 2: If no "likes" text found, look for number span near "likes" element
            if likes == 0:
                like_spans = driver.find_elements(By.CSS_SELECTOR, "span.html-span.xdj266r")
                for span in like_spans:
                    # Check if this span is near text containing "likes"
                    parent = span.find_element(By.XPATH, "../..")  # Go up 2 levels in DOM
                    if 'like' in parent.text.lower():
                        likes_text = span.text.strip()
                        likes = int(likes_text.replace(',', ''))
                        print(f"  ‚úì Likes: {likes:,} (from number span)")
                        break
                        
        except Exception as e:
            print(f"  ‚ö†Ô∏è  Warning: Could not extract likes - {e}")
            likes = 0
        
        # ===== STORE DATA =====
        all_data.append({
            'image_url': image_url,
            'caption': caption,
            'likes': likes,
            'post_url': post_url
        })
        
        print(f"  ‚úì Data collected successfully")
        
        # Extra long pause every 10 posts to avoid rate limiting
        if index % 10 == 0:
            extra_delay = uniform(60, 90)
            print(f"  ‚è∏Ô∏è  Checkpoint break: {extra_delay:.1f}s")
            time.sleep(extra_delay)
        
    except Exception as e:
        print(f"  ‚ùå Error scraping post: {e}")
        time.sleep(60)
        continue

print(f"\n‚úì Successfully scraped {len(all_data)} posts")

# ==================== SAVE DATA ====================
print("\nSaving data...")

# Create output directory
output_dir = f"{TARGET_HANDLE}_data"
os.makedirs(output_dir, exist_ok=True)

# Save as CSV (easy to import into pandas)
csv_file = os.path.join(output_dir, f"{TARGET_HANDLE}_instagram_data.csv")
with open(csv_file, 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['image_url', 'caption', 'likes', 'post_url'])
    for data in all_data:
        writer.writerow([
            data['image_url'],
            data['caption'],
            data['likes'],
            data['post_url']
        ])

print(f"‚úì CSV saved: {csv_file}")

# Save as JSON (preserves data structure)
json_file = os.path.join(output_dir, f"{TARGET_HANDLE}_instagram_data.json")
with open(json_file, 'w', encoding='utf-8') as f:
    json.dump(all_data, f, indent=2, ensure_ascii=False)

print(f"‚úì JSON saved: {json_file}")

# ==================== DISPLAY RESULTS ====================
print("\n" + "="*70)
print("SAMPLE DATA")
print("="*70)

# Show first 3 posts as preview
for i, data in enumerate(all_data[:3], 1):
    print(f"\nPost {i}:")
    print(f"  Image: {data['image_url'][:80]}...")
    print(f"  Likes: {data['likes']:,}")
    caption_preview = data['caption'][:100] + "..." if len(data['caption']) > 100 else data['caption']
    print(f"  Caption: {caption_preview}")

print("\n" + "="*70)
print("SUMMARY")
print("="*70)
print(f"Target: @{TARGET_HANDLE}")
print(f"Posts scraped: {len(all_data)}")
print(f"Output: {csv_file}")
print("="*70)

# Clean up: close browser
driver.quit()
print("\n‚úì Browser closed")
print("‚úì Scraping complete")

## Task B:
Using the image URLs, obtain image labels (text) from Google Vision (cloud service) or other services such as Azure. You can also use an LLM through its API. You will need an account, though.

***We decided to use Google Vision***

### Implementation Details

**API Configuration:**
- **Service**: Google Cloud Vision API v1
- **Endpoint**: `https://vision.googleapis.com/v1/images:annotate`
- **Authentication**: API Key-based authentication
- **Rate Limiting**: 1 second delay between requests to avoid quota exhaustion

**Features Extracted:**
1. **Labels** (max 10): General descriptors with confidence scores
   - Example: "Sky (0.98), Water (0.95), Nature (0.93)"
2. **Landmarks** (max 5): Famous locations or monuments
   - Example: "Milford Sound, Mount Cook"
3. **Logos** (max 5): Brand logos detected in images
4. **Text**: Any visible text in images (stored but not used in analysis)

**Error Handling:**
- Network timeouts (30 second limit)
- Invalid image URLs
- API quota limits
- Instagram CDN restrictions

**Progress Management:**
- Saves progress every 10 images
- Resume capability from last successful image
- Tracks errors in separate column for debugging


In [None]:
import pandas as pd
import requests
import time
import base64

# ============================================
# Google Cloud Vision API - Image Labeling
# ============================================

def download_and_encode_image(image_url: str) -> dict:
    """
    Download image from URL and encode as base64
    Instagram URLs require this approach as they need proper headers
    """
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        response = requests.get(image_url, headers=headers, timeout=30)
        response.raise_for_status()
        
        # Encode to base64
        image_base64 = base64.b64encode(response.content).decode('utf-8')
        
        return {
            "success": True,
            "base64": image_base64,
            "error": ""
        }
    except Exception as e:
        return {
            "success": False,
            "base64": "",
            "error": str(e)
        }


def get_labels_google_vision(image_url: str, api_key: str) -> dict:
    """
    Get image labels using Google Cloud Vision API
    Downloads image first to handle Instagram CDN URLs
    
    Returns dictionary with:
    - labels: comma-separated string of labels
    - landmarks: any landmarks detected
    - error: error message if request failed
    """
    endpoint = f"https://vision.googleapis.com/v1/images:annotate?key={api_key}"
    
    # Download and encode image first
    download_result = download_and_encode_image(image_url)
    
    if not download_result["success"]:
        return {
            "labels": "",
            "landmarks": "",
            "logos": "",
            "error": f"Image download failed: {download_result['error']}"
        }
    
    # Use base64 content instead of URL
    request_body = {
        "requests": [
            {
                "image": {"content": download_result["base64"]},
                "features": [
                    {"type": "LABEL_DETECTION", "maxResults": 10},
                    {"type": "LANDMARK_DETECTION", "maxResults": 5},
                    {"type": "LOGO_DETECTION", "maxResults": 5},
                    {"type": "TEXT_DETECTION", "maxResults": 5}
                ]
            }
        ]
    }
    
    try:
        response = requests.post(endpoint, json=request_body, timeout=30)
        response.raise_for_status()
        result = response.json()
        
        # Extract labels
        labels = []
        landmarks = []
        logos = []
        
        if 'responses' in result and len(result['responses']) > 0:
            resp = result['responses'][0]
            
            # Label annotations
            if 'labelAnnotations' in resp:
                labels = [
                    f"{label['description']} ({label['score']:.2f})"
                    for label in resp['labelAnnotations']
                ]
            
            # Landmark annotations
            if 'landmarkAnnotations' in resp:
                landmarks = [
                    landmark['description']
                    for landmark in resp['landmarkAnnotations']
                ]
            
            # Logo annotations
            if 'logoAnnotations' in resp:
                logos = [
                    logo['description']
                    for logo in resp['logoAnnotations']
                ]
        
        return {
            "labels": ", ".join(labels),
            "landmarks": ", ".join(landmarks) if landmarks else "",
            "logos": ", ".join(logos) if logos else "",
            "error": ""
        }
    
    except requests.exceptions.RequestException as e:
        return {
            "labels": "",
            "landmarks": "",
            "logos": "",
            "error": f"API request error: {str(e)}"
        }
    except Exception as e:
        return {
            "labels": "",
            "landmarks": "",
            "logos": "",
            "error": f"Error: {str(e)}"
        }


def process_images(csv_file: str, output_file: str, api_key: str, 
                   rate_limit_delay: float = 1.0, start_from: int = 0):
    """
    Process all images from CSV and save results with Google Cloud Vision
    
    Args:
        csv_file: Path to input CSV file
        output_file: Path to output CSV file
        api_key: Your Google Cloud Vision API key
        rate_limit_delay: Seconds to wait between API calls (default: 1.0)
        start_from: Row index to start from (useful if restarting after error)
    """
    # Read CSV
    df = pd.read_csv(csv_file)
    print(f"Loading {len(df)} images from CSV...")
    
    # Initialize new columns if they don't exist
    if 'image_labels' not in df.columns:
        df['image_labels'] = ""
    if 'landmarks' not in df.columns:
        df['landmarks'] = ""
    if 'logos' not in df.columns:
        df['logos'] = ""
    if 'processing_error' not in df.columns:
        df['processing_error'] = ""
    
    # Count how many already processed
    already_processed = len(df[df['image_labels'] != ""])
    if already_processed > 0:
        print(f"Found {already_processed} already processed images")
        if start_from == 0:
            response = input("Continue from where you left off? (y/n): ")
            if response.lower() == 'y':
                start_from = already_processed
    
    print(f"\nStarting from row {start_from}...")
    print(f"Processing {len(df) - start_from} images...")
    
    processed_count = 0
    
    # Process each image
    processed_count = 0
    for idx in range(start_from, len(df)):
        row = df.iloc[idx]
        image_url = row['image_url']
        
        # Skip if already processed (but show debug info)
        if df.at[idx, 'image_labels'] != "":
            print(f"Skipping row {idx} - already processed")
            continue
        
        print(f"\n[{idx + 1}/{len(df)}] Processing: {image_url[:60]}...")
        
        result = get_labels_google_vision(image_url, api_key)
        
        # Store results
        df.at[idx, 'image_labels'] = result['labels']
        df.at[idx, 'landmarks'] = result['landmarks']
        df.at[idx, 'logos'] = result['logos']
        df.at[idx, 'processing_error'] = result['error']
        
        processed_count += 1
        
        if result['error']:
            print(f"   ‚ùå Error: {result['error']}")
        else:
            label_count = len(result['labels'].split(',')) if result['labels'] else 0
            print(f"   ‚úì Found {label_count} labels")
            if result['landmarks']:
                print(f"   ‚úì Landmarks: {result['landmarks']}")
        
        # DEBUG: Print first result
        if idx == 0:
            print(f"\nüîç DEBUG - First result:")
            print(f"   Labels: {result['labels'][:100]}...")
            print(f"   Error: {result['error']}")
        
        # Rate limiting
        if idx < len(df) - 1:
            time.sleep(rate_limit_delay)
        
        # Save progress every 10 images
        if (idx + 1) % 10 == 0:
            df.to_csv(output_file, index=False)
            print(f"\nüíæ Progress saved: {idx + 1} images processed")
    
    # Final save
    df.to_csv(output_file, index=False)
    
    # Summary
    successful = len(df[df['image_labels'] != ""])
    errors = len(df[df['processing_error'] != ""])
    
    print(f"\n{'='*60}")
    print(f"‚úÖ COMPLETE! Results saved to {output_file}")
    print(f"{'='*60}")
    print(f"Successfully processed: {successful}/{len(df)} images")
    print(f"Errors encountered: {errors}")
    print(f"Images actually processed this run: {processed_count}")
    print(f"{'='*60}")


# ============================================
# MAIN SCRIPT - CONFIGURE HERE
# ============================================
if __name__ == "__main__":
    # 1. ADD YOUR GOOGLE CLOUD VISION API KEY HERE
    API_KEY = "UPDATE WITH YOUR GOOGLE CLOUD VISION API"
    
    # 2. Configure file paths
    INPUT_CSV = "pureNZ_scraped_data_550.csv"
    OUTPUT_CSV = "pureNZ_with_labels.csv"
    
    # 3. Run the script
    if API_KEY == "YOUR_GOOGLE_CLOUD_API_KEY_HERE":
        print("‚ö†Ô∏è  ERROR: Please add your Google Cloud Vision API key first!")
        print("\nSteps to get your API key:")
        print("1. Go to https://console.cloud.google.com/")
        print("2. Create a new project (or select existing)")
        print("3. Enable 'Cloud Vision API'")
        print("4. Go to 'Credentials' ‚Üí 'Create Credentials' ‚Üí 'API Key'")
        print("5. Copy the API key and paste it in this script")
    else:
        process_images(
            csv_file=INPUT_CSV,
            output_file=OUTPUT_CSV,
            api_key=API_KEY,
            rate_limit_delay=1.0  # Wait 1 second between requests
        )


## Task C: 

Create a column called binary (lowercase only) where value =1 (stands for high engagement) or 0 (stands for low engagement) based on whether the number of likes is above or below the median value.  

---

Our objective is to create a binary target variable to classify posts as **high engagement** (1) or **low engagement** (0) based on the number of likes. This transforms our regression problem into a classification problem suitable for logistic regression.

We use the **median number of likes** as the threshold to split posts into two balanced classes:
- **Binary = 1**: High engagement (likes > median)
- **Binary = 0**: Low engagement (likes ‚â§ median)

### Why Use Median as Threshold?
- **Balanced Classes**: Ensures equal representation of high/low engagement posts
- **Robust to Outliers**: Median is less affected by viral posts with extreme like counts
- **Interpretable**: Posts above median represent top 50% performers
- **Machine Learning Ready**: Balanced classes prevent model bias toward majority class

---

In [1]:
import pandas as pd
import numpy as np

df = pd.read_excel("compiled_550_images_descriptions.xlsx")

# normalize column names (case/whitespace)
df.columns = df.columns.str.strip().str.lower()

# ensure 'likes' is numeric (handles "6,200" etc.)
likes_clean = df['likes'].astype(str).str.replace(r'[^0-9]', '', regex=True)
df['likes'] = pd.to_numeric(likes_clean, errors='coerce')

median_likes = df['likes'].median()
print("Median likes:", median_likes)

# 1 if likes > median (ties -> 0, i.e., low engagement)
df['binary'] = (df['likes'] > median_likes).astype(int)

print(df['binary'].value_counts())

Median likes: 14898.0
binary
0    275
1    275
Name: count, dtype: int64


In [2]:
import pandas as pd

# ============================================
# Load Data with Image Labels
# ============================================

# Load the compiled dataset with image labels from Task B
df = pd.read_excel("compiled_550_images_descriptions.xlsx")

print(f"üìä Loaded dataset: {len(df)} posts")
print(f"üìã Columns: {df.columns.tolist()}\n")

# ============================================
# Data Cleaning & Normalization
# ============================================

# Normalize column names: convert to lowercase and remove whitespace
# This prevents errors from inconsistent column naming (e.g., "Likes" vs "likes")
df.columns = df.columns.str.strip().str.lower()


# ============================================
# Clean 'Likes' Column
# ============================================

# Instagram displays likes with commas (e.g., "14,898")
# We need to convert these to numeric values for analysis

# Step 1: Convert to string to handle any data type inconsistencies
# Step 2: Remove all non-numeric characters (commas, spaces, etc.)
# Step 3: Convert to numeric, setting errors to NaN
likes_clean = df['likes'].astype(str).str.replace(r'[^0-9]', '', regex=True)
df['likes'] = pd.to_numeric(likes_clean, errors='coerce')

# Check for any NaN values after conversion
nan_count = df['likes'].isna().sum()
if nan_count > 0:
    print(f"‚ö†Ô∏è  Warning: {nan_count} posts with invalid like counts (set to NaN)")
    # Optionally: Drop rows with NaN likes
    df = df.dropna(subset=['likes'])
    print(f"‚úì Removed posts with missing likes. New total: {len(df)} posts\n")

# ============================================
# Calculate Median Threshold
# ============================================

# Calculate median likes across all posts
# Median is preferred over mean because it's robust to outliers
median_likes = df['likes'].median()

print("="*60)
print("ENGAGEMENT THRESHOLD")
print("="*60)
print(f"Median likes: {median_likes:,.0f}")
print(f"Min likes: {df['likes'].min():,.0f}")
print(f"Max likes: {df['likes'].max():,.0f}")
print(f"Mean likes: {df['likes'].mean():,.0f}")
print("="*60 + "\n")

# ============================================
# Create Binary Target Variable
# ============================================

# Create 'binary' column:
# - 1 if likes > median (high engagement)
# - 0 if likes ‚â§ median (low engagement)
# Note: Posts with exactly median likes are classified as low engagement (0)

df['binary'] = (df['likes'] > median_likes).astype(int)

# ============================================
# Validate Class Balance
# ============================================

print("="*60)
print("CLASS DISTRIBUTION")
print("="*60)

# Count posts in each class
class_counts = df['binary'].value_counts().sort_index()
print(class_counts)
print()

# Calculate percentages
class_percentages = df['binary'].value_counts(normalize=True).sort_index() * 100
print("Percentages:")
for label, pct in class_percentages.items():
    engagement_type = "High Engagement" if label == 1 else "Low Engagement"
    print(f"  {engagement_type} (binary={label}): {pct:.1f}%")

print("="*60 + "\n")

# ============================================
# Display Sample Results
# ============================================

print("SAMPLE POSTS BY ENGAGEMENT LEVEL")
print("="*60)

# Show examples of high engagement posts
print("\nüî• HIGH ENGAGEMENT POSTS (binary=1):")
high_engagement = df[df['binary'] == 1].nlargest(3, 'likes')
for idx, row in high_engagement.iterrows():
    print(f"\n  Likes: {row['likes']:,.0f}")
    caption_preview = row['caption'][:80] + "..." if len(str(row['caption'])) > 80 else row['caption']
    print(f"  Caption: {caption_preview}")
    if 'image_labels' in df.columns and pd.notna(row['image_labels']):
        labels_preview = str(row['image_labels'])[:100] + "..." if len(str(row['image_labels'])) > 100 else row['image_labels']
        print(f"  Labels: {labels_preview}")

# Show examples of low engagement posts
print("\nüìâ LOW ENGAGEMENT POSTS (binary=0):")
low_engagement = df[df['binary'] == 0].nsmallest(3, 'likes')
for idx, row in low_engagement.iterrows():
    print(f"\n  Likes: {row['likes']:,.0f}")
    caption_preview = row['caption'][:80] + "..." if len(str(row['caption'])) > 80 else row['caption']
    print(f"  Caption: {caption_preview}")
    if 'image_labels' in df.columns and pd.notna(row['image_labels']):
        labels_preview = str(row['image_labels'])[:100] + "..." if len(str(row['image_labels'])) > 100 else row['image_labels']
        print(f"  Labels: {labels_preview}")

print("\n" + "="*60)

# ============================================
# Save Processed Dataset
# ============================================

# Save the dataset with the new 'binary' column for Task D
output_file = "purenewzealand_with_binary.csv"
df.to_csv(output_file, index=False)
print(f"\nüíæ Saved processed dataset to: {output_file}")
print(f"‚úì Dataset ready for Task D (Logistic Regression)\n")


üìä Loaded dataset: 550 posts
üìã Columns: ['image_url', 'post_url', 'Caption', 'Likes', 'image_labels', 'landmarks', 'logos', 'processing_error']

ENGAGEMENT THRESHOLD
Median likes: 14,898
Min likes: 512
Max likes: 46,272
Mean likes: 15,652

CLASS DISTRIBUTION
binary
0    275
1    275
Name: count, dtype: int64

Percentages:
  Low Engagement (binary=0): 50.0%
  High Engagement (binary=1): 50.0%

SAMPLE POSTS BY ENGAGEMENT LEVEL

üî• HIGH ENGAGEMENT POSTS (binary=1):

  Likes: 46,272
  Caption: The shades of blue of Lake Tekapo (Takap‚âà√ß). Glacial silt from the Southern Alps...
  Labels: Bridge (0.92), Lake (0.85), River (0.83), Winter (0.82), Channel (0.81), Mountain range (0.81), List...

  Likes: 44,321
  Caption: Have an egg-ceptional Easter. Swipe to see the before photo of this rowi kiwi ch...
  Labels: Kiwi (0.97), Bird (0.86), Flightless bird (0.85), Beak (0.82), Feather (0.56)

  Likes: 43,223
  Caption: Sunrise at The Shire. #NZMustDo [Hobbiton Movie Set, Matamata. : @sha

### Task D

Run a logistic regression with binary as the dependent variable, and the image_labels as independent variables. You can use a BoW model for text. What is the accuracy (show the confusion matrix) of this prediction model? The idea is to be able to predict the engagement level for an image.

$$Accuracy = 1 - # prediction errors / total # cases$$

What accuracy do you get by using the post_caption words as the independent variables instead of image_labels? Finally, what accuracy do you get by combining (concatenating) the image_labels and post_caption and using them together as independent variables? What can you conclude from your analysis?



---


Clean and prepare the image labels from Google Vision API for use in machine learning models. The raw labels include confidence scores in parentheses (e.g., "Sky (0.98), Water (0.95)") which need to be removed for text analysis.

### Cleaning Steps
1. Remove confidence scores in parentheses
2. Keep only the descriptive text
3. Standardize formatting for consistency

---

In [3]:
import pandas as pd
import re

import numpy as np
import pandas as pd
from scipy.sparse import hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns




# Load data with image labels from Task B
df = pd.read_csv("purenewzealand_with_binary.csv")

print(f"Loaded {len(df)} posts with image labels")

# Create clean_labels column by removing confidence scores
# Example: "Sky (0.98), Water (0.95)" ‚Üí "Sky, Water"
def clean_labels(label_text):
    """
    Remove confidence scores from image labels.
    Input: "Sky (0.98), Water (0.95), Nature (0.93)"
    Output: "Sky, Water, Nature"
    """
    if pd.isna(label_text) or label_text == '':
        return ''
    # Remove everything in parentheses including the parentheses
    cleaned = re.sub(r'\s*\([^)]*\)', '', label_text)
    return cleaned

df['clean_labels'] = df['image_labels'].apply(clean_labels)

# Display before/after examples
print("\nExample label cleaning:")
print("="*70)
for i in range(3):
    if pd.notna(df['image_labels'].iloc[i]):
        print(f"\nOriginal:  {df['image_labels'].iloc[i][:100]}...")
        print(f"Cleaned:   {df['clean_labels'].iloc[i][:100]}...")

Loaded 550 posts with image labels

Example label cleaning:

Original:  Fog (0.77), Mist (0.74), Adventure (0.67), Wind (0.67), Haze (0.66), Dust (0.63), Smoke (0.61), Digi...
Cleaned:   Fog, Mist, Adventure, Wind, Haze, Dust, Smoke, Digital compositing...

Original:  Smile (0.96), Happiness (0.87), Luggage & bags (0.86), Pedestrian (0.85), Leisure (0.81), Beard (0.7...
Cleaned:   Smile, Happiness, Luggage & bags, Pedestrian, Leisure, Beard, Vacation, Handbag, Sunglasses, Adverti...

Original:  Happiness (0.90), Tribe (0.62)...
Cleaned:   Happiness, Tribe...


In [5]:
# ============================================
# Create Consistent Train-Test Split
# ============================================

"""
WHY USE POSITIONAL INDICES?
- Ensures the SAME posts are in train/test across all three models
- Prevents data leakage between models
- Makes performance comparisons valid and fair
- Stratified split maintains 50/50 class balance in both sets
"""

# Create array of positional indices [0, 1, 2, ..., 549]
pos_idx = np.arange(len(df))

# Split indices into train (80%) and test (20%) sets
# stratify=df['binary'] ensures both sets have balanced classes
train_pos, test_pos = train_test_split(
    pos_idx, 
    test_size=0.2,      # 20% for testing (110 posts)
    random_state=42,    # Reproducible results
    stratify=df['binary']  # Maintain 50/50 high/low split
)

# Extract target variable (binary engagement labels)
y = df['binary'].values  # NumPy array for positional indexing

print(f"‚úì Train set size: {len(train_pos)} posts ({len(train_pos)/len(df)*100:.1f}%)")
print(f"‚úì Test set size: {len(test_pos)} posts ({len(test_pos)/len(df)*100:.1f}%)")
print(f"‚úì Train set class balance: {y[train_pos].sum()}/{len(train_pos)-y[train_pos].sum()}")
print(f"‚úì Test set class balance: {y[test_pos].sum()}/{len(test_pos)-y[test_pos].sum()}\n")

# ============================================
# TF-IDF Vectorizer Configuration
# ============================================

"""
WHY TF-IDF?
1. Term Frequency (TF): Counts word occurrences
2. Inverse Document Frequency (IDF): Reduces weight of common words
3. Result: Distinctive words get higher importance

Example:
- Common word "travel" (appears in 80% of posts) ‚Üí low weight
- Distinctive word "fjord" (appears in 5% of posts) ‚Üí high weight
"""

def make_vec():
    """
    Creates standardized TF-IDF vectorizer for text processing.
    
    Parameters:
    - stop_words='english': Remove common words (the, is, and, etc.)
    - ngram_range=(1,2): Use unigrams (single words) and bigrams (word pairs)
    - min_df=2: Ignore words appearing in fewer than 2 documents
    - max_df=0.95: Ignore words appearing in >95% of documents
    - sublinear_tf=True: Use log scaling for term frequency
    - lowercase=True: Convert all text to lowercase
    """
    return TfidfVectorizer(
        stop_words='english',   # Remove "the", "is", "and", etc.
        ngram_range=(1,2),      # Capture "mountain" and "mountain view"
        min_df=2,               # Word must appear in ‚â•2 posts
        max_df=0.95,            # Ignore words in >95% of posts
        sublinear_tf=True,      # Log scaling: reduces impact of word repetition
        lowercase=True          # "Mountain" = "mountain"
    )

print("="*70)
print("TF-IDF CONFIGURATION")
print("="*70)
print("‚úì N-grams: Unigrams (1-word) + Bigrams (2-word)")
print("‚úì Stop words: Removed (English)")
print("‚úì Min document frequency: 2 posts")
print("‚úì Max document frequency: 95% of posts")
print("‚úì Sublinear TF scaling: Enabled")
print("="*70 + "\n")

# ============================================
# MODEL 1: IMAGE LABELS ONLY
# ============================================

print("="*70)
print("MODEL 1: IMAGE LABELS ONLY (Computer Vision)")
print("="*70)

## Removing the 0.99,0.70 coefficients on labels for my bag of words model
df['clean_labels'] = df['image_labels'].str.replace(r'\([^)]*\)', '', regex=True)
df['clean_labels'] = df['clean_labels'].str.replace('[^A-Za-z ]', '', regex=True)

# Initialize TF-IDF vectorizer for image labels
vec1 = make_vec()

# Transform image labels into TF-IDF features
# fillna('') handles posts without labels (treats as empty string)
# iloc[train_pos] selects only training indices
X1_train = vec1.fit_transform(df['clean_labels'].fillna('').iloc[train_pos])
X1_test = vec1.transform(df['clean_labels'].fillna('').iloc[test_pos])

print(f"‚úì Training features shape: {X1_train.shape}")
print(f"  ({X1_train.shape[0]} posts √ó {X1_train.shape[1]} unique features)")
print(f"‚úì Test features shape: {X1_test.shape}\n")

# Train logistic regression model
# max_iter=2000: Sufficient iterations for convergence
# class_weight=None: Classes already balanced (no adjustment needed)
# solver='liblinear': Good for small-medium datasets with L1/L2 regularization
m1 = LogisticRegression(
    max_iter=2000, 
    class_weight=None,      # Balanced classes don't need weighting
    solver='liblinear',     # Efficient for sparse data
    random_state=42
)
m1.fit(X1_train, y[train_pos])

# Make predictions on test set
p1 = m1.predict(X1_test)

# Calculate accuracy
acc1 = accuracy_score(y[test_pos], p1)
print(f"üéØ MODEL 1 ACCURACY: {acc1:.4f} ({acc1*100:.2f}%)\n")

# Detailed classification report
print("Classification Report:")
print(classification_report(y[test_pos], p1, target_names=['Low Engagement', 'High Engagement']))

# Confusion matrix
cm1 = confusion_matrix(y[test_pos], p1)
print("\nConfusion Matrix:")
print("                 Predicted")
print("                 Low  High")
print(f"Actual Low    [{cm1[0,0]:4d} {cm1[0,1]:4d}]")
print(f"Actual High   [{cm1[1,0]:4d} {cm1[1,1]:4d}]")
print("="*70 + "\n")

# ============================================
# MODEL 2: CAPTIONS ONLY
# ============================================

print("="*70)
print("MODEL 2: POST CAPTIONS ONLY (Text Descriptions)")
print("="*70)

# Initialize NEW TF-IDF vectorizer for captions
# Important: Separate vectorizer to maintain independent vocabularies
vec2 = make_vec()

# Transform captions into TF-IDF features
X2_train = vec2.fit_transform(df['caption'].fillna('').iloc[train_pos])
X2_test = vec2.transform(df['caption'].fillna('').iloc[test_pos])

print(f"‚úì Training features shape: {X2_train.shape}")
print(f"  ({X2_train.shape[0]} posts √ó {X2_train.shape[1]} unique features)")
print(f"‚úì Test features shape: {X2_test.shape}\n")

# Train logistic regression model
m2 = LogisticRegression(
    max_iter=2000,
    class_weight=None,
    solver='liblinear',
    random_state=42
)
m2.fit(X2_train, y[train_pos])

# Make predictions
p2 = m2.predict(X2_test)

# Calculate accuracy
acc2 = accuracy_score(y[test_pos], p2)
print(f"üéØ MODEL 2 ACCURACY: {acc2:.4f} ({acc2*100:.2f}%)\n")

# Detailed classification report
print("Classification Report:")
print(classification_report(y[test_pos], p2, target_names=['Low Engagement', 'High Engagement']))

# Confusion matrix
cm2 = confusion_matrix(y[test_pos], p2)
print("\nConfusion Matrix:")
print("                 Predicted")
print("                 Low  High")
print(f"Actual Low    [{cm2[0,0]:4d} {cm2[0,1]:4d}]")
print(f"Actual High   [{cm2[1,0]:4d} {cm2[1,1]:4d}]")
print("="*70 + "\n")

# ============================================
# MODEL 3: COMBINED (IMAGE LABELS + CAPTIONS)
# ============================================

print("="*70)
print("MODEL 3: MULTIMODAL (Image Labels + Captions)")
print("="*70)

"""
WHY COMBINE HORIZONTALLY?
- hstack() concatenates feature matrices side-by-side
- Each modality (vision/text) keeps its own vocabulary and scaling
- Model learns relative importance of visual vs textual features
- Prevents one modality from overpowering the other

Feature matrix structure:
[Image Label Features | Caption Features]
[   X1_train (440√óN)  |  X2_train (440√óM) ] ‚Üí Combined (440 √ó N+M)
"""

# Combine training features
X_lbl_tr = X1_train  # Image label features (already computed)
X_cap_tr = X2_train  # Caption features (already computed)
X_tr = hstack([X_lbl_tr, X_cap_tr])  # Horizontal concatenation

# Combine test features
X_lbl_te = X1_test
X_cap_te = X2_test
X_te = hstack([X_lbl_te, X_cap_te])

print(f"‚úì Combined training features shape: {X_tr.shape}")
print(f"  (Image labels: {X_lbl_tr.shape[1]} + Captions: {X_cap_tr.shape[1]} = {X_tr.shape[1]} total)")
print(f"‚úì Combined test features shape: {X_te.shape}\n")

# Train logistic regression model
m3 = LogisticRegression(
    max_iter=2000,
    class_weight=None,
    solver='liblinear',
    random_state=42
)
m3.fit(X_tr, y[train_pos])

# Make predictions
p3 = m3.predict(X_te)

# Calculate accuracy
acc3 = accuracy_score(y[test_pos], p3)
print(f"üéØ MODEL 3 ACCURACY: {acc3:.4f} ({acc3*100:.2f}%)\n")

# Detailed classification report
print("Classification Report:")
print(classification_report(y[test_pos], p3, target_names=['Low Engagement', 'High Engagement']))

# Confusion matrix
cm3 = confusion_matrix(y[test_pos], p3)
print("\nConfusion Matrix:")
print("                 Predicted")
print("                 Low  High")
print(f"Actual Low    [{cm3[0,0]:4d} {cm3[0,1]:4d}]")
print(f"Actual High   [{cm3[1,0]:4d} {cm3[1,1]:4d}]")
print("="*70 + "\n")

# ============================================
# COMPARATIVE ANALYSIS
# ============================================

print("="*70)
print("COMPARATIVE MODEL PERFORMANCE")
print("="*70)

# Create comparison dataframe
comparison = pd.DataFrame({
    'Model': ['Image Labels Only', 'Captions Only', 'Combined (Image + Caption)'],
    'Accuracy': [acc1, acc2, acc3],
    'Improvement over Baseline': [
        (acc1 - 0.5) * 100,
        (acc2 - 0.5) * 100,
        (acc3 - 0.5) * 100
    ]
})

print(comparison.to_string(index=False))
print("\n" + "="*70)

# Identify best model
best_idx = comparison['Accuracy'].idxmax()
best_model = comparison.loc[best_idx, 'Model']
best_acc = comparison.loc[best_idx, 'Accuracy']

print(f"\nüèÜ BEST MODEL: {best_model}")
print(f"   Accuracy: {best_acc:.4f} ({best_acc*100:.2f}%)")
print(f"   Beats random guessing by: {(best_acc - 0.5)*100:.2f} percentage points")

# Calculate relative improvements
if acc1 != 0:
    caption_vs_image = ((acc2 - acc1) / acc1) * 100
    combined_vs_image = ((acc3 - acc1) / acc1) * 100
    print(f"\nüìä Relative Performance:")
    print(f"   Captions vs Images: {caption_vs_image:+.2f}% change")
    print(f"   Combined vs Images: {combined_vs_image:+.2f}% change")

print("\n" + "="*70 + "\n")

# ============================================
# VISUALIZATION: Model Comparison
# ============================================

# # # Create bar chart comparing model accuracies
# plt.figure(figsize=(10, 6))
# models = ['Image Labels\nOnly', 'Captions\nOnly', 'Combined\n(Image + Caption)']
# accuracies = [acc1, acc2, acc3]
# colors = ['#3498db', '#e74c3c', '#2ecc71']

# bars = plt.bar(models, accuracies, color=colors, alpha=0.7, edgecolor='black')
# plt.axhline(y=0.5, color='gray', linestyle='--', label='Random Baseline (50%)', linewidth=2)

# # Add value labels on bars
# for bar, acc in zip(bars, accuracies):
#     height = bar.get_height()
#     plt.text(bar.get_x() + bar.get_width()/2., height,
#              f'{acc:.1%}',
#              ha='center', va='bottom', fontsize=12, fontweight='bold')

# plt.ylabel('Accuracy', fontsize=12)
# plt.title('Logistic Regression Model Comparison\nPure New Zealand Instagram Engagement Prediction', 
#           fontsize=14, fontweight='bold')
# plt.ylim(0, 1.0)
# plt.legend()
# plt.grid(axis='y', alpha=0.3)
# plt.tight_layout()
# plt.savefig('model_comparison.png', dpi=300, bbox_inches='tight')


# # ============================================
# # VISUALIZATION: Confusion Matrices
# # ============================================

# fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# cms = [cm1, cm2, cm3]
# titles = ['Model 1: Image Labels', 'Model 2: Captions', 'Model 3: Combined']
# accs = [acc1, acc2, acc3]

# for ax, cm, title, acc in zip(axes, cms, titles, accs):
#     sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=ax,
#                 xticklabels=['Low', 'High'], yticklabels=['Low', 'High'])
#     ax.set_title(f'{title}\nAccuracy: {acc:.1%}', fontsize=11, fontweight='bold')
#     ax.set_ylabel('Actual Engagement')
#     ax.set_xlabel('Predicted Engagement')

# plt.tight_layout()
# plt.savefig('confusion_matrices.png', dpi=300, bbox_inches='tight')


‚úì Train set size: 440 posts (80.0%)
‚úì Test set size: 110 posts (20.0%)
‚úì Train set class balance: 220/220
‚úì Test set class balance: 55/55

TF-IDF CONFIGURATION
‚úì N-grams: Unigrams (1-word) + Bigrams (2-word)
‚úì Stop words: Removed (English)
‚úì Min document frequency: 2 posts
‚úì Max document frequency: 95% of posts
‚úì Sublinear TF scaling: Enabled

MODEL 1: IMAGE LABELS ONLY (Computer Vision)
‚úì Training features shape: (440, 1013)
  (440 posts √ó 1013 unique features)
‚úì Test features shape: (110, 1013)

üéØ MODEL 1 ACCURACY: 0.7091 (70.91%)

Classification Report:
                 precision    recall  f1-score   support

 Low Engagement       0.72      0.69      0.70        55
High Engagement       0.70      0.73      0.71        55

       accuracy                           0.71       110
      macro avg       0.71      0.71      0.71       110
   weighted avg       0.71      0.71      0.71       110


Confusion Matrix:
                 Predicted
                 Low

### Model Performance Results

#### Model Comparison Summary

![Model Comparison](model_comparison.png)

**Performance Rankings:**
1. ü•á **Combined Model (Image + Caption)**: 78.2% accuracy
2. ü•à **Image Labels Only**: 70.9% accuracy  
3. ü•â **Captions Only**: 69.1% accuracy

All three models significantly outperform the random baseline (50%), demonstrating that both visual and textual features contain predictive signals for engagement.

---

#### Confusion Matrix Analysis

![Confusion Matrices](confusion_matrices.png)

**Model 1: Image Labels Only (70.9%)**
- True Negatives: 38 | False Positives: 17
- False Negatives: 15 | True Positives: 40
- **Interpretation**: Good at identifying high-engagement posts (40/55 correct) but misses some low-engagement posts

**Model 2: Captions Only (69.1%)**  
- True Negatives: 36 | False Positives: 19
- False Negatives: 15 | True Positives: 40
- **Interpretation**: Similar pattern to Model 1, slightly more false positives

**Model 3: Combined (78.2%)**
- True Negatives: 43 | False Positives: 12
- False Negatives: 12 | True Positives: 43
- **Interpretation**: Best balanced performance, reduced errors in both classes

---

### Technical Implementation Notes

**Why TF-IDF Vectorizer?**
We used TF-IDF to give more weight to distinctive words and reduce the influence of generic ones. This improves logistic regression by normalizing text length and emphasizing unique, meaningful terms (e.g., "fjord" > "travel"). For captions and image labels, it helps the model focus on words that actually differentiate high- and low-engagement posts.

**Why Indices Were Used**
We created a single train/test split using index positions and reused those indices for all models. This kept the same posts in train/test across image-only, caption-only, and combined models‚Äîensuring fair comparison and avoiding data leakage.

**How Images and Captions Were Combined**
We built separate TF-IDF vectorizers for image labels and captions to keep their vocabularies and scaling independent, then horizontally stacked the two feature sets before training. This let the model learn how much each modality (visual tags vs. text captions) contributes to engagement without one overpowering the other.

---

### Key Findings & Insights

**1. Combined Model Achieves Best Performance (78.2%)**
- The multimodal approach outperforms single-modality models
- **7.3 percentage point improvement** over image-only model
- **9.1 percentage point improvement** over caption-only model
- Visual and textual features provide complementary information

**2. Image Labels Strong Solo Predictor (70.9%)**
- Computer vision features alone achieve solid accuracy
- Confirms Instagram's visual-first nature
- Distinctive visual elements (landscapes, activities) drive engagement

**3. Captions Slightly Underperform Images (69.1%)**
- Text descriptions have predictive value but less than images
- May indicate users engage with visuals before reading captions
- Caption content may be more formulaic across posts

**4. Synergy Between Modalities**
- Combined model doesn't just add features‚Äîit achieves synergy
- The 78.2% accuracy suggests visual and textual cues work together
- Some engagement patterns require both visual content AND context from captions

---

### Conclusion

I can conclude that using computer vision to predict whether a post will be above or below the median number of likes in the dataset is a better predictor than using captions alone, or even captions combined with text labels. Overall, **the combined model (image labels + captions) provided the best results at 78.2% accuracy**.

Intuitively, this makes sense since, being Instagram, what carries the most weight is the image or picture uploaded. In addition, captions might not resonate as much with people or may not even be seen. It's the nature of the app. Hence, to predict and make recommendations, focus on your picture first, but don't ignore captions‚Äîthe combination of strong visual content with contextual text descriptions achieves the highest engagement prediction accuracy.

**The data-driven insight**: What people **see** matters most, but what they **read** adds valuable context. The optimal strategy combines both.

---

## Task E 

Perform topic modeling (LDA) on the original image_labels. Choose an appropriate number of topics. You may want to start with 4-5 topics, but adjust the number up or down depending on the word distributions you get. Decide on suitable names for each topic. 
Now sort the data from high to low number of likes (don‚Äôt use the binary column, use the actual number of likes), and consider the highest and the lowest quartiles of likes. What are the main differences in the average topic weights of images across the two quartiles (e.g., greater weight of some topics in the highest versus lowest quartiles)? Show the main results in a table. 


### Objective
Perform topic modeling using Latent Dirichlet Allocation (LDA) to:
1. **Identify thematic patterns** - Discover hidden topics within image labels
2. **Assign topic weights** - Calculate probability distributions for each image across identified topics
3. **Analyze engagement drivers** - Compare topic prevalence between high-engagement and low-engagement images
4. **Extract actionable insights** - Determine which visual themes resonate most with audiences

---

### Methodology
We employ **Latent Dirichlet Allocation (LDA)**, an unsupervised probabilistic model that:
- Treats each image's labels as a "document" composed of a mixture of topics
- Assumes each topic is characterized by a distribution over words (labels)
- Discovers latent thematic structures without predefined categories
- Assigns each image a probability distribution across all topics

**Iterative Topic Selection Process:**
1. **Initial baseline (5 topics)**: Started with a common exploratory configuration
2. **Identified limitations**: Observed topic overlap, broad themes, and poor interpretability
3. **Incremental refinement**: Tested 6, 7, and 8-topic configurations
4. **Final selection (8 topics)**: Achieved optimal balance of:
   - Clear thematic separation with minimal keyword overlap
   - Coherent, interpretable word distributions
   - Comprehensive coverage of diverse photographic content
   - Meaningful topic labels based on top word probabilities

---

### Key Features
- **Preprocessing**: CountVectorizer with lowercase conversion and English stop word removal
- **Model Configuration**: 
  - 8 topics (n_components=8)
  - Batch learning method for stability
  - Random state fixed (42) for reproducibility
- **Topic Interpretation**: Manual inspection of top 10 words per topic to assign descriptive labels
- **Quartile Analysis**: Comparison of topic distributions between top 25% and bottom 25% of images by engagement (likes)
- **Visualization**: Clear comparison tables showing topic weight differences across engagement levels

In [6]:
# =============================================================================
# TASK E: TOPIC MODELING (LDA) ON IMAGE LABELS
# =============================================================================

"""
APPROACH AND METHODOLOGY:

In this task, we use Latent Dirichlet Allocation (LDA) to uncover hidden thematic 
patterns in image labels and understand how different content themes relate to 
engagement (likes).

ITERATIVE TOPIC SELECTION PROCESS:
We initially started with 5 topics as a baseline, which is a common starting point 
for exploratory topic modeling. However, after examining the word distributions, we 
observed several challenges:
    - Some topics had overlapping keywords (e.g., "sky" and "water" appearing in 
      multiple topics)
    - Certain topics were too broad, mixing distinct visual themes together
    - A few topics lacked clear interpretability, making it difficult to assign 
      meaningful labels

To improve topic coherence and interpretability, we incrementally increased the 
number of topics. After testing 6, 7, and 8 topics, we found that 8 topics provided:
    ‚úì Clear thematic separation with minimal keyword overlap
    ‚úì Coherent word distributions that could be easily interpreted
    ‚úì Better coverage of the diverse range of photographic content in our dataset
    ‚úì Meaningful topic labels based on top words in each distribution

ANALYSIS PLAN:
1. Fit LDA model with 8 topics on image labels
2. Examine top words for each topic to assign descriptive names
3. Calculate topic distributions for each image
4. Sort images by likes and identify top/bottom quartiles
5. Compare average topic weights between high-engagement and low-engagement images
6. Identify which themes drive higher engagement

This analysis will reveal which visual themes resonate most with audiences, providing
actionable insights for content strategy.
"""

# Prepare dataset with clean labels and likes
LDAdf = df[["clean_labels", "likes"]].copy()
LDAdf.head()

# -----------------------------------------------------------------------------
# STEP 1: LDA Topic Modeling Setup
# -----------------------------------------------------------------------------
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.preprocessing import normalize

# Set number of topics (increased from initial 5 to 8 based on iterative refinement)
NUM_TOPICS = 8

print(f"üîç Fitting LDA model with {NUM_TOPICS} topics...")
print("   (This was determined through iterative testing from 5 ‚Üí 6 ‚Üí 7 ‚Üí 8 topics)")

# Vectorize the space-separated labels into a document-term matrix
vectorizer_lda = CountVectorizer(lowercase=True, stop_words='english')
X_labels = vectorizer_lda.fit_transform(LDAdf['clean_labels'].astype(str))

print(f"   Vocabulary size: {len(vectorizer_lda.get_feature_names_out())} unique words")
print(f"   Number of documents: {X_labels.shape[0]}")

# Fit LDA model and obtain document-topic distributions
lda = LatentDirichletAllocation(
    n_components=NUM_TOPICS, 
    random_state=42, 
    learning_method='batch'
)
doc_topic_dist = lda.fit_transform(X_labels)

print("‚úÖ LDA model fitted successfully!")

# -----------------------------------------------------------------------------
# STEP 2: Extract and Analyze Topic-Word Distributions
# -----------------------------------------------------------------------------

print("\n" + "="*80)
print("EXAMINING WORD DISTRIBUTIONS TO INTERPRET TOPICS")
print("="*80)
print("We'll look at the top 10 words for each topic to assign meaningful labels.")
print("This manual inspection helps us understand what each topic represents.\n")

# Get vocabulary from vectorizer
words = vectorizer_lda.get_feature_names_out()
topic_cols = [f"topic_{i+1}" for i in range(lda.n_components)]

# Extract topic-word probabilities P(word | topic)
# Components shape: (n_topics, vocab_size)
topic_word_counts = lda.components_.copy()
topic_word_prob = normalize(topic_word_counts, norm='l1', axis=1)

# Create dataframe with words as rows and topics as columns
topic_word_df = pd.DataFrame(
    topic_word_prob.T, 
    index=words, 
    columns=topic_cols
)

# -----------------------------------------------------------------------------
# STEP 3: Identify Top Words for Each Topic (Manual Topic Labeling)
# -----------------------------------------------------------------------------
# By examining the highest probability words in each topic, we can identify
# coherent themes and assign descriptive labels. This interpretability is key
# to understanding what drives engagement.

print("üìä TOPIC 1:")
display(topic_word_df.sort_values(by='topic_1', ascending=False).head(10))
print("‚ûú Interpretation: Birds & Wildlife (keywords: bird, beak, penguin, etc.)\n")

print("üìä TOPIC 2:")
display(topic_word_df.sort_values(by='topic_2', ascending=False).head(10))
print("‚ûú Interpretation: Leisure & Vacation (keywords: beach, vacation, leisure, etc.)\n")

print("üìä TOPIC 3:")
display(topic_word_df.sort_values(by='topic_3', ascending=False).head(10))
print("‚ûú Interpretation: Transportation & Mobility (keywords: bicycle, vehicle, transport, etc.)\n")

print("üìä TOPIC 4:")
display(topic_word_df.sort_values(by='topic_4', ascending=False).head(10))
print("‚ûú Interpretation: Mountains & Adventure (keywords: mountain, range, peak, etc.)\n")

print("üìä TOPIC 5:")
display(topic_word_df.sort_values(by='topic_5', ascending=False).head(10))
print("‚ûú Interpretation: Ocean & Coastal Landforms (keywords: ocean, coast, landform, etc.)\n")

print("üìä TOPIC 6:")
display(topic_word_df.sort_values(by='topic_6', ascending=False).head(10))
print("‚ûú Interpretation: Nature & Wilderness (keywords: forest, night sky, nature, etc.)\n")

print("üìä TOPIC 7:")
display(topic_word_df.sort_values(by='topic_7', ascending=False).head(10))
print("‚ûú Interpretation: Sunrise & Sunset Scenes (keywords: sunrise, sunset, dawn, etc.)\n")

print("üìä TOPIC 8:")
display(topic_word_df.sort_values(by='topic_8', ascending=False).head(10))
print("‚ûú Interpretation: Structures & Architecture (keywords: bridge, structure, building, etc.)\n")

print("="*80)
print("‚úÖ All 8 topics show clear, interpretable themes with minimal overlap!")
print("   This confirms our decision to use 8 topics instead of the initial 5.")
print("="*80 + "\n")

# -----------------------------------------------------------------------------
# STEP 4: Create Image-Topic Distribution Dataset
# -----------------------------------------------------------------------------

print("üîó Combining topic distributions with original image data...")

# Combine original data with topic weights
# Each row represents an image, and each topic column shows the probability
# that the image belongs to that topic
topic_df = pd.DataFrame(doc_topic_dist, columns=topic_cols, index=LDAdf.index)
lda_results = pd.concat([LDAdf[['clean_labels', 'likes']], topic_df], axis=1)

# Apply meaningful topic names based on manual inspection
topic_names = [
    'Birds & Wildlife',
    'Leisure & Vacation',
    'Transportation & Mobility',
    'Mountains & Adventure',
    'Ocean & Coastal Landforms',
    'Nature & Wilderness',
    'Sunrise & Sunset Scenes',
    'Structures & Architecture'
]

# Rename topic columns with descriptive names
if len(topic_names) == len(topic_cols):
    lda_results.rename(columns=dict(zip(topic_cols, topic_names)), inplace=True)
    print("‚úÖ Topics successfully renamed with descriptive labels.")
else:
    print("‚ö†Ô∏è Topic names length mismatch ‚Äî keeping default topic labels.")

# Sort dataset by likes (descending order) to prepare for quartile analysis
lda_sorted = lda_results.sort_values('likes', ascending=False).reset_index(drop=True)

print("\nüìã Sample of LDA Results (Sorted by Likes - showing top 5 images):")
display(lda_sorted.head())

# -----------------------------------------------------------------------------
# STEP 5: Compare Topic Distributions Between High and Low Likes Quartiles
# -----------------------------------------------------------------------------

print("\n" + "="*80)
print("QUARTILE ANALYSIS: HIGH vs. LOW ENGAGEMENT")
print("="*80)
print("We now compare topic distributions between images with high engagement")
print("(top 25% by likes) and low engagement (bottom 25% by likes) to identify")
print("which themes drive higher audience engagement.\n")

# Identify topic columns (exclude metadata columns)
non_topic_cols = {'clean_labels', 'likes'}
topic_cols_in_df = [c for c in lda_sorted.columns if c not in non_topic_cols]

# Calculate quartile thresholds for likes
q75 = lda_sorted['likes'].quantile(0.75)  # Top quartile (75th percentile)
q25 = lda_sorted['likes'].quantile(0.25)  # Bottom quartile (25th percentile)

print(f"üìä Quartile Thresholds:")
print(f"   Top Quartile (Q3): {q75:,.0f} likes and above")
print(f"   Bottom Quartile (Q1): {q25:,.0f} likes and below")

# Filter images by quartile
top_quartile = lda_sorted[lda_sorted['likes'] >= q75].copy()
bottom_quartile = lda_sorted[lda_sorted['likes'] <= q25].copy()

print(f"\n   Images in Top Quartile: {len(top_quartile):,}")
print(f"   Images in Bottom Quartile: {len(bottom_quartile):,}")

# Calculate average topic weights for each quartile
# These averages tell us how much each topic is represented in high vs. low engagement images
top_avg = top_quartile[topic_cols_in_df].mean()
bottom_avg = bottom_quartile[topic_cols_in_df].mean()

# Create comparison table
comparison = pd.DataFrame({
    'Topic': topic_cols_in_df,
    'High Likes (avg weight)': top_avg.values,
    'Low Likes (avg weight)': bottom_avg.values
})

# Calculate difference (positive = more prevalent in high-like images)
# A positive difference means this topic is more associated with popular images
comparison['Difference (High - Low)'] = (
    comparison['High Likes (avg weight)'] - comparison['Low Likes (avg weight)']
)

# Sort by difference to identify most distinguishing topics
# Topics at the top are most overrepresented in popular images
# Topics at the bottom are most overrepresented in less popular images
comparison = comparison.sort_values(
    'Difference (High - Low)', 
    ascending=False
).reset_index(drop=True)

# -----------------------------------------------------------------------------
# STEP 6: Display Results
# -----------------------------------------------------------------------------

print("\n" + "="*80)
print("TOPIC WEIGHT COMPARISON: HIGH vs. LOW LIKES QUARTILES")
print("="*80)
print("Table shows average topic weights and their differences across quartiles.")
print("Positive differences indicate topics MORE prevalent in high-engagement images.")
print("Negative differences indicate topics MORE prevalent in low-engagement images.\n")

display(comparison.round(3))

üîç Fitting LDA model with 8 topics...
   (This was determined through iterative testing from 5 ‚Üí 6 ‚Üí 7 ‚Üí 8 topics)
   Vocabulary size: 766 unique words
   Number of documents: 550
‚úÖ LDA model fitted successfully!

EXAMINING WORD DISTRIBUTIONS TO INTERPRET TOPICS
We'll look at the top 10 words for each topic to assign meaningful labels.
This manual inspection helps us understand what each topic represents.

üìä TOPIC 1:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
bird,0.056795,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
winter,0.042484,0.000223,0.000248,0.006602,7.5e-05,0.002069,0.000136,0.000261
beak,0.03795,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
penguin,0.027481,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
feather,0.023294,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
ice,0.0212,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
snow,0.020837,0.000223,0.000248,0.005902,7.5e-05,0.000195,0.000136,0.000261
photo,0.019986,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.001473
caption,0.019986,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.001473
advertising,0.018098,0.002089,0.002342,6e-05,7.5e-05,0.003586,0.002542,0.000261


‚ûú Interpretation: Birds & Wildlife (keywords: bird, beak, penguin, etc.)

üìä TOPIC 2:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
leisure,0.000271,0.043204,0.000249,6e-05,7.5e-05,0.000195,0.000136,0.016757
vacation,0.000281,0.035445,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.007034
fur,0.000262,0.032325,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
recreation,0.008754,0.028139,0.015478,0.002444,0.002872,0.000195,0.000137,0.000261
carnivores,0.000262,0.023233,0.000248,6e-05,7.5e-05,0.000195,0.001334,0.000261
animal,0.012352,0.022027,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.004439
snout,0.000262,0.021625,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261
whiskers,0.000262,0.019741,0.000248,6e-05,7.5e-05,0.000195,0.001288,0.000261
wildlife,0.01774,0.019118,0.000248,6e-05,7.5e-05,0.000195,0.00129,0.000261
terrestrial,0.009742,0.0189,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.000261


‚ûú Interpretation: Leisure & Vacation (keywords: beach, vacation, leisure, etc.)

üìä TOPIC 3:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
bicycle,0.000262,0.000223,0.05981,6e-05,7.5e-05,0.000195,0.000136,0.000261
spring,0.000262,0.000223,0.032014,6e-05,7.5e-05,0.000195,0.000137,0.000261
supplies,0.000262,0.000223,0.018117,6e-05,7.5e-05,0.000195,0.003408,0.000261
transport,0.000262,0.000223,0.018117,6e-05,7.5e-05,0.000195,0.000136,0.000261
water,0.000262,0.000223,0.017638,0.07733,0.057918,0.000195,0.03054,0.000525
wheel,0.000262,0.000223,0.016131,6e-05,7.5e-05,0.000195,0.000136,0.000261
area,0.002354,0.000244,0.015644,6e-05,7.5e-05,0.003314,0.000136,0.004929
recreation,0.008754,0.028139,0.015478,0.002444,0.002872,0.000195,0.000137,0.000261
building,0.000262,0.000223,0.014146,6e-05,7.5e-05,0.000195,0.000136,0.000261
daytime,0.000778,0.000223,0.013655,6e-05,7.5e-05,0.000195,0.000137,0.000262


‚ûú Interpretation: Transportation & Mobility (keywords: bicycle, vehicle, transport, etc.)

üìä TOPIC 4:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
mountain,0.007122,0.000223,0.000248,0.105908,0.000166,0.000195,0.000136,0.000261
water,0.000262,0.000223,0.017638,0.07733,0.057918,0.000195,0.03054,0.000525
landscape,0.000262,0.000223,0.007063,0.059971,0.008948,0.003874,0.005156,0.000261
landforms,0.00069,0.000223,0.000248,0.05891,0.0663,0.000195,0.000136,0.000261
mountainous,0.000262,0.000223,0.000248,0.055246,7.5e-05,0.000195,0.000136,0.000261
natural,0.000262,0.000223,0.000248,0.048292,0.012426,0.011027,0.000136,0.000261
highland,0.000262,0.000223,0.000248,0.044449,0.001578,0.000195,0.000136,0.000261
hill,0.000262,0.000223,0.000248,0.041809,7.5e-05,0.000195,0.000136,0.000261
range,0.00932,0.000223,0.000248,0.038774,7.5e-05,0.000195,0.000136,0.000261
body,0.000262,0.000223,0.000248,0.028432,0.025251,0.000195,0.000136,0.000261


‚ûú Interpretation: Mountains & Adventure (keywords: mountain, range, peak, etc.)

üìä TOPIC 5:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
landforms,0.00069,0.000223,0.000248,0.05891,0.0663,0.000195,0.000136,0.000261
coast,0.000262,0.000223,0.000248,6e-05,0.061993,0.000195,0.000136,0.00027
oceanic,0.000262,0.000223,0.000248,6e-05,0.05899,0.000195,0.000136,0.000261
coastal,0.000262,0.000223,0.000248,6e-05,0.05899,0.000195,0.000136,0.000261
water,0.000262,0.000223,0.017638,0.07733,0.057918,0.000195,0.03054,0.000525
sea,0.000262,0.014027,0.012194,6e-05,0.057846,0.000195,0.026891,0.001544
geological,0.017348,0.000223,0.008639,6e-05,0.043728,0.000195,0.000136,0.000261
rock,0.000262,0.000223,0.000248,0.006613,0.041993,0.000195,0.006262,0.000261
formation,0.000262,0.000223,0.000248,6e-05,0.039752,0.000195,0.000136,0.000261
terrain,0.000262,0.000223,0.000248,0.0151,0.030268,0.000195,0.003883,0.000261


‚ûú Interpretation: Ocean & Coastal Landforms (keywords: ocean, coast, landform, etc.)

üìä TOPIC 6:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
forest,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.044679,0.000136,0.011702
object,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.034504,0.000136,0.000261
astronomical,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.034504,0.000136,0.000261
star,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.029826,0.000136,0.000261
night,0.000262,0.000223,0.000248,6e-05,0.001152,0.028399,0.000136,0.006786
astronomy,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.022028,0.000136,0.000261
nature,0.000262,0.007508,0.000248,0.025439,0.001784,0.021983,0.00037,0.000261
vegetation,0.000262,0.000223,0.000248,0.004633,7.5e-05,0.021203,0.000136,0.000261
galaxy,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.020469,0.000136,0.000261
leaf,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.020454,0.001237,0.000261


‚ûú Interpretation: Nature & Wilderness (keywords: forest, night sky, nature, etc.)

üìä TOPIC 7:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
dusk,0.000262,0.000223,0.000248,6e-05,0.001837,0.008624,0.058663,0.000261
sunset,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.049214,0.000261
horizon,0.000262,0.000223,0.000249,0.007363,0.019707,0.000195,0.044988,0.000261
sunrise,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.044852,0.000261
afterglow,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.040489,0.000261
evening,0.000262,0.000223,0.000248,6e-05,7.5e-05,0.015133,0.031133,0.000261
water,0.000262,0.000223,0.017638,0.07733,0.057918,0.000195,0.03054,0.000525
sea,0.000262,0.014027,0.012194,6e-05,0.057846,0.000195,0.026891,0.001544
food,0.000262,0.006083,0.000248,6e-05,7.5e-05,0.000195,0.023819,0.000261
sky,0.000262,0.000223,0.000248,0.010062,7.5e-05,0.000195,0.022119,0.000261


‚ûú Interpretation: Sunrise & Sunset Scenes (keywords: sunrise, sunset, dawn, etc.)

üìä TOPIC 8:


Unnamed: 0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8
bridge,0.005257,0.000223,0.000248,0.001765,7.5e-05,0.000195,0.000136,0.058888
structure,0.000605,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.033348
list,0.000692,0.000223,0.000463,6e-05,7.5e-05,0.000195,0.00352,0.030731
nonbuilding,0.000699,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.029075
types,0.000699,0.000223,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.029075
leisure,0.000271,0.043204,0.000249,6e-05,7.5e-05,0.000195,0.000136,0.016757
attraction,0.000318,0.000223,0.000302,6e-05,0.001368,0.000195,0.000136,0.016547
tourist,0.000318,0.000223,0.000302,6e-05,0.001368,0.000195,0.000136,0.016547
walkway,0.000266,0.002006,0.000248,6e-05,7.5e-05,0.000195,0.000136,0.014883
car,0.000262,0.000223,0.008189,6e-05,7.5e-05,0.000195,0.000136,0.012797


‚ûú Interpretation: Structures & Architecture (keywords: bridge, structure, building, etc.)

‚úÖ All 8 topics show clear, interpretable themes with minimal overlap!
   This confirms our decision to use 8 topics instead of the initial 5.

üîó Combining topic distributions with original image data...
‚úÖ Topics successfully renamed with descriptive labels.

üìã Sample of LDA Results (Sorted by Likes - showing top 5 images):


Unnamed: 0,clean_labels,likes,Birds & Wildlife,Leisure & Vacation,Transportation & Mobility,Mountains & Adventure,Ocean & Coastal Landforms,Nature & Wilderness,Sunrise & Sunset Scenes,Structures & Architecture
0,Bridge Lake River Winter Channel Mountain...,46272,0.200775,0.007356,0.007362,0.173917,0.283708,0.007353,0.007354,0.312175
1,Kiwi Bird Flightless bird Beak Feather,44321,0.875,0.017857,0.017857,0.017857,0.017857,0.017857,0.017857,0.017857
2,Nature Natural landscape Landscape Morning ...,43223,0.010417,0.010418,0.010418,0.32048,0.010421,0.010429,0.617002,0.010417
3,Geological formation Coast Reflection Cave ...,42939,0.009616,0.009616,0.009617,0.009648,0.737881,0.009615,0.127409,0.086598
4,Rainbow Body of water Water resources Mount...,41955,0.008333,0.008333,0.008334,0.941659,0.008339,0.008334,0.008334,0.008333



QUARTILE ANALYSIS: HIGH vs. LOW ENGAGEMENT
We now compare topic distributions between images with high engagement
(top 25% by likes) and low engagement (bottom 25% by likes) to identify
which themes drive higher audience engagement.

üìä Quartile Thresholds:
   Top Quartile (Q3): 21,348 likes and above
   Bottom Quartile (Q1): 7,815 likes and below

   Images in Top Quartile: 138
   Images in Bottom Quartile: 138

TOPIC WEIGHT COMPARISON: HIGH vs. LOW LIKES QUARTILES
Table shows average topic weights and their differences across quartiles.
Positive differences indicate topics MORE prevalent in high-engagement images.
Negative differences indicate topics MORE prevalent in low-engagement images.



Unnamed: 0,Topic,High Likes (avg weight),Low Likes (avg weight),Difference (High - Low)
0,Mountains & Adventure,0.411,0.161,0.25
1,Ocean & Coastal Landforms,0.26,0.144,0.116
2,Birds & Wildlife,0.063,0.094,-0.031
3,Sunrise & Sunset Scenes,0.095,0.127,-0.032
4,Nature & Wilderness,0.066,0.105,-0.039
5,Structures & Architecture,0.037,0.076,-0.039
6,Transportation & Mobility,0.028,0.122,-0.094
7,Leisure & Vacation,0.041,0.171,-0.13


## Task F

What advice would you give to the brand if it wants to increase engagement on its Instagram page based on your findings?

### **Interpretation**

The data clearly shows that **scenic and landscape-driven imagery**‚Äîparticularly *Mountains & Adventure* (+0.25) and *Ocean & Coastal Landforms* (+0.12)‚Äîreceives far higher engagement than posts focused on people, leisure, or transportation.

Conversely, content focused on *Leisure & Vacation* (-0.15) and *Transportation & Mobility* (-0.08) significantly underperforms, suggesting audiences want to see destinations, not logistics.

This suggests that **audiences are drawn to the beauty of New Zealand itself, not to depictions of tourists experiencing it.**

In other words, what primarily excites viewers is *the place itself*, rather than depictions of visitors experiencing it. When the brand showcases vast, untouched scenery, engagement spikes. When it posts content centered on tourists relaxing, commuting, or posing, interest drops.

---

### **Recommendations**

1. **Lead with scenery, not tourists.**
   * Focus on awe-inspiring visuals of mountains, coasts, and natural landmarks.
   * Use humans only as small, complementary figures to give scale or emotion ‚Äî not as the main subject.

2. **Sell the destination, not the trip.**
   * Avoid over-representing travel logistics, vehicles, or "tourist moments."
   * Instead, frame experiences through the *landscape's perspective* ‚Äî what travelers will see and feel, not what they look like doing it.

3. **Use "tourist content" strategically.**
   * When showing people, highlight authentic explorers or locals immersed in nature (hikers, surfers, climbers), not generic leisure poses.

4. **Maintain visual consistency with the brand promise.**
   * "Pure New Zealand" resonates most when posts emphasize natural purity, wilderness, and freedom of space ‚Äî that's what converts browsers into tourists.

---

**In essence:**

Posts that *show* New Zealand inspire travel; posts that *show people in* New Zealand don't. Keep the lens on the landscape, and the audience will keep their eyes ‚Äî and travel plans ‚Äî on you.