1.0 Data Pipeline Execution and Initial EDA

This notebook executes the data scraping, cleaning, and initial sentiment scoring, preparing the dataset for detailed analysis.

First, we ensure all necessary modules from our src folder are imported.

In [None]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Import our custom modules
from src.data_scraper import run_data_pipeline, save_data
from src.sentiment_analyzer import analyze_sentiment, save_processed_data

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

2. Data Acquisition

We run the core scraping function. Note: This uses mock data generation, as a real YouTube API key is required for actual scraping.

In [None]:
# --- PARAMETERS ---
CONTROVERSY_NAME = "The Tarte Brand Trip Scandal"
RAW_FILE_NAME = "tarte_raw_comments.csv"

# Run the data pipeline (uses mock data generated in src/data_scraper.py)
df_raw = run_data_pipeline(CONTROVERSY_NAME)
print(f"Total raw comments collected: {len(df_raw)}")

# Save the raw data
save_data(df_raw, RAW_FILE_NAME)

3. Sentiment Analysis Execution

We now run the full sentiment analysis, applying both VADER and TextBlob to all comments.

In [None]:
# Run the full sentiment analysis and cleaning pipeline
df_processed = analyze_sentiment(df_raw.copy())
PROCESSED_FILE_NAME = "tarte_processed_sentiment.csv"

# Save the processed data for subsequent notebooks
save_processed_data(df_processed, PROCESSED_FILE_NAME)

print("\n--- Processed Data Snapshot ---")
df_processed[['comment_text', 'vader_compound', 'textblob_polarity', 'textblob_subjectivity']].head()

4. Initial Exploratory Data Analysis (EDA)

A quick look at the distribution of engagement and preliminary VADER scores.

In [None]:
# Distribution of Comment Likes (Engagement)
plt.figure(figsize=(10, 5))
sns.histplot(df_processed['comment_likes'], bins=50, kde=True)
plt.title('Distribution of Comment Likes (Engagement)')
plt.xlabel('Comment Likes')
plt.ylabel('Count')
plt.show()

In [None]:
# Preliminary VADER Compound Score Distribution
plt.figure(figsize=(10, 5))
sns.histplot(df_processed['vader_compound'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of VADER Compound Scores')
plt.xlabel('VADER Compound Score (-1.0 to +1.0)')
plt.ylabel('Count')
plt.show()

# Final status update
print(f"Data pipeline complete. Processed data ready in data/processed/{PROCESSED_FILE_NAME}")
