1.0 Data Pipeline Execution and Initial EDA

This notebook executes the data scraping, cleaning, and initial sentiment scoring, preparing the dataset for detailed analysis.

First, we ensure all necessary modules from our src folder are imported.

In [1]:
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns

# --- ROBUST FIX: Adding Project Root to System Path ---
# We are currently in 'notebooks/'. We need to add the parent directory to the path.
import sys
import os

# Get the path to the current notebook's directory (e.g., .../notebooks)
current_dir = os.getcwd()

# Get the path to the project root (the directory containing 'src' and 'notebooks')
project_root_path = os.path.dirname(current_dir)

# Add the project root to the system path
sys.path.append(project_root_path)

# Verification: Print the path being added
print(f"Project root added to system path: {project_root_path}")
# ----------------------------------------------------------------

# Import our custom modules
from src.data_scraper import run_data_pipeline, save_data
from src.sentiment_analyzer import analyze_sentiment, save_processed_data

# Set display options
pd.set_option('display.max_columns', None)
sns.set_style("whitegrid")

Project root added to system path: D:\Projects\Sentiment-Analysis-Influencer-Controversies


ModuleNotFoundError: No module named 'src.data_scraper'

2. Data Acquisition

We run the core scraping function. Note: This uses mock data generation, as a real YouTube API key is required for actual scraping.

In [2]:
# --- PARAMETERS ---
CONTROVERSY_NAME = "The Tarte Brand Trip Scandal"
RAW_FILE_NAME = "tarte_raw_comments.csv"

# Run the data pipeline (uses mock data generated in src/data_scraper.py)
df_raw = run_data_pipeline(CONTROVERSY_NAME)
print(f"Total raw comments collected: {len(df_raw)}")

# Save the raw data
save_data(df_raw, RAW_FILE_NAME)

NameError: name 'run_data_pipeline' is not defined

3. Sentiment Analysis Execution

We now run the full sentiment analysis, applying both VADER and TextBlob to all comments.

In [3]:
# Run the full sentiment analysis and cleaning pipeline
df_processed = analyze_sentiment(df_raw.copy())
PROCESSED_FILE_NAME = "tarte_processed_sentiment.csv"

# Save the processed data for subsequent notebooks
save_processed_data(df_processed, PROCESSED_FILE_NAME)

print("\n--- Processed Data Snapshot ---")
df_processed[['comment_text', 'vader_compound', 'textblob_polarity', 'textblob_subjectivity']].head()

NameError: name 'analyze_sentiment' is not defined

4. Initial Exploratory Data Analysis (EDA)

A quick look at the distribution of engagement and preliminary VADER scores.

In [4]:
# Distribution of Comment Likes (Engagement)
plt.figure(figsize=(10, 5))
sns.histplot(df_processed['comment_likes'], bins=50, kde=True)
plt.title('Distribution of Comment Likes (Engagement)')
plt.xlabel('Comment Likes')
plt.ylabel('Count')
plt.show()

NameError: name 'df_processed' is not defined

<Figure size 1000x500 with 0 Axes>

In [5]:
# Preliminary VADER Compound Score Distribution
plt.figure(figsize=(10, 5))
sns.histplot(df_processed['vader_compound'], bins=30, kde=True, color='skyblue')
plt.title('Distribution of VADER Compound Scores')
plt.xlabel('VADER Compound Score (-1.0 to +1.0)')
plt.ylabel('Count')
plt.show()

# Final status update
print(f"Data pipeline complete. Processed data ready in data/processed/{PROCESSED_FILE_NAME}")


NameError: name 'df_processed' is not defined

<Figure size 1000x500 with 0 Axes>