# Social Media Sentiment and NFL Player Performance

**Objective:** Analyze sentiment in Reddit threads from the NFL Draft to explore its predictive power in evaluating future player performance. Use NLP and sentiment models to extract insights and assess their value for sports analytics and decision‑making.

**Key Files (edit paths below if yours differ):**
- `NFL_reddit_data_2021.csv` — Reddit comments / posts
- `combined_player_data.csv` — Player roster + stats/grades

---

### 1) Setup & Imports

In [None]:
# If you're running in a fresh environment, uncomment and run this once.
# %pip install -q pandas numpy matplotlib scikit-learn vaderSentiment rapidfuzz python-dateutil

In [None]:
import re
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from dateutil import tz
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from rapidfuzz import process, fuzz

pd.set_option("display.max_colwidth", 200)

# ---- Paths (EDIT THESE if your filenames differ) ----
NFL_REDDIT_PATH = "NFL_reddit_data_2021.csv"
PLAYER_DATA_PATH = "combined_player_data.csv"

# Where to save derived artifacts
ARTIFACT_DIR = "artifacts"
os.makedirs(ARTIFACT_DIR, exist_ok=True)

### 2) Load the data

In [None]:
# --- Load Data ---
def try_read_csv(path):
    for enc in [None, "utf-8", "utf-8-sig", "latin-1"]:
        try:
            return pd.read_csv(path, encoding=enc)
        except Exception as e:
            last = e
    raise last

reddit_df = try_read_csv(NFL_REDDIT_PATH)
players_df = try_read_csv(PLAYER_DATA_PATH)

print("Reddit shape:", reddit_df.shape)
print("Players shape:", players_df.shape)
print("Reddit columns:", reddit_df.columns.tolist()[:25])
print("Players columns:", players_df.columns.tolist()[:25])

### 3) Preprocess Reddit text & timestamps

In [None]:
# --- Data Preprocessing & Cleaning ---
# Heuristically identify text and time columns
def find_text_column(df):
    candidates = ["body","text","comment","content","selftext","title"]
    for c in candidates:
        if c in df.columns: 
            return c
    # fallback: pick the first object/string column with average length > 20
    for c in df.select_dtypes("object").columns:
        if df[c].dropna().astype(str).map(len).mean() > 20:
            return c
    raise ValueError("Could not find a text column. Please rename one to 'body'.")

def find_datetime_column(df):
    candidates = ["created_utc","created_at","created","timestamp","date","time"]
    for c in candidates:
        if c in df.columns:
            dt = pd.to_datetime(df[c], errors="coerce", utc=True)
            if dt.notna().mean() > 0.2:
                return c
    # none found
    return None

TEXT_COL = find_text_column(reddit_df)
TIME_COL = find_datetime_column(reddit_df)

# Basic cleaning: drop duplicates, normalize text
reddit_df = reddit_df.drop_duplicates()
reddit_df[TEXT_COL] = reddit_df[TEXT_COL].astype(str).str.strip()

# Remove obvious empties
reddit_df = reddit_df[reddit_df[TEXT_COL].str.len() > 0].copy()

# Parse time if available
if TIME_COL:
    reddit_df["created_dt"] = pd.to_datetime(reddit_df[TIME_COL], errors="coerce", utc=True).dt.tz_convert("UTC")
else:
    reddit_df["created_dt"] = pd.NaT

# Simple text normalization helpful for downstream matching
url_pat = re.compile(r"https?://\S+|www\.\S+")
reddit_df["text_clean"] = (
    reddit_df[TEXT_COL]
    .str.replace(url_pat, " ", regex=True)
    .str.replace(r"[\n\r\t]", " ", regex=True)
    .str.replace(r"\s+", " ", regex=True)
    .str.strip()
)

reddit_df.to_csv(f"{ARTIFACT_DIR}/reddit_clean.csv", index=False)
reddit_df.head(3)

### 4) Prepare player roster and name lexicon

In [None]:
# --- Prepare Player Roster (canonical names) ---
# Try to infer a canonical full name column 'player' from the player file
name_cols = [c for c in players_df.columns if c.lower() in {"player","player_name","name","full_name"}]
if len(name_cols) == 0:
    # Try first/last name combination
    first = next((c for c in players_df.columns if c.lower() in {"first","first_name"}), None)
    last = next((c for c in players_df.columns if c.lower() in {"last","last_name","surname"}), None)
    if first and last:
        players_df["player"] = (players_df[first].astype(str).str.strip() + " " + players_df[last].astype(str).str.strip()).str.replace(r"\s+", " ", regex=True)
    else:
        raise ValueError("Couldn't infer player name column. Ensure one of ['player','player_name','name'] exists or both first/last name columns exist.")
else:
    players_df["player"] = players_df[name_cols[0]].astype(str).str.strip()

# Create helper columns for matching
players_df["player_lower"] = players_df["player"].str.lower()

# Build last-name and full-name lexicons
def last_name(full_name):
    toks = [t for t in str(full_name).split() if t.strip()]
    return toks[-1] if toks else ""

players_df["last_name"] = players_df["player"].map(last_name).str.lower()

# Only use last names that are unique to avoid false positives
last_counts = players_df["last_name"].value_counts()
unique_last_names = set(last_counts[last_counts == 1].index)

# Canonical dictionary: variant -> canonical full name
name_map = {}

# Include full names always
for nm in players_df["player"].unique():
    name_map[nm.lower()] = nm

# Include unique last names
for _, r in players_df.iterrows():
    if r["last_name"] in unique_last_names and r["last_name"]:
        name_map[r["last_name"]] = r["player"]

# Build a compiled regex for fast matching
# Order names by length desc to prefer full names first
alternatives = sorted(list(name_map.keys()), key=lambda s: len(s), reverse=True)
# Escape regex metachars and join
alts_escaped = [re.escape(a) for a in alternatives if a and isinstance(a, str)]
if len(alts_escaped) == 0:
    raise ValueError("No names found to match against. Check your player file.")

pattern = re.compile(r"\b(" + "|".join(alts_escaped) + r")\b", flags=re.IGNORECASE)

len(name_map), list(name_map.items())[:5]

### 5) Match Reddit comments to players

In [None]:
# --- Link Reddit comments to players via name matching ---
def find_matches(text):
    if not isinstance(text, str) or not text:
        return []
    return [name_map.get(m.group(0).lower(), m.group(0)) for m in pattern.finditer(text)]

reddit_df["matched_players"] = reddit_df["text_clean"].map(find_matches)

# Explode to one row per (comment x player)
mentions = reddit_df.explode("matched_players").rename(columns={"matched_players":"player"})
mentions = mentions[mentions["player"].notna() & (mentions["player"].astype(str).str.len() > 0)].copy()

print("Mentions shape:", mentions.shape)
mentions.head(5)

### 6) Sentiment analysis with VADER + aggregation by player

In [None]:
# --- Sentiment Analysis (VADER) ---
analyzer = SentimentIntensityAnalyzer()

def vader_compound(text):
    if not isinstance(text, str) or text.strip() == "":
        return np.nan
    return analyzer.polarity_scores(text)["compound"]

mentions["compound"] = mentions["text_clean"].map(vader_compound)

def label_from_compound(c):
    if pd.isna(c):
        return "unknown"
    if c >= 0.05:
        return "positive"
    if c <= -0.05:
        return "negative"
    return "neutral"

mentions["sentiment_label"] = mentions["compound"].map(label_from_compound)

# Aggregate by player
agg = (mentions
       .assign(week=mentions["created_dt"].dt.to_period("W").astype(str) if "created_dt" in mentions.columns else "unknown")
       .groupby("player")
       .agg(mentions=("compound","count"),
            mean_compound=("compound","mean"),
            median_compound=("compound","median"),
            pos_rate=(lambda x: (x >= 0.05).mean()),
            neg_rate=(lambda x: (x <= -0.05).mean())
       )
       .reset_index()
      )

agg = agg.sort_values("mentions", ascending=False)
agg.to_csv(f"{ARTIFACT_DIR}/sentiment_by_player.csv", index=False)
agg.head(10)

### 7) Build a player performance score and join with sentiment

In [None]:
# --- Player Performance Scoring ---
# Try to auto-detect a single performance column; otherwise build a composite score.

def pick_performance_column(df):
    candidates = ["overall_grade","overall","pff_grade","rating","fantasy_points","approximate_value","av","war","epa"]
    lc = {c.lower(): c for c in df.columns}
    for cand in candidates:
        if cand in lc:
            return lc[cand]
    return None

# Standardize utility
def zscore(s):
    return (s - s.mean()) / (s.std(ddof=0) + 1e-9)

perf_col = pick_performance_column(players_df)

# Prepare a canonical 'player' column already present
perf_df = players_df.copy()

if perf_col is not None and np.issubdtype(perf_df[perf_col].dtype, np.number):
    perf_df["PerformanceScore"] = zscore(perf_df[perf_col])
    used_cols = [perf_col]
else:
    # Build a composite score from numeric columns that look like performance
    num_cols = perf_df.select_dtypes(include=[np.number]).columns.tolist()
    drop_keywords = ["id","year","season","age","height","weight","draft","round","pick","team_id","position_id"]
    perf_cols = [c for c in num_cols if not any(k in c.lower() for k in drop_keywords)]
    negative_keywords = ["fumble","interception","interceptions","int","penalty","drop","sack_allowed","missed_tackle"]

    if len(perf_cols) == 0:
        raise ValueError("No numeric performance-like columns found. Please add/select a performance column.")

    Z = pd.DataFrame({c: zscore(perf_df[c].astype(float)) for c in perf_cols})
    signs = np.array([ -1.0 if any(k in c.lower() for k in negative_keywords) else 1.0 for c in perf_cols ])
    perf_df["PerformanceScore"] = (Z.values * signs).mean(axis=1)
    used_cols = perf_cols[:15]  # show a subset in the report to keep it readable

# Keep only what we need
perf_df_small = perf_df[["player","PerformanceScore"]].copy()

# Join sentiment with performance
player_scores = perf_df_small.merge(agg, on="player", how="left")
player_scores.to_csv(f"{ARTIFACT_DIR}/player_scores.csv", index=False)
player_scores.head(10), perf_col, used_cols[:8]

### 8) Evaluation & Visualizations

In [None]:
# --- Evaluation & Visualizations ---
from scipy.stats import pearsonr, spearmanr

valid = player_scores.dropna(subset=["PerformanceScore","mean_compound"]).copy()

pearson_r = np.nan
spearman_r = np.nan
if len(valid) >= 3:
    pearson_r = pearsonr(valid["mean_compound"], valid["PerformanceScore"])[0]
    spearman_r = spearmanr(valid["mean_compound"], valid["PerformanceScore"]).correlation

print(f"Pearson r (sentiment vs performance): {pearson_r:.3f}" if not math.isnan(pearson_r) else "Pearson r: N/A")
print(f"Spearman r (rank correlation): {spearman_r:.3f}" if not math.isnan(spearman_r) else "Spearman r: N/A")

# Precision@k for "top performers" predicted by sentiment
def precision_at_k(df, k=10, sentiment_col="mean_compound"):
    df = df.dropna(subset=["PerformanceScore", sentiment_col]).copy()
    if len(df) == 0 or k <= 0:
        return np.nan
    top_sent = set(df.sort_values(sentiment_col, ascending=False).head(k)["player"])
    top_perf = set(df.sort_values("PerformanceScore", ascending=False).head(k)["player"])
    return len(top_sent & top_perf) / float(k)

for k in [5,10,20]:
    p_at_k = precision_at_k(valid, k=k, sentiment_col="mean_compound")
    print(f"precision@{k}: {p_at_k:.3f}" if not pd.isna(p_at_k) else f"precision@{k}: N/A")

# --- Plots ---
# 1) Sentiment vs Performance scatter
plt.figure(figsize=(8,6))
plt.scatter(valid["mean_compound"], valid["PerformanceScore"])
plt.axvline(0, linestyle="--", linewidth=1)
plt.title("Sentiment (mean compound) vs Player Performance")
plt.xlabel("Mean VADER Compound Sentiment")
plt.ylabel("PerformanceScore (standardized)")
plt.tight_layout()
plt.savefig(f"{ARTIFACT_DIR}/sentiment_vs_performance.png", dpi=160)
plt.show()

# 2) Top mentioned players bar chart
topN = 20
top_mentions = player_scores.sort_values("mentions", ascending=False).head(topN)
plt.figure(figsize=(10,6))
plt.barh(top_mentions["player"][::-1], top_mentions["mentions"][::-1])
plt.title(f"Top {topN} Players by Reddit Mentions")
plt.xlabel("Mentions")
plt.tight_layout()
plt.savefig(f"{ARTIFACT_DIR}/top_mentions.png", dpi=160)
plt.show()

# 3) Distribution of compound sentiment
plt.figure(figsize=(8,6))
plt.hist(mentions["compound"].dropna(), bins=50, alpha=0.8, density=False)
plt.title("Distribution of VADER Compound Scores for Player Mentions")
plt.xlabel("Compound")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig(f"{ARTIFACT_DIR}/compound_distribution.png", dpi=160)
plt.show()

### 9) Weekly sentiment trends for top‑mentioned players

In [None]:
# --- Sentiment Trend Over Time (Weekly) ---
if "created_dt" in mentions.columns and mentions["created_dt"].notna().any():
    # focus on players with enough volume
    vol = agg[agg["mentions"] >= max(5, agg["mentions"].quantile(0.75))]["player"].tolist()
    m2 = mentions[mentions["player"].isin(vol)].copy()
    m2["week"] = m2["created_dt"].dt.to_period("W").dt.start_time
    trend = m2.groupby(["player","week"])["compound"].mean().reset_index()

    # plot per top player
    for p in vol[:6]:  # cap to 6 plots to keep manageable
        sub = trend[trend["player"] == p]
        plt.figure(figsize=(8,4))
        plt.plot(sub["week"], sub["compound"], marker="o")
        plt.axhline(0, linestyle="--", linewidth=1)
        plt.title(f"Weekly Sentiment Trend — {p}")
        plt.xlabel("Week")
        plt.ylabel("Mean Compound")
        plt.xticks(rotation=45, ha="right")
        plt.tight_layout()
        fname = f"{ARTIFACT_DIR}/trend_{re.sub('[^a-zA-Z0-9]+','_',p)}.png"
        plt.savefig(fname, dpi=140)
        plt.show()

## Interpretation Guide (fill in with your observed results)

- **Correlation:** A positive Pearson/Spearman correlation indicates that players with more positive draft‑week sentiment tended to perform better on average. If your `r` is close to zero, sentiment may be weakly related to outcomes or confounded by hype/market size.
- **precision@k:** Compare the overlap between the top *k* players by sentiment and by performance. A value above baseline (e.g., > 0.20 for large rosters at k=10) suggests sentiment has some predictive power.
- **Trends:** Inspect weekly sentiment plots for star players. Spikes may align with news (injuries, depth chart changes, preseason buzz).

> **Caveats:** Name‑matching noise, position differences, and small sample sizes can dilute signal. Consider filtering by position or enriching with features like draft slot and team.

## Suggestions for Real‑World Improvements
1. **Better entity linking:** Replace heuristic name matching with a trained NER model and team/position priors; track unique player IDs.
2. **De‑duplication & bot filtering:** Remove cross‑posts and obvious bots; use account age/karma thresholds.
3. **Richer sentiment:** Combine VADER with transformer sentence‑embedding regressors (finetuned on sports sentiment).
4. **Position‑aware performance:** Compute position‑specific performance indices and evaluate per‑position precision@k.
5. **Causal checks:** Control for draft position (expected value) to see whether sentiment adds incremental lift.

## Conclusion (Write‑Up Template)

- **What we did:** Preprocessed Reddit data, scored sentiment with VADER, linked comments to players, and compared aggregated sentiment to actual performance.
- **What we found:** _(Summarize your correlation/precision values and 2–3 notable players.)_
- **So what:** _(State whether sentiment provides actionable signal for scouting/fantasy and any limitations.)_