In [11]:
import gzip
def readCSV(path):
  f = gzip.open(path, 'rt')
  f.readline()
  for l in f:
    yield l.strip().split(',')

In [12]:
import pandas as pd
import numpy as np
import html
import re
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

df = pd.read_csv(
    "/Users/andrewchen/CSE158/Assignment_2/redditSubmissions.csv.gz",
    compression="gzip",
    on_bad_lines="skip",
    engine="python"
)

df

Unnamed: 0,#image_id,unixtime,rawtime,title,total_votes,reddit_id,number_of_upvotes,subreddit,number_of_downvotes,localtime,score,number_of_comments,username
0,0,1.333172e+09,2012-03-31T12:40:39.590113-07:00,And here's a downvote.,63470.0,rmqjs,32657.0,funny,30813.0,1.333198e+09,1844.0,622.0,Animates_Everything
1,0,1.333178e+09,2012-03-31T14:16:01.093638-07:00,Expectation,35.0,rmun4,29.0,GifSound,6.0,1.333203e+09,23.0,3.0,Gangsta_Raper
2,0,1.333200e+09,2012-03-31T20:18:33.192906-07:00,Downvote,41.0,rna86,32.0,GifSound,9.0,1.333225e+09,23.0,0.0,Gangsta_Raper
3,0,1.333252e+09,2012-04-01T10:52:10-07:00,Every time I downvote something,10.0,ro7e4,6.0,GifSound,4.0,1.333278e+09,2.0,0.0,Gangsta_Raper
4,0,1.333273e+09,2012-04-01T16:35:54.393381-07:00,Downvote &quot;Dies Irae&quot;,65.0,rooof,57.0,GifSound,8.0,1.333298e+09,49.0,0.0,Gangsta_Raper
...,...,...,...,...,...,...,...,...,...,...,...,...,...
132298,9998,1.344760e+09,2012-08-12T15:24:06-07:00,OM NOM NOM,34.0,y41wv,25.0,funny,9.0,1.344785e+09,16.0,0.0,vaggietales
132299,9998,1.345270e+09,2012-08-18T13:09:38-07:00,Don't feed the animals...,19.0,yfw66,14.0,funny,5.0,1.345295e+09,9.0,2.0,Deydria
132300,9998,1.345954e+09,2012-08-26T04:06:02+00:00,WTF worthy.,49.0,yu838,26.0,WTF,23.0,1.345954e+09,3.0,6.0,beatlesrock
132301,9998,1.346626e+09,2012-09-02T22:45:06+00:00,"Just a camel eating a kids head, welcome to th...",123.0,z91ah,65.0,WTF,58.0,1.346626e+09,7.0,12.0,v7o


In [13]:
# 21k rows of negative scores
df[df['score'] < 0].head()

Unnamed: 0,#image_id,unixtime,rawtime,title,total_votes,reddit_id,number_of_upvotes,subreddit,number_of_downvotes,localtime,score,number_of_comments,username
5,0,1333761000.0,2012-04-07T08:11:00-07:00,"Demolished, every time you downvote someone",40.0,rxwjg,17.0,gifs,23.0,1333786000.0,-6.0,3.0,Hellothereawesome
7,0,1339160000.0,2012-06-08T19:54:35.421944-07:00,getting that first downvote on a new post,13.0,usmxn,5.0,funny,8.0,1339185000.0,-3.0,0.0,
8,0,1339408000.0,2012-06-11T16:44:39.947798-07:00,How reddit seems to reacts whenever I share a ...,14.0,uwzrd,6.0,funny,8.0,1339433000.0,-2.0,0.0,
9,0,1339425000.0,2012-06-11T21:34:51.692933-07:00,Every LastAirBender post with a NSFW tag,20.0,uxf5q,9.0,pics,11.0,1339450000.0,-2.0,0.0,HadManySons
10,0,1340008000.0,2012-06-18T15:28:35.800140-07:00,How I felt when i forgot to put &quot;spoiler&...,21.0,v8vl7,10.0,gifs,11.0,1340033000.0,-1.0,0.0,TraumaticASH


## 1. Data Validation

Before building any models, we first validate that the core numeric fields in the dataset are internally consistent and that there are no obvious data quality issues.

**Checks performed:**

1. **Score consistency**
   We verify that the Reddit `score` field matches the definition:
      `score` = `number_of_upvotes` - `number_of_downvotes`

2. **Non-negative votes**

   We confirm that `number_of_upvotes` and `number_of_downvotes` are never negative.  

3. **Duplicated posts**

   Using `reddit_id` as the unique identifier of a submission, we check for duplicated `reddit_id`s.  

4. **Missing values**

   Finally, we compute the total number of missing values in the dataset
      - The `username` column has 20,260 missing values, but will not be considered in our model
      - Every other column only has 1 missing value on the same row

In [14]:
# Score = upvotes - downvotes
score_diff = (df['number_of_upvotes'] - df['number_of_downvotes'] - df['score']).abs().sum()
print("Score consistency check (should be 0):", score_diff)

# Non-negative votes

neg_votes = ((df['number_of_upvotes'] < 0) | (df['number_of_downvotes'] < 0)).sum()
print("Number of rows with negative votes (should be 0):", neg_votes)

dup_count = df['reddit_id'].duplicated().sum()
print("Number of duplicated reddit_id values:", dup_count)

# Missing Values
df_nan = df.isna().sum()
print(df_nan)



Score consistency check (should be 0): 0.0
Number of rows with negative votes (should be 0): 0
Number of duplicated reddit_id values: 93
#image_id                  0
unixtime                   1
rawtime                    1
title                      1
total_votes                1
reddit_id                  1
number_of_upvotes          1
subreddit                  1
number_of_downvotes        1
localtime                  1
score                      1
number_of_comments         1
username               20260
dtype: int64


## 2. Type Casting and Data Cleaning

After validating the core relationships between the voting fields, we standardize column types and handle missing or duplicated data.

**2.1 Type casting**

To avoid subtle bugs during modeling, we explicitly cast numeric columns to appropriate types:

- `unixtime` → numeric (for conversion to timestamps)
- `total_votes`, `number_of_upvotes`, `number_of_downvotes`, `score` → `float`


**2.2 Handling duplicates**

We treat each `reddit_id` as a unique Reddit submission. If the same `reddit_id` appears multiple times, it is most likely a duplication introduced during data collection 
- Drop duplicate rows based on `reddit_id`, keeping a single canonical copy of each post.

**2.3 Handling missing values**

After type casting, we re-check for missing values.

- Since `username` has 20,260 values, we will drop the column now
- Because there is only a very small number of rows with missing data, we drop rows containing any `NaN` via `dropna()`.

**2.4 Title Text Cleaning**

The raw `title` field contains HTML entities (e.g., `&quot;`) and inconsistent whitespace from the SNAP export. Since titles are used for TF-IDF features and sentiment analysis, we apply light normalization to improve text quality while preserving meaning.

Cleaning steps:
- Convert to string to avoid type issues  
- Unescape HTML entities (e.g., `&quot;` → `"`)  
- Strip leading/trailing whitespace  
- Collapse repeated spaces/newlines into a single space  
- Lowercase for consistent text modeling  


In [15]:
df

Unnamed: 0,#image_id,unixtime,rawtime,title,total_votes,reddit_id,number_of_upvotes,subreddit,number_of_downvotes,localtime,score,number_of_comments,username
0,0,1.333172e+09,2012-03-31T12:40:39.590113-07:00,And here's a downvote.,63470.0,rmqjs,32657.0,funny,30813.0,1.333198e+09,1844.0,622.0,Animates_Everything
1,0,1.333178e+09,2012-03-31T14:16:01.093638-07:00,Expectation,35.0,rmun4,29.0,GifSound,6.0,1.333203e+09,23.0,3.0,Gangsta_Raper
2,0,1.333200e+09,2012-03-31T20:18:33.192906-07:00,Downvote,41.0,rna86,32.0,GifSound,9.0,1.333225e+09,23.0,0.0,Gangsta_Raper
3,0,1.333252e+09,2012-04-01T10:52:10-07:00,Every time I downvote something,10.0,ro7e4,6.0,GifSound,4.0,1.333278e+09,2.0,0.0,Gangsta_Raper
4,0,1.333273e+09,2012-04-01T16:35:54.393381-07:00,Downvote &quot;Dies Irae&quot;,65.0,rooof,57.0,GifSound,8.0,1.333298e+09,49.0,0.0,Gangsta_Raper
...,...,...,...,...,...,...,...,...,...,...,...,...,...
132298,9998,1.344760e+09,2012-08-12T15:24:06-07:00,OM NOM NOM,34.0,y41wv,25.0,funny,9.0,1.344785e+09,16.0,0.0,vaggietales
132299,9998,1.345270e+09,2012-08-18T13:09:38-07:00,Don't feed the animals...,19.0,yfw66,14.0,funny,5.0,1.345295e+09,9.0,2.0,Deydria
132300,9998,1.345954e+09,2012-08-26T04:06:02+00:00,WTF worthy.,49.0,yu838,26.0,WTF,23.0,1.345954e+09,3.0,6.0,beatlesrock
132301,9998,1.346626e+09,2012-09-02T22:45:06+00:00,"Just a camel eating a kids head, welcome to th...",123.0,z91ah,65.0,WTF,58.0,1.346626e+09,7.0,12.0,v7o


In [16]:
df['unixtime'] = pd.to_numeric(df['unixtime'], errors='coerce')

In [17]:
df.isna().sum()

#image_id                  0
unixtime                   1
rawtime                    1
title                      1
total_votes                1
reddit_id                  1
number_of_upvotes          1
subreddit                  1
number_of_downvotes        1
localtime                  1
score                      1
number_of_comments         1
username               20260
dtype: int64

In [8]:

# Casting 
df['unixtime'] = pd.to_numeric(df['unixtime'], errors='coerce')
df['total_votes'] = df['total_votes'].astype(float)
df['number_of_upvotes'] = df['number_of_upvotes'].astype(float)
df['number_of_downvotes'] = df['number_of_downvotes'].astype(float)
df['score'] = df['score'].astype(float)

# Duplicate Handling 
df = df.drop_duplicates(subset=['reddit_id'])
dup_reddit_ids = df['reddit_id'].duplicated().sum()
print("Number of duplicated reddit_id values:", dup_reddit_ids)

# Missing Values Handling
df = df.drop(columns=['username'])
df = df.dropna()
print(df.isna().sum())


def clean_title(text):
    # 1. Ensure string
    s = str(text)
    # 2. Unescape HTML entities: &quot; &amp; &lt; &gt; etc.
    s = html.unescape(s)
    # 3. Strip leading/trailing whitespace
    s = s.strip()
    # 4. Collapse multiple spaces/newlines into a single space
    s = re.sub(r'\s+', ' ', s)
    # 5. (Optional) lowercase for modeling
    s = s.lower()
    return s
df['title_clean'] = df['title'].apply(clean_title)
df['title'] = df['title_clean']
df.drop(columns=['title_clean'], inplace=True)

Number of duplicated reddit_id values: 0
#image_id              0
unixtime               0
rawtime                0
title                  0
total_votes            0
reddit_id              0
number_of_upvotes      0
subreddit              0
number_of_downvotes    0
localtime              0
score                  0
number_of_comments     0
dtype: int64


## 3. Feature Engineering

With a clean base dataset, we create additional features that capture temporal patterns and simple properties of the post titles. These engineered features will be used as inputs to our predictive models.

**3.1 Time-based features**

The dataset provides `unixtime`, which is the submission time in seconds since the Unix epoch. We convert this into a Python `datetime` object and then derive several interpretable time features:

- `datetime`: full timestamp converted from `unixtime`
- `hour`: hour of day the post was submitted (0–23)
- `dayofweek`: day of week (0 = Monday, …, 6 = Sunday)
- `year`: calendar year

**3.2 Title-based features**

We also construct simple structural features from the post title:

- `title_length`: number of characters in the title
- `word_count`: number of whitespace-separated tokens in the title

**3.3 Sentiment-based features **

We augment these with sentiment scores derived from the cleaned title text (negative, neutral, positive, and compound sentiment), but the core structural features are already set up here.

Together, the time-based and title-based features give our models additional signal beyond raw vote counts and subreddit identity, while remaining interpretable and easy to reason about.


In [58]:
df['datetime'] = pd.to_datetime(df['unixtime'], unit='s')
df['hour'] = df['datetime'].dt.hour
df['dayofweek'] = df['datetime'].dt.dayofweek
df['year'] = df['datetime'].dt.year
df['title'] = df['title'].astype(str).str.strip()
df['title_length'] = df['title'].str.len()
df['word_count'] = df['title'].str.split().apply(len)


sia = SentimentIntensityAnalyzer()

def add_title_sentiment(df):
    titles = df['title'].astype(str)

    scores = titles.apply(sia.polarity_scores) 
    scores_df = scores.apply(pd.Series)

    df['title_sent_neg'] = scores_df['neg']
    df['title_sent_neu'] = scores_df['neu']
    df['title_sent_pos'] = scores_df['pos']
    df['title_sent_compound'] = scores_df['compound']
    return df
# add new column
df = add_title_sentiment(df)


print("\nFinal dtypes:")
print(df.dtypes)



Final dtypes:
#image_id                       int64
unixtime                      float64
rawtime                        object
title                          object
total_votes                   float64
reddit_id                      object
number_of_upvotes             float64
subreddit                      object
number_of_downvotes           float64
localtime                     float64
score                         float64
number_of_comments            float64
datetime               datetime64[ns]
hour                            int32
dayofweek                       int32
year                            int32
title_length                    int64
word_count                      int64
title_sent_neg                float64
title_sent_neu                float64
title_sent_pos                float64
title_sent_compound           float64
dtype: object


In [61]:
df.columns

Index(['#image_id', 'unixtime', 'rawtime', 'title', 'total_votes', 'reddit_id',
       'number_of_upvotes', 'subreddit', 'number_of_downvotes', 'localtime',
       'score', 'number_of_comments', 'datetime', 'hour', 'dayofweek', 'year',
       'title_length', 'word_count', 'title_sent_neg', 'title_sent_neu',
       'title_sent_pos', 'title_sent_compound'],
      dtype='object')

In [64]:
cols_to_use = [
    'unixtime', 'title', 'subreddit', 'score',
    'number_of_upvotes', 'number_of_downvotes', 'total_votes',
    'datetime', 'hour', 'dayofweek', 'year',
    'title_length', 'word_count', 'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

df = df[cols_to_use]
df.head()


Unnamed: 0,unixtime,title,subreddit,score,number_of_upvotes,number_of_downvotes,total_votes,datetime,hour,dayofweek,year,title_length,word_count,title_sent_neg,title_sent_neu,title_sent_pos,title_sent_compound
0,1333172000.0,and here's a downvote.,funny,1844.0,32657.0,30813.0,63470.0,2012-03-31 05:40:39,5,5,2012,22,4,0.0,1.0,0.0,0.0
1,1333178000.0,expectation,GifSound,23.0,29.0,6.0,35.0,2012-03-31 07:16:01,7,5,2012,11,1,0.0,1.0,0.0,0.0
2,1333200000.0,downvote,GifSound,23.0,32.0,9.0,41.0,2012-03-31 13:18:33,13,5,2012,8,1,0.0,1.0,0.0,0.0
3,1333252000.0,every time i downvote something,GifSound,2.0,6.0,4.0,10.0,2012-04-01 03:52:10,3,6,2012,31,5,0.0,1.0,0.0,0.0
4,1333273000.0,"downvote ""dies irae""",GifSound,49.0,57.0,8.0,65.0,2012-04-01 09:35:54,9,6,2012,20,3,0.0,1.0,0.0,0.0


# MODELING 

In [154]:
# MODELING

# make sure to drop total_votes, upvotes, downvotes, number_of_commetns
# lets drop all subreddits with < 20 posts

## Helper Functions

In [63]:
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

# 1. Split helper (so you always use the same split)
def train_val_test_split(X, y, test_size=0.2, val_size=0.1, random_state=158):
    """
    Splits X, y into train/val/test.
    val_size is the fraction of the *original* data.
    """
    X_temp, X_test, y_temp, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    val_frac_of_temp = val_size / (1 - test_size)
    X_train, X_val, y_train, y_val = train_test_split(
        X_temp, y_temp, test_size=val_frac_of_temp, random_state=random_state
    )
    return X_train, X_val, X_test, y_train, y_val, y_test


def eval_regression(y_true, y_pred, name="Model"):

    # Evaluate a regression model using MAE (primary metric).
    mae = mean_absolute_error(y_true, y_pred)
    print(f"{name}")
    print(f"  MAE : {mae:.4f}")
    return {"model": name, "mae": mae}


# 3. Convenience wrapper: fit model + evaluate on a given split
def fit_and_evaluate(model, X_train, y_train, X_val, y_val, name=None):
    """
    Fits the model on (X_train, y_train) and evaluates on (X_val, y_val).
    """
    model.fit(X_train, y_train)
    y_pred = model.predict(X_val)
    model_name = name or model.__class__.__name__
    scores = eval_regression(y_val, y_pred, name=model_name)
    return model, scores


## Baseline 1

Predict Global Mean

MAE : 320.6778

## Baseline 2

Predict Subreddit mean, if DNE, fall back to global mean

MAE : 307.3176

In [65]:
feature_cols = [
    'title', 'subreddit',
    'datetime', 'hour', 'dayofweek', 'year',
    'title_length', 'word_count'
]

X = df[feature_cols]
y = df['score']   # or df['score_log'] if you later transform it

X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(X, y)

# ---- Baseline 1: Global mean predictor ----
global_mean = y_train.mean()
print(f"Global mean score (train): {global_mean:.4f}")

# constant prediction = global mean
y_val_pred_global = np.full_like(y_val, fill_value=global_mean, dtype=float)
eval_regression(y_val, y_val_pred_global, name="Baseline: Global Mean")


# ---- Baseline 2: Subreddit mean (with fallback to global mean) ----

# compute mean score per subreddit on TRAIN ONLY
train_df_for_means = pd.DataFrame({
    'subreddit': X_train['subreddit'].values,
    'score': y_train.values
})
subreddit_means = train_df_for_means.groupby('subreddit')['score'].mean()

def predict_subreddit_mean(X, subreddit_means, global_mean):
    """
    For each row in X, predict the mean score of its subreddit,
    falling back to the global mean if the subreddit was unseen in training.
    """
    return X['subreddit'].map(subreddit_means).fillna(global_mean).values

# predictions on validation set
y_val_pred_sub = predict_subreddit_mean(X_val, subreddit_means, global_mean)
eval_regression(y_val, y_val_pred_sub, name="Baseline: Subreddit Mean + Global Fallback")


Global mean score (train): 233.1191
Baseline: Global Mean
  MAE : 320.6778
Baseline: Subreddit Mean + Global Fallback
  MAE : 307.3176


{'model': 'Baseline: Subreddit Mean + Global Fallback',
 'mae': 307.31759431661175}

# MODEL 1 

Random Forest 

MAE : 329.3844

In [66]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# ---------------------------
# 1. Select the features Random Forest can use
# ---------------------------
rf_feature_cols = [
    'subreddit',
    'hour', 'dayofweek', 'year',
    'title_length', 'word_count',
    'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

X_rf = df[rf_feature_cols]
y = df['score']   # or df['score_log']


# ---------------------------
# 2. Train/val/test split
# ---------------------------
X_train, X_temp, y_train, y_temp = train_test_split(
    X_rf, y, test_size=0.20, random_state=158
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=158
)


# ---------------------------
# 3. Preprocess categorical + numeric
# ---------------------------
categorical_cols = ['subreddit']
numeric_cols = [
    'hour', 'dayofweek', 'year',
    'title_length', 'word_count',
    'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        ('num', 'passthrough', numeric_cols)
    ]
)


# ---------------------------
# 4. Define Random Forest model
# ---------------------------
rf_model = RandomForestRegressor(
    n_estimators=300,
    max_depth=None,      # let the model grow
    min_samples_split=2,
    random_state=158,
    n_jobs=-1
)


# ---------------------------
# 5. Create the modeling pipeline
# ---------------------------
rf_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', rf_model)
])


# ---------------------------
# 6. Fit & Evaluate on validation set
# ---------------------------
rf_pipeline.fit(X_train, y_train)
y_val_pred = rf_pipeline.predict(X_val)

eval_regression(y_val, y_val_pred, name="Random Forest (Metadata + Sentiment)")


Random Forest (Metadata + Sentiment)
  MAE : 329.3844


{'model': 'Random Forest (Metadata + Sentiment)', 'mae': 329.38440545877967}

## MODEL 2 

Ridge (TF-IDF + Metadata + Sentiment)

MAE: 311.7044281224904

In [67]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

# ---------------------------
# 1. Define X and y
# ---------------------------

ridge_feature_cols = [
    'title', 'subreddit',
    'hour', 'dayofweek', 'year',
    'title_length', 'word_count',
    'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

X = df[ridge_feature_cols]
y = df['score']          # or df['score_log'] if you transform


# ---------------------------
# 2. Train / val / test split (80/10/10)
# ---------------------------

X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.20, random_state=158
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.50, random_state=158
)


# ---------------------------
# 3. ColumnTransformer for preprocessing
# ---------------------------

text_col = 'title'
cat_cols = ['subreddit']
num_cols = [
    'hour', 'dayofweek', 'year',
    'title_length', 'word_count',
    'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

preprocessor = ColumnTransformer(
    transformers=[
        # TF-IDF on title text
        ('text', TfidfVectorizer(
            max_features=20000,
            ngram_range=(1, 2),   # unigrams + bigrams
            min_df=5
        ), text_col),

        # One-hot encode subreddit
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),

        # Scale numeric + sentiment features
        ('num', StandardScaler(), num_cols),
    ]
)


# ---------------------------
# 4. Ridge regression model
# ---------------------------

ridge = Ridge(alpha=1.0, random_state=158)


# ---------------------------
# 5. Full pipeline: preprocess -> model
# ---------------------------

ridge_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', ridge),
])


# ---------------------------
# 6. Fit & evaluate on validation set
# ---------------------------

ridge_pipeline.fit(X_train, y_train)
y_val_pred = ridge_pipeline.predict(X_val)

eval_regression(y_val, y_val_pred, name="Ridge (TF-IDF + Metadata + Sentiment)")


Ridge (TF-IDF + Metadata + Sentiment)
  MAE : 311.7044


{'model': 'Ridge (TF-IDF + Metadata + Sentiment)', 'mae': 311.7044281224904}

## MODEL 3 

Ridge (TF-IDF, tuned alpha, log-shift target)

MAE: 253.75196347566893


In [70]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline

# ------------- 1. Define X and log-transformed y (with shift) -------------

ridge_feature_cols = [
    'title', 'subreddit',
    'hour', 'dayofweek', 'year',
    'title_length', 'word_count',
    'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

X = df[ridge_feature_cols]

# raw target
y_raw = df['score']

# handle negative scores by shifting before log
min_score = y_raw.min()
print("Minimum score in data:", min_score)

# shift so that the minimum becomes 1, then log-transform
y_shifted = y_raw - min_score + 1
y_log = np.log(y_shifted)

# quick sanity check
print("Any NaNs in y_log?", np.isnan(y_log).sum())


# ------------- 2. Train / val / test split (keep raw + log aligned) -------------

X_train, X_temp, y_train_log, y_temp_log, y_train_raw, y_temp_raw = train_test_split(
    X, y_log, y_raw, test_size=0.20, random_state=158
)

X_val, X_test, y_val_log, y_test_log, y_val_raw, y_test_raw = train_test_split(
    X_temp, y_temp_log, y_temp_raw, test_size=0.50, random_state=158
)


# ------------- 3. Preprocessing (TF-IDF + OHE + scaling) -------------

text_col = 'title'
cat_cols = ['subreddit']
num_cols = [
    'hour', 'dayofweek', 'year',
    'title_length', 'word_count',
    'title_sent_neg', 'title_sent_neu', 'title_sent_pos', 'title_sent_compound'
]

preprocessor = ColumnTransformer(
    transformers=[
        ('text', TfidfVectorizer(
            max_features=20000,
            ngram_range=(1, 2),
            min_df=5
        ), text_col),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(), num_cols),
    ]
)


# ------------- 4. Base Ridge pipeline (we'll tune alpha) -------------

base_ridge = Ridge(random_state=158)

ridge_pipeline = Pipeline(steps=[
    ('preprocess', preprocessor),
    ('model', base_ridge),
])


# ------------- 5. Hyperparameter tuning for alpha (on log target) -------------

param_grid = {
    'model__alpha': [0.01, 0.1, 1, 3, 10, 30, 100]
}

grid = GridSearchCV(
    ridge_pipeline,
    param_grid=param_grid,
    scoring='neg_root_mean_squared_error',  # still fine to tune on RMSE in log space
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_train, y_train_log)

print("Best params:", grid.best_params_)
print("Best CV RMSE (log space):", -grid.best_score_)


# ------------- 6. Evaluate best model on validation set (in RAW score space) -------------

best_ridge = grid.best_estimator_

# predict in log(shifted score) space
y_val_pred_log = best_ridge.predict(X_val)

# invert: shifted = exp(log_pred), then unshift back to original score scale
y_val_pred_shifted = np.exp(y_val_pred_log)
y_val_pred_raw = y_val_pred_shifted + min_score - 1

eval_regression(y_val_raw, y_val_pred_raw, name="Ridge (TF-IDF, tuned alpha, log-shift target)")


Minimum score in data: -264.0
Any NaNs in y_log? 0
Fitting 3 folds for each of 7 candidates, totalling 21 fits
Best params: {'model__alpha': 30}
Best CV RMSE (log space): 0.5831789902426164
Ridge (TF-IDF, tuned alpha, log-shift target)
  MAE : 253.7520


{'model': 'Ridge (TF-IDF, tuned alpha, log-shift target)',
 'mae': 253.75196347566893}