### Model Optimization – Random Forest için Hiperparametre Arama

Bu bölümde amaç:

	•	Baseline modelde kullandığımız Random Forest’ı geliştirmek

	•	Daha iyi MAE ve R² almak

	•	Hiperparametre tuning (RandomizedSearchCV → GridSearchCV ikilisi)

Kullanacağımız özellikler:

	•	Sayısal sütunlar (budget, revenue, popularity…)

	•	Feature engineering ile eklenen genre_count, keyword_count

	•	Çıkarılmış hedef değişken: vote_average

Ama JSON kolonlar yok.
Sadece sayısal + sayısal engineering yapılmış kolonlarla çalışıyoruz.

In [11]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor

df = pd.read_csv("/content/data/processed/movies_fe.csv")

df.head()

Unnamed: 0,budget,genres,id,keywords,popularity,revenue,runtime,title,vote_average,vote_count,...,genre_count,keyword_count,budget_log,revenue_log,popularity_log,vote_count_log,runtime_bin,movie_age,budget_per_minute,popularity_per_vote
0,237000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",19995,"[{'id': 1463, 'name': 'culture clash'}, {'id':...",150.437577,2787965087,162.0,Avatar,7.2,11800,...,4,21,19.283571,21.748578,5.020174,9.37594,very_long,16.0,1462963.0,0.012748
1,300000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",285,"[{'id': 270, 'name': 'ocean'}, {'id': 726, 'na...",139.082615,961000000,169.0,Pirates of the Caribbean: At World's End,6.9,4500,...,3,16,19.519293,20.683485,4.942232,8.412055,very_long,18.0,1775148.0,0.0309
2,245000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",206647,"[{'id': 470, 'name': 'spy'}, {'id': 818, 'name...",107.376788,880674609,148.0,Spectre,6.3,4466,...,3,7,19.316769,20.596199,4.685614,8.404472,long,10.0,1655405.0,0.024038
3,250000000,"[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",49026,"[{'id': 849, 'name': 'dc comics'}, {'id': 853,...",112.31295,1084939099,165.0,The Dark Knight Rises,7.6,9106,...,4,21,19.336971,20.80479,4.730153,9.116799,very_long,13.0,1515152.0,0.012333
4,260000000,"[{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...",49529,"[{'id': 818, 'name': 'based on novel'}, {'id':...",43.926995,284139100,132.0,John Carter,6.1,2124,...,3,16,19.376192,19.464974,3.805039,7.661527,long,13.0,1969697.0,0.020672


## Feature–target ayırma

In [12]:
features = [
    "budget",
    "revenue",
    "runtime",
    "popularity",
    "vote_count",
    "genre_count",
    "keyword_count",
    "release_year"
]

X = df[features]
y = df["vote_average"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Baseline modeli yeniden çalıştır-Optimizasyon için referans

Optimizasyon sonrası elde ettiğimiz sonucu baseline ile karşılaştırmak için baseline modeli tekrar çalıştırıyoruz.

In [13]:
model_baseline = RandomForestRegressor(random_state=42)
model_baseline.fit(X_train, y_train)

preds = model_baseline.predict(X_test)

mae_base = mean_absolute_error(y_test, preds)
r2_base = r2_score(y_test, preds)

mae_base, r2_base

(0.5396618106139438, 0.6263328311689711)

## RandomizedSearchCV — Hızlı Optimizasyon


In [14]:
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_dist = {
    "n_estimators": [100, 200, 300, 400, 500],
    "max_depth": [None, 10, 20, 30, 40],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "max_features": ["auto", "sqrt", "log2"]
}

rf = RandomForestRegressor(random_state=42)

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=25,
    scoring="neg_mean_absolute_error",
    cv=3,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

## En iyi parametreler

In [15]:
random_search.best_params_

{'n_estimators': 400,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_features': 'log2',
 'max_depth': 40}

In [16]:
best_params = random_search.best_params_

model_opt = RandomForestRegressor(
    **best_params,
    random_state=42
)

model_opt.fit(X_train, y_train)

preds_opt = model_opt.predict(X_test)

mae_opt = mean_absolute_error(y_test, preds_opt)
r2_opt = r2_score(y_test, preds_opt)

mae_opt, r2_opt

(0.5347413050756474, 0.6415031099906201)