# 수행목표
- 랜덤 포레스트 회귀 모델로 학습해서 평점을 예측하고 사용자별 평점과 비교한다.

# 수행단계
KMRD 데이터에 대해 랜덤 포레스트로 학습하고 정답이 있는 데이터를 예측해서 결과를 비교한다.

- 이전 문제에서 만들었던 영화-장르-배우 원 핫 인코딩 행렬처럼 KMRD에 대해서 사용자-영화-장르 정보로 원 핫 인코딩 행렬을 만든다.
    - 시간 영향을 없애기 위해 평점 데이터에서 time을 삭제한다.
    - 데이터가 많으면 시간이 많이 걸리므로 출연진은 빼고 처리한다. (추가해서 처리도 해보고 결과가 좋아지는지 확인)
- `sklearn.ensemble.RandomForestRegressor` 를 사용한다.
- 평점이 있는 정보에 대해 `predict()`로 평점을 예측하고 정답과 비교한다.
- 특징의 중요도를 확인하기 위해 `model.feature_importances_`를 활용한다.
- 다른 라이브러리를 사용해보려고 한다면 `xgboost`나 `lightgbm`을 사용해 본다. 기본적인 사용법은 거의 같다.

# Library 설치

In [1]:
# Library

import os
import sys
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

mpl.rcParams['font.family'] = 'AppleGothic'
mpl.rcParams['axes.unicode_minus'] = False

# DataLoader

In [2]:
class MovieDataLoader:
    def __init__(self, file_path):
        self.file_path = file_path
        movie_path = os.path.join(self.file_path, 'movies.txt')
        rate_path = os.path.join(self.file_path, 'rates.csv')
        gerne_path = os.path.join(self.file_path, 'genres.csv')
        casting_path = os.path.join(self.file_path, 'castings.csv')
        self.movies = pd.read_csv(movie_path, sep='\t')
        self.rates = pd.read_csv(rate_path)
        self.gernes = pd.read_csv(gerne_path)
        self.castings = pd.read_csv(casting_path)

    def load(self):
        self._preprocess()

        return self.movies, self.rates, self.gernes, self.castings

    def _preprocess(self):
        self.movies.dropna(subset=['title_eng'], inplace=True)

        if self.movies['year'].isnull().sum() > 0:
            non_year = self.movies[self.movies['year'].isnull()]
            for row in non_year.iterrows():
                movie = row[1]['movie']
                title_eng = row[1]['title_eng'].split(' , ')[:-1]
                title_eng = ' , '.join(title_eng)
                year = row[1]['title_eng'].split(' , ')[-1]
                self.movies.loc[self.movies['movie'] == movie, 'title_eng'] = title_eng
                self.movies.loc[self.movies['movie'] == movie, 'year'] = year

        if self.movies['grade'].isnull().sum() > 0:
            self.movies['grade'] = self.movies['grade'].fillna('NR grade')


In [3]:
movies_df, rates_df, genres_df, castings_df = MovieDataLoader('../data/kmrd/').load()

movies_df.info()
rates_df.info()
genres_df.info()
castings_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 991 entries, 0 to 998
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   movie      991 non-null    int64 
 1   title      991 non-null    object
 2   title_eng  991 non-null    object
 3   year       991 non-null    object
 4   grade      991 non-null    object
dtypes: int64(1), object(4)
memory usage: 46.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140710 entries, 0 to 140709
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   user    140710 non-null  int64
 1   movie   140710 non-null  int64
 2   rate    140710 non-null  int64
 3   time    140710 non-null  int64
dtypes: int64(4)
memory usage: 4.3 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2025 entries, 0 to 2024
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   movie   2025 non-null   int64 

  self.movies.loc[self.movies['movie'] == movie, 'year'] = year


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

In [5]:
# remove time

rates_df.drop(columns=['time'], inplace=True)

In [6]:
df = pd.merge(rates_df, movies_df, on='movie')
df

Unnamed: 0,user,movie,rate,title,title_eng,year,grade
0,0,10003,7,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가
1,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가
2,0,10018,9,이티,"The Extra-Terrestrial , E.T. , 1982",2011.0,전체 관람가
3,0,10021,9,록키,"Rocky , 1976",2017.0,12세 관람가
4,0,10022,7,록키 2,"Rocky II , 1979",1980.0,12세 관람가
...,...,...,...,...,...,...,...
139768,52023,10998,10,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가
139769,52024,10998,10,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가
139770,52025,10998,7,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가
139771,52026,10998,9,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가


In [7]:
df = pd.merge(df, genres_df, on='movie')
df

Unnamed: 0,user,movie,rate,title,title_eng,year,grade,genre
0,0,10003,7,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가,SF
1,0,10003,7,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",2015.0,12세 관람가,코미디
2,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가,서부
3,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가,SF
4,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",1990.0,전체 관람가,판타지
...,...,...,...,...,...,...,...,...
349409,52026,10998,9,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가,스릴러
349410,52027,10998,10,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가,드라마
349411,52027,10998,10,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가,액션
349412,52027,10998,10,폭주 기관차,"Runaway Train , 1985",1989.0,15세 관람가,모험


In [8]:
# year 통계를 내서 의미있는 구간으로 나눠 더미변수로 만들어보자

df['year'] = df['year'].astype(int)
df['year'].describe()


count    349414.000000
mean       2000.309533
std          15.710646
min        1946.000000
25%        1989.000000
50%        1997.000000
75%        2016.000000
max        2020.000000
Name: year, dtype: float64

In [9]:
def year_to_category(year):
    if year < 1980:
        return 0
    elif year < 1990:
        return 1
    elif year < 2010:
        return 2
    else:
        return 3

df['year'] = df['year'].apply(year_to_category)

In [10]:
df.head()

Unnamed: 0,user,movie,rate,title,title_eng,year,grade,genre
0,0,10003,7,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",3,12세 관람가,SF
1,0,10003,7,빽 투 더 퓨쳐 2,"Back To The Future Part 2 , 1989",3,12세 관람가,코미디
2,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",2,전체 관람가,서부
3,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",2,전체 관람가,SF
4,0,10004,7,빽 투 더 퓨쳐 3,"Back To The Future Part III , 1990",2,전체 관람가,판타지


---

In [11]:
class Analyzer:
    def __init__(self, y_true, y_pred, y_pred_class):
        self.y_true = y_true
        self.y_pred = y_pred
        self.y_pred_class = y_pred_class

    def analyze_error(self):
        self._analyze_mae()
        self._analyze_mse()
        self._analyze_rmse()
        self._analyze_mape()

    def analyze_confusion_matrix(self):
        self._make_confusion_matrix()
        self._analyze_accuracy()
        self._analyze_precision()
        self._analyze_recall()
        self._analyze_f1()

    def _analyze_mae(self):
        self.mae = np.abs(self.y_true - self.y_pred).mean()
        print(f'MAE: {self.mae}')

    def _analyze_mse(self):
        self.mse = ((self.y_true - self.y_pred) ** 2).mean()
        print(f'MSE: {self.mse}')

    def _analyze_rmse(self):
        self.rmse = np.sqrt(self.mse)
        print(f'RMSE: {self.rmse}')

    def _analyze_mape(self):
        self.mape = (np.abs(self.y_true - self.y_pred) / self.y_true).mean()
        print(f'MAPE: {self.mape}')

    def _make_confusion_matrix(self):
        unique_labels = np.unique(np.concatenate((self.y_true, self.y_pred_class)))
        self.confusion_matrix = np.zeros((len(unique_labels), len(unique_labels)), dtype=int)

        label_to_index = {label: index for index, label in enumerate(unique_labels)}
        for true, pred in zip(self.y_true, self.y_pred_class):
            self.confusion_matrix[label_to_index[true]][label_to_index[pred]] += 1

        print('Confusion Matrix :')
        print(self.confusion_matrix)

    def _analyze_accuracy(self):
        self.accuracy = np.diag(self.confusion_matrix).sum() / self.confusion_matrix.sum()
        print(f'Accuracy: {self.accuracy}')

    def _analyze_precision(self):
        with np.errstate(divide='ignore', invalid='ignore'):
            try:
                self.precision = np.diag(self.confusion_matrix) / self.confusion_matrix.sum(axis=0)
            except:
                self.precision = np.nan
        print(f'Precision: {self.precision}')

    def _analyze_recall(self):
        with np.errstate(divide='ignore', invalid='ignore'):
            try:
                self.recall = np.diag(self.confusion_matrix) / self.confusion_matrix.sum(axis=1)
            except:
                self.recall = np.nan
        print(f'Recall: {self.recall}')

    def _analyze_f1(self):
        self.f1 = 2 * (self.precision * self.recall) / (self.precision + self.recall)
        print(f'F1: {self.f1}')


In [12]:
df.drop(columns=['title', 'title_eng'], inplace=True)

In [13]:
user_avg_rate = df.groupby('user')['rate'].mean().rename('user_mean')
user_rate_count = df.groupby('user')['rate'].count().rename('user_count')
df = pd.merge(df, user_avg_rate, on='user')
df = pd.merge(df, user_rate_count, on='user')
df

Unnamed: 0,user,movie,rate,year,grade,genre,user_mean,user_count
0,0,10003,7,3,12세 관람가,SF,7.440415,193
1,0,10003,7,3,12세 관람가,코미디,7.440415,193
2,0,10004,7,2,전체 관람가,서부,7.440415,193
3,0,10004,7,2,전체 관람가,SF,7.440415,193
4,0,10004,7,2,전체 관람가,판타지,7.440415,193
...,...,...,...,...,...,...,...,...
349409,52026,10998,9,1,15세 관람가,스릴러,9.000000,4
349410,52027,10998,10,1,15세 관람가,드라마,10.000000,4
349411,52027,10998,10,1,15세 관람가,액션,10.000000,4
349412,52027,10998,10,1,15세 관람가,모험,10.000000,4


In [14]:
df = pd.get_dummies(df, columns=['year', 'grade', 'genre'])
df

Unnamed: 0,user,movie,rate,user_mean,user_count,year_0,year_1,year_2,year_3,grade_12세 관람가,...,genre_범죄,genre_서부,genre_서사,genre_스릴러,genre_애니메이션,genre_액션,genre_에로,genre_전쟁,genre_코미디,genre_판타지
0,0,10003,7,7.440415,193,False,False,False,True,True,...,False,False,False,False,False,False,False,False,False,False
1,0,10003,7,7.440415,193,False,False,False,True,True,...,False,False,False,False,False,False,False,False,True,False
2,0,10004,7,7.440415,193,False,False,True,False,False,...,False,True,False,False,False,False,False,False,False,False
3,0,10004,7,7.440415,193,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,False
4,0,10004,7,7.440415,193,False,False,True,False,False,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
349409,52026,10998,9,9.000000,4,False,True,False,False,False,...,False,False,False,True,False,False,False,False,False,False
349410,52027,10998,10,10.000000,4,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
349411,52027,10998,10,10.000000,4,False,True,False,False,False,...,False,False,False,False,False,True,False,False,False,False
349412,52027,10998,10,10.000000,4,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [15]:
y = df['rate'].values
X = df.drop(columns=['user', 'rate']).values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

In [16]:
y_pred = model.predict(X_test)

In [17]:
feature_names = df.drop(columns=['rate']).columns
importances = model.feature_importances_

idx_sorted = np.argsort(importances)[::-1]  # 내림차순 인덱스
for i in range(10):
    idx = idx_sorted[i]
    print(f"{feature_names[idx]} : importance={importances[idx]:.4f}")


movie : importance=0.6794
user_mean : importance=0.1169
user : importance=0.1024
grade_12세 관람가 : importance=0.0080
grade_R : importance=0.0076
grade_전체 관람가 : importance=0.0075
year_3 : importance=0.0071
year_1 : importance=0.0062
year_2 : importance=0.0062
genre_다큐멘터리 : importance=0.0061


In [18]:
y_pred_class = np.round(y_pred)
analyzer = Analyzer(y_test, y_pred, y_pred_class)
analyzer.analyze_error()

MAE: 0.44224423053831163
MSE: 1.1035827733473706
RMSE: 1.0505154798228205
MAPE: 0.12498986751549335


In [19]:
analyzer.analyze_confusion_matrix()

Confusion Matrix :
[[ 1239   287   254   198   158   126    76    73   104    65]
 [    1   107    70    76    65    32    16    15    20     6]
 [    1     3   113    89    61    60    33    17    20     5]
 [    0     4    16   151   142   106    44    21    16    11]
 [    0     6     3    35   327   298   176    79    35    32]
 [    0     1    11    17    71   743   578   188    98    51]
 [    1     1     5    17    58   233  1704   889   260    87]
 [    0     2     7    12    26   105   530  3444  1199   276]
 [    0     0    12     3    26    56   174   918  4924   937]
 [    6    23    81    58    76   186   400  1129  5257 40111]]
Accuracy: 0.756450066539788
Precision: [0.99278846 0.24654378 0.19755245 0.23018293 0.32376238 0.38200514
 0.45671402 0.50848959 0.41263722 0.96464731]
Recall: [0.48023256 0.2622549  0.28109453 0.29549902 0.32996973 0.42263936
 0.5235023  0.6148902  0.69843972 0.84752889]
F1: [0.64733542 0.25415677 0.23203285 0.2587832  0.32683658 0.40129625
 0.487

---