<a href="https://colab.research.google.com/github/yjyg1215/Project_DeepLearning/blob/main/genre_labeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 영화 장르 자동 레이블링 시스템

## 가설: `영화 줄거리를 통해서 장르를 예측할 수 있다.`

### 1. 데이터 전처리

데이터셋: Kaggle - Wikipedia Movie Plot
* 선정 이유
  * 크롤링으로 데이터를 수집할까 생각했지만, 동적 크롤링을 진행해야 하는데 주어진 시간이 많지 않음...
  * API를 활용할 수는 없을까 찾아봤지만, 데이터를 받는 데 며칠이 소요됨...
  * 결국 Kaggle에서 가장 적절한 데이터셋을 찾아보기로 함.
  
  ***→ 줄거리를 포함하는 데이터셋이 많지 않은데, 그 중에서 가장 많은 row를 가지고 있고, 개봉 년도를 보면 1901년부터 2017년까지 넓은 범위의 영화를 포함하는 것 같아서 이 데이터셋을 선정함.***



In [1]:
import pandas as pd

movie=pd.read_csv("/content/drive/MyDrive/wiki_movie_plots_deduped.csv")
movie

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...
...,...,...,...,...,...,...,...,...
34881,2014,The Water Diviner,Turkish,Director: Russell Crowe,Director: Russell Crowe\r\nCast: Russell Crowe...,unknown,https://en.wikipedia.org/wiki/The_Water_Diviner,"The film begins in 1919, just after World War ..."
34882,2017,Çalgı Çengi İkimiz,Turkish,Selçuk Aydemir,"Ahmet Kural, Murat Cemcir",comedy,https://en.wikipedia.org/wiki/%C3%87alg%C4%B1_...,"Two musicians, Salih and Gürkan, described the..."
34883,2017,Olanlar Oldu,Turkish,Hakan Algül,"Ata Demirer, Tuvana Türkay, Ülkü Duru",comedy,https://en.wikipedia.org/wiki/Olanlar_Oldu,"Zafer, a sailor living with his mother Döndü i..."
34884,2017,Non-Transferable,Turkish,Brendan Bradley,"YouTubers Shanna Malcolm, Shira Lazar, Sara Fl...",romantic comedy,https://en.wikipedia.org/wiki/Non-Transferable...,The film centres around a young woman named Am...


* 필요한 데이터는 'Genre'와 'Plot'이므로 두 컬럼만 추출합니다.

In [2]:
df=movie[['Genre','Plot']]
df

Unnamed: 0,Genre,Plot
0,unknown,"A bartender is working at a saloon, serving dr..."
1,unknown,"The moon, painted with a smiling face hangs ov..."
2,unknown,"The film, just over a minute long, is composed..."
3,unknown,Lasting just 61 seconds and consisting of two ...
4,unknown,The earliest known adaptation of the classic f...
...,...,...
34881,unknown,"The film begins in 1919, just after World War ..."
34882,comedy,"Two musicians, Salih and Gürkan, described the..."
34883,comedy,"Zafer, a sailor living with his mother Döndü i..."
34884,romantic comedy,The film centres around a young woman named Am...


* 결측치 제거

In [3]:
len(df[df['Genre']=='unknown'])

6083

In [4]:
drop_index=df[df['Genre']=='unknown'].index

In [5]:
df=df.drop(drop_index).reset_index(drop=True)
df

Unnamed: 0,Genre,Plot
0,western,The film opens with two bandits breaking into ...
1,comedy,The film is about a family who move to the sub...
2,short,The Rarebit Fiend gorges on Welsh rarebit at a...
3,short action/crime western,The film features a train traveling through th...
4,short film,Irish villager Kathleen is a tenant of Captain...
...,...,...
28798,drama film,"Zeynep lost her job at weaving factory, and he..."
28799,comedy,"Two musicians, Salih and Gürkan, described the..."
28800,comedy,"Zafer, a sailor living with his mother Döndü i..."
28801,romantic comedy,The film centres around a young woman named Am...


* 중복 데이터 삭제

In [6]:
df.duplicated().sum()

151

In [7]:
df=df.drop_duplicates().reset_index(drop=True)
df

Unnamed: 0,Genre,Plot
0,western,The film opens with two bandits breaking into ...
1,comedy,The film is about a family who move to the sub...
2,short,The Rarebit Fiend gorges on Welsh rarebit at a...
3,short action/crime western,The film features a train traveling through th...
4,short film,Irish villager Kathleen is a tenant of Captain...
...,...,...
28647,drama film,"Zeynep lost her job at weaving factory, and he..."
28648,comedy,"Two musicians, Salih and Gürkan, described the..."
28649,comedy,"Zafer, a sailor living with his mother Döndü i..."
28650,romantic comedy,The film centres around a young woman named Am...


#### Genre 전처리

In [8]:
df['Genre'].nunique()

2264

In [9]:
pd.DataFrame(df['Genre'].unique())

Unnamed: 0,0
0,western
1,comedy
2,short
3,short action/crime western
4,short film
...,...
2259,sport film
2260,"animation, produced by glukoza production"
2261,"adventure, romance, fantasy film"
2262,ero


→ 카디널리티가 너무 높음. 가짓수를 줄여야 함.

* 장르별 빈도수 확인

In [10]:
df['Genre'].value_counts().head(50)

drama               5926
comedy              4364
horror              1152
action              1088
thriller             964
romance              921
western              857
crime                565
adventure            521
musical              465
romantic comedy      461
crime drama          460
science fiction      414
film noir            341
mystery              310
war                  272
animation            264
comedy, drama        235
sci-fi               218
family               217
fantasy              203
animated             195
musical comedy       154
comedy-drama         137
biography            136
anime                109
suspense             104
comedy drama         103
romantic drama       102
animated short        91
drama, romance        86
social                81
historical            77
action thriller       73
documentary           73
serial                70
world war ii          70
family drama          66
war drama             65
drama, crime          64


In [11]:
df['Genre'].value_counts().tail(50)

summer camp comedy                                     1
crime drama based on a true story                      1
drama based on the novel by stephen vizinczey          1
august 25                                              1
animation-drama                                        1
urban thriller                                         1
adventure fantasy                                      1
comedy sci-fi                                          1
coming-of-age                                          1
period comedy                                          1
tragic comedy                                          1
family comedy-drama                                    1
superhero/action                                       1
music/drama/romance                                    1
fantasy/thriller                                       1
martial arts/action/thriller                           1
comedy/family                                          1
horror/adventure               

→ crime drama는 crime으로, romantic comedy는 romance로, sci-fi는 science fiction으로, animated는 animation으로, anime는 animation으로, comedy drama는 comedy로 대체하기. 

In [12]:
df["Genre"].replace({"crime drama":"crime","romantic comedy": "romance","sci-fi": "science fiction","animated":"animation","anime":"animation","comedy, drama":"comedy"}, inplace=True)
drop_index=df[df['Genre']=="comedy, drama"].index
df=df.drop(drop_index)
df=df.reset_index(drop=True)

* 빈도수 200 이상의 장르만 추출

In [13]:
list_200 = df["Genre"].value_counts().reset_index(name="count").query("count > 200")["index"].tolist()
list_200

['drama',
 'comedy',
 'romance',
 'horror',
 'action',
 'crime',
 'thriller',
 'western',
 'science fiction',
 'animation',
 'adventure',
 'musical',
 'film noir',
 'mystery',
 'war',
 'family',
 'fantasy']

In [14]:
df = df[df["Genre"].isin(list_200)].reset_index(drop=True)
df

Unnamed: 0,Genre,Plot
0,western,The film opens with two bandits breaking into ...
1,comedy,The film is about a family who move to the sub...
2,comedy,Before heading out to a baseball game at a nea...
3,comedy,The plot is that of a black woman going to the...
4,drama,On a beautiful summer day a father and mother ...
...,...,...
20517,drama,"Through the night, three cars carry a small gr..."
20518,drama,The film opens with a Senegalese boy named Kha...
20519,comedy,"Two musicians, Salih and Gürkan, described the..."
20520,comedy,"Zafer, a sailor living with his mother Döndü i..."


In [15]:
df['Genre'].nunique()

17

→ 장르 17개로 구성된 20522개의 데이터셋 완성

* 레이블 인코딩

In [16]:
from sklearn.preprocessing import LabelEncoder

label_encoder=LabelEncoder()
df['Genre_encoded']=label_encoder.fit_transform(df['Genre'].tolist())
df

Unnamed: 0,Genre,Plot,Genre_encoded
0,western,The film opens with two bandits breaking into ...,16
1,comedy,The film is about a family who move to the sub...,3
2,comedy,Before heading out to a baseball game at a nea...,3
3,comedy,The plot is that of a black woman going to the...,3
4,drama,On a beautiful summer day a father and mother ...,5
...,...,...,...
20517,drama,"Through the night, three cars carry a small gr...",5
20518,drama,The film opens with a Senegalese boy named Kha...,5
20519,comedy,"Two musicians, Salih and Gürkan, described the...",3
20520,comedy,"Zafer, a sailor living with his mother Döndü i...",3


#### Plot text cleaning

In [17]:
df['Plot'][0]

"The film opens with two bandits breaking into a railroad telegraph office, where they force the operator at gunpoint to have a train stopped and to transmit orders for the engineer to fill the locomotive's tender at the station's water tank. They then knock the operator out and tie him up. As the train stops it is boarded by the bandits\u200d—\u200cnow four. Two bandits enter an express car, kill a messenger and open a box of valuables with dynamite; the others kill the fireman and force the engineer to halt the train and disconnect the locomotive. The bandits then force the passengers off the train and rifle them for their belongings. One passenger tries to escape but is instantly shot down. Carrying their loot, the bandits escape in the locomotive, later stopping in a valley where their horses had been left.\r\nMeanwhile, back in the telegraph office, the bound operator awakens, but he collapses again. His daughter arrives bringing him his meal and cuts him free, and restores him to

* 대문자를 소문자로 대체, 's, \r\n, 특수문자, 불용어 제거하기

In [18]:
import re
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stopwords=stopwords.words('english')

def clean_plot(plot):
  #소문자 전환
  plot=plot.lower()
  #'s 제거
  plot=re.sub(re.compile("\'s"),'',plot)
  #\r\n 제거
  plot=re.sub(re.compile("\\r\\n"),'',plot)
  #특수문자 제거
  plot=re.sub(re.compile("\(.*?\)"),'',plot)
  plot=re.sub(re.compile(r"[^\w\s]"),'',plot)
  #불용어 제거
  tokens=[w for w in plot.split() if not w in stopwords]
  cleaned_plot=(" ".join(tokens)).strip()

  return cleaned_plot

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
cleaned_plot=[]
for plot in df['Plot']:
  cleaned_plot.append(clean_plot(plot))

df['cleand_plot']=cleaned_plot
df

Unnamed: 0,Genre,Plot,Genre_encoded,cleand_plot
0,western,The film opens with two bandits breaking into ...,16,film opens two bandits breaking railroad teleg...
1,comedy,The film is about a family who move to the sub...,3,film family move suburbs hoping quiet life thi...
2,comedy,Before heading out to a baseball game at a nea...,3,heading baseball game nearby ballpark sports f...
3,comedy,The plot is that of a black woman going to the...,3,plot black woman going dentist toothache given...
4,drama,On a beautiful summer day a father and mother ...,5,beautiful summer day father mother take daught...
...,...,...,...,...
20517,drama,"Through the night, three cars carry a small gr...",5,night three cars carry small group men police ...
20518,drama,The film opens with a Senegalese boy named Kha...,5,film opens senegalese boy named khadim told li...
20519,comedy,"Two musicians, Salih and Gürkan, described the...",3,two musicians salih gürkan described adventure...
20520,comedy,"Zafer, a sailor living with his mother Döndü i...",3,zafer sailor living mother döndü coastal villa...


* 어간 추출

In [20]:
from nltk.stem.porter import *

stemmer=PorterStemmer()
df['stemmed_plot']=df['cleand_plot'].str.split().apply(lambda x: ' '.join([stemmer.stem(w) for w in x]))
df

Unnamed: 0,Genre,Plot,Genre_encoded,cleand_plot,stemmed_plot
0,western,The film opens with two bandits breaking into ...,16,film opens two bandits breaking railroad teleg...,film open two bandit break railroad telegraph ...
1,comedy,The film is about a family who move to the sub...,3,film family move suburbs hoping quiet life thi...,film famili move suburb hope quiet life thing ...
2,comedy,Before heading out to a baseball game at a nea...,3,heading baseball game nearby ballpark sports f...,head basebal game nearbi ballpark sport fan mr...
3,comedy,The plot is that of a black woman going to the...,3,plot black woman going dentist toothache given...,plot black woman go dentist toothach given lau...
4,drama,On a beautiful summer day a father and mother ...,5,beautiful summer day father mother take daught...,beauti summer day father mother take daughter ...
...,...,...,...,...,...
20517,drama,"Through the night, three cars carry a small gr...",5,night three cars carry small group men police ...,night three car carri small group men polic of...
20518,drama,The film opens with a Senegalese boy named Kha...,5,film opens senegalese boy named khadim told li...,film open senegales boy name khadim told littl...
20519,comedy,"Two musicians, Salih and Gürkan, described the...",3,two musicians salih gürkan described adventure...,two musician salih gürkan describ adventur cousin
20520,comedy,"Zafer, a sailor living with his mother Döndü i...",3,zafer sailor living mother döndü coastal villa...,zafer sailor live mother döndü coastal villag ...


### 2. 모델링

BERT로 진행
* 선정 이유
→ ***줄거리 데이터는 문맥 파악이 가장 중요하다고 생각하는데, BERT는 양방향으로 학습을 진행한다는 점이 좋게 작용할 것 같아서 선정함.***

Chance Level = 1/17 = 0.0588

* train, val, test set split

In [38]:
from sklearn.model_selection import train_test_split

X_train=df['stemmed_plot'].to_list()
y_train=df['Genre_encoded'].to_list()

X_train,X_test,y_train,y_test=train_test_split(X_train,y_train,test_size=0.2,random_state=42,stratify=y_train)
X_train,X_val,y_train,y_val=train_test_split(X_train,y_train,test_size=0.2,random_state=42,stratify=y_train)

* 토큰화 및 패딩

In [25]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 15.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 58.5 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 78.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [39]:
from transformers import BertTokenizer

tokenizer=BertTokenizer.from_pretrained('bert-base-cased')

X_train_encoded=tokenizer(X_train,truncation = True,padding=True,)
X_val_encoded=tokenizer(X_val,truncation = True,padding=True,)
X_test_encoded=tokenizer(X_test,truncation = True,padding=True,)

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

* 모델 학습

In [40]:
import tensorflow as tf

X_train_dataset=tf.data.Dataset.from_tensor_slices((
    dict(X_train_encoded),
    y_train
))
X_val_dataset=tf.data.Dataset.from_tensor_slices((
    dict(X_val_encoded),
    y_val
))
X_test_datset=tf.data.Dataset.from_tensor_slices((
    dict(X_test_encoded),
    y_test
))

In [43]:
from transformers import TFBertForSequenceClassification

model=TFBertForSequenceClassification.from_pretrained('bert-base-cased',num_labels=17)

optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer,loss=model.hf_compute_loss,metrics=['accuracy'])

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from tensorflow.keras.callbacks import EarlyStopping

earlystopping=EarlyStopping(
    monitor="val_accuracy",
    min_delta=0.001,
    patience=2
)

model.fit(X_train_dataset.shuffle(1000).batch(10),epochs=5,batch_size=10,
          validation_data=X_val_dataset.shuffle(1000).batch(10),
          callbacks=[earlystopping]
)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
  68/1314 [>.............................] - ETA: 24:41 - loss: 1.0513 - accuracy: 0.6456