# Project 3


# Movie Genre Classification

Classify a movie genre based on its plot.

<img src="moviegenre.png"
     style="float: left; margin-right: 10px;" />




https://www.kaggle.com/c/miia4201-202019-p3-moviegenreclassification/overview

### Data

Input:
- movie plot

Output:
- Probability of the movie belonging to each genre


### Evaluation

- 50% Report with all the details of the solution, the analysis and the conclusions. The report cannot exceed 10 pages, must be send in PDF format and must be self-contained.
- 50% Performance in the Kaggle competition (The grade for each group will be proportional to the ranking it occupies in the competition. The group in the first place will obtain 5 points, for each position below, 0.25 points will be subtracted, that is: first place: 5 points, second: 4.75 points, third place: 4.50 points ... eleventh place: 2.50 points, twelfth place: 2.25 points).


### Deadline
- The project must be carried out in the groups assigned.
- Use clear and rigorous procedures.
- The delivery of the project is on August 1st, 2021, 11:59 pm, through Bloque Neón.
- No projects will be received after the delivery time or by any other means than the one established. 




### Acknowledgements

We thank Professor Fabio Gonzalez, Ph.D. and his student John Arevalo for providing this dataset.

See https://arxiv.org/abs/1702.01992

## Sample Submission

In [38]:
import pandas as pd
import os
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split

In [3]:
dataTraining = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTraining.zip', encoding='UTF-8', index_col=0)
dataTesting = pd.read_csv('https://github.com/albahnsen/AdvancedMethodsDataAnalysisClass/raw/master/datasets/dataTesting.zip', encoding='UTF-8', index_col=0)

In [4]:
dataTraining.head()

Unnamed: 0,year,title,plot,genres,rating
3107,2003,Most,most is the story of a single father who takes...,"['Short', 'Drama']",8.0
900,2008,How to Be a Serial Killer,a serial killer decides to teach the secrets o...,"['Comedy', 'Crime', 'Horror']",5.6
6724,1941,A Woman's Face,"in sweden , a female blackmailer with a disfi...","['Drama', 'Film-Noir', 'Thriller']",7.2
4704,1954,Executive Suite,"in a friday afternoon in new york , the presi...",['Drama'],7.4
2582,1990,Narrow Margin,"in los angeles , the editor of a publishing h...","['Action', 'Crime', 'Thriller']",6.6


In [5]:
dataTesting.head()

Unnamed: 0,year,title,plot
1,1999,Message in a Bottle,"who meets by fate , shall be sealed by fate ...."
4,1978,Midnight Express,"the true story of billy hayes , an american c..."
5,1996,Primal Fear,martin vail left the chicago da ' s office to ...
6,1950,Crisis,husband and wife americans dr . eugene and mr...
7,1959,The Tingler,the coroner and scientist dr . warren chapin ...


### Create count vectorizer


In [6]:
vect = CountVectorizer(max_features=1000)
X_dtm = vect.fit_transform(dataTraining['plot'])
X_dtm.shape

(7895, 1000)

In [7]:
print(vect.get_feature_names()[:50])

['able', 'about', 'accepts', 'accident', 'accidentally', 'across', 'act', 'action', 'actor', 'actress', 'actually', 'adam', 'adult', 'adventure', 'affair', 'after', 'again', 'against', 'age', 'agent', 'agents', 'ago', 'agrees', 'air', 'alan', 'alex', 'alice', 'alien', 'alive', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'america', 'american', 'among', 'an', 'and', 'angeles', 'ann', 'anna', 'another', 'any', 'anyone', 'anything', 'apartment']


### Create y

In [8]:
dataTraining['genres'] = dataTraining['genres'].map(lambda x: eval(x))

le = MultiLabelBinarizer()
y_genres = le.fit_transform(dataTraining['genres'])

In [9]:
y_genres

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 1, 0, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0],
       [0, 1, 1, ..., 0, 0, 0]])

In [53]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X_dtm, y_genres, test_size=0.33, random_state=42)

### Train multi-class multi-label model

In [11]:
clf = OneVsRestClassifier(RandomForestClassifier(n_jobs=-1, n_estimators=100, max_depth=10, random_state=42))

In [12]:
clf.fit(X_train, y_train_genres)

OneVsRestClassifier(estimator=RandomForestClassifier(max_depth=10, n_jobs=-1,
                                                     random_state=42))

In [14]:
y_pred_genres = clf.predict_proba(X_test)

In [15]:
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7812262183677007

### Predict the testing dataset

In [16]:
X_test_dtm = vect.transform(dataTesting['plot'])

cols = ['p_Action', 'p_Adventure', 'p_Animation', 'p_Biography', 'p_Comedy', 'p_Crime', 'p_Documentary', 'p_Drama', 'p_Family',
        'p_Fantasy', 'p_Film-Noir', 'p_History', 'p_Horror', 'p_Music', 'p_Musical', 'p_Mystery', 'p_News', 'p_Romance',
        'p_Sci-Fi', 'p_Short', 'p_Sport', 'p_Thriller', 'p_War', 'p_Western']

y_pred_test_genres = clf.predict_proba(X_test_dtm)


In [17]:
res = pd.DataFrame(y_pred_test_genres, index=dataTesting.index, columns=cols)

In [18]:
res.head()

Unnamed: 0,p_Action,p_Adventure,p_Animation,p_Biography,p_Comedy,p_Crime,p_Documentary,p_Drama,p_Family,p_Fantasy,...,p_Musical,p_Mystery,p_News,p_Romance,p_Sci-Fi,p_Short,p_Sport,p_Thriller,p_War,p_Western
1,0.14303,0.10196,0.024454,0.029938,0.354552,0.13883,0.030787,0.49014,0.073159,0.101339,...,0.025069,0.063208,0.0,0.362818,0.056648,0.00897,0.017522,0.202605,0.033989,0.018117
4,0.122624,0.085786,0.024213,0.084795,0.370949,0.216657,0.080359,0.515684,0.062976,0.067019,...,0.024734,0.060935,0.000477,0.149703,0.05819,0.014248,0.020099,0.204794,0.030438,0.018506
5,0.151364,0.110284,0.013762,0.075334,0.304837,0.448736,0.02101,0.611544,0.081741,0.169121,...,0.044538,0.261372,0.0,0.335987,0.128505,0.001016,0.048658,0.423242,0.052693,0.025351
6,0.154448,0.125772,0.020991,0.064124,0.340779,0.140892,0.009133,0.632038,0.068287,0.063631,...,0.131074,0.088418,0.0,0.197224,0.132208,0.001432,0.039743,0.269385,0.077607,0.017862
7,0.175143,0.210069,0.035476,0.032505,0.31385,0.24315,0.021793,0.427885,0.079781,0.143879,...,0.023859,0.090359,4.8e-05,0.205117,0.241663,0.002634,0.018403,0.259465,0.021569,0.017585


In [19]:
res.to_csv('pred_genres_text_RF.csv', index_label='ID')

In [21]:
X_test.shape

(2606, 1000)

### Primer modelo de baseline: LightGBM

In [20]:
from lightgbm import LGBMClassifier

In [35]:
clf = OneVsRestClassifier( LGBMClassifier( random_seed=42 ) )
clf.fit(X_train.astype('float64'), y_train_genres.astype('float64'))

OneVsRestClassifier(estimator=LGBMClassifier(random_seed=42))

In [45]:
y_pred_genres = clf.predict_proba(X_test.astype('float64'))
roc_auc_score(y_test_genres, y_pred_genres, average='macro')

0.7978773764704655

In [55]:
y_pred_train_genres = clf.predict_proba(X_train.astype('float64'))
roc_auc_score(y_train_genres, y_pred_train_genres, average='macro')

0.9959521880166783

# Usar un pipeline para evitar data leakage:

In [56]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

In [79]:
X = dataTraining.drop(columns=['genres','rating']).reset_index(drop=True)
y = dataTraining['genres']

In [80]:
X_train, X_test, y_train_genres, y_test_genres = train_test_split(X, y, test_size=0.2, random_state=42)

In [81]:
le = MultiLabelBinarizer()
y_train_genres_mlb = le.fit_transform(y_train_genres)
y_test_genres_mlb = le.transform(y_test_genres)

In [82]:
text_transformer = Pipeline(steps=[('vectorizer', CountVectorizer())])

preprocessor = ColumnTransformer(
    transformers=[('txt_title', text_transformer, 'title'),
                 ('txt_plot', text_transformer, 'plot')], remainder='passthrough', sparse_threshold = 0 )

clf = Pipeline([('preprocessor', preprocessor ),
               ('classifier', OneVsRestClassifier( LGBMClassifier( random_seed=42 )))])

In [84]:
clf.fit(X_train,y_train_genres_mlb)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(remainder='passthrough', sparse_threshold=0,
                                   transformers=[('txt_title',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer())]),
                                                  'title'),
                                                 ('txt_plot',
                                                  Pipeline(steps=[('vectorizer',
                                                                   CountVectorizer())]),
                                                  'plot')])),
                ('classifier',
                 OneVsRestClassifier(estimator=LGBMClassifier(random_seed=42)))])

In [86]:
y_pred_genres = clf.predict_proba(X_test)
roc_auc_score(y_test_genres_mlb, y_pred_genres, average='macro')

0.8695510005598357

# Fine tunning de parámetros para mejorar el score

In [94]:
from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingRandomSearchCV
import pickle

In [95]:
param_grid = {
    "classifier__estimator__num_leaves": [20, 31, 50, 100],
    "classifier__estimator__max_depth": [-1, 3, 6, 15, 20, 25],
    "classifier__estimator__learning_rate": [0.05, 0.1, 0.2, 0.3, 0.6],
    "classifier__estimator__n_estimators": [50, 100, 1000, 1500],
    "classifier__estimator__subsample_for_bin": [100000, 200000, 250000, 300000],
    "classifier__estimator__min_split_gain": [0.0, 0.3, 0.4, 0.1, 0.5, 0.7, 0.9],
    "classifier__estimator__min_child_samples": [10, 20, 30, 50],
    "classifier__estimator__subsample": [0.7, 0.8, 0.9, 1],
    "classifier__estimator__reg_alpha": [0.0, 0.01, 0.05, 0.1, 0.2, 0.3, 0.5],
    "classifier__estimator__reg_lambda": [0.0, 0.01, 0.03, 0.05, 0.1, 0.2, 0.3]
}

search = HalvingRandomSearchCV(clf,
                               param_grid,
                               n_candidates='exhaust',
                               factor=4,
                               scoring='roc_auc',
                               n_jobs=1,
                               random_state=0).fit(X_train, y_train_genres_mlb)

Traceback (most recent call last):
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 199, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 362, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 547, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metr



Traceback (most recent call last):
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 199, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 362, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 547, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metr



Traceback (most recent call last):
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 199, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 362, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 547, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metr



Traceback (most recent call last):
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 674, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 199, in __call__
    return self._score(partial(_cached_call, None), estimator, X, y_true,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_scorer.py", line 362, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/utils/validation.py", line 63, in inner_f
    return f(*args, **kwargs)
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metrics/_ranking.py", line 547, in roc_auc_score
    return _average_binary_score(partial(_binary_roc_auc_score,
  File "/Users/jmartinez/opt/anaconda3/lib/python3.8/site-packages/sklearn/metr

KeyboardInterrupt: 

In [None]:
with open("LGBMClassifierSearch.pkl", "wb") as model:
    pickle.dump(search, model)