# Introduction

Is it possible to infer a movie category/genre, or a set of categories, based on its synopsis/overview? In this notebook, I will execute an extensive data analysis on over 4.5k movies exploring its categories and overview. Let's try to predict a movie's category based on text - its overview!

Dataset: [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv): Metadata on ~5,000 movies from TMDb.

# Loading the data

- Importing main libraries
- Listings files of the dataset
- Loading data into a `pandas.DataFrame`

Importing the main libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # visualization

In [None]:
# from: https://www.kaggle.com/sohier/getting-imdb-kernels-working-with-tmdb-data/
import json

def loadMovies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date']).apply(lambda x: x.date())
    json_columns = ['genres', 'keywords', 'production_countries', 'production_companies', 'spoken_languages']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df


def loadCredits(path):
    df = pd.read_csv(path)
    json_columns = ['cast', 'crew']
    for column in json_columns:
        df[column] = df[column].apply(json.loads)
    return df

Listing the files.

In [None]:
import os

# Listing files
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Loading the movies' data.

In [None]:
# Loading movies
dfMovies = loadMovies("../input/tmdb-movie-metadata/tmdb_5000_movies.csv")

# Filtering columns
columnsOfInterest = ['id', 'title', 'genres', 'overview']
dfMovies = dfMovies[columnsOfInterest]
# Removing null(overview)
dfMovies = dfMovies[dfMovies['overview'].notnull()].reset_index(drop=True)

# Print
display(dfMovies.head(3))

---
# Movies' categories

Which movie categories do we have in this dataset? How many movies per category do we have?

In [None]:
# Counting number of categories
categoriesCount = {}

for index, row in dfMovies.iterrows():
    for category in row['genres']:
        catName = category['name']
        categoriesCount[catName] = categoriesCount[catName]+1 if (catName in categoriesCount) else 1

In [None]:
print('number of categories:', len(categoriesCount.keys()))

As we can see, we have a few categories presented in a high number of movies (_e.g.,_ Drama) and others not so much (_e.g.,_ TV Movie). 
At this point, we have to specify the importance of these low representative categories for our project.
For studying proposes, I will maintain all of the categories. However, note that this action can injure our future prediction.

In [None]:
# Plotting
keys = categoriesCount.keys()
values = categoriesCount.values()

plt.bar(keys, values)
plt.xticks(rotation='vertical')
plt.show()

---
# NLP (Natural Language Processing)

Extract attributes from the movies' overview to create vectors of characteristics describing the movies and their categories.

## Data Pre-processing

It will perform the following steps:

- **Tokenization** - split the sentences into words/tokens. Lowercase the words and remove punctuation.
    - Words that have fewer than 3 characters are removed.
    - All [stopwords](https://en.wikipedia.org/wiki/Stop_word) are removed.
- **Lemmatization** — words in third person are changed to first person and verbs in past and future tenses are changed into present.
- **Stemming** — words are reduced to their root form.

In [None]:
import gensim # topic modeling toolkit
import nltk # natural language toolkit

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *

stemmer = SnowballStemmer('english')
nltk.download('wordnet')

Preprocess coding _(hide)_.

In [None]:
# Lemmatization process
def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='v')


# Stemming process
def stemming(text):
    return stemmer.stem(text)


# Tokenization process
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            LemmatizedToken = lemmatize(token)
            result.append(stemming(LemmatizedToken))
    return result

Testing preprocess -    
Selecting a random movie.

In [None]:
movie = dfMovies.loc[255]
display(movie)

In [None]:
overview = movie['overview']
print(overview)

In [None]:
print(preprocess(overview))

Preprocess all movies.

In [None]:
processedMovies = dfMovies['overview'].map(preprocess)
display(processedMovies)

## Bag of Words (BOW)

The [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model) model is a representation used in natural language processing to transform the document into number vectors. In this model, a document is represented as the bag of its words, disregarding grammar and even word order but keeping multiplicity/quantity.

In [None]:
dictionary = gensim.corpora.Dictionary(processedMovies)

Filter out tokens that appear in

- Less than 10 movies (absolute number); or
- More than 50% movies (fraction of total corpus size).
- After that, keep only the first 100,000 most frequent tokens.

In [None]:
dictionary.filter_extremes(no_below=10, no_above=0.5, keep_n=100000)

Gensim doc2bow

- Creating the movies' attributes/characteristics.
    - Using the dictionary to create the BOW.

In [None]:
bowCorpus = [dictionary.doc2bow(doc) for doc in processedMovies]

Preview a vector of characteristics.

In [None]:
display(bowCorpus[255])

## TF-IDF

[Term frequency–inverse document frequency](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (TF-IDF) is another way to represent the characteristics from movies, similar to bag-of-words.
TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

In [None]:
from gensim import corpora, models

tfidf = models.TfidfModel(bowCorpus)
tfidfCorpus = tfidf[bowCorpus]

Preview a vector of characteristics.

In [None]:
display(tfidfCorpus[255])

At this point, we have two ways to represent the movies' attributes - bag-of-words and tf-idf.   
**What can we do?** Classify categories, identify common topics into the movies, and so on.

# 1. Infer Categories

Note, a movie is composed of a set of categories. Thus, we have to identify $n$ categories to an unseen movie, because we do not know the number of categories that it has. 
In this way, we have to use a [soft classify](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3233196/) approach - that is a classifier that identify a set of classes to a record.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from gensim.matutils import corpus2dense

## Attributes

Creating a dense matrix, that is transform the words into a matrix of attributes.

In [None]:
tfidfDense = corpus2dense(tfidfCorpus, num_terms=100000, num_docs=len(tfidfCorpus))
tfidfDense = tfidfDense.T

In [None]:
print('movies, attributes:', tfidfDense.shape)

Getting characteristics of the categories:

- For each movie
    - For each category
        - Get its atributes and it category

In [None]:
denseMatrix, yCategory = [], []
for index, row in dfMovies.iterrows():
    for category in row['genres']:
        denseMatrix.append(tfidfDense[index])
        yCategory.append(category['name'])

Memory cleanning: reducing the decimals and deleting unused variables.

In [None]:
denseMatrix = [tup.astype(np.float16) for tup in denseMatrix]

In [None]:
del tfidfDense

## Trainning

Split the data into train and test datasets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(denseMatrix, yCategory, test_size=0.1, random_state=43)

print('train size:', len(y_train))
print('test size :', len(y_test))

Let's train our classifier. 

In [None]:
def getClassesProbabilities(model, record):
    probs = model.predict_proba([record]).T
    classes = model.classes_

    output = pd.DataFrame(data=[classes,probs]).T
    output.columns = ['category','probability']
    output['probability'] = output.apply(lambda x: x['probability'][0], axis=1)
    return output.sort_values(by='probability', ascending=False)

In [None]:
# Train model
clf = LogisticRegression(random_state=43, max_iter=150).fit(X_train, y_train)

In [None]:
from joblib import dump, load

# Save model
dump(clf, 'logisticRegression.model')

In [None]:
# Load trained model
clf = load('logisticRegression.model') 

## Detecting categories

Predict a categorie, or a set of categories, to a movie.

In [None]:
clf.predict([X_test[1]])

See the probability for other categories.

In [None]:
getClassesProbabilities(clf, X_test[1])

---
# 2. Topic Modeling

Topic modeling is a type of statistical modeling for discovering the abstract "topics" that occur in a set of documents. 
[Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA) model is an algorithm used to classify text in a document to a particular topic. 
A topic is represented by a set of most representative words (common words) that appear in a collection of documents.

References:

- [A Beginner’s Guide to Latent Dirichlet Allocation (LDA)](https://towardsdatascience.com/latent-dirichlet-allocation-lda-9d1cd064ffa2)
- [Topic Modeling and Latent Dirichlet Allocation (LDA) in Python](https://towardsdatascience.com/topic-modeling-and-latent-dirichlet-allocation-in-python-9bf156893c24)

## Using bag-of-words (BOW)

In [None]:
ldaBow = gensim.models.LdaMulticore(bowCorpus, num_topics=10, id2word=dictionary, passes=2, workers=2)

See the 5 most representative words for each abstract topic.

In [None]:
for idx, topic in ldaBow.print_topics(num_words=5):
    print('Topic: {} Words: {}'.format(idx, topic))

## Using TF-IDF

In [None]:
ldaTfidf = gensim.models.LdaMulticore(tfidfCorpus, num_topics=10, id2word=dictionary, passes=2, workers=4)

See the 5 most representative words for each abstract topic.

In [None]:
for idx, topic in ldaTfidf.print_topics(num_words=5):
    print('Topic: {} Word: {}'.format(idx, topic))

### Predicting Topic using LDA TF-IDF model

Check which topic a movie belongs to.

- Selecting a random movie - id 255.

In [None]:
for index, score in sorted(ldaTfidf[tfidfCorpus[255]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t Topic {}: {}".format(score, index, ldaTfidf.print_topic(index, 5)))

### Unseen movie using LDA BOW model

Check which topic a new movie belongs to.

- Topic identify for "Toy Story 4".

In [None]:
newMovie = "Woody attempts to make Forky, a toy, suffering from existential crisis, realise his importance in the life of Bonnie, their owner. However, things become difficult when Gabby Gabby enters their lives."
print(newMovie)

In [None]:
bowVector = dictionary.doc2bow(preprocess(newMovie))

for index, score in sorted(ldaBow[bowVector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic {}: {}".format(score, index, ldaBow.print_topic(index, 5)))