***This notebook is based on the "Feature Engineering for NLP in Python for ML" course at DataCamp***

The purpose of this code is to indicate 10 films based only on the analysis of the plot made available in the dataset "wikipedia-movie-plots"

Steps for implementation:
1. Import the dataset;

2. Check for plot information on all movies;

3. Apply nlp () to the Plot column - Create tokens from each word;

5. Apply pre-processing for each token;

6. Calculates the importance of each word according to the whole document with TFIDF;

7. Calculate the cosine_similarity between each plot;

8. Create function to list the films with cosine_similarity closest to the calculated value for the chosen film.


Por questão de processamento foi necessário diminuir pela metade o tamanho do dataset escolhido

**Import libraries**

In [None]:
import numpy as np
import pandas as pd
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.metrics.pairwise import cosine_similarity 

**Import the dataset**

In [None]:
md = pd.read_csv('../input/wikipedia-movie-plots/wiki_movie_plots_deduped.csv')

**Check what is displayed in that database**

In [None]:
md.head()

In [None]:
md.describe()

Select the 'Plot' column

In [None]:
md_plot = md['Plot']

In [None]:
md_plot.head()

Verify if there is any missing value in 'Plot' column

In [None]:
md_nan = md_plot.isna()
md_nan.sum()

**Pre-processing of the text**

Pre-processing using SPACY library

In [None]:
nlp = spacy.load('en_core_web_sm') 

In [None]:
doc = nlp(md_plot[0]) 
print(doc) 

In [None]:
lemmas = [token.lemma_ for token in doc] 
print(lemmas)

In [None]:
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() or lemma not in STOP_WORDS] 

print(a_lemmas)

In [None]:
print(' '.join(a_lemmas))

Create a function for this pre-processing and apply to all other Plots in the database

In [None]:
def preprocess(text):
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in STOP_WORDS]
    
    return ' '.join(a_lemmas)

In [None]:
preprocess(md_plot[0]) # verificar o resultado da função
print(md_plot[0])

Test the function for simple cases

In [None]:
md_plot_test = [[md_plot[0]], [md_plot[1]], [md_plot[2]]] #selecionar apenas algumas linhas
md_plot_test = pd.DataFrame(md_plot_test, columns = ['Plot']) 
    
md_plot_test['test'] = md_plot_test['Plot'].apply(lambda x: preprocess(x))

md_plot_test #cria uma nova coluna

Apply function to the dataset
**here it was necessary to decrease the size of the dataset

In [None]:
md_half = md[:len(md)//2] 

In [None]:
md_half['Plot_lemma'] = md_half['Plot'].apply(lambda x: preprocess(x)) 

In [None]:
md_half #verificar a nova coluna

In [None]:
md_half.head()

Salvar resultado

In [None]:
np.savez_compressed('md_half')
md_half.to_csv('csv_to_submit.csv', index = False)

**Apply TFIDF**

In [None]:
vectorizer = TfidfVectorizer()

Just some visualization of the dataset

In [None]:
md_half_plot_lemma = md_half['Plot_lemma'] 
md_half_plot_lemma.head()
md_half_plot_lemma.shape

Verify the TFIDF matrix on the test dataset

In [None]:
tfidf_matrix_teste = vectorizer.fit_transform(md_plot_test['test'])
print(tfidf_matrix_teste) 

Apply TFIDF 

In [None]:
tfidf_matrix_half = vectorizer.fit_transform(md_half['Plot_lemma']) #criar matriz de TFIDF

In [None]:
md_half_plot_lemma.shape

In [None]:
print(tfidf_matrix_half) 

**Apply o cosine simularity**

Visualization of some tests

In [None]:
cosine_sim_test = cosine_similarity(tfidf_matrix_teste, tfidf_matrix_teste)

Now apply on the correct dataset

In [None]:
cosine_sim_half = cosine_similarity(tfidf_matrix_half, tfidf_matrix_half)

In [None]:
print(cosine_sim_half)

In [None]:
cosine_sim_half.shape

**Make the recommendations**

In [None]:
indices_half = pd.Series(md_half.index, index=md_half['Title']).drop_duplicates() #pegar os nomes de cada filme
indices_half

In [None]:

def get_recommendations(title, cosine_sim, indices):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:6]
    movie_indices = [i[0] for i in sim_scores]
    return md_half['Title'].iloc[movie_indices]

In [None]:
print(get_recommendations('The Godfather', cosine_sim_half, indices_half))