# NMF for Movies : Preprocessing

Refrence : https://towardsdatascience.com/topic-modeling-quora-questions-with-lda-nmf-aff8dce5e1dd


Dataset Source : https://www.kaggle.com/jrobischon/wikipedia-movie-plots

The dataset contains descriptions of 34,886 movies from around the world. 

Column descriptions are listed below:



*  Release Year - Year in which the movie was released

* Title - Movie title

* Origin/Ethnicity - Origin of movie (i.e. American, Bollywood, Tamil, etc.)

* Director - Director(s)

* Plot - Main actor and actresses
* Genre - Movie Genre(s)
* Wiki Page - URL of the Wikipedia page from which the plot description was scraped
* Plot - Long form description of movie plot (WARNING: May contain spoilers!!!)

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
cd /gdrive/MyDrive/AA-Sem6/CS360-DSANDS/NMF-Project/

/gdrive/MyDrive/AA-Sem6/CS360-DSANDS/NMF-Project


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


#MAIN DATASET of plots

In [4]:
data=pd.read_csv('wiki_movie_plots_deduped.csv')

In [5]:
data.head(5)

Unnamed: 0,Release Year,Title,Origin/Ethnicity,Director,Cast,Genre,Wiki Page,Plot
0,1901,Kansas Saloon Smashers,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Kansas_Saloon_Sm...,"A bartender is working at a saloon, serving dr..."
1,1901,Love by the Light of the Moon,American,Unknown,,unknown,https://en.wikipedia.org/wiki/Love_by_the_Ligh...,"The moon, painted with a smiling face hangs ov..."
2,1901,The Martyred Presidents,American,Unknown,,unknown,https://en.wikipedia.org/wiki/The_Martyred_Pre...,"The film, just over a minute long, is composed..."
3,1901,"Terrible Teddy, the Grizzly King",American,Unknown,,unknown,"https://en.wikipedia.org/wiki/Terrible_Teddy,_...",Lasting just 61 seconds and consisting of two ...
4,1902,Jack and the Beanstalk,American,"George S. Fleming, Edwin S. Porter",,unknown,https://en.wikipedia.org/wiki/Jack_and_the_Bea...,The earliest known adaptation of the classic f...


## PRE PROCESSING
The following preprocessing steps were used :

* The text was converted to lowercase.
* Words containing numbers were removed.
* Brackets and other punctuations were removed.
* The data was lemmatised (it can be replaced by stemming as an alternate)
* Stop words and Names are dropped from the data since they do not add to the theme of the data. The names dataset used for collecting names to drop is taken from here.


In [None]:
pip install -U spacy

      Successfully uninstalled catalogue-1.0.0
  Found existing installation: srsly 1.0.5
    Uninstalling srsly-1.0.5:
      Successfully uninstalled srsly-1.0.5
  Found existing installation: thinc 7.4.0
    Uninstalling thinc-7.4.0:
      Successfully uninstalled thinc-7.4.0
  Found existing installation: spacy 2.2.4
    Uninstalling spacy-2.2.4:
      Successfully uninstalled spacy-2.2.4
Successfully installed catalogue-2.0.4 click-7.1.2 pathy-0.5.2 pydantic-1.7.4 smart-open-3.0.0 spacy-3.0.6 spacy-legacy-3.0.5 srsly-2.4.1 thinc-8.0.3 typer-0.3.2


In [None]:
!python -m spacy download en

In [None]:
import re 
import spacy
import string

In [None]:
def clean_text(text):
    #Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.
    text = text.lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub(r'\w*\d\w*', '', text)
    return text


In [None]:
df_clean=pd.DataFrame(data['Plot'].apply(lambda x: clean_text(x)))

In [None]:
df_clean.columns

Index(['Plot'], dtype='object')

In [None]:
nlp = spacy.load('en_core_web_sm')

#lemmatising the data
def lemmatizer(text):        
    sent = []
    doc = nlp(text)
    for word in doc:
        sent.append(word.lemma_)
    return " ".join(sent)
    
df_clean["Plot_lemmatize"] =  df_clean.apply(lambda x: lemmatizer(x['Plot']), axis=1)


In [None]:
df_clean['Plot_lemmatize_clean'] = df_clean['Plot_lemmatize'].str.replace('-PRON-', '')


In [None]:
df_clean.to_csv('Plots_preprocessed.csv')


-------------------------------------