<a href="https://colab.research.google.com/github/vinaykumargummadi/CineSense/blob/main/notebooks/pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import re

import matplotlib.pyplot as plt
import seaborn as sns

from tqdm.notebook import tqdm_notebook
tqdm_notebook.pandas()

import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Data Preprocessing**

In [None]:
import re
from string import punctuation
import spacy
import nltk

from nltk.stem import WordNetLemmatizer
lemmatization_obj = WordNetLemmatizer()

from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
nlp=spacy.load('en_core_web_sm')

In [None]:
punctuation += '--'
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~--'

In [None]:
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
path = r"/content/drive/MyDrive/01. DSML_ML Algorithms/CineSense/data"
# df.to_csv(path+"/processed_data.csv")

In [None]:
df=pd.read_csv(path+"/processed_data.csv")
df.drop('Unnamed: 0',inplace=True,axis=1)

In [None]:
df=df[~df.movie_info.isnull()]
df.reset_index(drop=True, inplace=True)

**PRE_PROCESS_TEXT:** Function applies all the neccesary steps from removing unwanted tags to STOPWORDS, punctuations as well. Also, finding the root cause of the word using `spacy: lemmatization`

In [None]:
def pre_process_text(text):
  if isinstance(text,str):
    regex_cast = r"\((.*?)\)"
    cast_removed = re.sub(regex_cast, '', text, 0, re.MULTILINE)
    doc = nlp(cast_removed)
    punct_stop_removed= " ".join([word.text.lower()  for word in doc if word.text.lower() not in punctuation if word.text.lower() not in stop_words])
    word_lemma = " ".join([word.lemma_ for word in nlp(punct_stop_removed)])
    final_text = re.sub(r'\s+', ' ', word_lemma)
  else:
    return ""

  return final_text

making sure that our predefined function `pre_process_text` working fine with a sample text

In [None]:
sample="""The film stars Joseph Cotten as Holly Martins, a writer of pulp westerns who arrives penniless as a guest of his childhood chum, Harry Lime. However, Martins discovers that Lime is dead and develops a conspiracy theory. As he learns more about the circumstances of Lime's death, he becomes convinced that a "third man" was present at the time. Martins finds himself running interference with British officer Major Calloway and falls head over heels for Lime's grief-stricken lover, Anna."""
sample

'The film stars Joseph Cotten as Holly Martins, a writer of pulp westerns who arrives penniless as a guest of his childhood chum, Harry Lime. However, Martins discovers that Lime is dead and develops a conspiracy theory. As he learns more about the circumstances of Lime\'s death, he becomes convinced that a "third man" was present at the time. Martins finds himself running interference with British officer Major Calloway and falls head over heels for Lime\'s grief-stricken lover, Anna.'

In [None]:
words = sample.split()
final=" ".join([word.lower() for word in words if word.lower() not in stop_words])
final

'film stars joseph cotten holly martins, writer pulp westerns arrives penniless guest childhood chum, harry lime. however, martins discovers lime dead develops conspiracy theory. learns circumstances lime\'s death, becomes convinced "third man" present time. martins finds running interference british officer major calloway falls head heels lime\'s grief-stricken lover, anna.'

In [None]:
doc = nlp(sample)
final=" ".join([token.text.lower() for token in doc if not token.is_stop])
final

'film stars joseph cotten holly martins , writer pulp westerns arrives penniless guest childhood chum , harry lime . , martins discovers lime dead develops conspiracy theory . learns circumstances lime death , convinced " man " present time . martins finds running interference british officer major calloway falls head heels lime grief - stricken lover , anna .'

In [None]:
test_sample=df.movie_info.sample(1).values[0]
processed_text = pre_process_text(test_sample)
print(test_sample,'\n',processed_text)

After the fall of communism in Romania, unwanted children in state orphanages escaped into the streets. These children are the subject of this documentary, which follows them into a subterranean world of hierarchy, hunger and drug use. The film focuses on five diverse children, including 12-year-old Mihai, who ran away from home due to his father's beatings, and Cristina, who passes for a boy. The filmmakers follow and record their daily lives, and the strained possibilities for reintegration. 
 fall communism romania unwanted child state orphanage escape street child subject documentary follow subterranean world hierarchy hunger drug use film focus five diverse child include 12 year old mihai run away home due father 's beating cristina pass boy filmmaker follow record daily live strained possibility reintegration


In [None]:
#apply the pre-processing steps on the entire function
df['processed_movie_info'] = df.movie_info.progress_apply(pre_process_text)

In [None]:
df.head()

Unnamed: 0,rotten_tomatoes_link,movie_title,movie_info,content_rating,genres,directors,authors,actors,original_release_date,streaming_release_date,...,tomatometer_status,tomatometer_rating,tomatometer_count,audience_status,audience_rating,audience_count,tomatometer_top_critics_count,tomatometer_fresh_critics_count,tomatometer_rotten_critics_count,processed_movie_info
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",PG,"Action & Adventure, Comedy, Drama, Science Fic...",Chris Columbus,"Craig Titley, Chris Columbus, Rick Riordan","Logan Lerman, Brandon T. Jackson, Alexandra Da...",2010-02-12,2015-11-25,...,Rotten,49.0,149.0,Spilled,53.0,254421.0,43,73,76,always trouble prone life teenager percy jacks...
1,m/0878835,Please Give,Kate (Catherine Keener) and her husband Alex (...,R,Comedy,Nicole Holofcener,Nicole Holofcener,"Catherine Keener, Amanda Peet, Oliver Platt, R...",2010-04-30,2012-09-04,...,Certified-Fresh,87.0,142.0,Upright,64.0,11574.0,44,123,19,kate husband alex wealthy new yorkers prowl es...
2,m/10,10,"A successful, middle-aged Hollywood songwriter...",R,"Comedy, Romance",Blake Edwards,Blake Edwards,"Dudley Moore, Bo Derek, Julie Andrews, Robert ...",1979-10-05,2014-07-24,...,Fresh,67.0,24.0,Spilled,53.0,14684.0,2,16,8,successful middle aged hollywood songwriter fa...
3,m/1000013-12_angry_men,12 Angry Men (Twelve Angry Men),Following the closing arguments in a murder tr...,NR,"Classics, Drama",Sidney Lumet,Reginald Rose,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",1957-04-13,2017-01-13,...,Certified-Fresh,100.0,54.0,Upright,97.0,105386.0,6,54,0,follow closing argument murder trial 12 member...
4,m/1000079-20000_leagues_under_the_sea,"20,000 Leagues Under The Sea","In 1866, Professor Pierre M. Aronnax (Paul Luk...",G,"Action & Adventure, Drama, Kids & Family",Richard Fleischer,Earl Felton,"James Mason, Kirk Douglas, Paul Lukas, Peter L...",1954-01-01,2016-06-10,...,Fresh,89.0,27.0,Upright,74.0,68918.0,5,24,3,1866 professor pierre m. aronnax assistant con...


**SimilartyMatcher**

In [None]:
similarity_df = df[['movie_title','movie_info','processed_movie_info']]

In [None]:
# similarity_df.reset_index(drop=True, inplace=True)

In [None]:
similarity_df.head()

Unnamed: 0,movie_title,movie_info,processed_movie_info
0,Percy Jackson & the Olympians: The Lightning T...,"Always trouble-prone, the life of teenager Per...",always trouble prone life teenager percy jacks...
1,Please Give,Kate (Catherine Keener) and her husband Alex (...,kate husband alex wealthy new yorkers prowl es...
2,10,"A successful, middle-aged Hollywood songwriter...",successful middle aged hollywood songwriter fa...
3,12 Angry Men (Twelve Angry Men),Following the closing arguments in a murder tr...,follow closing argument murder trial 12 member...
4,"20,000 Leagues Under The Sea","In 1866, Professor Pierre M. Aronnax (Paul Luk...",1866 professor pierre m. aronnax assistant con...


In [None]:
similarity_df[['movie_title','processed_movie_info']].sample(2).values

array([['Teenage Cocktail',
        'two young woman plan run away try webcam modeling make enough money survive first money come roll girl quickly learn consequence action blindside'],
       ['The Big Sleep',
        'private eye philip marlowe fall porno blackmail murder case 1970 london']],
      dtype=object)

['The Third Man',
        "The film stars Joseph Cotten as Holly Martins, a writer of pulp westerns who arrives penniless as a guest of his childhood chum, Harry Lime. However, Martins discovers that Lime is dead and develops a conspiracy theory. As he learns more about the circumstances of Lime's death, he becomes convinced that a "third man" was present at the time. Martins finds himself running interference with British officer Major Calloway and falls head over heels for Lime's grief-stricken lover, Anna."],        
['They Drive by Night',
'Joe and Paul, who work as delivery truck drivers. They are pushing hard to try to run their business successfully. One night, due to fatigue, Paul falls asleep behind the wheel and demolishes the truck, losing his arm in the process. Joe is offered a job by the truck company owner, whose wife, Lana, falls for Joe. Lana kills her husband, and Joe refuses her advances, leading Lana to frame him for the murder. Meanwhile, Joe's love interest, Cassie, becomes entangled in the unfolding drama.']]

In [None]:
user_input_text = """The film stars Joseph Cotten as Holly Martins, a writer of pulp westerns who arrives penniless as a guest of his childhood chum, Harry Lime. However, Martins discovers that Lime is dead and develops a conspiracy theory. As he learns more about the circumstances of Lime's death, he becomes convinced that a "third man" was present at the time. Martins finds himself running interference with British officer Major Calloway and falls head over heels for Lime's grief-stricken lover, Anna."""
user_input_text=pre_process_text(user_input_text)

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
vector = TfidfVectorizer(max_df=0.95,max_features=3000)

In [None]:
def get_similar_movies(top_n,user_input_text):
  user_input_text=pre_process_text(user_input_text)
  tfidf_matrix=vector.fit_transform(list(similarity_df['processed_movie_info']) + [user_input_text])
  user_tfidf_data=tfidf_matrix[-1]
  movie_tfidf_data=tfidf_matrix[:-1]
  cosine_similarities = cosine_similarity(user_tfidf_data, movie_tfidf_data)
  cosine_similarities=cosine_similarities.flatten()
  similar_cosine_indices=cosine_similarities.argsort()[::-1]
  for index in similar_cosine_indices[:top_n]:
    print(f"Similarity Score: {cosine_similarities[index]}\t Movie Title: {similarity_df.loc[index,'movie_title']}")

In [None]:
get_similar_movies(10,"""In 'The Secret Garden,' a young orphan girl named Mary discovers a hidden, magical garden on her uncle's estate. With the help of her newfound friends, she unlocks the garden's mysteries and learns the power of friendship, love, and the beauty of nature.""")

Similarity Score: 0.5463597996630795	 Movie Title: The Secret Garden
Similarity Score: 0.5106749540248043	 Movie Title: The Secret Garden
Similarity Score: 0.48573529198833415	 Movie Title: Gnomeo and Juliet
Similarity Score: 0.4367035255327119	 Movie Title: Shut Up and Play the Hits
Similarity Score: 0.38040012360117703	 Movie Title: The Treasure
Similarity Score: 0.3474718492507245	 Movie Title: Sherlock Gnomes
Similarity Score: 0.29889874523429627	 Movie Title: A Man Named Pearl
Similarity Score: 0.2971446168997393	 Movie Title: The Garden
Similarity Score: 0.262877244952219	 Movie Title: Mirai
Similarity Score: 0.2571878137654771	 Movie Title: The Garden of Earthly Delights


In [None]:
similarity_df.shape

(17391, 3)

In [None]:
tfidf_matrix=vector.fit_transform(list(similarity_df['processed_movie_info']) + [user_input_text])

In [None]:
tfidf_matrix.toarray().shape

(17392, 3000)

In [None]:
tfidf_matrix.shape

(17392, 3000)

In [None]:
user_tfidf_data=tfidf_matrix[-1]
movie_tfidf_data=tfidf_matrix[:-1]

In [None]:
cosine_similarities = cosine_similarity(user_tfidf_data, movie_tfidf_data)

In [None]:
cosine_similarities.shape,cosine_similarities.flatten().shape

((1, 17391), (17391,))

In [None]:
cosine_similarities=cosine_similarities.flatten()

In [None]:
cosine_similarities

array([0.02677294, 0.        , 0.01489644, ..., 0.06878839, 0.0924994 ,
       0.        ])

In [None]:
similar_cosine_indices=cosine_similarities.argsort()[::-1]

In [None]:
similar_cosine_indices

array([15624, 11895, 15442, ...,  8922,  8920,  8695])

In [None]:
cosine_similarities[similar_cosine_indices[0]]

0.8323048425054459

In [None]:
top_n = 10
for index in similar_cosine_indices[:top_n]:
  print(f"Similarity Score: {cosine_similarities[index]}, Movie Title: {similarity_df.loc[index,'movie_title']}")

Similarity Score: 0.8323048425054459, Movie Title: The Third Man
Similarity Score: 0.3403230061486163, Movie Title: Proof
Similarity Score: 0.33895396992702564, Movie Title: The Return of Martin Guerre (Le Retour de Martin Guerre)
Similarity Score: 0.32507659828098884, Movie Title: Tortilla Soup
Similarity Score: 0.31240983370169256, Movie Title: Sneakers
Similarity Score: 0.2878716767293235, Movie Title: Closed Circuit
Similarity Score: 0.2850964505222563, Movie Title: Great World of Sound
Similarity Score: 0.28105838128219396, Movie Title: Grosse Pointe Blank
Similarity Score: 0.27614935405073354, Movie Title: A Song for Martin
Similarity Score: 0.2694335871969458, Movie Title: Flesh & Blood (Flesh+Blood) (The Rose and the Sword)


In [None]:
def get_years(text):
  user_years=dict()
  early_match = re.search(r"early (\d{4})", text)
  late_match = re.search(r"late (\d{4})", text)
  between_match = re.search(r"between (\d{4}) and (\d{4})", text)

  early_year = early_match.group(1) if early_match else None
  late_year = late_match.group(1) if late_match else None
  between_start_year = between_match.group(1) if between_match else None
  between_end_year = between_match.group(2) if between_match else None

  user_years ={
      'early_year':int(early_year),
      'late_year':int(late_year),
      'between_start_year':int(between_start_year),
      'between_end_year':int(between_end_year)
  }
  return user_years

In [None]:
get_years("I watched a movie in the early 2000s and it was again released in late 2010s but the movie between 1990 and 2000s are gold")

{'early_year': 2000,
 'late_year': 2010,
 'between_start_year': 1990,
 'between_end_year': 2000}

In [None]:


text = "early 2000s and late 1900s and in between 2010 and 2012"

early_match = re.search(r"early (\d{4})", text)
late_match = re.search(r"late (\d{4})", text)
between_match = re.search(r"between (\d{4}) and (\d{4})", text)

early_year = early_match.group(1) if early_match else None
late_year = late_match.group(1) if late_match else None
between_start_year = between_match.group(1) if between_match else None
between_end_year = between_match.group(2) if between_match else None

print("Early:", early_year)
print("Late:", late_year)
print("Between:", between_start_year, "and", between_end_year)
