<a href="https://colab.research.google.com/github/shalakagangadhare/IMDB-semantic-similarity/blob/main/Imdb_semantic_similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#NLP Project: Sentiment Analysis on IMDb Movie Reviews

Introduction

This project aims to build a model that can classify movie reviews from the IMDb dataset as either positive or negative. We will use Natural Language Processing (NLP) techniques to preprocess the text data, convert it into numerical features using TF-IDF, and then train a Logistic Regression model for classification.

Dataset: The dataset contains 50,000 movie reviews, labeled as positive (1) or negative (0).

🧪 Results

BERT-based embeddings significantly outperformed traditional TF-IDF and Word2Vec on semantic similarity.

Models were able to identify similarity even when the vocabulary differed but sentiment matched.

📌 Applications

Duplicate review detection

Sentiment alignment

Chatbot and FAQ matching

Plagiarism detection in review systems

In [None]:
!pip install tensorflow-text



In [None]:
!pip install unstructured
!pip install langchain

Collecting packaging<25,>=23.2 (from langchain-core<1.0.0,>=0.3.58->langchain)
  Downloading packaging-24.2-py3-none-any.whl.metadata (3.2 kB)
Downloading packaging-24.2-py3-none-any.whl (65 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.5/65.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: packaging
  Attempting uninstall: packaging
    Found existing installation: packaging 25.0
    Uninstalling packaging-25.0:
      Successfully uninstalled packaging-25.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dopamine-rl 4.1.2 requires jax>=0.1.72, which is not installed.
dopamine-rl 4.1.2 requires jaxlib>=0.1.51, which is not installed.
tensorflow-decision-forests 1.11.0 requires tensorflow==2.18.0, but you have tensorflow 2.14.0 which is incompatible.
tensorflow-decision-forests 1.11.0 requires tf-ker

In [None]:
from torchvision import models
import torch.nn as nn
import torch.optim as optim
import gc
import torch
import pandas as pd
import numpy as np
import re
import nltk
import sys
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
eng_words = set(nltk.corpus.words.words())
stop = stopwords.words('english')

In [None]:
df_imdb = pd.read_csv('/content/Copy of IMDB Dataset.csv', on_bad_lines='skip', quoting=3)
# on_bad_lines='skip' will skip rows with errors
# quoting=3 will tell pandas to ignore quotes inside fields enclosed in double quotes

df_imdb = df_imdb.sample(frac=1).reset_index(drop=True)
# Apply lower() only to string values
df_imdb['review'] = df_imdb['review'].apply(lambda x: x.lower() if isinstance(x, str) else x)

In [None]:
#Removing the html strips Helper for main 1
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets Helper for main 1
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text - Main 1
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

# Lemmatize words
def my_lemmatizer(x):
    lemmatizer= WordNetLemmatizer()
    return ' '.join(list(map(lemmatizer.lemmatize, x.split())))

### Non English words removal
def non_eng_word_removal(x):
    return " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in eng_words or not w.isalpha())

In [None]:
df_imdb['review'] = df_imdb['review'].astype(str).apply(denoise_text)
#Removing stopwords
df_imdb['review_stopwords_removed']= df_imdb['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# Removing punctuations and underscores
df_imdb["review_stoppunc_removed"] = df_imdb['review'].str.replace('[^\w\s]',' ')
df_imdb["review_stoppunc_removed"] = df_imdb['review_stoppunc_removed'].apply(lambda x: x.replace("_"," "))


### Lemmatizing and removing non eng words
df_imdb['review_stoppunc_removed'] = df_imdb['review_stoppunc_removed'].apply(lambda x: my_lemmatizer(x))
df_imdb['review_stoppunc_removed'] = df_imdb['review_stoppunc_removed'].apply(lambda x: non_eng_word_removal(x))

In [None]:
len(df_imdb['review_stoppunc_removed'].tolist())

45999

In [None]:
df_imdb.columns
df_imdb.drop(columns=['review', 'review_stopwords_removed'], inplace = True)

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_imdb['review_stoppunc_removed'].tolist())

In [None]:
sorted(vectorizer.vocabulary_, key = lambda x: x[1])

['10',
 '90',
 '20',
 '100',
 '000',
 '30',
 '80',
 '2002',
 '10rated',
 '70',
 '81',
 '11',
 '01',
 '13th',
 '13',
 '14',
 'm4tv',
 '45',
 '06',
 '1980s',
 '1966',
 '1982',
 '1990',
 '1950s',
 'nan',
 'ha',
 'watching',
 'can',
 'safely',
 'waste',
 'cab',
 'fall',
 'fantasy',
 'eagerly',
 'man',
 'lay',
 'baker',
 'half',
 'same',
 'canal',
 'van',
 'happy',
 'bad',
 'part',
 'make',
 'eager',
 'day',
 'fact',
 'may',
 'have',
 'major',
 'sad',
 'landscape',
 'hardly',
 'many',
 'sadly',
 'maybe',
 'fan',
 'cannot',
 'wa',
 'saw',
 'making',
 'mani',
 'way',
 'want',
 'rating',
 'say',
 'cavalcade',
 'watchable',
 'game',
 'havoc',
 'various',
 'watch',
 'racism',
 'garbage',
 'satisfy',
 'save',
 'lack',
 'fascinating',
 'pantheon',
 'saint',
 'had',
 'family',
 'rather',
 'massive',
 'tang',
 'made',
 'matter',
 'saying',
 'fairly',
 'palm',
 'dandy',
 'jane',
 'take',
 'camera',
 'male',
 'mantis',
 'past',
 'pain',
 'taint',
 'raccoon',
 'rain',
 'map',
 'call',
 'each',
 'painfu

In [None]:
pip install -U tensorflow==2.14.0 tensorflow-text==2.14.0 tensorflow-hub==0.15.0


Collecting tensorflow-hub==0.15.0
  Using cached tensorflow_hub-0.15.0-py2.py3-none-any.whl.metadata (1.3 kB)
Using cached tensorflow_hub-0.15.0-py2.py3-none-any.whl (85 kB)
Installing collected packages: tensorflow-hub
  Attempting uninstall: tensorflow-hub
    Found existing installation: tensorflow-hub 0.16.1
    Uninstalling tensorflow-hub-0.16.1:
      Successfully uninstalled tensorflow-hub-0.16.1
Successfully installed tensorflow-hub-0.15.0


In [None]:
pip install numpy<2.0 --force-reinstall


/bin/bash: line 1: 2.0: No such file or directory


In [None]:
pip install numpy<2.0 tensorflow==2.14 tensorflow_hub tensorflow_text


/bin/bash: line 1: 2.0: No such file or directory


In [None]:
!pip install -U tensorflow==2.14.0 tensorflow-text==2.14.0 tensorflow-hub==0.15.0 numpy==1.23.5 --force-reinstall

Collecting tensorflow==2.14.0
  Using cached tensorflow-2.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorflow-text==2.14.0
  Using cached tensorflow_text-2.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting tensorflow-hub==0.15.0
  Using cached tensorflow_hub-0.15.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting absl-py>=1.0.0 (from tensorflow==2.14.0)
  Using cached absl_py-2.2.2-py3-none-any.whl.metadata (2.6 kB)
Collecting astunparse>=1.6.0 (from tensorflow==2.14.0)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow==2.14.0)
  Using cached flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow==2.14.0)
  Using cached gast-

In [None]:
!pip install -U numpy==1.23.5 --force-reinstall
!pip install -U tensorflow==2.14.0 tensorflow-text==2.14.0 tensorflow-hub==0.15.0 --force-reinstall

^C
^C


In [None]:
!pip install -U jaxlib jax==0.4.14 numpy==1.23.5 --force-reinstall
!pip install -U tensorflow==2.14.0 tensorflow-text==2.14.0 tensorflow-hub==0.15.0 --force-reinstall

Collecting jaxlib
  Using cached jaxlib-0.6.0-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Collecting jax==0.4.14
  Using cached jax-0.4.14-py3-none-any.whl
Collecting numpy==1.23.5
  Using cached numpy-1.23.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.3 kB)
Collecting ml_dtypes>=0.2.0 (from jax==0.4.14)
  Using cached ml_dtypes-0.5.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (21 kB)
Collecting opt_einsum (from jax==0.4.14)
  Using cached opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting scipy>=1.7 (from jax==0.4.14)
  Using cached scipy-1.15.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
INFO: pip is looking at multiple versions of jaxlib to determine which version is compatible with other requirements. This could take a while.
Collecting jaxlib
  Using cached jaxlib-0.5.3-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
  Using cached jaxlib-0.5.1-cp311-cp311-manylinux201

Collecting tensorflow==2.14.0
  Using cached tensorflow-2.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorflow-text==2.14.0
  Using cached tensorflow_text-2.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.9 kB)
Collecting tensorflow-hub==0.15.0
  Using cached tensorflow_hub-0.15.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting absl-py>=1.0.0 (from tensorflow==2.14.0)
  Using cached absl_py-2.2.2-py3-none-any.whl.metadata (2.6 kB)
Collecting astunparse>=1.6.0 (from tensorflow==2.14.0)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow==2.14.0)
  Using cached flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow==2.14.0)
  Using cached gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow==2.14.0)
  Using cached google_pasta-0.2.

In [None]:
!pip uninstall -y jaxlib jax numpy
!pip install -U numpy==1.24.3  # TensorFlow 2.14 officially supports NumPy 1.24.3
!pip install -U tensorflow==2.14.0 tensorflow-text==2.14.0 tensorflow-hub==0.15.0

Found existing installation: jaxlib 0.4.30
Uninstalling jaxlib-0.4.30:
  Successfully uninstalled jaxlib-0.4.30
Found existing installation: jax 0.4.14
Uninstalling jax-0.4.14:
  Successfully uninstalled jax-0.4.14
Found existing installation: numpy 1.23.5
Uninstalling numpy-1.23.5:
  Successfully uninstalled numpy-1.23.5
Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chex 0.1.89 requires jax>=0.4.27, which is not installed.
chex 0.1.89 requires jaxlib>=0.4.27, which is not installed.
flax 0.10.6 requires jax>=0.5.1, which is not installed.
optax 0.2.4 requires jax>=0.4.27, which is not installed.


Collecting ml-dtypes==0.2.0 (from tensorflow==2.14.0)
  Using cached ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Using cached ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
Installing collected packages: ml-dtypes
  Attempting uninstall: ml-dtypes
    Found existing installation: ml_dtypes 0.5.1
    Uninstalling ml_dtypes-0.5.1:
      Successfully uninstalled ml_dtypes-0.5.1
[31m  ERROR: Operation cancelled by user[0m[31m
[0m^C


In [None]:
pip install numpy==1.24.3 --force-reinstall


Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
chex 0.1.89 requires jax>=0.4.27, which is not installed.
chex 0.1.89 requires jaxlib>=0.4.27, which is not installed.
flax 0.10.6 requires jax>=0.5.1, which is not installed.
optax 0.2.4 requires jax>=0.4.27, which is not installed.
optax 0.2.4 requires jaxlib>=0.4.27, which is not installed.
orbax-checkpoint 0.11.13 requires jax>=0.5.0, which is not installed.
dopamine-rl 4.1.2 requires jax>=0.1.72, wh

In [None]:
!pip install tensorflow==2.14.0 tensorflow-hub tensorflow-text numpy==1.24.3 ml-dtypes==0.2.0 protobuf==3.20.* --force-reinstall

Collecting tensorflow==2.14.0
  Using cached tensorflow-2.14.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting tensorflow-hub
  Using cached tensorflow_hub-0.16.1-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tensorflow-text
  Using cached tensorflow_text-2.19.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Collecting numpy==1.24.3
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting ml-dtypes==0.2.0
  Using cached ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Collecting protobuf==3.20.*
  Using cached protobuf-3.20.3-py2.py3-none-any.whl.metadata (720 bytes)
Collecting absl-py>=1.0.0 (from tensorflow==2.14.0)
  Using cached absl_py-2.2.2-py3-none-any.whl.metadata (2.6 kB)
Collecting astunparse>=1.6.0 (from tensorflow==2.14.0)
  Using cached astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecti

In [None]:
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text

bert_preprocess_model = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_model =  'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer(bert_preprocess_model, name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer(bert_model, trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  #net = tf.keras.layers.Dropout(0.1)(net)
  #net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

bert_embedding_model= build_classifier_model()

In [None]:
import sys # Import the sys module
bert_embedding_model.summary()
print('bert size', sys.getsizeof(bert_embedding_model))

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 text (InputLayer)           [(None,)]                    0         []                            
                                                                                                  
 preprocessing (KerasLayer)  {'input_mask': (None, 128)   0         ['text[0][0]']                
                             , 'input_type_ids': (None,                                           
                              128),                                                               
                              'input_word_ids': (None,                                            
                             128)}                                                                
                                                                                              

In [None]:
texts= tf.constant(["This is my name", "I am called by this name"])
type(bert_embedding_model.predict(texts))



numpy.ndarray

In [None]:
# concat_vec= np.hstack(X, bert_embedding_model(tf.constant(df_imdb['review_stoppunc_removed'].tolist())))
### Getting error!

In [None]:
def choose_records(x_np, start,end):
    return x_np[start:end][:]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer # import the TfidfVectorizer class

vectorizer = TfidfVectorizer()

In [None]:
import pandas as pd

df_imdb = pd.read_csv('/content/Copy of IMDB Dataset.csv')


In [None]:
!pip install tensorflow_text
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')

eng_words = set(nltk.corpus.words.words())
stop = stopwords.words('english')

# Load the IMDB dataset
df_imdb = pd.read_csv('/content/Copy of IMDB Dataset.csv', on_bad_lines='skip', quoting=3)

# Apply lower() only to string values
df_imdb['review'] = df_imdb['review'].apply(lambda x: x.lower() if isinstance(x, str) else x)


# Removing the html strips Helper for main 1
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

# Removing the square brackets Helper for main 1
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

# Removing the noisy text - Main 1
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text

# Lemmatize words
def my_lemmatizer(x):
    lemmatizer = WordNetLemmatizer()
    return ' '.join(list(map(lemmatizer.lemmatize, x.split())))

### Non English words removal
def non_eng_word_removal(x):
    return " ".join(w for w in nltk.wordpunct_tokenize(x) if w.lower() in eng_words or not w.isalpha())


df_imdb['review'] = df_imdb['review'].astype(str).apply(denoise_text)
# Removing stopwords
df_imdb['review_stopwords_removed'] = df_imdb['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# Removing punctuations and underscores
df_imdb["review_stoppunc_removed"] = df_imdb['review'].str.replace('[^\w\s]', ' ')
df_imdb["review_stoppunc_removed"] = df_imdb['review_stoppunc_removed'].apply(lambda x: x.replace("_", " "))

### Lemmatizing and removing non eng words
df_imdb['review_stoppunc_removed'] = df_imdb['review_stoppunc_removed'].apply(lambda x: my_lemmatizer(x))
df_imdb['review_stoppunc_removed'] = df_imdb['review_stoppunc_removed'].apply(lambda x: non_eng_word_removal(x))

# Load a pre-trained BERT model for embedding
bert_embedding_model = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

# Load the preprocessor associated with the BERT model
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

# Initialize the TF-IDF vectorizer (if not already done)
vectorizer = TfidfVectorizer() # Example, adjust as needed

# Fit the vectorizer to your training data
# Assuming 'df_imdb' is your DataFrame and 'review_stoppunc_removed' is the text column
vectorizer.fit(df_imdb['review_stoppunc_removed'].tolist())  # Fit before using transform

batch_size = 10
start = 0
end = batch_size
user_query = "Hey how are you"

def vectorize_user_query(user_query):
    user_query = [user_query]

    # 1. Preprocess the user query using the preprocessor
    encoder_inputs = preprocessor(user_query)

    # 2. Get BERT embeddings using the preprocessed input
    bert_vec = bert_embedding_model(encoder_inputs)['pooled_output'].numpy()

    # 3. Get TF-IDF vector
    tfidf_vec = vectorizer.transform(user_query).toarray()

    # 4. Horizontally stack the vectors
    return np.hstack((tfidf_vec, bert_vec))


# Example usage
vectorized_query = vectorize_user_query(user_query)
print(vectorized_query)



[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[[ 0.          0.          0.         ... -0.35199001 -0.66522896
   0.93488419]]


In [None]:
!pip install tensorflow_text
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow as tf
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

# Load a pre-trained BERT model for embedding
bert_embedding_model = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")

# Load the preprocessor associated with the BERT model
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")

# Initialize the TF-IDF vectorizer (if not already done)
vectorizer = TfidfVectorizer() # Example, adjust as needed

# Fit the vectorizer to your training data
# Assuming 'df_imdb' is your DataFrame and 'review_stoppunc_removed' is the text column
vectorizer.fit(df_imdb['review_stoppunc_removed'].tolist())  # Fit before using transform

batch_size = 10
start = 0
end = batch_size
user_query = "Hey how are you"

def vectorize_user_query(user_query):
    user_query = [user_query]

    # 1. Preprocess the user query using the preprocessor
    encoder_inputs = preprocessor(user_query)

    # 2. Get BERT embeddings using the preprocessed input
    bert_vec = bert_embedding_model(encoder_inputs)['pooled_output'].numpy()

    # 3. Get TF-IDF vector
    tfidf_vec = vectorizer.transform(user_query).toarray()

    # 4. Horizontally stack the vectors
    return np.hstack((tfidf_vec, bert_vec))


# Example usage
vectorized_query = vectorize_user_query(user_query)
print(vectorized_query)

[[ 0.          0.          0.         ... -0.35198984 -0.66522896
   0.93488413]]


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import tensorflow_hub as hub
import tensorflow_text as text
import tensorflow as tf
import numpy as np
import gc
from sklearn.metrics.pairwise import cosine_similarity

# Assuming df_imdb and other necessary variables are already defined

# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer()

# Fit the vectorizer and transform the text data
# Assuming 'df_imdb' is your DataFrame and 'review_stoppunc_removed' is the text column
X = vectorizer.fit_transform(df_imdb['review_stoppunc_removed'].tolist())  # Calculate TF-IDF vectors

# Load the preprocessor associated with the BERT model
preprocessor = hub.KerasLayer(
    "https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")


# Now you can proceed with your loop
df_imdb['similarity'] = 0
for i in range(0, X.shape[0] // batch_size + 1):
    end = start + batch_size
    # Preprocess the text data before passing it to the BERT model
    encoder_inputs = preprocessor(df_imdb['review_stoppunc_removed'].iloc[start:end].tolist())
    bert_embeddings = bert_embedding_model(encoder_inputs)['pooled_output'].numpy()

    concat_vec = np.hstack((X.toarray()[start: end][:], bert_embeddings))
    dist = cosine_similarity(concat_vec, vectorize_user_query(user_query))
    del concat_vec
    print(type(dist))
    for i, j in enumerate(range(start, end)):
        df_imdb['similarity'].iloc[j] = dist[i]

    start = start + batch_size
    gc.collect()

<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_imdb['similarity'].i

<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


<class 'numpy.ndarray'>


You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df_imdb['similarity'].iloc[j] = dist[i]


KeyboardInterrupt: 

In [None]:
df_imdb['similarity']

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,similarity
"""One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right",as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence,which set in right from the word GO. Trust me,this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs,sex or violence. Its is hardcore,in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City,an experimental section of the prison where all the cells have glass fronts and face inwards,so privacy is not high on the agenda. Em City is home to many..Aryans,Muslims,gangstas,Latinos,Christians,Italians,Irish and more....so scuffles,death stares,dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fact that it goes where other shows wouldn't dare. Forget pretty pictures painted for mainstream audiences,forget charm,forget romance...OZ doesn't mess around. The first episode I ever saw struck me as so nasty it was surreal,I couldn't say I was ready for it,but as I watched more,I developed a taste for Oz,and got accustomed to the high levels of graphic violence. Not just violence,but injustice (crooked guards who'll be sold out for a nickel,inmates who'll kill on order and get away with it,well mannered,middle class inmates being turned into prison bitches due to their lack of street skills or prison experience) Watching Oz,0.0
"""A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting",and sometimes discomforting,"sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only """"has got all the polari"""" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries",not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which,rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses,"particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.""",positive,,,,,,,,,,,,,,,,,,,,0.0
"""I thought this was a wonderful way to spend time on a too hot summer weekend",sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic,but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction,I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson,"in this she managed to tone down her """"sexy"""" image and jumped right into a average",but spirited young woman.<br /><br />This may not be the crown jewel of his career,"but it was wittier than """"Devil Wears Prada"""" and more interesting than """"Superman"""" a great comedy to go see with friends.""",positive,,,,,,,,,,,,,,,,,,,0.0
"""Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly",Jake decides to become Rambo and kill the zombie.<br /><br />OK,first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie,"and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them.""",negative,,,,,,,,,,,,,,,,,,,,,,0.0
"""Petter Mattei's """"Love in the Time of Money"""" is a visually stunning film to watch. Mr. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money",power and success do to people in the different situations we encounter. <br /><br />This being a variation on the Arthur Schnitzler's play about the same theme,the director transfers the action to the present time New York where all these different characters meet and connect. Each one is connected in one way,or another to the next person,but no one seems to know the previous point of contact. Stylishly,the film has a sophisticated luxurious look. We are taken to see how these people live and the world they live in their own habitat.<br /><br />The only thing one gets out of all these souls in the picture is the different stages of loneliness each one inhabits. A big city is not exactly the best place in which human relations find sincere fulfillment,as one discerns is the case with most of the people we encounter.<br /><br />The acting is good under Mr. Mattei's direction. Steve Buscemi,Rosario Dawson,Carol Kane,Michael Imperioli,Adrian Grenier,and the rest of the talented cast,"make these characters come alive.<br /><br />We wish Mr. Mattei good luck and await anxiously for his next work.""",positive,,,,,,,,,,,,,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"""I thought this movie did a down right good job. It wasn't as creative or original as the first",but who was expecting it to be. It was a whole lotta fun. the more i think about it the more i like it,and when it comes out on DVD I'm going to pay the money for it very proudly,every last cent. Sharon Stone is great,she always is,even if her movie is horrible(Catwoman),but this movie isn't,this is one of those movies that will be underrated for its lifetime,and it will probably become a classic in like 20 yrs. Don't wait for it to be a classic,watch it now and enjoy it. Don't expect a masterpiece,or something thats gripping and soul touching,just allow yourself to get out of your life and get yourself involved in theirs.<br /><br />All in all,this movie is entertaining and i recommend people who haven't seen it see it,because what the critics and box office say doesn't always count,see it for yourself,you never know,"you might just enjoy it. I tip my hat to this movie<br /><br />8/10""",positive,,,,,,,,,0.0
"""Bad plot",bad dialogue,bad acting,idiotic directing,the annoying porn groove soundtrack that ran continually over the overacted script,and a crappy copy of the VHS cannot be redeemed by consuming liquor. Trust me,because I stuck this turkey out to the end. It was so pathetically bad all over that I had to figure it was a fourth-rate spoof of Springtime for Hitler.<br /><br />The girl who played Janis Joplin was the only faint spark of interest,and that was only because she could sing better than the original.<br /><br />If you want to watch something similar but a thousand times better,"then watch Beyond The Valley of The Dolls.""",negative,,,,,,,,,,,,,,,,,0.0
"""I am a Catholic taught in parochial elementary schools by nuns","taught by Jesuit priests in high school & college. I am still a practicing Catholic but would not be considered a """"good Catholic"""" in the church's eyes because I don't believe certain things or act certain ways just because the church tells me to.<br /><br />So back to the movie...its bad because two people are killed by this nun who is supposed to be a satire as the embodiment of a female religious figurehead. There is no comedy in that and the satire is not done well by the over acting of Diane Keaton. I never saw the play but if it was very different from this movies then it may be good.<br /><br />At first I thought the gun might be a fake and the first shooting all a plan by the female lead of the four former students as an attempt to demonstrate Sister Mary's emotional and intellectual bigotry of faith. But it turns out the bullets were real and the story has tragedy...the tragedy of loss of life (besides the two former students...the lives of the aborted babies",the life of the student's mom),the tragedy of dogmatic authority over love of people,the tragedy of organized religion replacing true faith in God. This is what is wrong with today's Islam,"and yesterday's Judaism and Christianity.""",negative,,,,,,,,,,,,,,,,,,,,0.0
"""I'm going to have to disagree with the previous comment and side with Maltin on this one. This is a second rate","excessively vicious Western that creaks and groans trying to put across its central theme of the Wild West being tamed and kicked aside by the steady march of time. It would like to be in the tradition of """"Butch Cassidy and the Sundance Kid""""",but lacks that film's poignancy and charm. Andrew McLaglen's direction is limp,and the final 30 minutes or so are a real botch,with some incomprehensible strategy on the part of heroes Charlton Heston and Chris Mitchum. (Someone give me a holler if you can explain to me why they set that hillside on fire.) There was something callous about the whole treatment of the rape scene,and the woman's reaction afterwards certainly did not ring true. Coburn is plenty nasty as the half breed escaped convict out for revenge,but all of his fellow escapees are underdeveloped (they're like bowling pins to be knocked down one by one as the story lurches forward). Michael Parks gives one of his typically shifty,lethargic,mumbling performances,"but in this case it was appropriate as his modern style sheriff symbolizes the complacency that technological progress can bring about.""",negative,,,,,,,,,,,,,,,,0.0


In [None]:
max(df_imdb['similarity'])

0.9765613282392966