<a href="https://colab.research.google.com/github/subhradipXD/Sentiment-Analysis/blob/main/MRSA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **A Sentimental Analysis Case Study**

## **Fetching Data**

In [3]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/DataSet/IMDB Dataset.csv')

In [None]:
df.head(20)

In [None]:
df['review'][0]

## **Text Cleaning**

In [6]:
df['sentiment'].replace({'positive':1, 'negative':0},inplace = True)

In [None]:
df.head()

In [None]:
import re
clean = re.compile('<.*?>')
re.sub(clean, '' ,df.iloc[2].review)

In [9]:
# remove html tags

def clean_text(text):
  clean = re.compile('<.*?>')
  return re.sub(clean, '' ,text)

In [10]:
df['review'] = df['review'].apply(clean_text)

In [11]:
# convert into lower case

def convert_lower_text(text):
  return text.lower()

In [12]:
df['review'] = df['review'].apply(convert_lower_text)

In [None]:
df['review'][0]

In [14]:
#remove special character

def remove_special_char(text):
  x = ''
  for i in text:
    if i.isalnum():
      x=x+i;
    else:
      x=x+' '
  return x

In [15]:
df['review'] = df['review'].apply(remove_special_char)

In [None]:
import nltk
nltk.download('stopwords')

In [17]:
from nltk.corpus import stopwords

In [18]:
def remove_stopwords(text):
  x=[]
  for i in text.split():
    if i not in stopwords.words('english'):
      x.append(i)

  y = x[:]
  x.clear()
  return y

In [None]:
stopwords.words('english')

In [20]:
df['review'] = df['review'].apply(remove_stopwords)

In [None]:
df['review']

In [22]:
# Steamming
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [23]:
y = []
def stem_words(text):
  for i in text:
    y.append(ps.stem(i))

  z=y[:]
  y.clear()
  return z

In [24]:
df['review'] = df['review'].apply(stem_words)

In [25]:
def join_back(text):
  return " ".join(text)

In [26]:
df['review'] = df['review'].apply(join_back)

In [None]:
df['review']

## **Tf-Idf Vectorizer**

**TF-IDF consists of two main components:** Term Frequency (TF) and Inverse Document Frequency (IDF), which are combined to assign a weight to each term in a document. Let's break down how TF-IDF works:

**1. Term Frequency (TF):**

Term Frequency measures how often a term (word) appears in a specific document.
It is calculated for each term in a document using the following formula:

# TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)


The result is a value between 0 and 1, where 1 represents a high frequency of the term in the document.

**2. Inverse Document Frequency (IDF):**

Inverse Document Frequency measures the importance of a term across the entire corpus.
It is calculated for each term in the corpus using the following formula:

# IDF(t) = log_e(Total number of documents in the corpus / Number of documents containing term t)

The IDF value is higher for terms that appear in fewer documents across the corpus, indicating their importance.

**3. TF-IDF Score:**

The TF-IDF score for a term in a document is obtained by multiplying its TF and IDF values:

#TF-IDF(t, d) = TF(t, d) * IDF(t)

The TF-IDF score reflects the importance of a term within a specific document while considering its rarity across the entire corpus. Here are some key points to understand about TF-IDF:

**High TF-IDF Score**: A high TF-IDF score suggests that the term is both frequent in the document and unique to that document. Such terms are considered important for that specific document.

**Low TF-IDF Score:** A low TF-IDF score indicates that the term is either common in the entire corpus or not prevalent in the document. These terms are considered less important for distinguishing the document's content.

**Normalization:** To prevent bias towards longer documents, it's common to normalize TF by dividing it by the total number of terms in the document. This is called TF normalization.

**Logarithmic Scaling:** The IDF is often calculated with a logarithmic scaling factor to reduce the impact of extremely rare terms.

**Vector Representation:** After calculating TF-IDF scores for all terms in all documents, you can represent each document as a TF-IDF vector, where each dimension corresponds to a term, and the value in each dimension is the TF-IDF score for that term.

In [28]:
#tf-Idf vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(df['review'])
y =  df['sentiment']

In [29]:
#train test split
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100,)
classifier.fit(x_train,y_train)

In [None]:
#making predictions
y_pred = classifier.predict(x_test)
#model accuracy
print("Model Accuracy : {}%".format((y_pred == y_test).mean()))
#confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test,y_pred))

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix
import itertools
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
print(confusion_matrix(y_test, y_pred, labels=[1,0]))

In [None]:
# Compute confusion matrix
import numpy as np
cnf_matrix = confusion_matrix(y_test, y_pred, labels=[1,0])
np.set_printoptions(precision=2)


# Plot non-normalized confusion matrix
plt.figure(figsize=(13,5))
plot_confusion_matrix(cnf_matrix, classes=['positive=1','negative=0'],normalize= False,  title='Confusion matrix')