<a href="https://colab.research.google.com/github/sai-bharghav/Text-Classifier/blob/main/Text_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import modules

In [None]:
import nltk
import numpy as np
import pandas as pd

In [None]:
nltk.download('all')

# Import the data

The dataset we are working on is from the [Github](https://github.com/futurexskill/ml-model-deployment/blob/main/Restaurant_Reviews.tsv.txt) and the raw link for it is [here](https://raw.githubusercontent.com/futurexskill/ml-model-deployment/main/Restaurant_Reviews.tsv.txt)

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/futurexskill/ml-model-deployment/main/Restaurant_Reviews.tsv.txt',delimiter='\t', quoting = 3)

df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In NLP we have words like a, an, the etc. They will not tell us whether the sentence is positive or negative but these words will occupy space. These words are called as **Stop words**

What are Stop words(defination)?
>Stop words are a set of commonly used words in a language. Examples of stop words in English are “a,” “the,” “is,” “are,” etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so widely used that they carry very little useful information.

In [None]:
from nltk.corpus import stopwords

We also have some words like 'runner', 'running', 'runs' where they mean the same but are in different context. In NLP we have the concept of **Stemming** to reduce these words and complexity

What is stemming?
> Stemming is a text preprocessing technique used in natural language processing (NLP) to reduce words to their root or base form. The goal of stemming is to simplify and standardize words, which helps improve the performance of information retrieval, text classification, and other NLP tasks.

In [None]:
from nltk.stem.porter import PorterStemmer

To apply these things through code we have to instanciate the Stem class

In [None]:
ps = PorterStemmer()

Let us get an idea about the dataframe and the columns we are dealing.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  1000 non-null   object
 1   Liked   1000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 15.8+ KB


Nice we are dealing with two columns where there are 1000 rows and the dtype is object and int64 respectively

# Create the loop to build our corpus of words

Now our plan is to loop through 1000 rows omit out the stop words and apply stemming. This will create a corpus of clean text

In [None]:
corpus = []
import re

for i in range(0,1000):
  customer_review = re.sub('[^a-zA-Z]',' ',df['Review'][i]) # We use re library to omit out the symbols like ',' '.' and other types
  customer_review = customer_review.lower() # Converting them into lower case
  customer_review = customer_review.split()# We split the string into list from space as the delimiter

  # We have to omit the stop words and apply stemming from the customer reviews
  clean_review = [ps.stem(word) for word in customer_review if not word in set(stopwords.words('english'))]

  # Since the output of the above code is a list, let us use join from the list and make it a string
  clean_review = ' '.join(clean_review)
  # Append the clean review to the corpus
  corpus.append(clean_review)

Wow since our loop has completed we got the corpus at the end of the cell.

Let us check the first word in the corpus

In [None]:
corpus[0]

'wow love place'

In [None]:
df['Review'][0]

'Wow... Loved this place.'

If we look at the above two cells we can see that every letter is converted to lower and there is no punctuation in the corpus word

# Convert text into numerical array
Let us convert the text into numeric array, for this we will make use of `sklearn.feature_extraction.text.TfidfVectorizer`

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features = 1500, min_df= 3, max_df=0.6)

X = vectorizer.fit_transform(corpus).toarray()

X

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

* `min_df`: This parameter specifies the minimum number of documents in which a term (word) must appear to be included in the vocabulary. For example, if `min_df=2`, the term must appear in at least two documents to be considered. This helps in excluding very rare terms that may not carry much meaningful information.

* `max_df` : This parameter specifies the maximum proportion of documents in which a term can appear to be included in the vocabulary. It can be specified as an absolute count (e.g., `max_df=5`)

In [None]:
# Shape of X
X.shape

(1000, 467)

In [None]:
# Let us check a sample record and the shape of it
X[0],X[0].shape


(array([0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.  

In [None]:
# Get the target variable and convert it into an array
y = df.iloc[:,1].values

In [None]:
y

array([1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1,

# Split the data

Let us start modeling the dataset, before modeling and trying to fit a model we have to split the data into train and test datasets. For this we will make use of `train_test_split` from `sklearn.model_selection`

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,
                                                 y,
                                                 test_size=0.2,
                                                 random_state=0)

We can built our custom Neural network model for the classification problem above

In [None]:
# Import the required modules for PyTorch
import torch
import torch.nn as nn
from torch.nn import functional as F

We have to convert the dtype from int to float

In [None]:
Xtrain_ = torch.from_numpy(X_train).float()
Xtest_ = torch.from_numpy(X_test).float()

ytrain_ =torch.from_numpy(y_train)
ytest_= torch.from_numpy(y_test)

In [None]:
# Check the shape of all the tensors

Xtrain_.shape,Xtest_.shape

(torch.Size([800, 467]), torch.Size([200, 467]))

In [None]:
ytrain_.shape,ytest_.shape

(torch.Size([800]), torch.Size([200]))

From the shapes of the tensors we can infer the parameters for neural network
* `input_size`:467
* `output_size`:2
* `hidden_size`:500 ( This is our own choice)

# Build the model

In [None]:
class Net(nn.Module):
  def __init__(self,
               input_size:int,
               output_size:int,
               hidden_size:int):
    super().__init__()
    self.fc1 = nn.Linear(input_size,hidden_size)
    self.fc2 = nn.Linear(hidden_size,hidden_size)
    self.fc3 = nn.Linear(hidden_size,output_size)

  def forward(self, X):
    X = torch.relu(self.fc1(X))
    X=torch.relu(self.fc2(X))

    return F.log_softmax(X,dim=1)

model = Net(467,2,500)

In [None]:
# Setup the loss function and optimizer
optimizer = torch.optim.Adam(model.parameters(),lr=0.01)
loss_fn = nn.NLLLoss()

In [None]:
epochs = 100

for epoch in range(epochs):
  y_pred = model(Xtrain_)
  loss = loss_fn(y_pred,ytrain_)
  optimizer.zero_grad()
  loss.backward()
  optimizer.step()

  print(f'Epoch : {epoch} | loss : {loss}')

Epoch : 0 | loss : 6.18903923034668
Epoch : 1 | loss : 5.952858924865723
Epoch : 2 | loss : 5.662379741668701
Epoch : 3 | loss : 5.251699924468994
Epoch : 4 | loss : 4.698814868927002
Epoch : 5 | loss : 3.9946401119232178
Epoch : 6 | loss : 3.1487953662872314
Epoch : 7 | loss : 2.229799270629883
Epoch : 8 | loss : 1.4206492900848389
Epoch : 9 | loss : 0.9075080156326294
Epoch : 10 | loss : 0.6682791113853455
Epoch : 11 | loss : 0.5627195239067078
Epoch : 12 | loss : 0.5028825402259827
Epoch : 13 | loss : 0.45663169026374817
Epoch : 14 | loss : 0.41241705417633057
Epoch : 15 | loss : 0.3703853189945221
Epoch : 16 | loss : 0.33464598655700684
Epoch : 17 | loss : 0.30373966693878174
Epoch : 18 | loss : 0.27550116181373596
Epoch : 19 | loss : 0.25167638063430786
Epoch : 20 | loss : 0.23228786885738373
Epoch : 21 | loss : 0.21463662385940552
Epoch : 22 | loss : 0.19800984859466553
Epoch : 23 | loss : 0.18350453674793243
Epoch : 24 | loss : 0.1708034873008728
Epoch : 25 | loss : 0.1588782817

# Predict on the test values

In [None]:
# Let us predict the Xtest_ values

model.eval()
with torch.inference_mode():
  test_preds = model(Xtest_)

  test_loss = loss_fn(test_preds,ytest_)

  print(f'Test Loss | {loss}')

Test Loss | 0.034299105405807495
