This presentation explores an unsupervised machine learning approach to perform aspect-based sentiment analysis on product reviews from Flipkart. Our goal is to automatically identify specific aspects of products that customers discuss in their reviews and determine the sentiment (positive, negative, or neutral) expressed towards those aspects.

Aspect-based sentiment analysis (ABSA) is a text analysis technique that categorizes data by aspect and identifies the sentiment attributed to each one. We will use it to analyze customer feedback for products for aspects such as price, service, quality and etc.

### Import required Libraries
* nltk (Natural Language Toolkit) for text processing libraries for tokenization, parsing, classification, stemming, tagging, and semantic reasoning.
* CountVectorizer from scikit-learn for converting text data into a matrix of token counts, facilitating the process of feature extraction from text data.


In [1]:
import pandas as pd
import torch
import numpy as np
import sqlite3
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


<h1  style="text-align: center" class="list-group-item list-group-item-action active">1. Data Preparation </h1><a id = "2" ></a>

In our case we fetch the data from flipkart's web page from the online open data resources.

In [3]:
import sqlite3

# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("/content/drive/MyDrive/Colab Notebooks/Data/flipkart_products.db")
items = pd.read_sql_query("SELECT * from items", con)
con.close()

**The** data is fetched from flipkart India web page contains 83 tables in which 1 table contains information about rest 82 tables. Those rest 82 tables contains reviews of 82 mobile phones.

A table which contains information about mobile phones has following attributes.

 - product_id    
 - product_name  
 - price         
 - category      
 - sub_category  
 - specifications
 - ratings       
 - discount      
 - moreinfo      

Tables containing reviews of 82 mobile phones has following attributes.

 - product_id  
 - review_id   
 - title       
 - review      
 - likes       
 - dislikes    
 - ratings     
 - reviewer    



In [4]:
items.head()

Unnamed: 0,product_id,product_name,price,category,sub_category,specifications,ratings,discount,moreinfo
0,ECMB000001,"Redmi 9A (SeaBlue, 32 GB)","₹7,413",Electronics,Mobile,2 GB RAM | 32 GB ROM16.59 cm (6.53 inch) Full ...,4.3,3.0,/redmi-9a-seablue-32-gb/p/itmeabd39a0cd669?pid...
1,ECMB000002,"Redmi 9A (Midnight Black, 32 GB)","₹7,421",Electronics,Mobile,2 GB RAM | 32 GB ROM16.59 cm (6.53 inch) Full ...,4.3,3.0,/redmi-9a-midnight-black-32-gb/p/itmeabd39a0cd...
2,ECMB000003,"Redmi 9A (Nature Green, 32 GB)","₹7,384",Electronics,Mobile,2 GB RAM | 32 GB ROM16.59 cm (6.53 inch) Full ...,4.3,4.0,/redmi-9a-nature-green-32-gb/p/itmeabd39a0cd66...
3,ECMB000004,"Redmi 9 (Carbon Black, 64 GB)","₹10,745",Electronics,Mobile,4 GB RAM | 64 GB ROM16.59 cm (6.53 inch) HD+ D...,4.2,,/redmi-9-carbon-black-64-gb/p/itm4fb151383983b...
4,ECMB000005,"Redmi 9 (Sky Blue, 64 GB)","₹10,489",Electronics,Mobile,4 GB RAM | 64 GB ROM16.59 cm (6.53 inch) HD+ D...,4.2,,/redmi-9-sky-blue-64-gb/p/itm4fb151383983b?pid...


In [5]:
items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82 entries, 0 to 81
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   product_id      82 non-null     object
 1   product_name    82 non-null     object
 2   price           82 non-null     object
 3   category        82 non-null     object
 4   sub_category    82 non-null     object
 5   specifications  82 non-null     object
 6   ratings         82 non-null     object
 7   discount        34 non-null     object
 8   moreinfo        82 non-null     object
dtypes: object(9)
memory usage: 5.9+ KB


Items dataset is just like an index for some book.

In [6]:
con = sqlite3.connect("/content/drive/MyDrive/Colab Notebooks/Data/flipkart_products.db")

df = pd.read_sql_query("SELECT * from ECMB000001", con)

for i in range(2, len(items) + 1):

    df_temp = pd.read_sql_query("SELECT * from ECMB{:06d}".format(i), con)
    df = pd.concat([df, df_temp])
con.close()


In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 53493 entries, 0 to 289
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   product_id  53493 non-null  object
 1   review_id   53493 non-null  object
 2   title       53493 non-null  object
 3   review      53493 non-null  object
 4   likes       53493 non-null  object
 5   dislikes    53493 non-null  object
 6   ratings     48488 non-null  object
 7   reviewer    53493 non-null  object
dtypes: object(8)
memory usage: 3.7+ MB


In [8]:
df.head()

Unnamed: 0,product_id,review_id,title,review,likes,dislikes,ratings,reviewer
0,ECMB000001,ECMB0000010000001,Excellent,Wow superb I love it❤️👍 battery backup so nice 👍👍,740,160,5,Abhishek Saini
1,ECMB000001,ECMB0000010000002,Worth the money,Mobile So Good In Range Redmi 9a Has Miui 12 L...,355,104,4,Dinesh Kumar Sahni
2,ECMB000001,ECMB0000010000003,Just wow!,Wonderful device and smart phone best camera b...,125,47,5,Flipkart Customer
3,ECMB000001,ECMB0000010000004,Simply awesome,Very good mobile. Value for money. Battery bac...,0,0,5,Amit Sen
4,ECMB000001,ECMB0000010000005,Highly recommended,Really great.... value for money...,90,15,5,Sudeshna pakira


<h2  style="text-align: center" class="list-group-item list-group-item-success"> 2 Data Cleaning </h2><a id = "2.3" ></a>



### Missing data manipulation
We can handle missing values using lots of techniques like simple statistical methods, using machine learning models and many more but in this case majority of missing values are from ratings column and we will use that column as label column later so I don't want to fake out or ruin the original distribution of the label data specially so thats why I will be dropping the rows having missing values in column ratings. Dropping technique in never recommended but in this senario I am using it.  

In [9]:
df.isna().sum()

product_id       0
review_id        0
title            0
review           0
likes            0
dislikes         0
ratings       5005
reviewer         0
dtype: int64

In [10]:
# Dropping the rows with missing values

df.dropna(inplace=True, axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48488 entries, 0 to 287
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   product_id  48488 non-null  object
 1   review_id   48488 non-null  object
 2   title       48488 non-null  object
 3   review      48488 non-null  object
 4   likes       48488 non-null  object
 5   dislikes    48488 non-null  object
 6   ratings     48488 non-null  object
 7   reviewer    48488 non-null  object
dtypes: object(8)
memory usage: 3.3+ MB


# Data Preparation


In [11]:
!pip install demoji



In [12]:
!nltk.download('stopwords')

/bin/bash: -c: line 1: syntax error near unexpected token `'stopwords''
/bin/bash: -c: line 1: `nltk.download('stopwords')'


In [13]:
import re
import unicodedata as uni
import demoji
from nltk.corpus import stopwords
import spacy
sp = spacy.load("en_core_web_sm")
nltk.download('stopwords')
en_stopwords = set(stopwords.words('english'))

def remove_url(text):
    text = re.sub(r"http\S+", "", text)
    return text

def handle_emoji(string):
    emojis = demoji.findall(string)

    for emoji in emojis:
        string = string.replace(emoji, " " + emojis[emoji].split(":")[0])

    return string

def word_tokenizer(text):
    text = text.lower()
    text = text.split()

    return text

def remove_stopwords(text):
    text = [word for word in text if word not in en_stopwords]
    return text

def lemmatization(text):

    text = " ".join(text)
    token = sp(text)

    text = [word.lemma_ for word in token]
    return text

def label(y):
    if y == '5':
        return 1
    elif y == '4':
        return 1
    else:
        return 0

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [31]:
def preprocessing(text):

    text = remove_url(text)
    text = uni.normalize('NFKD', text)
    text = handle_emoji(text)
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = word_tokenizer(text)
    text = lemmatization(text)
    text = remove_stopwords(text)
    text = " ".join(text)

    return text

Data Enhancement and Augmentation
We do not have much of negative comments. We will use data augmentation to create more data our methodology will be we take sentences having 5 star rating and then select the reviews of length in the range of 20 to 30 words.  

After havving these positive reviews we will use Antonym Augmentation to generate negative reviews.

In [15]:
!pip install nlpaug



In [16]:
df_label = df.copy()
from tqdm import tqdm
tqdm.pandas()
df_label['y'] = df_label.ratings.progress_map(label)
df_label = df_label[['review', 'y', 'ratings']]
df_label.y.value_counts()

100%|██████████| 48488/48488 [00:00<00:00, 757900.16it/s]


1    44751
0     3737
Name: y, dtype: int64

In [17]:
!pip install nlpaug



In [32]:
df_five = df_label[(df_label['ratings'] == '5')]
positive = list(df_five[(df_five['review'].str.len() > 100) & (df_five['review'].str.len() < 500)]['review'])

import nlpaug.augmenter.word as naw
negative_aug = naw.AntonymAug(name='Antonym_Aug', aug_min=1, aug_max=10, aug_p=0.3, lang='eng', stopwords=en_stopwords, tokenizer=None,
                     reverse_tokenizer=None, stopwords_regex=None, verbose=0)

aug_negative = negative_aug.augment(positive)

In [33]:
df_negative = pd.DataFrame({"review" : aug_negative, 'y' : [0]*len(aug_negative)})
df_positive = pd.DataFrame({"review" : positive, 'y' : [1]*len(positive)})
df_temp = pd.concat([df_negative, df_positive]).sample(frac = 1, random_state = 11).reset_index(drop=True)
df = df_temp

In [35]:
from tqdm import tqdm

tqdm.pandas()

df['clean_review'] = df['review'].progress_map(preprocessing)

df.head()

100%|██████████| 12326/12326 [03:46<00:00, 54.34it/s]


Unnamed: 0,review,y,clean_review
0,Slim and steady in holding and operating.Good ...,1,slim steady hold operatinggood modern day look...
1,Overall bad phone in this budget. Simply away ...,0,overall bad phone budget simply away first one...
2,Must buy. Best phone in this price range. Nice...,1,must buy good phone price range nice look supe...
3,super phone in this prize every thing is imper...,0,super phone prize every thing imperfect camera...
4,This is the best phone in this price range whi...,1,good phone price range highquality camera proc...


In [36]:
reviews = df.clean_review.values.tolist()

In [37]:
df['clean_review2'] = df['clean_review'].progress_map(word_tokenizer)

df.head()

100%|██████████| 12326/12326 [00:00<00:00, 142993.44it/s]


Unnamed: 0,review,y,clean_review,clean_review2
0,Slim and steady in holding and operating.Good ...,1,slim steady hold operatinggood modern day look...,"[slim, steady, hold, operatinggood, modern, da..."
1,Overall bad phone in this budget. Simply away ...,0,overall bad phone budget simply away first one...,"[overall, bad, phone, budget, simply, away, fi..."
2,Must buy. Best phone in this price range. Nice...,1,must buy good phone price range nice look supe...,"[must, buy, good, phone, price, range, nice, l..."
3,super phone in this prize every thing is imper...,0,super phone prize every thing imperfect camera...,"[super, phone, prize, every, thing, imperfect,..."
4,This is the best phone in this price range whi...,1,good phone price range highquality camera proc...,"[good, phone, price, range, highquality, camer..."


# Latent Dirichlet Allocation (LDA)
Topics are just a collection of keywords with a probability distribution, while documents are collections of topics with the same probability distribution. A topic model just provides a list of keywords for each topic. What is the situation?Human perception of what an issue symbolises and what it should be labelled is common in the context of an LDA model.

In [38]:
data_words = df['clean_review2'].values.tolist()
len(data_words)

12326

In [39]:
import gensim.corpora as corpora

# Create Dictionary
id2word = corpora.Dictionary(data_words)
# Create Corpus
texts = data_words
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
# View
print(corpus[:1][0][:30])


[(0, 1), (1, 1), (2, 1), (3, 1), (4, 2), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1)]


In [40]:
from gensim.models import LdaMulticore
from gensim.models import LdaModel
from pprint import pprint

# number of topics
num_topics = 10
# Build LDA model
lda_model = LdaMulticore(corpus=corpus, id2word=id2word,
                     num_topics=num_topics, iterations=400)
# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]




[(0,
  '0.051*"good" + 0.041*"i" + 0.029*"camera" + 0.028*"phone" + 0.017*"mobile" '
  '+ 0.017*"redmi" + 0.016*"battery" + 0.013*"price" + 0.012*"quality" + '
  '0.012*"use"'),
 (1,
  '0.035*"product" + 0.035*"i" + 0.017*"phone" + 0.017*"flipkart" + '
  '0.016*"camera" + 0.016*"day" + 0.014*"good" + 0.014*"price" + 0.012*"bad" + '
  '0.012*"use"'),
 (2,
  '0.032*"i" + 0.030*"phone" + 0.019*"camera" + 0.017*"good" + 0.015*"battery" '
  '+ 0.014*"bad" + 0.013*"evil" + 0.012*"quality" + 0.011*"use" + '
  '0.011*"performance"'),
 (3,
  '0.041*"face" + 0.037*"i" + 0.034*"smile" + 0.020*"mobile" + 0.018*"evil" + '
  '0.018*"camera" + 0.017*"flipkart" + 0.017*"bad" + 0.016*"thank" + '
  '0.016*"product"'),
 (4,
  '0.039*"good" + 0.028*"phone" + 0.021*"product" + 0.014*"mobile" + '
  '0.014*"camera" + 0.014*"battery" + 0.012*"performance" + 0.012*"gb" + '
  '0.012*"price" + 0.011*"i"'),
 (5,
  '0.062*"phone" + 0.025*"camera" + 0.022*"i" + 0.022*"good" + 0.016*"awesome" '
  '+ 0.015*"pubg" + 0

As we don't have labeled data so we need to make few assumptions regarding choosing of aspect from the comment. From above analysis using LDA we inerpret that below listed aspect are few important one which were mentioned in most of the comments.
- Phone
- Camera
- Battery
- Delivery
- Processor

So we use the LDA to discover the thematic structure, it is like a guided feature engineering process, then We gonna create a similarity matrix now by leveraging semantic word relationships with fast text which can provide enhanced word representations and contextual similarity


In [41]:
%%time
from gensim.models import FastText
fasttext_model = FastText(data_words, vector_size= 100, window=5, min_count=5, workers=4,sg=1)
# fasttext_model = FastText.load_fasttext_format("../input/fast100/cc.en.100.bin")

CPU times: user 25.3 s, sys: 252 ms, total: 25.6 s
Wall time: 15.9 s


In [42]:
fasttext_model.save("FastText-Model-For-ABSA.bin")

In [43]:
aspects = ["phone", "camera", "battery", "delivery", "processor"]

def get_similarity(text, aspect):
    try:
        text = " ".join(text)
        return fasttext_model.wv.n_similarity(text, aspect)
    except:
        return 0


In [44]:
from tqdm import tqdm
tqdm.pandas()
for aspect in aspects:
    df[aspect] = df['clean_review2'].progress_map(lambda text: get_similarity(text, aspect))

100%|██████████| 12326/12326 [00:20<00:00, 593.89it/s]
100%|██████████| 12326/12326 [00:19<00:00, 631.61it/s]
100%|██████████| 12326/12326 [00:19<00:00, 619.81it/s]
100%|██████████| 12326/12326 [00:20<00:00, 598.41it/s]
100%|██████████| 12326/12326 [00:19<00:00, 623.25it/s]


In [45]:
df.to_csv("Clean_Flipkart_Product.csv", index = False)

As of now we are done of choosing the aspects for each respective review. Now our next step will be to create an pytorch based model which can predict aspect based sentiment.

# Model Training

In [46]:
import torch
from torch import nn
from torch.utils.data import Dataset
from torch.utils.data import DataLoader
from torch.utils.data import RandomSampler
import warnings

<h2  style="text-align: center" class="list-group-item list-group-item-success"> 6.1 Configurations </h2><a id = "6.1" ></a>

In [47]:
class config:
    warnings.filterwarnings("ignore", category = UserWarning)
    IMG_SIZE = (224,224)
    DEVICE = ("cuda" if torch.cuda.is_available() else "cpu")
    FOLDS = 5
    SHUFFLE = True
    BATCH_SIZE = 32
    LR = 0.01
    EPOCHS = 30
    EMB_DIM = 100
    MAX_LEN = 20
    MODEL_PATH = "./Models/MyModel.pt"

In [48]:
df = pd.read_csv("/content/Clean_Flipkart_Product.csv")
df.head()

Unnamed: 0,review,y,clean_review,clean_review2,phone,camera,battery,delivery,processor
0,Slim and steady in holding and operating.Good ...,1,slim steady hold operatinggood modern day look...,"['slim', 'steady', 'hold', 'operatinggood', 'm...",0.881814,0.869249,0.867634,0.952238,0.881585
1,Overall bad phone in this budget. Simply away ...,0,overall bad phone budget simply away first one...,"['overall', 'bad', 'phone', 'budget', 'simply'...",0.890131,0.868406,0.879134,0.960879,0.871773
2,Must buy. Best phone in this price range. Nice...,1,must buy good phone price range nice look supe...,"['must', 'buy', 'good', 'phone', 'price', 'ran...",0.913131,0.874783,0.881336,0.940665,0.886667
3,super phone in this prize every thing is imper...,0,super phone prize every thing imperfect camera...,"['super', 'phone', 'prize', 'every', 'thing', ...",0.909267,0.872951,0.888725,0.959379,0.880897
4,This is the best phone in this price range whi...,1,good phone price range highquality camera proc...,"['good', 'phone', 'price', 'range', 'highquali...",0.930133,0.901578,0.861556,0.930592,0.896418


<h2  style="text-align: center" class="list-group-item list-group-item-success"> 6.2 Dataset Generator </h2><a id = "6.2" ></a>

<h3  style="text-align: center" class="list-group-item list-group-item-warning"> 6.2.1 Creation of the Vocabulary </h3><a id = "6.2.1" ></a>

convert text into numerical form that models can understand, while managing vocabulary size and handling unknown words gracefully.

In [49]:
class Vocabulary:

    '''
    __init__ method is called by default as soon as an object of this class is initiated
    we use this method to initiate our vocab dictionaries
    '''
    def __init__(self, freq_threshold, max_size):
        '''
        freq_threshold : the minimum times a word must occur in corpus to be treated in vocab
        max_size : max source vocab size. Eg. if set to 10,000, we pick the top 10,000 most frequent words and discard others
        '''
        #initiate the index to token dict
        ## <PAD> -> padding, used for padding the shorter sentences in a batch to match the length of longest sentence in the batch
        ## <SOS> -> start token, added in front of each sentence to signify the start of sentence
        ## <EOS> -> End of sentence token, added to the end of each sentence to signify the end of sentence
        ## <UNK> -> words which are not found in the vocab are replace by this token
        self.itos = {0: '<PAD>', 1:'<SOS>', 2:'<EOS>', 3: '<UNK>'}
        #initiate the token to index dict
        self.stoi = {k:j for j,k in self.itos.items()}

        self.freq_threshold = freq_threshold
        self.max_size = max_size

    '''
    __len__ is used by dataloader later to create batches
    '''
    def __len__(self):
        return len(self.itos)

    '''
    a simple tokenizer to split on space and converts the sentence to list of words
    '''
    @staticmethod
    def tokenizer(text):
        return [tok.lower().strip() for tok in text.split(' ')]

    '''
    build the vocab: create a dictionary mapping of index to string (itos) and string to index (stoi)
    output ex. for stoi -> {'the':5, 'a':6, 'an':7}
    '''
    def build_vocabulary(self, sentence_list):
        #calculate the frequencies of each word first to remove the words with freq < freq_threshold
        frequencies = {}  #init the freq dict
        idx = 4 #index from which we want our dict to start. We already used 4 indexes for pad, start, end, unk

        #calculate freq of words
        for sentence in sentence_list:
            for word in self.tokenizer(sentence):
                if word not in frequencies.keys():
                    frequencies[word]=1
                else:
                    frequencies[word]+=1


        #limit vocab by removing low freq words
        frequencies = {k:v for k,v in frequencies.items() if v>self.freq_threshold}

        #limit vocab to the max_size specified
        frequencies = dict(sorted(frequencies.items(), key = lambda x: -x[1])[:self.max_size-idx]) # idx =4 for pad, start, end , unk

        #create vocab
        for word in frequencies.keys():
            self.stoi[word] = idx
            self.itos[idx] = word
            idx+=1


    '''
    convert the list of words to a list of corresponding indexes
    '''
    def numericalize(self, text):
        #tokenize text
        tokenized_text = self.tokenizer(text)
        numericalized_text = []
        for token in tokenized_text:
            if token in self.stoi.keys():
                numericalized_text.append(self.stoi[token])
            else: #out-of-vocab words are represented by UNK token index
                numericalized_text.append(self.stoi['<UNK>'])

        return numericalized_text


handling text data, integrating vocabulary construction, and preparing the data for model training

In [50]:
from torch.utils.data import Dataset

class CustomDataset(Dataset):
    '''
    Initiating Variables
    df: the training dataframe
    source_column : the name of source text column in the dataframe
    transform : If we want to add any augmentation
    freq_threshold : the minimum times a word must occur in corpus to be treated in vocab
    source_vocab_max_size : max source vocab size
    '''

    def __init__(self, df, source_column,freq_threshold = 3,
                source_vocab_max_size = 10000 , transform=None):

        self.df = df
        self.transform = transform

        #get source and target texts
        self.source_texts = self.df[source_column]


        ##VOCAB class has been created above
        #Initialize source vocab object and build vocabulary
        self.source_vocab = Vocabulary(freq_threshold, source_vocab_max_size)
        self.source_vocab.build_vocabulary(self.source_texts.tolist())


    def __len__(self):
        return len(self.df)

    '''
    __getitem__ runs on 1 example at a time. Here, we get an example at index and return its numericalize source and
    target values using the vocabulary objects we created in __init__
    '''
    def __getitem__(self, index):
        source_text = self.source_texts[index]

        if self.transform is not None:
            source_text = self.transform(source_text)

        #numericalize texts ['<SOS>','cat', 'in', 'a', 'bag','<EOS>'] -> [1,12,2,9,24,2]
        numerialized_source = [self.source_vocab.stoi["<SOS>"]]
        numerialized_source += self.source_vocab.numericalize(source_text)
        numerialized_source.append(self.source_vocab.stoi["<EOS>"])

        #convert the list to tensor and return
        return torch.tensor(numerialized_source), torch.tensor(self.df.y[index])

In [51]:
df.head()

Unnamed: 0,review,y,clean_review,clean_review2,phone,camera,battery,delivery,processor
0,Slim and steady in holding and operating.Good ...,1,slim steady hold operatinggood modern day look...,"['slim', 'steady', 'hold', 'operatinggood', 'm...",0.881814,0.869249,0.867634,0.952238,0.881585
1,Overall bad phone in this budget. Simply away ...,0,overall bad phone budget simply away first one...,"['overall', 'bad', 'phone', 'budget', 'simply'...",0.890131,0.868406,0.879134,0.960879,0.871773
2,Must buy. Best phone in this price range. Nice...,1,must buy good phone price range nice look supe...,"['must', 'buy', 'good', 'phone', 'price', 'ran...",0.913131,0.874783,0.881336,0.940665,0.886667
3,super phone in this prize every thing is imper...,0,super phone prize every thing imperfect camera...,"['super', 'phone', 'prize', 'every', 'thing', ...",0.909267,0.872951,0.888725,0.959379,0.880897
4,This is the best phone in this price range whi...,1,good phone price range highquality camera proc...,"['good', 'phone', 'price', 'range', 'highquali...",0.930133,0.901578,0.861556,0.930592,0.896418


In [52]:
dataset = CustomDataset(df, "clean_review")

In [53]:
len(dataset.source_vocab.stoi)

4808

Saving the pytorch custom dataset

In [54]:
import pickle

with open('dataset-new', 'wb') as dataset_file:

  # Step 3
    pickle.dump(dataset, dataset_file, pickle.HIGHEST_PROTOCOL)

# import pickle

# # Step 2
# with open('./dataset', 'rb') as config_dictionary_file:

#     # Step 3
#     config_dictionary = pickle.load(config_dictionary_file)

#     # After config_dictionary is read from file
#     print(config_dictionary)

<h2  style="text-align: center" class="list-group-item list-group-item-success"> 6.3 Word Embeddings </h2><a id = "6.3" ></a>

create an embedding layer for a PyTorch model that is initialized with pretrained word embeddings from our previous fasttext. This approach is particularly useful when we want to leverage the semantic information captured by these pretrained embeddings

In [55]:
def get_emb_layer_with_weights(target_vocab, emb_model, trainable = False):

    weights_matrix = np.zeros((len(target_vocab), config.EMB_DIM))
    words_found = 0

    for i, word in enumerate(target_vocab):
        weights_matrix[i] = np.concatenate([emb_model.wv[word]])
        words_found += 1

    print(f"Words found are : {words_found}")

    weights_matrix = torch.tensor(weights_matrix, dtype = torch.float32).reshape(len(target_vocab), config.EMB_DIM)
    emb_layer = nn.Embedding.from_pretrained(weights_matrix)
    print(emb_layer)
    if trainable:
        emb_layer.weight.requires_grad = True
    else:
        emb_layer.weight.requires_grad = False

    return emb_layer

takes care of padding sequences to a uniform length and preparing batches for processing by a neural network.

In [56]:
class MyCollate:
    def __init__(self, pad_idx, maxlen):
        self.pad_idx = pad_idx
        self.maxlen = maxlen


    #__call__: a default method
    ##   First the obj is created using MyCollate(pad_idx) in data loader
    ##   Then if obj(batch) is called -> __call__ runs by default
    def __call__(self, batch):
        #get all source indexed sentences of the batch
        source = [item[0] for item in batch]
        #pad them using pad_sequence method from pytorch.
#         source = pad_sequence(source, batch_first=False, padding_value = self.pad_idx)

        padded_sequence = torch.zeros((self.maxlen, len(batch)), dtype = torch.int)

        for idx, text in enumerate(source):

            if len(text) > self.maxlen:
                padded_sequence[:, idx] = source[idx][: self.maxlen]
            else:
                padded_sequence[:len(source[idx]), idx] = padded_sequence[:len(source[idx]), idx] + source[idx]


        #get all target indexed sentences of the batch
        target = [item[1] for item in batch]

        target = torch.tensor(target, dtype = torch.float32).reshape(-1)
        return padded_sequence, target


<h2  style="text-align: center" class="list-group-item list-group-item-success"> 6.4 Initializing the Model </h2><a id = "6.4" ></a>

In [57]:
class Model(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, embedding_layer):
        super().__init__()
#         self.embedding = nn.Embedding(input_dim, embedding_dim)
        self.hidden_dim = hidden_dim
        self.embedding = embedding_layer
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional = True)
        self.fc1 = nn.Linear(2*hidden_dim, 128)
        self.fc2 = nn.Linear(128, output_dim)
        self.dropout = nn.Dropout(0.3)
        self.sigmoid = nn.Sigmoid()



    def forward(self, text):

        max_len, N = text.shape
        hidden = torch.zeros((2, N , self.hidden_dim),
                          dtype=torch.float)
        memory = torch.zeros((2, N , self.hidden_dim),
                          dtype=torch.float)
        hidden = hidden.to(config.DEVICE)
        memory = memory.to(config.DEVICE)
        embedded = self.embedding(text)
        output, hidden = self.lstm(embedded, (hidden, memory))
#         assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        y_pred = output[-1,:,:]
        y_pred = self.fc1(y_pred)
        y_pred = self.fc2(y_pred)
        y_pred = self.sigmoid(y_pred)

        return y_pred

 <h2  style="text-align: center" class="list-group-item list-group-item-success"> 6.5 Training and K-fold Cross Validation </h2><a id = "6.5" ></a>

In [58]:
def train_epochs(dataloader,model, loss_fn, optimizer):
    train_correct = 0
    train_loss = 0

    model.train()

    for review, label in tqdm(dataloader):

        review, label = review.to(config.DEVICE), label.to(config.DEVICE)
        optimizer.zero_grad()
        output = model(review)
        output = output.reshape(-1)
        loss = loss_fn(output, label)
        loss.backward()
        optimizer.step()

        train_loss += loss.item()*review.size(1)
        prediction = (output > 0.5).float()
        train_correct += (prediction == label).float().sum()

    return train_loss, train_correct



calculates the total validation loss and the number of correctly predicted instances

In [59]:
def val_epochs(dataloader, model, loss_fn):
    val_correct = 0
    val_loss = 0

    model.eval()
#     hidden = model.init_hidden(config.BATCH_SIZE)

    for review, label in dataloader:

        review, label = review.to(config.DEVICE), label.to(config.DEVICE)

        output = model(review)
        output = output.reshape(-1)

        loss = loss_fn(output, label)

        val_loss += loss.item()*review.size(1)
        prediction = (output > 0.5).float()
        val_correct += (prediction == label).float().sum()
#         prediction =
    return val_loss, val_correct



In [60]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from torch.utils.data import SubsetRandomSampler
from torch.optim import Adam
from tqdm import tqdm
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt


# sfk = StratifiedKFold(n_splits = config.FOLDS)
kfold = KFold(n_splits = config.FOLDS)
model_state_dicts = {}

for fold, (train_idx, val_idx) in enumerate(kfold.split(np.arange(len(dataset)))):

    train_sampler = SubsetRandomSampler(train_idx)
    val_sampler = SubsetRandomSampler(val_idx)

    train_loader = DataLoader(dataset, batch_size = config.BATCH_SIZE, sampler = train_sampler, collate_fn = MyCollate(0, config.MAX_LEN))
    val_loader = DataLoader(dataset, batch_size = config.BATCH_SIZE, sampler = val_sampler, collate_fn = MyCollate(0, config.MAX_LEN))

    VOCAB_SIZE = len(dataset.source_vocab)
    HIDDEN_DIM = 128
    OUTPUT_DIM = 1
    VOCAB = list(dataset.source_vocab.stoi)

    embedding_layer = get_emb_layer_with_weights(target_vocab = VOCAB, emb_model = fasttext_model, trainable = False)

    model = Model(VOCAB_SIZE, config.EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, embedding_layer)
    model = model.to(config.DEVICE)

#     model
#     model = Model(2, len(dataset.source_vocab), 128, 100, 1 ).to(config.DEVICE)
#     hidden = model.init_hidden(config.BATCH_SIZE)
#     model.hidden = hidden

    loss_fn = nn.BCELoss()
    optimizer = torch.optim.SGD(model.parameters(), lr = 0.1)

    train_losses = []
    val_losses = []
    train_accs = []
    val_accs = []

    print(f"-----------------------------------------------------------{fold}-fold of the model-----------------------------------------------------------")
    for epoch in range(config.EPOCHS):
        train_loss, train_correct = train_epochs(train_loader, model, loss_fn, optimizer)
        val_loss, val_correct = val_epochs(val_loader, model, loss_fn)

        train_loss = train_loss/len(train_loader.sampler)
        val_loss = val_loss/len(val_loader.sampler)
        train_acc = (train_correct/len(train_loader.sampler))*100
        val_acc = (val_correct/len(val_loader.sampler))*100

        train_losses.append(train_loss)
        val_losses.append(val_loss)
        train_accs.append(train_acc.cpu().numpy().tolist())
        val_accs.append(val_acc.cpu().numpy().tolist())

        print(f"| Train Loss : {train_loss} |", end = " ")
        print(f" Val Loss : {val_loss} |", end = " ")
        print(f"Train Acc : {train_acc} |", end = " ")
        print(f"Val Acc : {val_acc} |")


    # Saving the state dicts for the model
    model_state_dicts.update({f"LSTM-Model-for-{fold}" : model.state_dict(),
                             f"Model-Optimizer-for-{fold}" : optimizer.state_dict()})

#     # summarize history for accuracy
#     plt.plot(train_accs)
#     plt.plot(val_accs)
#     plt.title('Model Accuracy')
#     plt.ylabel('Accuracy')
#     plt.xlabel('Epoch')
#     plt.legend(['Train', 'Test'], loc='upper left')
#     plt.show()
#     # summarize history for loss
#     plt.plot(train_losses)
#     plt.plot(val_losses)
#     plt.title('Model Loss')
#     plt.ylabel('Loss')
#     plt.xlabel('Epoch')
#     plt.legend(['Train', 'Test'], loc='upper left')
#     plt.show()

Words found are : 4808
Embedding(4808, 100)
-----------------------------------------------------------0-fold of the model-----------------------------------------------------------


100%|██████████| 309/309 [00:10<00:00, 28.14it/s]


| Train Loss : 0.6902937014252855 |  Val Loss : 0.6886428800039145 | Train Acc : 53.34685516357422 | Val Acc : 52.51419448852539 |


100%|██████████| 309/309 [00:10<00:00, 30.53it/s]


| Train Loss : 0.6600127167440573 |  Val Loss : 1.379133265577287 | Train Acc : 64.6551742553711 | Val Acc : 51.540958404541016 |


100%|██████████| 309/309 [00:10<00:00, 29.62it/s]


| Train Loss : 0.40072536495104283 |  Val Loss : 0.6925887399249606 | Train Acc : 82.45436096191406 | Val Acc : 68.45093536376953 |


100%|██████████| 309/309 [00:11<00:00, 27.78it/s]


| Train Loss : 0.2555114975807391 |  Val Loss : 1.4620614171704913 | Train Acc : 90.42596435546875 | Val Acc : 54.86618423461914 |


100%|██████████| 309/309 [00:11<00:00, 27.84it/s]


| Train Loss : 0.22551323404992812 |  Val Loss : 0.1857909042137431 | Train Acc : 91.41987609863281 | Val Acc : 92.86293029785156 |


100%|██████████| 309/309 [00:11<00:00, 27.64it/s]


| Train Loss : 0.1987286410394595 |  Val Loss : 0.20012954011595743 | Train Acc : 92.57606506347656 | Val Acc : 91.88969421386719 |


100%|██████████| 309/309 [00:11<00:00, 27.73it/s]


| Train Loss : 0.1893771380514783 |  Val Loss : 0.176860555850267 | Train Acc : 92.69776916503906 | Val Acc : 92.6601791381836 |


100%|██████████| 309/309 [00:11<00:00, 27.63it/s]


| Train Loss : 0.1829629059253793 |  Val Loss : 1.656160006299615 | Train Acc : 92.98174285888672 | Val Acc : 49.188968658447266 |


100%|██████████| 309/309 [00:10<00:00, 28.86it/s]


| Train Loss : 0.21640522349965863 |  Val Loss : 0.18247985199323438 | Train Acc : 91.72413635253906 | Val Acc : 93.02513885498047 |


100%|██████████| 309/309 [00:10<00:00, 30.89it/s]


| Train Loss : 0.1764809346489578 |  Val Loss : 0.15707339179631857 | Train Acc : 93.36714172363281 | Val Acc : 94.03892517089844 |


100%|██████████| 309/309 [00:10<00:00, 29.49it/s]


| Train Loss : 0.16456105601473223 |  Val Loss : 0.25684855966575726 | Train Acc : 93.71196746826172 | Val Acc : 89.78102111816406 |


100%|██████████| 309/309 [00:11<00:00, 27.75it/s]


| Train Loss : 0.1619799759820799 |  Val Loss : 0.1711862979574614 | Train Acc : 93.8843765258789 | Val Acc : 93.06568908691406 |


100%|██████████| 309/309 [00:11<00:00, 27.63it/s]


| Train Loss : 0.1567795250799061 |  Val Loss : 0.17529800456147174 | Train Acc : 94.14807891845703 | Val Acc : 92.61962890625 |


100%|██████████| 309/309 [00:11<00:00, 27.78it/s]


| Train Loss : 0.157955246779909 |  Val Loss : 0.16785521158723452 | Train Acc : 93.73224639892578 | Val Acc : 93.10624694824219 |


100%|██████████| 309/309 [00:11<00:00, 27.74it/s]


| Train Loss : 0.1508719839212347 |  Val Loss : 0.15365002184066942 | Train Acc : 94.1277847290039 | Val Acc : 93.75506591796875 |


100%|██████████| 309/309 [00:11<00:00, 27.97it/s]


| Train Loss : 0.15008999194806294 |  Val Loss : 0.14706877687309974 | Train Acc : 94.27992248535156 | Val Acc : 94.40389251708984 |


100%|██████████| 309/309 [00:10<00:00, 29.36it/s]


| Train Loss : 0.15014542838139175 |  Val Loss : 0.19712775205250677 | Train Acc : 94.1277847290039 | Val Acc : 93.10624694824219 |


100%|██████████| 309/309 [00:09<00:00, 31.43it/s]


| Train Loss : 0.14753471057990503 |  Val Loss : 0.7536616816524558 | Train Acc : 94.3914794921875 | Val Acc : 68.04541778564453 |


100%|██████████| 309/309 [00:10<00:00, 29.22it/s]


| Train Loss : 0.1492779294146002 |  Val Loss : 0.18829934833161927 | Train Acc : 94.27992248535156 | Val Acc : 91.9708023071289 |


100%|██████████| 309/309 [00:11<00:00, 27.73it/s]


| Train Loss : 0.14189650766045217 |  Val Loss : 0.1466174875291643 | Train Acc : 94.71602630615234 | Val Acc : 94.80941009521484 |


100%|██████████| 309/309 [00:11<00:00, 27.70it/s]


| Train Loss : 0.14340005223818772 |  Val Loss : 0.17150678235886616 | Train Acc : 94.66531372070312 | Val Acc : 93.71451568603516 |


100%|██████████| 309/309 [00:11<00:00, 27.74it/s]


| Train Loss : 0.14164599000682213 |  Val Loss : 0.3665696828870068 | Train Acc : 94.4928970336914 | Val Acc : 81.1435546875 |


100%|██████████| 309/309 [00:11<00:00, 27.70it/s]


| Train Loss : 0.14151054331182708 |  Val Loss : 0.15222376846218902 | Train Acc : 94.69573974609375 | Val Acc : 93.83617401123047 |


100%|██████████| 309/309 [00:11<00:00, 27.81it/s]


| Train Loss : 0.14096414417512035 |  Val Loss : 0.4148240021440531 | Train Acc : 94.68559265136719 | Val Acc : 86.00973510742188 |


100%|██████████| 309/309 [00:10<00:00, 29.68it/s]


| Train Loss : 0.138461077120344 |  Val Loss : 0.13810750274085845 | Train Acc : 94.71602630615234 | Val Acc : 94.84996032714844 |


100%|██████████| 309/309 [00:09<00:00, 30.90it/s]


| Train Loss : 0.13782120158113767 |  Val Loss : 0.27383988828602984 | Train Acc : 94.64502716064453 | Val Acc : 90.55149841308594 |


100%|██████████| 309/309 [00:10<00:00, 28.63it/s]


| Train Loss : 0.13714252563975524 |  Val Loss : 0.13797415508937758 | Train Acc : 94.62474822998047 | Val Acc : 94.52554321289062 |


100%|██████████| 309/309 [00:11<00:00, 27.52it/s]


| Train Loss : 0.13080622261242741 |  Val Loss : 0.13896016809796283 | Train Acc : 95.1217041015625 | Val Acc : 94.32279205322266 |


100%|██████████| 309/309 [00:11<00:00, 27.74it/s]


| Train Loss : 0.1312109871247961 |  Val Loss : 0.1406718589915363 | Train Acc : 94.79715728759766 | Val Acc : 94.68775177001953 |


100%|██████████| 309/309 [00:11<00:00, 27.81it/s]


| Train Loss : 0.13081599768401364 |  Val Loss : 0.14377954493634423 | Train Acc : 95.09127807617188 | Val Acc : 94.32279205322266 |
Words found are : 4808
Embedding(4808, 100)
-----------------------------------------------------------1-fold of the model-----------------------------------------------------------


100%|██████████| 309/309 [00:11<00:00, 27.64it/s]


| Train Loss : 0.6906839687784839 |  Val Loss : 0.6884977766031184 | Train Acc : 54.50765609741211 | Val Acc : 51.07505416870117 |


100%|██████████| 309/309 [00:11<00:00, 27.30it/s]


| Train Loss : 0.6580715276774644 |  Val Loss : 0.40084199750641053 | Train Acc : 63.24916458129883 | Val Acc : 85.67951202392578 |


100%|██████████| 309/309 [00:10<00:00, 29.11it/s]


| Train Loss : 0.3985542569280384 |  Val Loss : 2.5025202940655946 | Train Acc : 83.3079833984375 | Val Acc : 49.290061950683594 |


100%|██████████| 309/309 [00:09<00:00, 31.29it/s]


| Train Loss : 0.2762204254994877 |  Val Loss : 0.18218529646891277 | Train Acc : 89.35199737548828 | Val Acc : 94.03651428222656 |


100%|██████████| 309/309 [00:10<00:00, 29.13it/s]


| Train Loss : 0.2190932809865302 |  Val Loss : 0.19193816998705238 | Train Acc : 91.87709045410156 | Val Acc : 92.21095275878906 |


100%|██████████| 309/309 [00:11<00:00, 27.60it/s]


| Train Loss : 0.19970099782181516 |  Val Loss : 0.1782316559136035 | Train Acc : 92.04948425292969 | Val Acc : 93.79310607910156 |


100%|██████████| 309/309 [00:11<00:00, 27.64it/s]


| Train Loss : 0.19957867419505268 |  Val Loss : 0.17029969258192346 | Train Acc : 92.06977081298828 | Val Acc : 93.91481018066406 |


100%|██████████| 309/309 [00:11<00:00, 27.74it/s]


| Train Loss : 0.17953762217741107 |  Val Loss : 0.24292302082120829 | Train Acc : 92.93175506591797 | Val Acc : 89.7363052368164 |


100%|██████████| 309/309 [00:11<00:00, 27.77it/s]


| Train Loss : 0.17516971978451198 |  Val Loss : 0.5535795777612961 | Train Acc : 93.0838623046875 | Val Acc : 80.64909362792969 |


100%|██████████| 309/309 [00:11<00:00, 27.74it/s]


| Train Loss : 0.1738971303644099 |  Val Loss : 0.1686436870918868 | Train Acc : 93.2359848022461 | Val Acc : 93.3874282836914 |


100%|██████████| 309/309 [00:10<00:00, 28.96it/s]


| Train Loss : 0.1629030267043016 |  Val Loss : 0.1974560148394374 | Train Acc : 93.49964141845703 | Val Acc : 92.69776916503906 |


100%|██████████| 309/309 [00:09<00:00, 31.38it/s]


| Train Loss : 0.15616522560604637 |  Val Loss : 0.4897181541241449 | Train Acc : 93.8647232055664 | Val Acc : 84.17850494384766 |


100%|██████████| 309/309 [00:10<00:00, 29.03it/s]


| Train Loss : 0.15435530655573376 |  Val Loss : 0.6677531620544244 | Train Acc : 93.95598602294922 | Val Acc : 63.56998062133789 |


100%|██████████| 309/309 [00:11<00:00, 27.66it/s]


| Train Loss : 0.15909243632784578 |  Val Loss : 0.17797978352641963 | Train Acc : 93.6517562866211 | Val Acc : 93.63082885742188 |


100%|██████████| 309/309 [00:11<00:00, 26.32it/s]


| Train Loss : 0.15182241270236083 |  Val Loss : 0.16811160588005092 | Train Acc : 94.12837982177734 | Val Acc : 93.87423706054688 |


100%|██████████| 309/309 [00:11<00:00, 27.67it/s]


| Train Loss : 0.1530293493687762 |  Val Loss : 0.1786948250549511 | Train Acc : 94.0573959350586 | Val Acc : 93.79310607910156 |


100%|██████████| 309/309 [00:11<00:00, 27.78it/s]


| Train Loss : 0.15102358997741183 |  Val Loss : 0.16873608330617948 | Train Acc : 94.25007629394531 | Val Acc : 93.95537567138672 |


100%|██████████| 309/309 [00:11<00:00, 27.53it/s]


| Train Loss : 0.14645582858664596 |  Val Loss : 0.1618616258706633 | Train Acc : 94.42247009277344 | Val Acc : 94.15821075439453 |


100%|██████████| 309/309 [00:10<00:00, 28.17it/s]


| Train Loss : 0.14325396529669066 |  Val Loss : 0.17366319122709123 | Train Acc : 94.26021575927734 | Val Acc : 93.26571655273438 |


100%|██████████| 309/309 [00:10<00:00, 30.44it/s]


| Train Loss : 0.141430798964098 |  Val Loss : 0.16384475102787932 | Train Acc : 94.2703628540039 | Val Acc : 93.79310607910156 |


100%|██████████| 309/309 [00:10<00:00, 29.98it/s]


| Train Loss : 0.14251954488259047 |  Val Loss : 0.17651888860326306 | Train Acc : 94.42247009277344 | Val Acc : 93.18458557128906 |


100%|██████████| 309/309 [00:11<00:00, 27.98it/s]


| Train Loss : 0.1390497654959017 |  Val Loss : 0.1927534969640673 | Train Acc : 94.58472442626953 | Val Acc : 92.69776916503906 |


100%|██████████| 309/309 [00:11<00:00, 27.67it/s]


| Train Loss : 0.13579598196734288 |  Val Loss : 0.16362317839652038 | Train Acc : 94.64556884765625 | Val Acc : 93.99594116210938 |


100%|██████████| 309/309 [00:11<00:00, 27.64it/s]


| Train Loss : 0.13468766904254434 |  Val Loss : 0.16442746458335214 | Train Acc : 94.71656036376953 | Val Acc : 93.71196746826172 |


100%|██████████| 309/309 [00:11<00:00, 27.59it/s]


| Train Loss : 0.13545085426887288 |  Val Loss : 0.15551825998158292 | Train Acc : 94.64556884765625 | Val Acc : 93.99594116210938 |


100%|██████████| 309/309 [00:11<00:00, 27.53it/s]


| Train Loss : 0.13207516133709313 |  Val Loss : 0.16278956053480054 | Train Acc : 94.84839630126953 | Val Acc : 93.75253295898438 |


100%|██████████| 309/309 [00:11<00:00, 26.71it/s]


| Train Loss : 0.13018823339862715 |  Val Loss : 0.1634668929438139 | Train Acc : 95.02079010009766 | Val Acc : 94.19878387451172 |


100%|██████████| 309/309 [00:10<00:00, 30.25it/s]


| Train Loss : 0.13177598665142842 |  Val Loss : 0.20928070978478777 | Train Acc : 94.94979858398438 | Val Acc : 91.72413635253906 |


100%|██████████| 309/309 [00:10<00:00, 29.84it/s]


| Train Loss : 0.1299779840881316 |  Val Loss : 0.15687255780184245 | Train Acc : 95.02079010009766 | Val Acc : 94.03651428222656 |


100%|██████████| 309/309 [00:10<00:00, 28.11it/s]


| Train Loss : 0.12757927838109615 |  Val Loss : 0.18759353240955673 | Train Acc : 95.06134796142578 | Val Acc : 93.3874282836914 |
Words found are : 4808
Embedding(4808, 100)
-----------------------------------------------------------2-fold of the model-----------------------------------------------------------


100%|██████████| 309/309 [00:11<00:00, 27.57it/s]


| Train Loss : 0.6912897562668756 |  Val Loss : 0.687416344578803 | Train Acc : 53.57468795776367 | Val Acc : 61.582149505615234 |


100%|██████████| 309/309 [00:11<00:00, 27.33it/s]


| Train Loss : 0.6737027679455835 |  Val Loss : 0.6024465887590064 | Train Acc : 62.245208740234375 | Val Acc : 73.3874282836914 |


100%|██████████| 309/309 [00:11<00:00, 27.31it/s]


| Train Loss : 0.4448193136159504 |  Val Loss : 0.2065066059027434 | Train Acc : 80.51921844482422 | Val Acc : 94.19878387451172 |


100%|██████████| 309/309 [00:11<00:00, 27.50it/s]


| Train Loss : 0.27489526803424774 |  Val Loss : 0.17172313037783693 | Train Acc : 89.43312072753906 | Val Acc : 93.83367156982422 |


100%|██████████| 309/309 [00:11<00:00, 27.37it/s]


| Train Loss : 0.22858397502007488 |  Val Loss : 0.200372120145799 | Train Acc : 91.25849151611328 | Val Acc : 93.18458557128906 |


100%|██████████| 309/309 [00:10<00:00, 29.28it/s]


| Train Loss : 0.20251044875201438 |  Val Loss : 0.2158388325170015 | Train Acc : 91.97850036621094 | Val Acc : 92.4137954711914 |


100%|██████████| 309/309 [00:09<00:00, 30.93it/s]


| Train Loss : 0.18782763969050786 |  Val Loss : 0.3482068332732026 | Train Acc : 92.63766479492188 | Val Acc : 82.0689697265625 |


100%|██████████| 309/309 [00:10<00:00, 28.66it/s]


| Train Loss : 0.18440826035164853 |  Val Loss : 0.321880593382079 | Train Acc : 92.67822265625 | Val Acc : 84.30020141601562 |


100%|██████████| 309/309 [00:11<00:00, 27.49it/s]


| Train Loss : 0.17020148117593525 |  Val Loss : 0.15800447426585226 | Train Acc : 93.25626373291016 | Val Acc : 94.11764526367188 |


100%|██████████| 309/309 [00:11<00:00, 27.43it/s]


| Train Loss : 0.17433692049981223 |  Val Loss : 0.15368146864927806 | Train Acc : 93.1954116821289 | Val Acc : 93.87423706054688 |


100%|██████████| 309/309 [00:11<00:00, 27.34it/s]


| Train Loss : 0.20026471445048125 |  Val Loss : 0.6645007811501834 | Train Acc : 91.86695098876953 | Val Acc : 66.45030212402344 |


100%|██████████| 309/309 [00:11<00:00, 27.50it/s]


| Train Loss : 0.2970139073107296 |  Val Loss : 0.17888677297027794 | Train Acc : 88.45958709716797 | Val Acc : 93.46855926513672 |


100%|██████████| 309/309 [00:11<00:00, 27.53it/s]


| Train Loss : 0.18842249976911368 |  Val Loss : 0.17574034578659956 | Train Acc : 92.8303451538086 | Val Acc : 92.98174285888672 |


100%|██████████| 309/309 [00:10<00:00, 28.78it/s]


| Train Loss : 0.17372390614260938 |  Val Loss : 0.19093724715223057 | Train Acc : 93.56049346923828 | Val Acc : 92.9006118774414 |


100%|██████████| 309/309 [00:09<00:00, 31.13it/s]


| Train Loss : 0.16996951315779799 |  Val Loss : 0.17112456479092822 | Train Acc : 93.41851806640625 | Val Acc : 93.42799377441406 |


100%|██████████| 309/309 [00:10<00:00, 29.08it/s]


| Train Loss : 0.16173166082846127 |  Val Loss : 0.2991999169641588 | Train Acc : 93.77344512939453 | Val Acc : 87.78904724121094 |


100%|██████████| 309/309 [00:11<00:00, 27.76it/s]


| Train Loss : 0.15983018443863334 |  Val Loss : 0.1965790280593374 | Train Acc : 93.96612548828125 | Val Acc : 92.77890014648438 |


100%|██████████| 309/309 [00:11<00:00, 27.61it/s]


| Train Loss : 0.15248911271861354 |  Val Loss : 0.14387269393780833 | Train Acc : 94.1892318725586 | Val Acc : 94.56389617919922 |


100%|██████████| 309/309 [00:11<00:00, 27.62it/s]


| Train Loss : 0.1506320760001603 |  Val Loss : 0.23976346994677 | Train Acc : 94.14866638183594 | Val Acc : 90.8316421508789 |


100%|██████████| 309/309 [00:11<00:00, 27.46it/s]


| Train Loss : 0.14611399553972318 |  Val Loss : 0.21137986989152519 | Train Acc : 94.39205169677734 | Val Acc : 93.06288146972656 |


100%|██████████| 309/309 [00:11<00:00, 27.47it/s]


| Train Loss : 0.14706860652542686 |  Val Loss : 0.142249110588921 | Train Acc : 94.4326171875 | Val Acc : 94.40161895751953 |


100%|██████████| 309/309 [00:10<00:00, 28.58it/s]


| Train Loss : 0.14295772157515774 |  Val Loss : 0.14271201712228584 | Train Acc : 94.56444549560547 | Val Acc : 94.64502716064453 |


100%|██████████| 309/309 [00:10<00:00, 30.74it/s]


| Train Loss : 0.1423099654696556 |  Val Loss : 0.139930882043362 | Train Acc : 94.38191223144531 | Val Acc : 94.32048797607422 |


100%|██████████| 309/309 [00:10<00:00, 29.13it/s]


| Train Loss : 0.14063006630111644 |  Val Loss : 0.14106977812900504 | Train Acc : 94.53401947021484 | Val Acc : 94.60446166992188 |


100%|██████████| 309/309 [00:11<00:00, 27.27it/s]


| Train Loss : 0.1375973012225434 |  Val Loss : 0.13957274165646782 | Train Acc : 94.75711822509766 | Val Acc : 94.52333068847656 |


100%|██████████| 309/309 [00:11<00:00, 27.39it/s]


| Train Loss : 0.13619264036012027 |  Val Loss : 0.156220709452024 | Train Acc : 94.83824920654297 | Val Acc : 94.48275756835938 |


100%|██████████| 309/309 [00:11<00:00, 27.44it/s]


| Train Loss : 0.13510393814852725 |  Val Loss : 0.14226705989311356 | Train Acc : 94.90924072265625 | Val Acc : 94.44219207763672 |


100%|██████████| 309/309 [00:11<00:00, 27.67it/s]


| Train Loss : 0.13540465313349637 |  Val Loss : 0.1403487017336651 | Train Acc : 94.77740478515625 | Val Acc : 94.56389617919922 |


100%|██████████| 309/309 [00:11<00:00, 27.62it/s]


| Train Loss : 0.13245802240848734 |  Val Loss : 0.14687930396314083 | Train Acc : 94.97008514404297 | Val Acc : 94.56389617919922 |


100%|██████████| 309/309 [00:10<00:00, 28.23it/s]


| Train Loss : 0.1319042666333646 |  Val Loss : 0.2133701152945576 | Train Acc : 94.97008514404297 | Val Acc : 91.39958953857422 |
Words found are : 4808
Embedding(4808, 100)
-----------------------------------------------------------3-fold of the model-----------------------------------------------------------


100%|██████████| 309/309 [00:10<00:00, 30.27it/s]


| Train Loss : 0.6907832524244786 |  Val Loss : 0.6886452274438576 | Train Acc : 53.58483123779297 | Val Acc : 49.046653747558594 |


100%|██████████| 309/309 [00:10<00:00, 29.65it/s]


| Train Loss : 0.6693988858644044 |  Val Loss : 0.5539235061370819 | Train Acc : 61.89027404785156 | Val Acc : 76.14604187011719 |


100%|██████████| 309/309 [00:11<00:00, 27.78it/s]


| Train Loss : 0.440214093951736 |  Val Loss : 0.6978164449273694 | Train Acc : 81.0262680053711 | Val Acc : 67.13996124267578 |


100%|██████████| 309/309 [00:11<00:00, 27.62it/s]


| Train Loss : 0.2703387759548089 |  Val Loss : 0.846586086783883 | Train Acc : 89.6663589477539 | Val Acc : 67.26166534423828 |


100%|██████████| 309/309 [00:11<00:00, 27.54it/s]


| Train Loss : 0.22850460250861093 |  Val Loss : 0.15758117413333414 | Train Acc : 91.05567169189453 | Val Acc : 94.27992248535156 |


100%|██████████| 309/309 [00:11<00:00, 27.55it/s]


| Train Loss : 0.20687535049063405 |  Val Loss : 0.15584202335389902 | Train Acc : 92.10018920898438 | Val Acc : 94.64502716064453 |


100%|██████████| 309/309 [00:11<00:00, 26.11it/s]


| Train Loss : 0.18932738217592524 |  Val Loss : 0.14784030203403373 | Train Acc : 92.68836975097656 | Val Acc : 94.19878387451172 |


100%|██████████| 309/309 [00:11<00:00, 27.67it/s]


| Train Loss : 0.181336500039616 |  Val Loss : 0.16004009648522788 | Train Acc : 92.91146850585938 | Val Acc : 94.36105346679688 |


100%|██████████| 309/309 [00:10<00:00, 29.39it/s]


| Train Loss : 0.177938923703622 |  Val Loss : 0.14792337986068238 | Train Acc : 93.16499328613281 | Val Acc : 94.19878387451172 |


100%|██████████| 309/309 [00:10<00:00, 30.75it/s]


| Train Loss : 0.17318209553935104 |  Val Loss : 0.191206679799883 | Train Acc : 93.36781311035156 | Val Acc : 93.1440200805664 |


100%|██████████| 309/309 [00:10<00:00, 28.62it/s]


| Train Loss : 0.17343436981305888 |  Val Loss : 0.14473706046176005 | Train Acc : 92.96217346191406 | Val Acc : 94.40161895751953 |


100%|██████████| 309/309 [00:11<00:00, 27.41it/s]


| Train Loss : 0.16763250816501235 |  Val Loss : 0.14459440655152467 | Train Acc : 93.61119079589844 | Val Acc : 94.03651428222656 |


100%|██████████| 309/309 [00:11<00:00, 27.50it/s]


| Train Loss : 0.15799397940375043 |  Val Loss : 0.18873392883725992 | Train Acc : 94.0573959350586 | Val Acc : 91.76470947265625 |


100%|██████████| 309/309 [00:11<00:00, 27.58it/s]


| Train Loss : 0.16047736465018222 |  Val Loss : 0.18313159306682872 | Train Acc : 93.83429718017578 | Val Acc : 93.06288146972656 |


100%|██████████| 309/309 [00:11<00:00, 27.52it/s]


| Train Loss : 0.17304687125967405 |  Val Loss : 1.3465761662496756 | Train Acc : 93.41851806640625 | Val Acc : 54.52332305908203 |


100%|██████████| 309/309 [00:11<00:00, 27.62it/s]


| Train Loss : 0.21459557910276686 |  Val Loss : 0.17298546667242873 | Train Acc : 91.80610656738281 | Val Acc : 92.94117736816406 |


100%|██████████| 309/309 [00:10<00:00, 29.18it/s]


| Train Loss : 0.1662851325905404 |  Val Loss : 0.2536099744788522 | Train Acc : 93.621337890625 | Val Acc : 89.57403564453125 |


100%|██████████| 309/309 [00:09<00:00, 31.05it/s]


| Train Loss : 0.15632010883933214 |  Val Loss : 0.13815538480190245 | Train Acc : 94.11824035644531 | Val Acc : 94.68559265136719 |


100%|██████████| 309/309 [00:10<00:00, 29.01it/s]


| Train Loss : 0.15622890798126432 |  Val Loss : 0.14061558391997103 | Train Acc : 94.12837982177734 | Val Acc : 94.96957397460938 |


100%|██████████| 309/309 [00:11<00:00, 27.58it/s]


| Train Loss : 0.15200376738194893 |  Val Loss : 0.7197469457014394 | Train Acc : 94.07768249511719 | Val Acc : 65.11155700683594 |


100%|██████████| 309/309 [00:11<00:00, 27.64it/s]


| Train Loss : 0.1525041725104581 |  Val Loss : 0.16332432115658302 | Train Acc : 93.81401824951172 | Val Acc : 94.44219207763672 |


100%|██████████| 309/309 [00:11<00:00, 27.52it/s]


| Train Loss : 0.14707430792790543 |  Val Loss : 0.1866123473324622 | Train Acc : 94.28050231933594 | Val Acc : 92.73833465576172 |


100%|██████████| 309/309 [00:11<00:00, 27.67it/s]


| Train Loss : 0.14690513467379712 |  Val Loss : 0.14090266247670985 | Train Acc : 94.25007629394531 | Val Acc : 93.87423706054688 |


100%|██████████| 309/309 [00:11<00:00, 27.54it/s]


| Train Loss : 0.14070821603222353 |  Val Loss : 0.15795615685132708 | Train Acc : 94.62529754638672 | Val Acc : 93.63082885742188 |


100%|██████████| 309/309 [00:10<00:00, 29.20it/s]


| Train Loss : 0.14249360124878915 |  Val Loss : 0.14538573957920645 | Train Acc : 94.61515045166016 | Val Acc : 93.99594116210938 |


100%|██████████| 309/309 [00:09<00:00, 31.15it/s]


| Train Loss : 0.1423770824666198 |  Val Loss : 0.1508363047718594 | Train Acc : 94.58472442626953 | Val Acc : 93.67140197753906 |


100%|██████████| 309/309 [00:11<00:00, 27.69it/s]


| Train Loss : 0.13907667793897935 |  Val Loss : 0.1375778676271481 | Train Acc : 94.69627380371094 | Val Acc : 94.96957397460938 |


100%|██████████| 309/309 [00:11<00:00, 27.62it/s]


| Train Loss : 0.138496315214228 |  Val Loss : 0.13115763731950678 | Train Acc : 94.5745849609375 | Val Acc : 94.84786987304688 |


100%|██████████| 309/309 [00:11<00:00, 27.59it/s]


| Train Loss : 0.13801188019291585 |  Val Loss : 0.13142205188967077 | Train Acc : 94.8686752319336 | Val Acc : 94.80730438232422 |


100%|██████████| 309/309 [00:11<00:00, 27.61it/s]


| Train Loss : 0.1363220982226729 |  Val Loss : 0.1313079434539556 | Train Acc : 94.80783081054688 | Val Acc : 94.96957397460938 |
Words found are : 4808
Embedding(4808, 100)
-----------------------------------------------------------4-fold of the model-----------------------------------------------------------


100%|██████████| 309/309 [00:11<00:00, 27.60it/s]


| Train Loss : 0.6911548577001982 |  Val Loss : 0.6876109206168695 | Train Acc : 52.651859283447266 | Val Acc : 52.77890396118164 |


100%|██████████| 309/309 [00:11<00:00, 27.40it/s]


| Train Loss : 0.6743600447017587 |  Val Loss : 0.6265811671833479 | Train Acc : 61.22097396850586 | Val Acc : 70.30426025390625 |


100%|██████████| 309/309 [00:10<00:00, 28.60it/s]


| Train Loss : 0.4282818848735456 |  Val Loss : 0.684238060879659 | Train Acc : 81.10739135742188 | Val Acc : 68.11358642578125 |


100%|██████████| 309/309 [00:10<00:00, 30.61it/s]


| Train Loss : 0.2622801862716004 |  Val Loss : 0.23032262229623224 | Train Acc : 89.6866455078125 | Val Acc : 91.03448486328125 |


100%|██████████| 309/309 [00:10<00:00, 29.08it/s]


| Train Loss : 0.23170697569677862 |  Val Loss : 0.4342333251395042 | Train Acc : 91.05567169189453 | Val Acc : 82.67748260498047 |


100%|██████████| 309/309 [00:11<00:00, 27.45it/s]


| Train Loss : 0.20028705463522092 |  Val Loss : 1.076980977706445 | Train Acc : 92.09004974365234 | Val Acc : 62.51521301269531 |


100%|██████████| 309/309 [00:11<00:00, 27.60it/s]


| Train Loss : 0.18528808286303428 |  Val Loss : 0.8639120592064136 | Train Acc : 93.01287841796875 | Val Acc : 68.51927185058594 |


100%|██████████| 309/309 [00:11<00:00, 27.76it/s]


| Train Loss : 0.17916686155167183 |  Val Loss : 0.5957103418216533 | Train Acc : 93.2359848022461 | Val Acc : 75.90263366699219 |


100%|██████████| 309/309 [00:11<00:00, 27.50it/s]


| Train Loss : 0.17905510014838977 |  Val Loss : 0.2534836248258548 | Train Acc : 93.00273895263672 | Val Acc : 89.53346252441406 |


100%|██████████| 309/309 [00:11<00:00, 27.47it/s]


| Train Loss : 0.16948485468971283 |  Val Loss : 0.2200969202477961 | Train Acc : 93.51992797851562 | Val Acc : 92.45436096191406 |


100%|██████████| 309/309 [00:10<00:00, 28.31it/s]


| Train Loss : 0.16412440290899913 |  Val Loss : 0.16161877959437715 | Train Acc : 93.74302673339844 | Val Acc : 93.87423706054688 |


100%|██████████| 309/309 [00:10<00:00, 30.61it/s]


| Train Loss : 0.16017915118754292 |  Val Loss : 0.18472651249982236 | Train Acc : 93.90528106689453 | Val Acc : 93.54969787597656 |


100%|██████████| 309/309 [00:10<00:00, 29.56it/s]


| Train Loss : 0.1600240820314444 |  Val Loss : 0.5560698172576257 | Train Acc : 94.0066909790039 | Val Acc : 79.51318359375 |


100%|██████████| 309/309 [00:11<00:00, 27.65it/s]


| Train Loss : 0.15520595832641054 |  Val Loss : 0.15929726342808523 | Train Acc : 93.93570709228516 | Val Acc : 94.19878387451172 |


100%|██████████| 309/309 [00:11<00:00, 27.66it/s]


| Train Loss : 0.15101060567745003 |  Val Loss : 0.22002412496667362 | Train Acc : 94.09796142578125 | Val Acc : 90.8316421508789 |


100%|██████████| 309/309 [00:11<00:00, 27.59it/s]


| Train Loss : 0.15733804724418815 |  Val Loss : 0.15901480540935456 | Train Acc : 94.0066909790039 | Val Acc : 93.83367156982422 |


100%|██████████| 309/309 [00:11<00:00, 27.60it/s]


| Train Loss : 0.14577061365490848 |  Val Loss : 0.19470908126130732 | Train Acc : 94.39205169677734 | Val Acc : 93.22515106201172 |


100%|██████████| 309/309 [00:11<00:00, 27.66it/s]


| Train Loss : 0.14260936773548066 |  Val Loss : 0.16489973151055118 | Train Acc : 94.39205169677734 | Val Acc : 93.99594116210938 |


100%|██████████| 309/309 [00:10<00:00, 28.59it/s]


| Train Loss : 0.14128540641709522 |  Val Loss : 0.15275998859671955 | Train Acc : 94.61515045166016 | Val Acc : 94.36105346679688 |


100%|██████████| 309/309 [00:10<00:00, 30.49it/s]


| Train Loss : 0.14273306184237028 |  Val Loss : 0.1508485665677104 | Train Acc : 94.33120727539062 | Val Acc : 94.52333068847656 |


100%|██████████| 309/309 [00:10<00:00, 29.45it/s]


| Train Loss : 0.1391789112374188 |  Val Loss : 0.15952124865253364 | Train Acc : 94.55430603027344 | Val Acc : 93.67140197753906 |


100%|██████████| 309/309 [00:11<00:00, 27.54it/s]


| Train Loss : 0.13787860021217274 |  Val Loss : 0.1851873290442633 | Train Acc : 94.5745849609375 | Val Acc : 93.71196746826172 |


100%|██████████| 309/309 [00:11<00:00, 27.49it/s]


| Train Loss : 0.13917063714439784 |  Val Loss : 0.16102914053032413 | Train Acc : 94.68614196777344 | Val Acc : 94.11764526367188 |


100%|██████████| 309/309 [00:11<00:00, 27.55it/s]


| Train Loss : 0.13555778364234686 |  Val Loss : 0.160154372260766 | Train Acc : 94.74698638916016 | Val Acc : 94.40161895751953 |


100%|██████████| 309/309 [00:11<00:00, 27.73it/s]


| Train Loss : 0.13427625576274269 |  Val Loss : 0.16251293641168427 | Train Acc : 94.78754425048828 | Val Acc : 93.54969787597656 |


100%|██████████| 309/309 [00:11<00:00, 27.54it/s]


| Train Loss : 0.13265408858964398 |  Val Loss : 0.15611886511468018 | Train Acc : 95.03092956542969 | Val Acc : 94.19878387451172 |


100%|██████████| 309/309 [00:10<00:00, 28.59it/s]


| Train Loss : 0.13294751582359907 |  Val Loss : 0.26559978433361403 | Train Acc : 94.85853576660156 | Val Acc : 87.54563903808594 |


100%|██████████| 309/309 [00:10<00:00, 30.54it/s]


| Train Loss : 0.1342687845753706 |  Val Loss : 0.16233687356586543 | Train Acc : 94.72669982910156 | Val Acc : 94.03651428222656 |


100%|██████████| 309/309 [00:11<00:00, 27.95it/s]


| Train Loss : 0.1289015904736862 |  Val Loss : 0.19515375549702868 | Train Acc : 94.93965911865234 | Val Acc : 94.03651428222656 |


100%|██████████| 309/309 [00:10<00:00, 28.12it/s]


| Train Loss : 0.13065833290371145 |  Val Loss : 0.16567381830049138 | Train Acc : 94.94979858398438 | Val Acc : 93.75253295898438 |


In [61]:
torch.save(model_state_dicts, "My-Model.pt")

<h1  style="text-align: center" class="list-group-item list-group-item-action active">7. Inference</h1><a id = "7" ></a>

In [62]:
def numericalize(text):

    numerialized_source = []
    numerialized_source = [dataset.source_vocab.stoi["<SOS>"]]
    numerialized_source += dataset.source_vocab.numericalize(text)
    numerialized_source.append(dataset.source_vocab.stoi["<EOS>"])

    return numerialized_source

def padding(source):
    padded_sequence = torch.zeros(config.MAX_LEN, 1, dtype = torch.int)
    source = torch.tensor(source)

    if len(source) > config.MAX_LEN:
        padded_sequence[:, 0] = source[: config.MAX_LEN]
    else:
        padded_sequence[:len(source), 0] = padded_sequence[:len(source), 0] + source

    return padded_sequence

In [63]:
def infer_processing(text):

    text = preprocessing(text)
    text = numericalize(text)
    text = padding(text)
    return text

In [64]:
aspects = ["phone", "camera", "battery", "neutral", "processor"]

def get_similarity(text, aspect):
    try:
#         text = " ".join(text)
        return fasttext_model.wv.n_similarity(text, aspect)
    except:
        return 0

def best_aspect(text, aspects):
    a = []

    for aspect in aspects:
        a.append(get_similarity(text, aspect))

    return aspects[np.argmax(a)]


In [65]:
sample = "I am really impressed with the phone's great battery backup."

ba = best_aspect(preprocessing(sample), aspects)

a = infer_processing(sample).to(config.DEVICE)

In [66]:
model.eval()
sentiment = model(a)
sentiment = sentiment.cpu().detach().numpy()[0]

if sentiment > 0.5:
    sentiment = 'Positively'
else :
    sentiment = 'Negatively'

In [67]:
print(f"The reviewer is talking {sentiment} about the {ba} of the phone in his/her comment")

The reviewer is talking Positively about the processor of the phone in his/her comment
