# Module - 3 (Topic analysis and topic (attribute) wise sentiment analysis)

## Download Data

In [1]:
!wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz

--2021-04-04 11:44:40--  http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Amazon_Instant_Video_5.json.gz
Resolving snap.stanford.edu (snap.stanford.edu)... 171.64.75.80
Connecting to snap.stanford.edu (snap.stanford.edu)|171.64.75.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9517526 (9.1M) [application/x-gzip]
Saving to: ‘reviews_Amazon_Instant_Video_5.json.gz.2’


2021-04-04 11:44:48 (1.15 MB/s) - ‘reviews_Amazon_Instant_Video_5.json.gz.2’ saved [9517526/9517526]



In [2]:
#https://drive.google.com/file/d/1Y7fUWhnyjgwYy-lEhm6c155G_x8OQnQy/view?usp=sharing
!gdown https://drive.google.com/uc?id=1Y7fUWhnyjgwYy-lEhm6c155G_x8OQnQy
!unzip Models.zip

Downloading...
From: https://drive.google.com/uc?id=1Y7fUWhnyjgwYy-lEhm6c155G_x8OQnQy
To: /content/Models.zip
9.25MB [00:00, 19.9MB/s]
Archive:  Models.zip
   creating: Models/
  inflating: Models/tfidfmodel.joblib  
  inflating: Models/model.expElogbeta.npy  
  inflating: Models/lrmodel.joblib   
  inflating: Models/model.id2word    
  inflating: Models/model            
  inflating: Models/model.state      


## Importing required libraries

In [3]:
import re
import os
import json
import gzip
import numpy as np
import pandas as pd
from tqdm import tqdm
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer 

import gensim
from gensim.utils import simple_preprocess

from joblib import dump, load

%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

!pip install ipython-autotime
%load_ext autotime

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
#Snltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

time: 98 ms (started: 2021-04-04 11:44:56 +00:00)


## Loading Data

In [4]:
# load data
data = []
with gzip.open('reviews_Amazon_Instant_Video_5.json.gz') as file:
    for line in file:
        data.append(json.loads(line.strip()))

# convert list into pandas dataframe
df = pd.DataFrame.from_dict(data)

print('Total number of reviews are: ',len(data))

# Viewing first 5 reviews
df.head(5)

Total number of reviews are:  37126


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A11N155CW1UV02,B000H00VBQ,AdrianaM,"[0, 0]",I had big expectations because I love English ...,2.0,A little bit boring for me,1399075200,"05 3, 2014"
1,A3BC8O2KCL29V2,B000H00VBQ,Carol T,"[0, 0]",I highly recommend this series. It is a must f...,5.0,Excellent Grown Up TV,1346630400,"09 3, 2012"
2,A60D5HQFOTSOM,B000H00VBQ,"Daniel Cooper ""dancoopermedia""","[0, 1]",This one is a real snoozer. Don't believe anyt...,1.0,Way too boring for me,1381881600,"10 16, 2013"
3,A1RJPIGRSNX4PW,B000H00VBQ,"J. Kaplan ""JJ""","[0, 0]",Mysteries are interesting. The tension betwee...,4.0,Robson Green is mesmerizing,1383091200,"10 30, 2013"
4,A16XRPF40679KG,B000H00VBQ,Michael Dobey,"[1, 1]","This show always is excellent, as far as briti...",5.0,Robson green and great writing,1234310400,"02 11, 2009"


time: 1.11 s (started: 2021-04-04 11:44:56 +00:00)


**Attribute Information:**

1. reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B

2. asin - ID of the product, e.g. 0000013714

3. reviewerName - name of the reviewer

4. helpful - helpful votes of the review

5. reviewText - text of the review

6. overall - rating of the product

7. summary - summary of the review

8. unixReviewTime - time of the review (unix time)

9. reviewTime - time of the review (raw)


In [5]:
df.columns #verifying for columns

Index(['reviewerID', 'asin', 'reviewerName', 'helpful', 'reviewText',
       'overall', 'summary', 'unixReviewTime', 'reviewTime'],
      dtype='object')

time: 5.52 ms (started: 2021-04-04 11:44:57 +00:00)


## Modifying the overall to our requirements

In [6]:
def modify_overall(overall):
  ''' 
  Function to modify overall:
    overall greater than 3 is changed to 1(positive sentiment) 
    overall less than or equal to 3 is changed to 0(negative sentiment)
  Input: overall
  Output: Modified overall
  '''
  if overall <= 3:
      return 0
  return 1

time: 3.02 ms (started: 2021-04-04 11:44:57 +00:00)


In [7]:
actualScore = df['overall']
new_score = actualScore.map(modify_overall) 
df['overall'] = new_score
print("Number of data points in our data", df.shape)
df.head()

Number of data points in our data (37126, 9)


Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A11N155CW1UV02,B000H00VBQ,AdrianaM,"[0, 0]",I had big expectations because I love English ...,0,A little bit boring for me,1399075200,"05 3, 2014"
1,A3BC8O2KCL29V2,B000H00VBQ,Carol T,"[0, 0]",I highly recommend this series. It is a must f...,1,Excellent Grown Up TV,1346630400,"09 3, 2012"
2,A60D5HQFOTSOM,B000H00VBQ,"Daniel Cooper ""dancoopermedia""","[0, 1]",This one is a real snoozer. Don't believe anyt...,0,Way too boring for me,1381881600,"10 16, 2013"
3,A1RJPIGRSNX4PW,B000H00VBQ,"J. Kaplan ""JJ""","[0, 0]",Mysteries are interesting. The tension betwee...,1,Robson Green is mesmerizing,1383091200,"10 30, 2013"
4,A16XRPF40679KG,B000H00VBQ,Michael Dobey,"[1, 1]","This show always is excellent, as far as briti...",1,Robson green and great writing,1234310400,"02 11, 2009"


time: 73.4 ms (started: 2021-04-04 11:44:57 +00:00)


## Cleaning Data

In [8]:
''' Remove the duplicates which have same reviewerID, asin, unixReviewTime, and reviewText '''
new = df.drop_duplicates(subset={"reviewerID","asin","unixReviewTime","reviewText"}, keep='first', inplace=False)
new.shape

(37126, 9)

time: 125 ms (started: 2021-04-04 11:44:57 +00:00)


In [9]:
''' Why we add extra stopwords to extract the topics'''
stopwords= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'should', "should've",'different','especially','common','anything','unfunny', 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ma','amazon','also','keep',"go","bad","better","back","excellent","year","cast" ,"two","know","new", "take","still" ,"come","best" ,"every","ever" ,"many",
            "pilot" ,"line" ,"next" ,"wait"  ,"watched" ,"going"])

time: 14 ms (started: 2021-04-04 11:44:57 +00:00)


In [10]:
stopwords1= set(['br', 'the', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
            "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
            'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
            'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
            'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
            'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
            'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
            'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
            'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
            'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
            's', 't', 'can', 'will', 'just', 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
            've', 'y', 'ma'])

time: 8.46 ms (started: 2021-04-04 11:44:57 +00:00)


### Required functions

In [11]:
def decontracted(phrase):
    ''' 
      Function used to decontact the words in the phrase
      Input: phrase
      Output: decontracted phrase
    '''
    phrase = re.sub(r"won't", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase

time: 9.4 ms (started: 2021-04-04 11:44:57 +00:00)


In [12]:
def pipeline(review): 
  '''
     Input: list of review
     Output: processed reviews 
  '''
  processed_reviews = []
  for sentence in tqdm(review):
    sentence = sentence.lower() #converting all letters in the sentence to lowercase
    sentence = decontracted(sentence) #decontact the sentence in the review
    sentence = re.sub('[^A-Za-z]+', ' ', sentence) #retaining only alphabets in the sentence 

    #Word tokenization
    word_tokens = word_tokenize(sentence) 

    #Stop word removal
    filtered_sentence = [w for w in word_tokens if  w not in stopwords]
    
    #Lemmatization
    wnl = WordNetLemmatizer()
    filtered_sentence = [wnl.lemmatize(w) for w in filtered_sentence ]
    processed_reviews.append(filtered_sentence)
    
  return processed_reviews


time: 12.2 ms (started: 2021-04-04 11:44:57 +00:00)


In [13]:
def pipeline_classification(review): 
  '''
     Input: list of review
     Output: processed reviews 
  '''
  processed_reviews = []
  for sentence in tqdm(review):
    sentence = sentence.lower() #converting all letters in the sentence to lowercase
    sentence = decontracted(sentence) #decontact the sentence in the review
    sentence = re.sub('[^A-Za-z]+', ' ', sentence) #retaining only alphabets in the sentence 

    #Word tokenization
    word_tokens = word_tokenize(sentence) 

    #Stop word removal
    filtered_sentence = [w for w in word_tokens if  w not in stopwords1]
    
    #Lemmatization
    wnl = WordNetLemmatizer()
    filtered_sentence = [wnl.lemmatize(w) for w in filtered_sentence ]

    final = ' '.join(filtered_sentence)
    processed_reviews.append(final)
  return processed_reviews


time: 12.7 ms (started: 2021-04-04 11:44:57 +00:00)


## Creating required variables

In [14]:
reviews = df['reviewText'].values
score = df['overall'].values

time: 2.06 ms (started: 2021-04-04 11:44:57 +00:00)


In [15]:
#Saving Orignal Reviews
unprocessed_review = reviews

#filtering review
reviews = pipeline(reviews)

100%|██████████| 37126/37126 [00:31<00:00, 1183.99it/s]

time: 31.4 s (started: 2021-04-04 11:44:57 +00:00)





# 1. Extract the topics from the reviews using any topic extraction technique of your choice.

In [16]:
dictionary = gensim.corpora.Dictionary(reviews)

time: 2.5 s (started: 2021-04-04 11:45:31 +00:00)


In [17]:
# Printing first10 values of dictionary
count = 0
for i in dictionary:
    print(i, dictionary[i])
    count += 1
    if count > 10:
        break

0 appeal
1 big
2 boring
3 detective
4 english
5 expectation
6 guy
7 investigative
8 love
9 not
10 particular
time: 24.5 ms (started: 2021-04-04 11:45:33 +00:00)


In [18]:
#Remove very rare and very common words:
#words appearing less than 15 times
#words appearing in more than 10% of all documents

dictionary.filter_extremes(no_below=15, no_above=0.1, keep_n= 100000)

time: 135 ms (started: 2021-04-04 11:45:33 +00:00)


In [19]:
'''
Create the Bag-of-words model for each review i.e for each review we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in reviews]

time: 1.85 s (started: 2021-04-04 11:45:34 +00:00)


In [20]:
'''
Preview BOW for our sample preprocessed document
'''
review_index = 50
bow_review_index = bow_corpus[review_index]

for i in range(len(bow_review_index)):
    print("Word {} (\"{}\") appears {} time.".format(bow_review_index[i][0],  dictionary[bow_review_index[i][0]],bow_review_index[i][1]))

Word 14 ("highly") appears 1 time.
Word 19 ("recommend") appears 1 time.
Word 119 ("bit") appears 1 time.
Word 192 ("little") appears 1 time.
Word 426 ("week") appears 1 time.
Word 712 ("wish") appears 1 time.
Word 782 ("available") appears 1 time.
Word 783 ("began") appears 1 time.
Word 784 ("edge") appears 1 time.
Word 785 ("happy") appears 1 time.
Word 786 ("reality") appears 1 time.
Word 787 ("seat") appears 1 time.
Word 788 ("stream") appears 1 time.
Word 789 ("suspend") appears 1 time.
time: 12.7 ms (started: 2021-04-04 11:45:35 +00:00)


In [None]:
#Running LDA using Bag of Words
lda_model =  gensim.models.LdaMulticore(bow_corpus,num_topics = 8,id2word = dictionary,passes = 10,workers = 8)

time: 4min 5s (started: 2021-04-04 08:14:41 +00:00)


In [26]:
# Saving Model

lda_model.save('model1')

time: 16.9 ms (started: 2021-04-04 10:30:50 +00:00)


In [21]:
lda_model = gensim.models.LdaMulticore.load('model')

time: 17.7 ms (started: 2021-04-04 11:45:35 +00:00)


In [22]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx+1, topic ))
    print("\n")

Topic: 1 
Words: 0.006*"doctor" + 0.004*"science" + 0.004*"fan" + 0.004*"dvd" + 0.004*"tv" + 0.004*"end" + 0.003*"set" + 0.003*"special" + 0.003*"u" + 0.003*"find"


Topic: 2 
Words: 0.008*"family" + 0.007*"crime" + 0.005*"murder" + 0.005*"life" + 0.005*"case" + 0.004*"detective" + 0.004*"mystery" + 0.004*"man" + 0.004*"police" + 0.004*"plot"


Topic: 3 
Words: 0.018*"film" + 0.007*"life" + 0.004*"woman" + 0.004*"man" + 0.003*"find" + 0.003*"world" + 0.003*"scene" + 0.003*"work" + 0.003*"never" + 0.003*"look"


Topic: 4 
Words: 0.016*"funny" + 0.010*"tv" + 0.010*"comedy" + 0.006*"laugh" + 0.006*"fun" + 0.005*"actor" + 0.005*"humor" + 0.004*"hope" + 0.004*"writing" + 0.004*"family"


Topic: 5 
Words: 0.015*"film" + 0.009*"acting" + 0.008*"plot" + 0.006*"horror" + 0.006*"scene" + 0.006*"action" + 0.005*"end" + 0.005*"lot" + 0.004*"pretty" + 0.004*"little"


Topic: 6 
Words: 0.009*"film" + 0.004*"role" + 0.004*"scene" + 0.004*"actor" + 0.003*"play" + 0.003*"city" + 0.003*"life" + 0.003*"r

In [51]:
Topics = [ 'Science related tv  show','Crime or detective film/tv show','Mystery Drama film','Comedy tv show','Horror/Action Film','Drama film',
'General topic','Play (theatre) or Musical-Drama film/tv show']

time: 919 µs (started: 2021-04-04 12:21:44 +00:00)


# 2. Report sentences under each topic.
# 3. Analyse whether the topics extracted make sense. Justify your claim with some examples.

In [24]:
topic_list = []
for i in tqdm(reviews):
  t = lda_model[dictionary.doc2bow(i)]
  max_prob = t[0][1]
  topic = t[0][0]
  for j in t:
    if j[1]>max_prob:
      topic = j[0]
  topic_list.append(topic+1)


100%|██████████| 37126/37126 [00:27<00:00, 1329.34it/s]

time: 27.9 s (started: 2021-04-04 11:46:24 +00:00)





In [25]:
#Count of total reviews under each topic

from collections import Counter
count = Counter(topic_list)
print(count)

Counter({8: 13526, 7: 8153, 1: 3519, 5: 3182, 4: 2932, 2: 2251, 3: 1891, 6: 1672})
time: 5.82 ms (started: 2021-04-04 11:46:52 +00:00)


## Sentences Under Topic 1

In [26]:
#Sentences under topic 1

print('Sentences Under Topic 1')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 1):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 1
----------------------------------------------------------------------------------------------------
1.  I love the variety of comics.  Great for dinner TV entertainment because of length of each episode.  Many of the featured comics have gone on to even bigger TV specials so it's great to see some of their earlier material.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
2.  This show is like a runaway train ride. Every episode has poor Jack in some death defying crisis. Really enjoyed the ride.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
3.  For sheer intensity and plot topsy-turvies, no other show comes close to 24. You'd think the premise (every hour o

## Sentences Under Topic 2

In [27]:
#Sentences under topic 2

print('Sentences Under Topic 2')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 2):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 2
----------------------------------------------------------------------------------------------------
1.  I discovered this series quite by accident. Having watched and appreciated Masterpiece Contemporary: Place of Execution, I was keen to read the novel (which inspired the TV adaptation) by Val McDermid. The novel was very well-written, and a nail-biting suspense thriller. Then I discovered that Val McDermid wrote other novels as well, and a couple of them inspired the TV crime drama Wire in the Blood.I finished watching all of Season 1 and have become a fan of this gritty crime drama that follows the investigations led by DI Carol Jordan (Hermione Norris). She is assisted by clinical psychologist Dr. Tony Hill (Robson Green), a rather eccentric figure who delves deeply into the minds of serial killers, studies patterns of criminal behavior and profiles criminals. His methods may seem strange at times, but he always manages to get results. Both Jordan and Hill 

## Sentences Under Topic 3

In [28]:
#Sentences under topic 3

print('Sentences Under Topic 3')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 3):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 3
----------------------------------------------------------------------------------------------------
1.  Season three opens, with Gibbs going after Kate's killer, Ari. And to try to throw alot at us, NCIS gets a new director, Jenny Shepard played by Lauren Holly. Now Shepard and Gibbs do have some history, since both were partners at NCIS at one time, also. To proclaim Ari's innocence, is Massade officer Ziva David. And I'm sorry, I didn't like her then. And I don't liker her now, She and pretty much all of Massad is built up like they are the only agents that are extremely tough, they're unstoppable. So Ziva and Jenny try to convince Gibbs that he's wrong, and Gibbs even makes the only logical argument against Ari's innocence. That if he were innocent, he would turn himself in. So, after much arguing, he gets Ziva to follow him. And Ziva kills Ari in Gibb's basement, And we get each persons view of how they saw Kate in this episode. Now, I would have inserted P

## Sentences Under Topic 4

In [29]:
#Sentences under topic 4

print('Sentences Under Topic 4')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 4):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break   

  

Sentences Under Topic 4
----------------------------------------------------------------------------------------------------
1.  This is the best of the best comedy Stand-up. The fact that I was able to just watch continuously one comedian after another was great. I had the best laughter I have had in a long time.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
2.  Watched it for Kevin Hart and only Kevin Hart!  He makes me laugh.  The best comedy comes from pain and Kevin does his comedy with a huge heart.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
3.  some comedians are very good, some not so good, and some are just not funny.  I don't know the names to chose from and watched 

## Sentences Under Topic 5

In [30]:
#Sentences under topic 5

print('Sentences Under Topic 5')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 5):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 5
----------------------------------------------------------------------------------------------------
1.  There were some good entertainers, and some are just dumb.  I kept a paper and pen and wrote down names of the ones I wanted to see more.  They keep them in fairly short bits.  If you are looking to laugh continuous this may not be for you.  But each person has a different sense of humor so give it a try, but don't expect greatness.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
2.  Non stop action with edge of your seat thrilling plots with a new turn every minute.  Cast of great stars keep all events seeming like weeks but actually are in 24 hours
----------------------------------------------------------------------------------------------------
-----------------------------------------------------------

## Sentences Under Topic 6

In [31]:
#Sentences under topic 6

print('Sentences Under Topic 6')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 6):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 6
----------------------------------------------------------------------------------------------------
1.  This is a banner series for this extraordinarily conceived and executed television series.  From its explosive opening through 24 hours of plot twists, exciting action, and devastating loss, this season is absolutely superb.  Be warned though---there's lots of unexpected killings in this season and several favorite characters are disposed of.  You just don't know what to expect from this Emmy-winning series.Kiefer Sutherland won his much deserved Best Actor emmy as Jack finds himself confronted with one impossible task after another.  The supporting cast is superb:  James Morrison as CTU director Bill Buchanan; Mary Lynn Rajskub as the neurotic but brilliant Chloe; Kim Raver as Jack's former love Audrey; William Devane as irasible Secretary of Defense; Louis Lombardi as the plump and human Edgar; Jayne Atkinson as the Homeland Security Director; Sean Astin as

## Sentences Under Topic 7

In [32]:
#Sentences under topic 7

print('Sentences Under Topic 7')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 7):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 7
----------------------------------------------------------------------------------------------------
1.  I highly recommend this series. It is a must for anyone who is yearning to watch "grown up" television. Complex characters and plots to keep one totally involved. Thank you Amazin Prime.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
2.  There are many episodes in this series, so I pretty-much just skip through them to try to find a description of something I think I would like.  It's kind of a crap shoot as to whether you'll be entertained or not, but hey, if you're just sitting around trying to kill 20-30 minutes while you're waiting for something else to do, it's worth a shot.
----------------------------------------------------------------------------------------------------
----------------------------

## Sentences Under Topic 8

In [33]:
#Sentences under topic 8

print('Sentences Under Topic 8')

itr = 0
ct =0
while(True):
  if(topic_list[itr] == 8):
    print('-'*100)
    print(ct+1,end='')
    print('. ',end=' ')
    print(unprocessed_review [itr])
    print('-'*100)
    ct+=1
  itr+=1
  if(ct ==10):
    break
    

  

Sentences Under Topic 8
----------------------------------------------------------------------------------------------------
1.  I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn't appeal to me at all.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
2.  This one is a real snoozer. Don't believe anything you read or hear, it's awful. I had no idea what the title means. Neither will you.
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
3.  Mysteries are interesting.  The tension between Robson and the tall blond is good but not always believable.  She often seemed uncomfortable.
----------------------

# 4. Report topic wise sentiment distribution for the whole repository. Explain the method that you used. Give complete reference of any paper that you use for the purpose.

In [52]:

data = pipeline_classification(unprocessed_review)
tf_idf_vect = load('tfidfmodel.joblib')
tf_idf_vect = tf_idf_vect.transform(data)
model = load('lrmodel.joblib')
sentiment = model.predict(tf_idf_vect)


100%|██████████| 37126/37126 [00:28<00:00, 1297.23it/s]


time: 36.1 s (started: 2021-04-04 12:21:49 +00:00)


In [53]:
distribution = [[0,0] for i in range(8)]
for i in tqdm(range(len(data))):
  distribution[topic_list[i]-1][sentiment[i]]+=1

100%|██████████| 37126/37126 [00:00<00:00, 865444.68it/s]

time: 51 ms (started: 2021-04-04 12:22:25 +00:00)





In [54]:
for i in range(len(distribution)):
  print("Total positive and Negative reviws under topic {} are {} , {} .".format(Topics[i],distribution[i][1],distribution[i][0]))

Total positive and Negative reviws under topic Science related tv  show are 3225 , 294 .
Total positive and Negative reviws under topic Crime or detective film/tv show are 2088 , 163 .
Total positive and Negative reviws under topic Mystery Drama film are 1526 , 365 .
Total positive and Negative reviws under topic Comedy tv show are 2594 , 338 .
Total positive and Negative reviws under topic Horror/Action Film are 2149 , 1033 .
Total positive and Negative reviws under topic Drama film are 1429 , 243 .
Total positive and Negative reviws under topic General topic are 6898 , 1255 .
Total positive and Negative reviws under topic Play (theatre) or Musical-Drama film/tv show are 12199 , 1327 .
time: 6.47 ms (started: 2021-04-04 12:22:25 +00:00)


In [55]:
df = pd.DataFrame(
{"Topics" : Topics,
"Positive Sentiments" : [distribution[i][1] for i in range(8)],
"Negative Sentiments" : [distribution[i][0] for i in range(8)]},
)


time: 6.35 ms (started: 2021-04-04 12:22:25 +00:00)


In [56]:
df

Unnamed: 0,Topics,Positive Sentiments,Negative Sentiments
0,Science related tv show,3225,294
1,Crime or detective film/tv show,2088,163
2,Mystery Drama film,1526,365
3,Comedy tv show,2594,338
4,Horror/Action Film,2149,1033
5,Drama film,1429,243
6,General topic,6898,1255
7,Play (theatre) or Musical-Drama film/tv show,12199,1327


time: 21.7 ms (started: 2021-04-04 12:22:25 +00:00)
