<a href="https://colab.research.google.com/github/santhoshjinna15/INFO5731/blob/main/In_class_exercise_05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The fifth In-class-exercise (2/23/2021, 20 points in total)

In exercise-03, I asked you to collected 500 textual data based on your own information needs (If you didn't collect the textual data, you should recollect for this exercise). Now we need to think about how to represent the textual data for text classification. In this exercise, you are required to select 10 types of features (10 types of features but absolutely more than 10 features) in the followings feature list, then represent the 500 texts with these features. The output should be in the following format:
![image.png](attachment:image.png)

The feature list:

* (1) tf-idf features
* (2) POS-tag features: number of adjective, adverb, auxiliary, punctuation, complementizer, coordinating conjunction, subordinating conjunction, determiner, interjection, noun, possessor, preposition, pronoun, quantifier, verb, and other. (select some of them if you use pos-tag features)
* (3) Linguistic features:
  * number of right-branching nodes across all constituent types
  * number of right-branching nodes for NPs only
  * number of left-branching nodes across all constituent types
  * number of left-branching nodes for NPs only
  * number of premodifiers across all constituent types
  * number of premodifiers within NPs only
  * number of postmodifiers across all constituent types
  * number of postmodifiers within NPs only
  * branching index across all constituent types, i.e. the number of right-branching nodes minus number of left-branching nodes
  * branching index for NPs only
  * branching weight index: number of tokens covered by right-branching nodes minus number of tokens covered by left-branching nodes across all categories
  * branching weight index for NPs only 
  * modification index, i.e. the number of premodifiers minus the number of postmodifiers across all categories
  * modification index for NPs only
  * modification weight index: length in tokens of all premodifiers minus length in tokens of all postmodifiers across all categories
  * modification weight index for NPs only
  * coordination balance, i.e. the maximal length difference in coordinated constituents
  
  * density (density can be calculated using the ratio of folowing function words to content words) of determiners/quantifiers
  * density of pronouns
  * density of prepositions
  * density of punctuation marks, specifically commas and semicolons
  * density of auxiliary verbs
  * density of conjunctions
  * density of different pronoun types: Wh, 1st, 2nd, and 3rd person pronouns
  
  * maximal and average NP length
  * maximal and average AJP length
  * maximal and average PP length
  * maximal and average AVP length
  * sentence length

* Other features in your mind (ie., pre-defined patterns)


Extracting Data

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
main_text = [] # List to store Review headings
sub_text =[] #List to store reviews
sub_rating=[]#List to store ratings
for number in range(60):
  link = "https://www.flipkart.com/boat-aavante-bar-1150-60-w-bluetooth-soundbar/product-reviews/itm1fe41ecae973b?pid=ACCFH425PNGFUHAU&lid=LSTACCFH425PNGFUHAUROM019&marketplace=FLIPKART&page=" + str(number) # Generating link dynamically
 # print(link)
  page = requests.get(link) # Accessing the webpage
  soup = BeautifulSoup(page.text, 'html.parser')
  main_reviews = soup.find_all(class_='_2-N8zT') # Getting the Review Heading by using the class name
  text_reviews = soup.find_all(class_='t-ZTKy') # Getting the full reviews by using the class name
  rating       = soup.find_all(class_='_3LWZlK _1BLPMq')
  for ele, sub_ele,sub_in_ele in zip(main_reviews, text_reviews,rating) : # Iterating through the list
      main_text.append(ele.text) #Appending to empty list
      sub_text.append(sub_ele.text)
      sub_rating.append(sub_in_ele.text)
df = pd.DataFrame(list(zip(main_text, sub_text,sub_rating)), columns =['Glimpse of Review', 'Full Review','Rating'])  # Creating Dataframe
print("Length of data frame is {0}".format(len(df)))
df

Length of data frame is 536


Unnamed: 0,Glimpse of Review,Full Review,Rating
0,Wonderful,"sound is really good , even without sub woofer...",4
1,Worth every penny,"excellent product, nice sound quality, preset ...",5
2,Awesome,"Best Brand, Built quality is very good, No mor...",5
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4
4,Very Good,Writing this review after using it 7 days ..Ov...,4
...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5
532,Super!,Worthy product and stylish designREAD MORE,5
533,Fabulous!,Overall This Product AmazingREAD MORE,5
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5



Preprocessing Data

Converting to lower case

In [2]:

df['After Preprocessing'] = df['Full Review'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df

Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing
0,Wonderful,"sound is really good , even without sub woofer...",4,"sound is really good , even without sub woofer..."
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,"excellent product, nice sound quality, preset ..."
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,"best brand, built quality is very good, no mor..."
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using it since last 6 days :pros :1) stylish l...
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing this review after using it 7 days ..ov...
...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,"best speaker for small room,the audio is so cl..."
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product and stylish designread more
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall this product amazingread more
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value for money and satisfactionread more



Removing Punctuation


In [3]:
df['After Preprocessing'] = df['After Preprocessing'].str.replace('[^\w\s]','')
df

Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing
0,Wonderful,"sound is really good , even without sub woofer...",4,sound is really good even without sub woofer ...
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,excellent product nice sound quality preset so...
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,best brand built quality is very good no more ...
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using it since last 6 days pros 1 stylish look...
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing this review after using it 7 days over...
...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,best speaker for small roomthe audio is so cle...
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product and stylish designread more
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall this product amazingread more
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value for money and satisfactionread more



Removing Numerics


In [9]:
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join([i for i in x if not i.isdigit()]))
df

Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing
0,Wonderful,"sound is really good , even without sub woofer...",4,sound is really good even without sub woofer ...
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,excellent product nice sound quality preset so...
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,best brand built quality is very good no more ...
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using it since last days pros stylish look a...
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing this review after using it days overa...
...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,best speaker for small roomthe audio is so cle...
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product and stylish designread more
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall this product amazingread more
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value for money and satisfactionread more



Removing Special Characters

In [10]:

import re
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: ''.join(re.sub(r"[^a-zA-Z0-9]+", ' ', charctr) for charctr in x ))
df

Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing
0,Wonderful,"sound is really good , even without sub woofer...",4,sound is really good even without sub woofer ...
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,excellent product nice sound quality preset so...
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,best brand built quality is very good no more ...
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using it since last days pros stylish look a...
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing this review after using it days overa...
...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,best speaker for small roomthe audio is so cle...
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product and stylish designread more
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall this product amazingread more
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value for money and satisfactionread more


In [11]:
import nltk
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to /root/nltk_data...
       |   Unzipping corpora/brown_tei.zip.
       | Downloading package cess_cat to /root/nltk_data...
       |   Unzipping corpora/cess_cat.zip.
       | Downloading package

True


Removing Stop Words


In [12]:

from nltk.corpus import stopwords
stop = stopwords.words('english')
df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df

Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing
0,Wonderful,"sound is really good , even without sub woofer...",4,sound really good even without sub woofer bass...
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,excellent product nice sound quality preset so...
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,best brand built quality good distortion sound...
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using since last days pros stylish look awesom...
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing review using days overall sound astoni...
...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,best speaker small roomthe audio clear could h...
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product stylish designread
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall product amazingread
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value money satisfactionread



Spelling Correction

In [13]:
from textblob import TextBlob
df['After Preprocessing'].apply(lambda x: str(TextBlob(x).correct()))

0      sound really good even without sub offer bass ...
1      excellent product nice sound quality present s...
2      best brand built quality good distortion sound...
3      using since last days pro stylish look awesome...
4      writing review using days overall sound astoni...
                             ...                        
531    best speaker small soothe audit clear could he...
532                      worthy product stylish designed
533                          overall product amazingread
534                         value money satisfactionread
535                         good sound qualitythanksread
Name: After Preprocessing, Length: 536, dtype: object


Stemming

In [14]:
from nltk.stem import PorterStemmer
st = PorterStemmer()
df['After Preprocessing'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))

0      sound realli good even without sub woofer bass...
1      excel product nice sound qualiti preset sound ...
2      best brand built qualiti good distort sound co...
3      use sinc last day pro stylish look awesom soun...
4      write review use day overal sound astonish cla...
                             ...                        
531    best speaker small roomth audio clear could he...
532                    worthi product stylish designread
533                           overal product amazingread
534                          valu money satisfactionread
535                         good sound qualitythanksread
Name: After Preprocessing, Length: 536, dtype: object


Lemmatization

In [15]:

from textblob import Word
import nltk
nltk.download('wordnet')

df['After Preprocessing'] = df['After Preprocessing'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing
0,Wonderful,"sound is really good , even without sub woofer...",4,sound really good even without sub woofer bass...
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,excellent product nice sound quality preset so...
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,best brand built quality good distortion sound...
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using since last day pro stylish look awesome ...
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing review using day overall sound astonis...
...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,best speaker small roomthe audio clear could h...
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product stylish designread
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall product amazingread
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value money satisfactionread


Parts of Speech Tagging and Features

In [16]:
from nltk.tokenize import word_tokenize
pos = []
for sentence in df['After Preprocessing']:
  text = word_tokenize(sentence)
  pos.append(nltk.pos_tag(text))
pos

[[('sound', 'NN'),
  ('really', 'RB'),
  ('good', 'JJ'),
  ('even', 'RB'),
  ('without', 'IN'),
  ('sub', 'NN'),
  ('woofer', 'NN'),
  ('bass', 'NN'),
  ('good', 'JJ'),
  ('enough', 'RB'),
  ('fine', 'JJ'),
  ('sound', 'NN'),
  ('quality', 'NN'),
  ('really', 'RB'),
  ('good', 'JJ'),
  ('didnt', 'NN'),
  ('support', 'NN'),
  ('hdmi', 'VBD'),
  ('digital', 'JJ'),
  ('audio', 'RB'),
  ('even', 'RB'),
  ('co', 'VBP'),
  ('axial', 'JJ'),
  ('really', 'RB'),
  ('sad', 'JJ'),
  ('thatread', 'NN')],
 [('excellent', 'JJ'),
  ('product', 'NN'),
  ('nice', 'JJ'),
  ('sound', 'JJ'),
  ('quality', 'NN'),
  ('preset', 'VBN'),
  ('sound', 'JJ'),
  ('mode', 'NN'),
  ('nice', 'JJ'),
  ('bass', 'NN'),
  ('price', 'NN'),
  ('segment', 'NN'),
  ('good', 'JJ'),
  ('installation', 'NN'),
  ('service', 'NN'),
  ('good', 'JJ'),
  ('connectivity', 'NN'),
  ('lag', 'VBP'),
  ('tv', 'NN'),
  ('connectionread', 'NN')],
 [('best', 'JJS'),
  ('brand', 'NN'),
  ('built', 'VBN'),
  ('quality', 'NN'),
  ('good', 'JJ'

In [17]:
Adjective = []
Adverb = []
CordinatingConjunction = []
SubordinatingConjuction = []
Interjection = []
Noun = []
Verb = []
PersonalPronoun = []
predeterminer = []
Determiner = []

In [18]:
for value in pos:
  AdjectiveCount = 0
  AdverbCount = 0
  CordinatingConjunctionCount = 0
  SubordinatingConjuctionCount = 0
  InterjectionCount = 0
  NounCount = 0
  VerbCount = 0
  PersonalPronounCount = 0
  predeterminerCount = 0
  DeterminerCount = 0
  for word,tag in value:
    if tag == 'JJ':
      AdjectiveCount = AdjectiveCount + 1
    elif tag == 'RB':
      AdverbCount = AdverbCount + 1
    elif tag == 'CC':
      CordinatingConjunctionCount = CordinatingConjunctionCount + 1
    elif tag == 'UH':
      InterjectionCount = InterjectionCount + 1
    elif tag == 'NN':
      NounCount = NounCount + 1
    elif tag == 'VR':
      VerbCount = VerbCount + 1
    elif tag == 'PRP':
      PersonalPronounCount = PersonalPronounCount + 1
    elif tag == 'PDT':
      predeterminerCount = predeterminerCount + 1
    elif tag == 'DT':
      DeterminerCount = DeterminerCount + 1
    elif tag == 'IN':
       SubordinatingConjuctionCount = SubordinatingConjuctionCount + 1
  Adjective.append(AdjectiveCount)
  Adverb.append(AdverbCount)
  CordinatingConjunction.append(CordinatingConjunctionCount)
  Interjection.append(InterjectionCount)
  Noun.append(NounCount)
  Verb.append(VerbCount)
  PersonalPronoun.append(PersonalPronounCount)
  predeterminer.append(predeterminerCount)
  Determiner.append(DeterminerCount)
  SubordinatingConjuction.append(SubordinatingConjuctionCount)

In [19]:
df['Number of Adjectives'] = Adjective
df['Number of Adverbs'] = Adverb
df['Number of Cordinating Conjunctions'] = CordinatingConjunction
df['Number of Interjections'] = Interjection
df['Number of Nouns'] = Noun
df['Number of Verbs'] = Verb
df['Number of Personal Pronouns'] = PersonalPronoun
df['Number of Predeterminers'] = predeterminer
df['Number of Determiners'] = Determiner
df['Number of Subordinating Conjuctions'] = SubordinatingConjuction
df

Unnamed: 0,Glimpse of Review,Full Review,Rating,After Preprocessing,Number of Adjectives,Number of Adverbs,Number of Cordinating Conjunctions,Number of Interjections,Number of Nouns,Number of Verbs,Number of Personal Pronouns,Number of Predeterminers,Number of Determiners,Number of Subordinating Conjuctions
0,Wonderful,"sound is really good , even without sub woofer...",4,sound really good even without sub woofer bass...,7,7,0,0,9,0,0,0,0,1
1,Worth every penny,"excellent product, nice sound quality, preset ...",5,excellent product nice sound quality preset so...,7,0,0,0,11,0,0,0,0,0
2,Awesome,"Best Brand, Built quality is very good, No mor...",5,best brand built quality good distortion sound...,7,1,0,0,14,0,0,0,0,0
3,Master Blaster,Using it since last 6 days :PROS :1) Stylish ...,4,using since last day pro stylish look awesome ...,13,3,0,0,22,0,0,0,1,1
4,Very Good,Writing this review after using it 7 days ..Ov...,4,writing review using day overall sound astonis...,6,0,0,0,25,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
531,Great product,"best speaker for small room,the audio is so cl...",5,best speaker small roomthe audio clear could h...,2,1,0,0,8,0,0,0,0,0
532,Super!,Worthy product and stylish designREAD MORE,5,worthy product stylish designread,2,0,0,0,2,0,0,0,0,0
533,Fabulous!,Overall This Product AmazingREAD MORE,5,overall product amazingread,1,0,0,0,2,0,0,0,0,0
534,Mind-blowing purchase,Value for money and satisfactionREAD MORE,5,value money satisfactionread,0,0,0,0,3,0,0,0,0,0
