# COMP47670 - Assignment 2: Text Classification

## RAYNA VARGHESE

## S19200265

### <u>Task 1:</u> Web Scraping

#### <font color=orange>__*Category Selection:*__</font>

The three categories that were selected are : *Restaurants*, *Fashion* and *Gym*

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
from requests import get
import numpy as np
import pandas as pd

In [2]:
def getData(url):
    data = get(url)
    # creating object for BeautifulSoup that parses html and xml code
    soup = BeautifulSoup(data.text, 'html.parser') 
    
    data = pd.DataFrame()
    Ratings = []
    Reviews = []
    data['Class_label'] = " "

    # finding all the <a> tags which consists of links to the reviews of each business in each category
    business = soup.find_all('a')

    
    for link in business[1:]:
        # extracting the links of each business and appending it to the main link
        response = urlopen("http://mlg.ucd.ie/modules/yalp/"+ link.get('href')).read()
        html_soup = BeautifulSoup(response, 'html.parser')
        
        # finding all the <div> tags that consists of class 'review'
        list1 = html_soup.find_all('div', class_='review')

        for r in list1:
            # extracting all <p> tags with class 'rating' which has <img> tag that consists the data for rating(number of stars)
            Ratings.append(r.find_all('p', class_='rating')[0].img["alt"])
            # extracting all the text within <p> tags with class 'review-text' 
            Reviews.append(r.find_all('p', class_='review-text')[0].text)

    data['Ratings'] = Ratings
    data['Reviews'] = Reviews
    # creating class label for reviews i.e if the ratings are 4 or 5 it is positive(1) else it is negative(0)
    data['Class_label'] = np.where((data['Ratings']=='4-star') | (data['Ratings']=='5-star'), 1, 0) #positive-1, negative-0
    
    return data

In [3]:
Restaurants = getData('http://mlg.ucd.ie/modules/yalp/restaurants_list.html')
Restaurants[0:5]

Unnamed: 0,Class_label,Ratings,Reviews
0,0,2-star,My husband and I had a rare afternoon off so w...
1,1,4-star,For years I thought this was only a wine store...
2,1,5-star,This place is so charming! I went with my husb...
3,1,5-star,We have been wanting to try this place for a c...
4,0,1-star,Decor looks ok but layout is too busy. Difficu...


In [4]:
Fashion = getData('http://mlg.ucd.ie/modules/yalp/fashion_list.html')
Fashion[0:5]

Unnamed: 0,Class_label,Ratings,Reviews
0,1,5-star,Looking for the best tactical supplies? Look n...
1,0,1-star,Stood in line like an idiot for 5 minutes to p...
2,1,4-star,Another great store with quality Equipment. Th...
3,1,5-star,The Problem with this store is not that they h...
4,1,5-star,Great place! We went in at almost closing time...


In [5]:
Gym = getData('http://mlg.ucd.ie/modules/yalp/gym_list.html')
Gym[0:5]

Unnamed: 0,Class_label,Ratings,Reviews
0,1,5-star,If you're looking for boxing in the East Valle...
1,0,1-star,I was really excited to try a fun workout rout...
2,0,2-star,I was interested in taking a boxing bootcamp c...
3,1,4-star,I worked out at 1 on 1 boxing for a bout 6 mon...
4,1,4-star,This place literally KICKED my butt every. sin...


***

### <u>Task 2:</u> Classification Model

Naive Bayes classification model is used as, for categorical data Bayesian classifier is best suitable and also after trying out different variations of test the model that gives fairly correct prediction was Naive Bayes.

In [6]:
import nltk
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
#import warnings
#warnings.filterwarnings("ignore", category=FutureWarning)

In [7]:
# converting the text into tokens for further processisng
vectorizer = CountVectorizer() 

#### <font color=purple>__i] Category 1: Restaurants:__</font>

##### <font color=orange>__*a] Preprocessing data:*__</font>

In [8]:
print("Samples per class: {}".format(np.bincount(Restaurants['Class_label'])))

Samples per class: [ 838 1162]


If we count the Class_label column, we have 838 positive reviews and 1162 negative reviews
<p> Now, we split the data in such a way that 70% is training data and 30% is test data </p>

In [9]:
Rx_train, Rx_test, Ry_train, Ry_test = train_test_split(Restaurants['Reviews'], Restaurants['Class_label'], test_size=0.3)
print(Rx_train.shape,Rx_test.shape,Ry_train.shape,Ry_test.shape)

(1400,) (600,) (1400,) (600,)


In [10]:
print("Samples per class: {}".format(np.bincount(Ry_train))) # [positive,negative]
print("Samples per class: {}".format(np.bincount(Ry_test)))

Samples per class: [583 817]
Samples per class: [255 345]


Tokenizing the training and testing dataset
<p>We fit and transform our training set and then perform transform on test set. Here, we have built a vocabulary from training dataset and we apply it on the test dataset</p>

In [11]:
Rx_train = vectorizer.fit_transform(Rx_train)
Rx_test = vectorizer.transform(Rx_test)

##### <font color=orange>__*b] Classification Model (Naive Bayes):*__</font>

Now, as this is binary classification model, we will use cross validation with multibinomial Naive Bayes with cv of 5 that says that it is 5 times as many as the original data, so we get a mean value 


In [12]:
R_cv_scores = cross_val_score(MultinomialNB(), Rx_train, Ry_train, cv=5)
print("Mean cross validation accuracy: {:.2f}".format(np.mean(R_cv_scores)))

Mean cross validation accuracy: 0.86


In [13]:
# naive bayes model
nb = MultinomialNB()
nb.fit(Rx_train, Ry_train)
print("Training set score: {:.3f}".format(nb.score(Rx_train, Ry_train)))
print("Test set score: {:.3f}".format(nb.score(Rx_test, Ry_test)))
#test score is consistent with cross validation score

Training set score: 0.961
Test set score: 0.848


Here we can see that the test score is 0.858 which is consistent with the cross validation score.

##### <font color=orange>__*c] Prediction:*__</font>

In [14]:
R_pred_nb = nb.predict(Rx_test)
R_confusion = confusion_matrix(Ry_test, R_pred_nb)
print("Confusion matrix: \n{}".format(R_confusion))

Confusion matrix: 
[[205  50]
 [ 41 304]]


Here, we can notice that there are 53 false-positive and 32 false-negative, which says around 30 incorrect in both cases.
<p>Similarly, for other 2 categories, the same procedure is performed.</p>

#### <font color=purple>__ii] Category 2: Fashion__</font>

##### <font color=orange>__*a] Preprocessing data:*__</font>

In [15]:
print("Samples per class: {}".format(np.bincount(Fashion['Class_label'])))

Samples per class: [ 795 1205]


In [16]:
Fx_train, Fx_test, Fy_train, Fy_test = train_test_split(Fashion['Reviews'], Fashion['Class_label'], test_size=0.3)
print(Fx_train.shape,Fx_test.shape,Fy_train.shape,Fy_test.shape)

(1400,) (600,) (1400,) (600,)


In [17]:
print("Samples per class: {}".format(np.bincount(Fy_train))) # [positive,negative]
print("Samples per class: {}".format(np.bincount(Fy_test)))

Samples per class: [557 843]
Samples per class: [238 362]


In [18]:
Fx_train = vectorizer.fit_transform(Fx_train)
Fx_test = vectorizer.transform(Fx_test)

##### <font color=orange>__*b] Classification Model (Naive Bayes):*__</font>

In [19]:
F_cv_scores = cross_val_score(MultinomialNB(), Fx_train, Fy_train, cv=5)
print("Mean cross validation accuracy: {:.2f}".format(np.mean(F_cv_scores)))

Mean cross validation accuracy: 0.88


In [20]:
# naive bayes model
nb = MultinomialNB()
nb.fit(Fx_train, Fy_train)
print("Training set score: {:.3f}".format(nb.score(Fx_train, Fy_train)))
print("Test set score: {:.3f}".format(nb.score(Fx_test, Fy_test)))
#test score is consistent with cross validation score

Training set score: 0.968
Test set score: 0.888


With this dataset there is slightly more difference in the test score and cross validation score

##### <font color=orange>__*c] Prediction:*__</font>

In [21]:
F_pred_nb = nb.predict(Fx_test)
F_confusion = confusion_matrix(Fy_test, F_pred_nb)
print("Confusion matrix: \n{}".format(F_confusion))

Confusion matrix: 
[[196  42]
 [ 25 337]]


The number of incorrect data is lesser than the previous dataset.

#### <font color=purple>__iii] Category 3: Gym__</font>

##### <font color=orange>__*a] Preprocessing data:*__</font>

In [22]:
#if we count the Class_label column, we have 701 positive reviews and 1299 negative reviews
print("Samples per class: {}".format(np.bincount(Gym['Class_label'])))

Samples per class: [ 701 1299]


In [23]:
Gx_train, Gx_test, Gy_train, Gy_test = train_test_split(Gym['Reviews'], Gym['Class_label'], test_size=0.3)
print(Gx_train.shape,Gx_test.shape,Gy_train.shape,Gy_test.shape)

(1400,) (600,) (1400,) (600,)


In [24]:
print("Samples per class: {}".format(np.bincount(Gy_train))) # [positive,negative]
print("Samples per class: {}".format(np.bincount(Gy_test)))

Samples per class: [489 911]
Samples per class: [212 388]


In [25]:
Gx_train = vectorizer.fit_transform(Gx_train)
Gx_test = vectorizer.transform(Gx_test)

##### <font color=orange>__*b] Classification Model (Naive Bayes):*__</font>

In [26]:
G_cv_scores = cross_val_score(MultinomialNB(), Gx_train, Gy_train, cv=5)
print("Mean cross validation accuracy: {:.2f}".format(np.mean(G_cv_scores)))

Mean cross validation accuracy: 0.91


In [27]:
# naive bayes model
nb = MultinomialNB()
nb.fit(Gx_train, Gy_train)
print("Training set score: {:.3f}".format(nb.score(Gx_train, Gy_train)))
print("Test set score: {:.3f}".format(nb.score(Gx_test, Gy_test)))
#test score is consistent with cross validation score

Training set score: 0.969
Test set score: 0.903


##### <font color=orange>__*c] Prediction:*__</font>

In [28]:
G_pred_nb = nb.predict(Gx_test)
G_confusion = confusion_matrix(Gy_test, G_pred_nb)
print("Confusion matrix: \n{}".format(G_confusion))

Confusion matrix: 
[[177  35]
 [ 23 365]]


There is huge amount of difference in false-positive and false-negative, where false-positive is twice as much as false-negative.

### <u>Task 3:</u> Performance Evaluation

Evaluating the performance of a model with one category and using that to test on the other two categories.

#### <font color=purple>__a] Testing classification model of 'Restaurants' on 'Fashion' and 'Gym'__</font>

##### <font color=orange>__*i] Test Fashion Dataset*__</font>

In [29]:
Rx_train, Rx_test, Ry_train, Ry_test = train_test_split(Restaurants['Reviews'], Restaurants['Class_label'], test_size=0.3)
Fx_train, Fx_test, Fy_train, Fy_test = train_test_split(Fashion['Reviews'], Fashion['Class_label'], test_size=0.3)

Rx_train = vectorizer.fit_transform(Rx_train)
Fx_test = vectorizer.transform(Fx_test)

nb = MultinomialNB()
nb.fit(Rx_train, Ry_train)
print("Training set score: {:.3f}".format(nb.score(Rx_train, Ry_train)))
print("Test set score: {:.3f}".format(nb.score(Fx_test, Fy_test)))

Training set score: 0.955
Test set score: 0.740


##### <font color=orange>__*ii] Test Gym Dataset*__</font>

In [30]:
Rx_train, Rx_test, Ry_train, Ry_test = train_test_split(Restaurants['Reviews'], Restaurants['Class_label'], test_size=0.3)
Gx_train, Gx_test, Gy_train, Gy_test = train_test_split(Gym['Reviews'], Gym['Class_label'], test_size=0.3)

Rx_train = vectorizer.fit_transform(Rx_train)
Gx_test = vectorizer.transform(Gx_test)

nb = MultinomialNB()
nb.fit(Rx_train, Ry_train)
print("Training set score: {:.3f}".format(nb.score(Rx_train, Ry_train)))
print("Test set score: {:.3f}".format(nb.score(Gx_test, Gy_test)))

Training set score: 0.959
Test set score: 0.808


#### <font color=purple>__b] Testing classification model of 'Fashion' on 'Restaurants' and 'Gym'__</font>

##### <font color=orange>__*i] Test Restaurants Dataset*__</font>

In [31]:
Fx_train, Fx_test, Fy_train, Fy_test = train_test_split(Fashion['Reviews'], Fashion['Class_label'], test_size=0.3)
Rx_train, Rx_test, Ry_train, Ry_test = train_test_split(Restaurants['Reviews'], Restaurants['Class_label'], test_size=0.3)

Fx_train = vectorizer.fit_transform(Fx_train)
Rx_test = vectorizer.transform(Rx_test)

# naive bayes model
nb = MultinomialNB()
nb.fit(Fx_train, Fy_train)
print("Training set score: {:.3f}".format(nb.score(Fx_train, Fy_train)))
print("Test set score: {:.3f}".format(nb.score(Rx_test, Ry_test)))

Training set score: 0.961
Test set score: 0.838


##### <font color=orange>__*ii] Test Gym Dataset*__</font>

In [32]:
Fx_train, Fx_test, Fy_train, Fy_test = train_test_split(Fashion['Reviews'], Fashion['Class_label'], test_size=0.3)
Gx_train, Gx_test, Gy_train, Gy_test = train_test_split(Gym['Reviews'], Gym['Class_label'], test_size=0.3)

Fx_train = vectorizer.fit_transform(Fx_train)
Gx_test = vectorizer.transform(Gx_test)

nb = MultinomialNB()
nb.fit(Fx_train, Fy_train)
print("Training set score: {:.3f}".format(nb.score(Fx_train, Fy_train)))
print("Test set score: {:.3f}".format(nb.score(Gx_test, Gy_test)))

Training set score: 0.974
Test set score: 0.872


#### <font color=purple>__c] Testing classification model of 'Gym' on 'Restaurants' and 'Fashion'__</font>

##### <font color=orange>__*i] Test Restaurants Dataset*__</font>

In [33]:
Gx_train, Gx_test, Gy_train, Gy_test = train_test_split(Gym['Reviews'], Gym['Class_label'], test_size=0.3)
Rx_train, Rx_test, Ry_train, Ry_test = train_test_split(Restaurants['Reviews'], Restaurants['Class_label'], test_size=0.3)

Gx_train = vectorizer.fit_transform(Gx_train)
Rx_test = vectorizer.transform(Rx_test)

nb = MultinomialNB()
nb.fit(Gx_train, Gy_train)
print("Training set score: {:.3f}".format(nb.score(Gx_train, Gy_train)))
print("Test set score: {:.3f}".format(nb.score(Rx_test, Ry_test)))


Training set score: 0.966
Test set score: 0.825


##### <font color=orange>__*ii] Test Fashion Dataset*__</font>

In [34]:
Gx_train, Gx_test, Gy_train, Gy_test = train_test_split(Gym['Reviews'], Gym['Class_label'], test_size=0.3)
Fx_train, Fx_test, Fy_train, Fy_test = train_test_split(Fashion['Reviews'], Fashion['Class_label'], test_size=0.3)

Gx_train = vectorizer.fit_transform(Gx_train)
Fx_test = vectorizer.transform(Fx_test)

nb = MultinomialNB()
nb.fit(Gx_train, Gy_train)
print("Training set score: {:.3f}".format(nb.score(Gx_train, Gy_train)))
print("Test set score: {:.3f}".format(nb.score(Fx_test, Fy_test)))

Training set score: 0.966
Test set score: 0.803


There is drastic difference in training and testing score, which says how important the role of training datasets is. The 'fashion' model creates least difference as compared to the other two categories, however the difference cant be ignored. Hence it wouldnt be wise to use the classification model of another trained dataset.