# COMP41680 Assignment 2

### Task 1. Data Collection

We use the provided web link to extract data. 

In [1]:
##Import all the packages required
#pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import seaborn as sns

In [2]:
URLBase = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"  ##The baseUrl remains constant for all the pages.
URL = URLBase + "index.html"
page = requests.get(URL)

soup = BeautifulSoup(page.content, 'html.parser')

In [3]:
## Opened the page on web browser and with the help of inspect element tool found the id of the body which contains the page data.
results = soup.find(id='all') 

In [4]:
##To understand the structure of the web pages containing the required data we print it using prettify() which formats the html for better understanding.
print(results.prettify())

<div id="all">
 <div class="list-group">
  <a class="list-group-item list-group-item-action" href="month-jan-001.html">
   January 2020  [1522]
  </a>
  <a class="list-group-item list-group-item-action" href="month-feb-001.html">
   February 2020  [1492]
  </a>
  <a class="list-group-item list-group-item-action" href="month-mar-001.html">
   March 2020  [1395]
  </a>
  <a class="list-group-item list-group-item-action" href="month-apr-001.html">
   April 2020  [1209]
  </a>
  <a class="list-group-item list-group-item-action" href="month-may-001.html">
   May 2020  [1315]
  </a>
  <a class="list-group-item list-group-item-action" href="month-jun-001.html">
   June 2020  [1376]
  </a>
  <a class="list-group-item list-group-item-action" href="month-jul-001.html">
   July 2020  [1358]
  </a>
  <a class="list-group-item list-group-item-action" href="month-aug-001.html">
   August 2020  [1254]
  </a>
  <a class="list-group-item list-group-item-action" href="month-sep-001.html">
   September 2

In [5]:
#Above we could see hrefs which are used to navigate to other pages for specific months.
#So we scrap the hrefs for all the months and save them in the list which we can append to the baseURL to navigate to other pages.

hrefList = []
href_elements = soup.find_all('a', {'class':'list-group-item list-group-item-action'})
for href_element in href_elements:
    text = href_element['href'] 
    hrefList.append(text)

print(hrefList)

['month-jan-001.html', 'month-feb-001.html', 'month-mar-001.html', 'month-apr-001.html', 'month-may-001.html', 'month-jun-001.html', 'month-jul-001.html', 'month-aug-001.html', 'month-sep-001.html', 'month-oct-001.html', 'month-nov-001.html', 'month-dec-001.html']


In [6]:
#We have created a dataframe having 3 columns to save the title, story snippet and category of the news that we scrap from the urls.
columns = ["title","snippet","category"]
df = pd.DataFrame(columns=columns)

In [7]:
i = 0
for href in hrefList:
    page = requests.get(URLBase+href) #the basseURL and URL for every month to navigate to that specific month's page
    soup = BeautifulSoup(page.content, 'html.parser')
    pages = soup.find_all('h4', {'class':'results'})

    #Every month had around 1500 records of news which were divided into different pages to view. So we scrap the number of pages
    #present in every month to get all the data from all pages.
    pageNo = pages[0].get_text() 
    pageNo = pageNo[-2:]

    pages = [href]

    #we loop through all the page numbers as they all contain different URL's.
    x = '001'
    for z in range(int(pageNo)-1):
        x = str(int(x) + 1).zfill(len(x))
        if len(x) == 3:
            x = x[1:]
        pages.append(href[0:11] + x + href[13:18])
    
    #Once we have all the URL's gathered for all the pages for a month we iterate through all the pages to collect the data.
    for page in pages:
        page = requests.get(URLBase+page)
        soup = BeautifulSoup(page.content, 'html.parser')
        articles = soup.find_all('div', {'class':'article'})

        for ar in articles:
            title = ar.find("h5").get_text() #Gets the title of the news
            title = title.lstrip('0123456789.- ') #Used to strip the number in front of the news title.
            target = ar.find_all('p', {'class':'metadata'})
            category = target[1].get_text()
            category = category[9:]
            category = category.strip().replace('s/+',"") #Gets the category of the news article
            if category == 'Music' or category == 'Books' or category == 'Film': #Condition to only save the articles belonging to either music, books or film category.
                df.loc[i] = [title, ar.find('p',{'class':'snippet'}).get_text(), category]
                i += 1
    

In [8]:
#We have collected all the required data and stored it in a dataframe as it is easy to carry out operations on it.
df 

Unnamed: 0,title,snippet,category
0,Be honest. You're not going to read all those...,"Every year, about this time, my Instagram feed...",Books
1,Mariah Carey's Twitter account hacked on New ...,Mariah Carey’s Twitter account appeared to hav...,Music
2,Providence Lost by Paul Lay review – the rise...,The only public execution of a British head of...,Books
3,"War epics, airmen and young Sopranos: essenti...",1917 An epic of Lean-ian proportions is delive...,Film
4,'I'm on the hunt for humour and hope': what w...,Matt Haig I have been very dark and gloomy wit...,Books
...,...,...,...
5384,Banging toons: why bands such as Bis are maki...,"An architect cranks a lever, and suddenly Mr S...",Music
5385,Little Scratch by Rebecca Watson review - a d...,Rebecca Watson’s debut novel started life as a...,Books
5386,'All that mattered was survival': the songs t...,Isaac Hayes – Going in Circles When it came to...,Music
5387,‘It took its toll’: the terrible legacy of Ma...,"Summary: As a child in 1960s east Harlem, docu...",Film


### Initial observations
* Here we can see that the dataset contains an approximately equal portion of each class. This means our dataset is balanced so we won’t perform any undersampling or oversampling method.

In [9]:
df['category'].value_counts()

Books    1821
Music    1797
Film     1771
Name: category, dtype: int64

In [10]:
df.loc[2]['snippet']

'The only public execution of a British head of state occurred 371 years ago outside the Banqueting House in Whitehall on 30 January 1649. It was a rad …'

## 1. Text cleaning and preparation
### 1.1. Special character cleaning
We can see the following special characters:

\r
\n
\ before possessive pronouns 
\ before possessive pronouns 2 
" when quoting text

In [11]:
df['snippet'] = df['snippet'].str.replace("\r", " ")
df['snippet'] = df['snippet'].str.replace("\n", " ")
df['snippet'] = df['snippet'].str.replace("    ", " ")
# "/' when quoting text
df['snippet'] = df['snippet'].str.replace('"', '')
df['snippet'] = df['snippet'].str.replace("'", "")


df['title'] = df['title'].str.replace("\r", " ")
df['title'] = df['title'].str.replace("\n", " ")
df['title'] = df['title'].str.replace("    ", " ")
# "/' when quoting text
df['title'] = df['title'].str.replace('"', '')
df['title'] = df['title'].str.replace("'", "")

### 1.2. Upcase/downcase¶
We'll downcase the texts because we want, for example, Whitehall and whitehall to be the same word.

In [12]:
# Lower-casing the text
df['title'] = df['title'].str.lower()

df['snippet'] = df['snippet'].str.lower()

### 1.3. Punctuation signs¶
Punctuation signs won't have any predicting power, so we'll just get rid of them.

In [13]:
punctuation_signs = list("?:!.,;")

for punct_sign in punctuation_signs:
    df['title'] = df['title'].str.replace(punct_sign, '')
    df['snippet'] = df['snippet'].str.replace(punct_sign, '')

### 1.4. Stemming and Lemmatization¶
Since stemming can produce output words that don't exist, we'll only use a lemmatization process. Lemmatization takes into consideration the morphological analysis of the words and returns words that do exist, so it will be more useful for us.

In [14]:
# Saving the lemmatizer into an object
wordnet_lemmatizer = WordNetLemmatizer()

In [15]:
#In order to lemmatize, we have to iterate through every word
nrows = len(df)
lemmatized_text_list = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['snippet']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list.append(lemmatized_text)
    
nrows = len(df)
lemmatized_text_list_title = []

for row in range(0, nrows):
    
    # Create an empty list containing lemmatized words
    lemmatized_list = []
    
    # Save the text and its words into an object
    text = df.loc[row]['title']
    text_words = text.split(" ")

    # Iterate through every word to lemmatize
    for word in text_words:
        lemmatized_list.append(wordnet_lemmatizer.lemmatize(word, pos="v"))
        
    # Join the list
    lemmatized_text = " ".join(lemmatized_list)
    
    # Append to the list containing the texts
    lemmatized_text_list_title.append(lemmatized_text)
    
   

In [16]:
df['snippet'] = lemmatized_text_list
df['title'] = lemmatized_text_list_title



Although lemmatization doesn't work perfectly in all cases, it can be useful.
<br>
### 1.5. Stop words

In [17]:
# Downloading the stop words list
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\raksh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
# Loading the stop words in english
stop_words = list(stopwords.words('english'))

In [19]:
#df['snippet'] = df['snippet']

for stop_word in stop_words:

    regex_stopword = r"\b" + stop_word + r"\b"
    df['snippet'] = df['snippet'].str.replace(regex_stopword, '')
    df['title'] = df['title'].str.replace(regex_stopword, '')

In [20]:
df.head(1)

Unnamed: 0,title,snippet,category
0,honest youre go read book holiday,every year time instagram fee fill pictur...,Books


### Task 2. Binary Text Classification

For binary classification, we classify data into one of two binary groups - these are usually represented as 0's and 1's in our data.

In [21]:
#We create a new dataframe of data for Binary classification of the news articles.
BCcolumns = ["document","category"]
BFdf = pd.DataFrame(columns=BCcolumns)
MFdf = pd.DataFrame(columns=BCcolumns)
BMdf = pd.DataFrame(columns=BCcolumns)
BFdf

Unnamed: 0,document,category


In [22]:
#From the main dataframe we have selected 2 categories of data i.e. Books and Film 
r = 0
for index, row in df.iterrows():
    if row['category'] == 'Books' or row['category'] == 'Film':
        BFdf.loc[r] = [row['title'] + row['snippet'] , row['category']] #saved the title and snippet as one document.
        r += 1
BFdf

r = 0
for index, row in df.iterrows():
    if row['category'] == 'Music' or row['category'] == 'Film':
        MFdf.loc[r] = [row['title'] + row['snippet'] , row['category']] #saved the title and snippet as one document.
        r += 1
MFdf

r = 0
for index, row in df.iterrows():
    if row['category'] == 'Music' or row['category'] == 'Books':
        BMdf.loc[r] = [row['title'] + row['snippet'] , row['category']] #saved the title and snippet as one document.
        r += 1
BMdf

Unnamed: 0,document,category
0,honest youre go read book holiday eve...,Books
1,mariah careys twitter account hack new years...,Music
2,providence lose paul lay review – rise fal...,Books
3,im hunt humour hope author read 2020m...,Books
4,diary murderer kim young-ha review – dark ...,Books
...,...,...
3613,humaning nice idea ridiculous corporate buz...,Books
3614,banging toons band bis make soundtracks ...,Music
3615,little scratch rebecca watson review - dari...,Books
3616,matter survival songs get us 2020isaac ...,Music


### Label Encoding

In [23]:
#We have created dictionaries with category names which will be grouped together for binary classification.
BF_category_codes = {
    'Books': 0,
    'Film': 1
}
MF_category_codes = {
    'Music': 0,
    'Film': 1
}
BM_category_codes = {
    'Books': 0,
    'Music': 1
}
# Category mapping
BFdf['Category_Code'] = BFdf['category']
BFdf = BFdf.replace({'Category_Code':BF_category_codes})

MFdf['Category_Code'] = MFdf['category']
MFdf = MFdf.replace({'Category_Code':MF_category_codes})

BMdf['Category_Code'] = BMdf['category']
BMdf = BMdf.replace({'Category_Code':BM_category_codes})

### Train-Test Split 
We'll set apart a test set to prove the quality of our models. We'll choose a test set size of 15% of the full dataset.

In [24]:
#Creating train-test split for Books and Film category from the dataset.
X_train, X_test, y_train, y_test = train_test_split(BFdf['document'], 
                                                    BFdf['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

#### We'll use TF-IDF Vectors as features. 

We have to define the following parameters:

ngram_range: We want to consider both unigrams and bigrams.<br>
max_df: When building the dataset ignore terms that have a document frequency strictly higher than the given threshold.<br>
min_df: When building the dataset ignore terms that have a document frequency strictly lower than the given threshold.<br>
max_features: If not None, build a dataset that only consider the top max_features ordered by term frequency across the corpus.


In [25]:
# Parameter selection
ngram_range = (1,2)
min_df = 10
max_df = 1.
max_features = 300

#### To transform the word sequences into numerical features we're using TF-IDF vectorization.
TF-IDF stands for Term Frequency-Inverse Document Frequency, a combination of two metrics - term frequency(TF) and inverse document frequency(IDF), and the idea is to weigh down the frequent terms while scaling up the rare or less frequent ones.

In [26]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2

tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(3053, 300)
(539, 300)


#### Different combinations of parameter selection result in different accuracy of models.
We have fitted and then transformed the training set, but we have only transformed the test set.
<br>
<br>
We'll try multiple machine learning classification models in order to find which one performs best on our data. We will try with the following models:

* K Nearest Neighbors<br>
* Multinomial Naive Bayes<br>
* Support Vector Machine<br>

In [27]:
knnc_0 =KNeighborsClassifier(n_neighbors=3)

print('Parameters:\n')
print(knnc_0.get_params())

Parameters:

{'algorithm': 'auto', 'leaf_size': 30, 'metric': 'minkowski', 'metric_params': None, 'n_jobs': None, 'n_neighbors': 3, 'p': 2, 'weights': 'uniform'}


#### Fit the model to the training data and then get predictions.

In [28]:
knnc_0.fit(features_train, labels_train)

knnc_pred = knnc_0.predict(features_test)

#### Evaluation
We will use the below metrics for the evaluation of our models.

* Accuracy: the accuracy metric measures the ratio of correct predictions over the total number of predictions.
* Precision: precision is used to measure the positive preductions that are correctly predicted from the total predictions in a positive class.
* Recall: recall is used to measure the fraction of positive pridictions that are correctly classified.
* F1-Score: this metric represents the harmonic mean between recall and precision values.

In [29]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, knnc_0.predict(features_train)))

# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, knnc_pred))

The training accuracy is: 
0.872911889944317
The test accuracy is: 
0.7217068645640075


#### Classification report

In [30]:
# Classification report
print("Classification report")
print(classification_report(labels_test,knnc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.79      0.60      0.68       269
           1       0.68      0.84      0.75       270

    accuracy                           0.72       539
   macro avg       0.73      0.72      0.72       539
weighted avg       0.73      0.72      0.72       539



aux_df = BFdf[['category', 'Category_Code']].drop_duplicates().sort_values('Category_Code')
conf_matrix = confusion_matrix(labels_test, knnc_pred)
plt.figure(figsize=(12.8,6))
sns.heatmap(conf_matrix, 
            annot=True,
            xticklabels=aux_df['category'].values, 
            yticklabels=aux_df['category'].values,
            cmap="Blues")
plt.ylabel('Predicted')
plt.xlabel('Actual')
plt.title('Confusion matrix')
plt.show()

In [31]:
mnbc = MultinomialNB()
mnbc

MultinomialNB()

In [32]:
mnbc.fit(features_train, labels_train)

MultinomialNB()

In [33]:
mnbc_pred = mnbc.predict(features_test)

In [34]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, mnbc.predict(features_train)))

# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, mnbc_pred))

The training accuracy is: 
0.8978054372748117
The test accuracy is: 
0.8664192949907236


In [35]:
# Classification report
print("Classification report")
print(classification_report(labels_test,mnbc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.85      0.90      0.87       269
           1       0.89      0.84      0.86       270

    accuracy                           0.87       539
   macro avg       0.87      0.87      0.87       539
weighted avg       0.87      0.87      0.87       539



aux_df = BFdf[['category', 'Category_Code']].drop_duplicates().sort_values('Category_Code')
conf_matrix = confusion_matrix(labels_test, mnbc_pred)
plt.figure(figsize=(12.8,6))
sns.heatmap(conf_matrix, 
            annot=True,
            xticklabels=aux_df['category'].values, 
            yticklabels=aux_df['category'].values,
            cmap="Blues")
plt.ylabel('Predicted')
plt.xlabel('Actual')
plt.title('Confusion matrix')
plt.show()

In [36]:

svc_0 =SVC(random_state=8)

print('Parameters currently in use:\n')
print(svc_0.get_params())

Parameters currently in use:

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': 8, 'shrinking': True, 'tol': 0.001, 'verbose': False}


In [37]:
svc_0.fit(features_train, labels_train)
svc_pred = svc_0.predict(features_test)

In [38]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, svc_0.predict(features_train)))


# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, svc_pred))

The training accuracy is: 
0.980019652800524
The test accuracy is: 
0.8719851576994434


In [39]:
# Classification report
print("Classification report")
print(classification_report(labels_test,svc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.85      0.91      0.88       269
           1       0.90      0.84      0.87       270

    accuracy                           0.87       539
   macro avg       0.87      0.87      0.87       539
weighted avg       0.87      0.87      0.87       539



### Cross validation testing

Above we have done hold out testing and below we will perform Cross validation testing to test the performance of our model.
As expected cross validation testing gave better results, as in cv testing the model gets a chance to train on comparatively more data which helps it to learn more classification pattern.
<br>
<br>
We have used pipepline to carry out cross-validation testing. The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters.

In [40]:
pipeline2 = Pipeline([
 ('vec', CountVectorizer(stop_words="english")),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SVC())
])

#y = BFdf['category'].map({'Books': 1, 'Film': 0}).astype(int)
acc_scores = cross_val_score(pipeline2, BFdf['document'], BFdf['Category_Code'], cv=15, scoring="accuracy")

acc_scores.mean()

0.914251510925151

In this project, we just want documents to be correctly predicted. The costs of false positives or false negatives are the same to us. So, it does not matter whether our classifier is more specific or more sensitive, as long as it classifies correctly as much documents as possible. Therefore, we will consider the accuracy when comparing models.

#### From the above test results we will choose SVM for further predictions as it gave the highest accuracy score compared to KNN and MultinomialNB

We will now create binary classifier models for the remaining 2 pairs of categories

In [41]:
#Create train and test splits for Book and Music category.

X_train, X_test, y_train, y_test = train_test_split(BMdf['document'], 
                                                    BMdf['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

In [42]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(3075, 300)
(543, 300)


In [43]:
svc_0 =SVC(random_state=8)

print('Parameters currently in use:\n')
print(svc_0.get_params())
svc_0.fit(features_train, labels_train)
svc_pred = svc_0.predict(features_test)

Parameters currently in use:

{'C': 1.0, 'break_ties': False, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0, 'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'scale', 'kernel': 'rbf', 'max_iter': -1, 'probability': False, 'random_state': 8, 'shrinking': True, 'tol': 0.001, 'verbose': False}


In [44]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, svc_0.predict(features_train)))


# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, svc_pred))

The training accuracy is: 
0.9785365853658536
The test accuracy is: 
0.8784530386740331


In [45]:
# Classification report
print("Classification report")
print(classification_report(labels_test,svc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.88      0.87      0.88       266
           1       0.88      0.88      0.88       277

    accuracy                           0.88       543
   macro avg       0.88      0.88      0.88       543
weighted avg       0.88      0.88      0.88       543



In [46]:
pipeline2 = Pipeline([
 ('vec', CountVectorizer(stop_words="english")),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SVC())
])

y = BMdf['category'].map({'Books': 1, 'Music': 0}).astype(int)
acc_scores = cross_val_score(pipeline2, BMdf['document'], y, cv=15, scoring="accuracy")

acc_scores.mean()

0.9286798578009443

In [47]:
#Create train and test splits for Book and Film category.

X_train, X_test, y_train, y_test = train_test_split(MFdf['document'], 
                                                    MFdf['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

In [48]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(3032, 300)
(536, 300)


In [49]:
svc_0.fit(features_train, labels_train)
svc_pred = svc_0.predict(features_test)

In [50]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, svc_0.predict(features_train)))


# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, svc_pred))

The training accuracy is: 
0.9812005277044855
The test accuracy is: 
0.8656716417910447


In [51]:
# Classification report of Music and Film binary classifier
print("Classification report")
print(classification_report(labels_test,svc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.88      0.88      0.88       298
           1       0.85      0.84      0.85       238

    accuracy                           0.87       536
   macro avg       0.86      0.86      0.86       536
weighted avg       0.87      0.87      0.87       536



In [52]:
pipeline2 = Pipeline([
 ('vec', CountVectorizer(stop_words="english")),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SVC())
])

y = MFdf['category'].map({'Music': 1, 'Film': 0}).astype(int)
acc_scores = cross_val_score(pipeline2, MFdf['document'], y, cv=15, scoring="accuracy")

acc_scores.mean()


0.9316101123993903

In the above scenarios we can see that cross validation testing has always given better accuracy compared to hold out testing.

### Task 3. Multi-class Text Classification

Some binary classifiers do not support multi-class classification. For SVM to be used for multi-class classification we need to set an input parameter "decision_function_shape" to 'ovo'.

In [53]:
#Created a dataframe for Multiclass text classification
MCcolumns = ["document","category"]
MCdf = pd.DataFrame(columns=MCcolumns)
MCdf

Unnamed: 0,document,category


In [54]:
#Added rows belonging to all categories to the dataframe
r = 0
for index, row in df.iterrows():
    MCdf.loc[r] = [row['title'] + row['snippet'] , row['category']]
    r += 1
MCdf

Unnamed: 0,document,category
0,honest youre go read book holiday eve...,Books
1,mariah careys twitter account hack new years...,Music
2,providence lose paul lay review – rise fal...,Books
3,war epics airmen young sopranos essential fi...,Film
4,im hunt humour hope author read 2020m...,Books
...,...,...
5384,banging toons band bis make soundtracks ...,Music
5385,little scratch rebecca watson review - dari...,Books
5386,matter survival songs get us 2020isaac ...,Music
5387,‘ take toll’ terrible legacy martin luther...,Film


In [55]:
category_codes = {
    'Books': 0,
    'Music': 1,
    'Film': 2
}
# Category mapping
MCdf['Category_Code'] = MCdf['category']
MCdf = MCdf.replace({'Category_Code':category_codes})

In [56]:
#Create train and test splits for Book and Music category.

X_train, X_test, y_train, y_test = train_test_split(MCdf['document'], 
                                                    MCdf['Category_Code'], 
                                                    test_size=0.15, 
                                                    random_state=8)

In [57]:
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)
                        
features_train = tfidf.fit_transform(X_train).toarray()
labels_train = y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = y_test
print(features_test.shape)

(4580, 300)
(809, 300)


In [58]:
svc_0 =SVC(random_state=8,decision_function_shape='ovo')

svc_0.fit(features_train, labels_train)
svc_pred = svc_0.predict(features_test)

In [59]:
# Training accuracy
print("The training accuracy is: ")
print(accuracy_score(labels_train, svc_0.predict(features_train)))


# Test accuracy
print("The test accuracy is: ")
print(accuracy_score(labels_test, svc_pred))

The training accuracy is: 
0.9543668122270742
The test accuracy is: 
0.7911001236093943


In [60]:
# Classification report of Music and Film binary classifier
print("Classification report")
print(classification_report(labels_test,svc_pred))

Classification report
              precision    recall  f1-score   support

           0       0.74      0.82      0.78       276
           1       0.78      0.81      0.79       257
           2       0.87      0.75      0.80       276

    accuracy                           0.79       809
   macro avg       0.80      0.79      0.79       809
weighted avg       0.80      0.79      0.79       809



In [62]:
#As we increase the number of folds i.e. the value of cv the accuracy of the model also increases.

pipeline2 = Pipeline([
 ('vec', CountVectorizer(stop_words="english")),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SVC(decision_function_shape='ovo'))
])

acc_scores = cross_val_score(pipeline2, MCdf['document'], MCdf['category'], cv=5, scoring="accuracy")

acc_scores.mean()

0.8734426867733672

### Multi-class vs Binary class classifier results evaluation
The greater the number of output nodes the higher complexity of our model. This means that given a fixed amount of data, a greater number of output nodes will lead to poorer results. Hence, the accuracy score of our multi-class classifier is less than that of the binary classifiers.

### Conclusion

We began by exploring the simplest form of classification - binary. This helped us to model data where our response could take one of two states.

We then moved further into multi-class classification, when the response variable can take more than 2 states. Here we have 3 states.

Then we fit and evaluate models with training and test sets. Furthermore, we can explore additional ways to refine model fitting among various algorithms by changing their input parameters.