## Student details:
Student name: **Siddharth Prince**  
Student ID: **23052058**

### House keeping code

In [1]:
from textblob.classifiers import NaiveBayesClassifier
import pandas as pd
import numpy as np
import time

def convertDfToList(df):
    return df.to_records(index=False).tolist()

# Task 4

In [2]:
# Method for Naive Bayes text classification using the TextBlob library
def naiveBayesClassifier(trainingSet, testingSet, verbose=False):
    nbClassifier = NaiveBayesClassifier(trainingSet)
    correctPreds = 0
    for testDoc in testingSet:
        predicted = nbClassifier.classify(testDoc[0])
        if predicted == testDoc[1]:
            correctPreds += 1
        if verbose:
            print(f'document: {testDoc}\t\tpredicted class: {predicted}')
    accuracyScore = correctPreds/len(testingSet)
    if verbose:
        print(f'Accuracy: {accuracyScore}')
    return accuracyScore

In [3]:
trainingSet = [('London is the Capital of GB','GB'),
               ('Oxford is a city in GB','GB'),
               ('Dublin is the capital of Ireland','IE'),
               ('Limerick is a city in Ireland','IE')]
testSet = [('University of Limerick','IE'),
           ('University College Dublin','IE'),
           ('Imperial College London','GB'),
           ('University of Oxford','GB'),
           ('Ireland & GB','GB')]
accuracyScore = naiveBayesClassifier(trainingSet, testSet, verbose=True)

document: ('University of Limerick', 'IE')		predicted class: IE
document: ('University College Dublin', 'IE')		predicted class: IE
document: ('Imperial College London', 'GB')		predicted class: GB
document: ('University of Oxford', 'GB')		predicted class: GB
document: ('Ireland & GB', 'GB')		predicted class: IE
Accuracy: 0.8


# Task 5

In [4]:
# Downloading the text classification data set
!wget https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv

# Loading the csv data inta a pandas dataframe
df = pd.read_csv('./bbc-text.csv')
df.head()

--2023-12-04 00:24:55--  https://storage.googleapis.com/dataset-uploader/bbc/bbc-text.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.193.207, 172.253.116.207, 209.85.202.207, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.193.207|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5057493 (4.8M) [text/csv]
Saving to: ‘bbc-text.csv.1’


2023-12-04 00:24:56 (11.1 MB/s) - ‘bbc-text.csv.1’ saved [5057493/5057493]



Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [5]:
# The order of columns in the data frame are flipped. To avoid problems with the classifier method down the line, we'll reindex the df
df = df.reindex(columns=['text', 'category'])

In [6]:
print(f'Number of documents: {len(df.index)}\n')
    
print(f'Documents per category: {df.groupby("category").size()}')

Number of documents: 2225

Documents per category: category
business         510
entertainment    386
politics         417
sport            511
tech             401
dtype: int64


## K-Fold crossvalidation

Finding the distribution of data according to each class

In [7]:
classFreqs = df['category'].value_counts()
classFreqs

category
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

Splitting the data frame into multiple data frames with each having data for a single class. This is to ensure an even split (stratification) when creating the k-folds.

In [8]:
# To create a "stratified" k-fold dataset, we will aim to have an even distribution of all classes in all folds.
classWiseDfs = []
# Spliting the df into separate dfs for each class
for dataClass in classFreqs.index:
    classDf = df[df['category'] == dataClass]
    classWiseDfs.append(classDf)

Finding out how many data points are to be taken for each class within a fold for a stratified k-fold cross validation.

In [13]:
k = 10 # Number of folds
n_samples_per_fold_per_class = np.floor_divide(classFreqs, k) 
print(n_samples_per_fold_per_class)

category
sport            51
business         51
politics         41
tech             40
entertainment    38
Name: count, dtype: int64


Iterating through each class and concatenating the test and training splits with each range for the corresponding fold.

In [14]:
# Initialize the fold indices
foldIndices = [(np.array([], dtype=int), np.array([], dtype=int)) for _ in range(k)]

# Iterate over the classes
for i, classDf in enumerate(classWiseDfs):
    # Using the pandas' inbuilt sample function to randomise data in each class df
    classDf = classDf.sample(frac=1)

    # Split the class data frame into folds
    for j in range(k):
        start = j * n_samples_per_fold_per_class[i]
        end = (j + 1) * n_samples_per_fold_per_class[i]
        indices = np.arange(start, end) # Creates an array of indices for the start and end interval defined above.
        foldIndices[j] = (
            np.concatenate([foldIndices[j][0], np.array([i for i in classDf.index if i not in classDf.index[indices]])]),
            np.concatenate([foldIndices[j][1], classDf.index[indices]])
        )

print(foldIndices)

[(array([2157, 1019, 1338, ..., 1871, 1041, 2211]), array([ 773, 2096, 1909, 1798, 1424, 1867, 1745,  743,  546, 1582,  722,
       1523, 1601, 1264,  594,  335, 1137, 1932, 1217, 1399,  533, 1983,
        327, 1302,  846,  629,  762,  146, 1281,  828, 1663, 1587, 1496,
       2119,  217, 2146,  386,  265,  215, 2142,  127,   91,  112,  867,
       1489, 1065, 1386, 1026, 1632,  182,  587,  853, 1515, 1773, 2135,
       1959,  746, 1842,  316,  415, 2154, 1242, 2037,  311, 1429, 1256,
        818, 1180,  133, 1123,  411,  763,  681,  855,  191,  489, 2148,
       1952, 1230,  499, 1626,  631, 2051, 1375,  224,  178, 1961,  242,
       1100,  382, 1282,  676,  447, 1544,  179, 1392, 2067,  955, 1321,
        290,  248,  134,  260, 1782, 1014,  567,  516,  399, 1529,  287,
        493,  197, 2206,  174,  907,  173,  951, 1259, 1012,   94, 1404,
       1288,  946, 1347,   55,  310,  934,  384,  702, 1040,  697,  275,
       2062, 1658, 1295,  771, 1990,  343, 1791, 2054,  979, 1516,  321,

  start = j * n_samples_per_fold_per_class[i]
  end = (j + 1) * n_samples_per_fold_per_class[i]


In [10]:
# Just checking if there is any overlap between the training and testing sets in a fold.
t = [i for i in foldIndices[0][0] if i in foldIndices[0][1]]
print(t)
print(len(foldIndices[1][1]))
# len(df)

[]
221


Now that we have indices of our 10-fold stratified dataset, let's iterate through training and testing each fold.

In [11]:
accuracyScores = []
for trainIndices, testIndices in foldIndices:
    foldNumber = len(accuracyScores) + 1
    print(f'Running for fold #{foldNumber}')
    print('-' * 80)
    trainDf, testDf = df.iloc[trainIndices], df.iloc[testIndices]
    trainDf, testDf = trainDf.sample(frac=1).reset_index(drop=True), testDf.sample(frac=1).reset_index(drop=True)
    print('Training set samples:')
    display(trainDf)
    print('Testing set samples:')
    display(testDf)

    # Classifying the test data based on the training data

    # First, we need to convert the df to a list of tuples so that the data strutures are compatible with the defined function
    trainingList, testingList = convertDfToList(trainDf), convertDfToList(testDf)
    now = time.time()
    # Passing the train and test data to the naiveBayesClassifier function that we defined in task 4 above. Code reusability is great! :D
    accuracyScore = naiveBayesClassifier(trainingList, testingList)
    accuracyScores.append(accuracyScore)
    print(f'Accuracy score for fold #{foldNumber}: {accuracyScore}')
    print(f'Time taken for fold #{foldNumber}: {time.time() - now} seconds')
    print('-' * 80)

Running for fold #1
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,souness backs smith for scotland graeme sounes...,sport
1,mobiles rack up 20 years of use mobile phones ...,tech
2,gadget market to grow in 2005 the explosion ...,tech
3,india s reliance family feud heats up the ongo...,business
4,nuclear strike key terror risk the uk and us...,politics
...,...,...
1999,us blogger fired by her airline a us airline a...,tech
2000,go-ahead for new internet names the internet c...,tech
2001,itunes user sues apple over ipod a user of app...,tech
2002,wales coach elated with win mike ruddock paid ...,sport


Testing set samples:


Unnamed: 0,text,category
0,tv s future down the phone line internet tv ha...,tech
1,career honour for actor dicaprio actor leonard...,entertainment
2,budget to set scene for election gordon brown ...,politics
3,howard and blair tax pledge clash tony blair h...,politics
4,hitler row over welsh arts cash an artist cri...,politics
...,...,...
216,keanu reeves given hollywood star actor keanu ...,entertainment
217,india opens skies to competition india will al...,business
218,owen set for skipper role wales number eight m...,sport
219,diageo to buy us wine firm diageo the world s...,business


Accuracy score for fold #1: 0.9683257918552036
Time taken for fold #1: 162.5450794696808 seconds
--------------------------------------------------------------------------------
Running for fold #2
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,india widens access to telecoms india has rais...,business
1,blair backs pre-election budget tony blair h...,politics
2,fed warns of more us rate rises the us looks s...,business
3,bt program to beat dialler scams bt is introdu...,tech
4,edwards tips idowu for euro gold world outdoor...,sport
...,...,...
1999,snicket tops us box office chart the film adap...,entertainment
2000,shark tale dvd is us best-seller oscar-nominat...,entertainment
2001,seamen sail into biometric future the luxury c...,tech
2002,bank set to leave rates on hold uk interest ra...,business


Testing set samples:


Unnamed: 0,text,category
0,bombardier chief to leave company shares in tr...,business
1,wilkinson to miss ireland match england will h...,sport
2,act on detention ruling uk urged the governme...,politics
3,johnny cash manager holiff dies the former man...,entertainment
4,argentina closes $102.6bn debt swap argentina ...,business
...,...,...
216,benitez delight after crucial win liverpool ma...,sport
217,news corp eyes video games market news corp t...,business
218,millions buy mp3 players in us one in 10 adult...,tech
219,branson show flops on us screens entrepreneur ...,entertainment


Accuracy score for fold #2: 0.9683257918552036
Time taken for fold #2: 164.50197291374207 seconds
--------------------------------------------------------------------------------
Running for fold #3
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,mirza makes indian tennis history teenager san...,sport
1,martinez sees off vinci challenge veteran span...,sport
2,security warning over fbi virus the us feder...,tech
3,ajax refuse to rule out jol move ajax have ref...,sport
4,collins banned in landmark case sprinter miche...,sport
...,...,...
1999,berlin cheers for anti-nazi film a german movi...,entertainment
2000,blair congratulates bush on win tony blair has...,politics
2001,howard dismisses tory tax fears michael howard...,politics
2002,kilroy names election seat target ex-chat show...,politics


Testing set samples:


Unnamed: 0,text,category
0,kennedy criticises unfair taxes gordon brown...,politics
1,abbas will not tolerate attacks palestinian ...,politics
2,ericsson sees earnings improve telecoms equipm...,business
3,mobile games come of age the bbc news website ...,tech
4,home phones face unclear future the fixed line...,tech
...,...,...
216,preview: ireland v england (sun) lansdowne roa...,sport
217,dvd review: spider-man 2 it s a universal rule...,entertainment
218,lord scarman 93 dies peacefully distinguishe...,politics
219,sport betting rules in spotlight a group of mp...,politics


Accuracy score for fold #3: 0.9638009049773756
Time taken for fold #3: 133.18538641929626 seconds
--------------------------------------------------------------------------------
Running for fold #4
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,open source leaders slam patents the war of wo...,tech
1,labour s core support takes stock tony blair h...,politics
2,mourinho expects fight to finish chelsea manag...,sport
3,yukos unit fetches $9bn at auction a little-kn...,business
4,gb quartet get cross country call four british...,sport
...,...,...
1999,warning over windows word files writing a micr...,tech
2000,text message record smashed again uk mobile ow...,tech
2001,french consumer spending rising french consume...,business
2002,guantanamo four free in weeks all four britons...,politics


Testing set samples:


Unnamed: 0,text,category
0,napster offers rented music to go music downlo...,tech
1,german business confidence slides german busin...,business
2,the future in your pocket if you are a geek or...,tech
3,wenger signs new deal arsenal manager arsene w...,sport
4,candidate resigns over bnp link a prospective ...,politics
...,...,...
216,hewitt fights back to reach final lleyton hewi...,sport
217,eu fraud clampdown urged eu member states are ...,politics
218,us interest rates increased to 2% us interest ...,business
219,uefa approves fake grass uefa says it will all...,sport


Accuracy score for fold #4: 0.9502262443438914
Time taken for fold #4: 131.55906128883362 seconds
--------------------------------------------------------------------------------
Running for fold #5
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,green reports shun supply chain nearly 20% mor...,business
1,french consumer spending rising french consume...,business
2,howard and blair tax pledge clash tony blair h...,politics
3,mcconnell in drunk remark row scotland s fir...,politics
4,vickery upbeat about arm injury england prop p...,sport
...,...,...
1999,millions to miss out on the net by 2025 40% o...,tech
2000,jones happy with henson heroics wales fly-half...,sport
2001,housing plans criticised by mps irreversible ...,politics
2002,fuming robinson blasts officials england coach...,sport


Testing set samples:


Unnamed: 0,text,category
0,russian film wins bbc world prize russian dram...,entertainment
1,bush website blocked outside us surfers outsid...,tech
2,bellamy fined after row newcastle have fined t...,sport
3,gizmondo gadget hits the shelves the gizmondo ...,tech
4,army chiefs in regiments decision military chi...,politics
...,...,...
216,how the academy awards flourished the 77th ann...,entertainment
217,blunkett tells of love and pain david blunkett...,politics
218,roundabout continues nostalgia trip the new bi...,entertainment
219,thanou desperate to make return greek sprinter...,sport


Accuracy score for fold #5: 0.9683257918552036
Time taken for fold #5: 129.39380049705505 seconds
--------------------------------------------------------------------------------
Running for fold #6
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,weak dollar hits reuters revenues at media gro...,business
1,renault boss hails great year strong sales o...,business
2,howard attacks pay later budget tory leader ...,politics
3,uk athletics agrees new kit deal uk athletics ...,sport
4,france starts digital terrestrial france has b...,tech
...,...,...
1999,glasgow hosts tsunami benefit gig the top name...,entertainment
2000,euronext poised to make lse bid pan-european...,business
2001,wenger steps up row arsene wenger has stepped ...,sport
2002,yukos heading back to us courts russian oil an...,business


Testing set samples:


Unnamed: 0,text,category
0,musicians to tackle us red tape musicians gro...,entertainment
1,arsenal through on penalties arsenal win 4-2 o...,sport
2,bat spit drug firm goes to market a german fir...,business
3,bortolami predicts dour contest italy skipper ...,sport
4,europe backs digital tv lifestyle how people r...,tech
...,...,...
216,councils prepare to set tax rises council tax ...,politics
217,crisis ahead in social sciences a national b...,politics
218,game firm holds cast auditions video game fi...,tech
219,ex-boeing director gets jail term an ex-chief ...,business


Accuracy score for fold #6: 0.9457013574660633
Time taken for fold #6: 129.62819457054138 seconds
--------------------------------------------------------------------------------
Running for fold #7
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,box office blow for alexander director oliver ...,entertainment
1,u2 s desire to be number one u2 who have won ...,entertainment
2,s korea spending boost to economy south korea ...,business
3,chinese wine tempts italy s illva italy s illv...,business
4,china continues breakneck growth china s econo...,business
...,...,...
1999,standard life cuts policy bonuses standard lif...,business
2000,wenger handed summer war chest arsenal boss ar...,sport
2001,woolf murder sentence rethink plans to give mu...,politics
2002,jungle tv show ratings drop by 4m the finale o...,entertainment


Testing set samples:


Unnamed: 0,text,category
0,houllier praises benitez regime former liverpo...,sport
1,singapore growth at 8.1% in 2004 singapore s e...,business
2,microsoft releases bumper patches microsoft ha...,tech
3,ray charles studio becomes museum a museum ded...,entertainment
4,straw backs ending china embargo uk foreign se...,politics
...,...,...
216,collins appeals against drugs ban sprinter mic...,sport
217,critics back aviator for oscars martin scorses...,entertainment
218,ireland 21-19 argentina an injury-time dropped...,sport
219,the sound of music is coming home the original...,entertainment


Accuracy score for fold #7: 0.9773755656108597
Time taken for fold #7: 132.35676264762878 seconds
--------------------------------------------------------------------------------
Running for fold #8
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,eu referendum question unveiled the question t...,politics
1,jowell rejects las vegas jibe the secretary ...,politics
2,asia shares defy post-quake gloom thailand has...,business
3,indonesia declines debt freeze indonesia no ...,business
4,fa probes crowd trouble the fa is to take acti...,sport
...,...,...
1999,prop jones ready for hard graft adam jones say...,sport
2000,wall street cheers bush victory the us stock m...,business
2001,conservative mp defects to labour a conservati...,politics
2002,lions blow to world cup winners british and ir...,sport


Testing set samples:


Unnamed: 0,text,category
0,mobile gig aims to rock 3g forget about going ...,tech
1,broadband steams ahead in the us more and more...,tech
2,mobiles rack up 20 years of use mobile phones ...,tech
3,downing injury mars uefa victory middlesbrough...,sport
4,honda wins china copyright ruling japan s hond...,business
...,...,...
216,berlin celebrates european cinema organisers s...,entertainment
217,tough schedule delays elliot show preview perf...,entertainment
218,europe asks asia for euro help european leader...,business
219,uk pioneers digital film network the world s f...,tech


Accuracy score for fold #8: 0.9638009049773756
Time taken for fold #8: 131.37616515159607 seconds
--------------------------------------------------------------------------------
Running for fold #9
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,labour s cunningham to stand down veteran labo...,politics
1,ferdinand casts doubt over glazer rio ferdinan...,sport
2,lloyd s of london head chides fsa the head of ...,business
3,da vinci film to star tom hanks actor tom hank...,entertainment
4,blair returns from peace mission prime ministe...,politics
...,...,...
1999,koubek suspended after drugs test stefan koube...,sport
2000,blunkett hints at election call ex-home secret...,politics
2001,worldcom boss left books alone former worldc...,business
2002,blair moves to woo jewish voters tony blair ha...,politics


Testing set samples:


Unnamed: 0,text,category
0,newcastle 27-27 gloucester newcastle centre ma...,sport
1,a question of trust and technology a major gov...,tech
2,mayor will not retract nazi jibe london mayor ...,politics
3,mps demand budget leak answers ministers hav...,politics
4,post-christmas lull in lending uk mortgage le...,business
...,...,...
216,india and iran in gas export deal india has si...,business
217,europe backs digital tv lifestyle how people r...,tech
218,hantuchova in dubai last eight daniela hantuch...,sport
219,bomb threat at bernabeu stadium spectators wer...,sport


Accuracy score for fold #9: 0.9411764705882353
Time taken for fold #9: 133.2353584766388 seconds
--------------------------------------------------------------------------------
Running for fold #10
--------------------------------------------------------------------------------
Training set samples:


Unnamed: 0,text,category
0,worcester v sale (fri) sixways friday 25 feb...,sport
1,parmalat founder offers apology the founder an...,business
2,arsenal may seek full share listing arsenal ...,business
3,us top of supercomputing charts the us has pus...,tech
4,brown proud of economy record gordon brown h...,politics
...,...,...
1999,dallaglio eyeing lions tour place former engla...,sport
2000,bat spit drug firm goes to market a german fir...,business
2001,last star wars not for children the sixth an...,entertainment
2002,millions to miss out on the net by 2025 40% o...,tech


Testing set samples:


Unnamed: 0,text,category
0,jobs go at oracle after takeover oracle has an...,business
1,howard s unfinished business he s not finishe...,politics
2,ec calls truce in deficit battle the european ...,business
3,virgin radio offers 3g broadcast uk broadcaste...,tech
4,blackburn v burnley ewood park tuesday 1 mar...,sport
...,...,...
216,millions buy mp3 players in us one in 10 adult...,tech
217,us prepares for hybrid onslaught sales of hybr...,business
218,japanese mogul arrested for fraud one of japan...,business
219,senior fannie mae bosses resign the two most s...,business


Accuracy score for fold #10: 0.9728506787330317
Time taken for fold #10: 132.00436401367188 seconds
--------------------------------------------------------------------------------


## Results
Checking the final list of accuracy scores for each fold and the average accuracy score across all folds.

In [12]:
print(f'Accuracies for different folds: {accuracyScores}')
print(f'Average accuracy across all 10 folds: {sum(accuracyScores)/len(accuracyScores)}')

Accuracies for different folds: [0.9683257918552036, 0.9683257918552036, 0.9638009049773756, 0.9502262443438914, 0.9683257918552036, 0.9457013574660633, 0.9773755656108597, 0.9638009049773756, 0.9411764705882353, 0.9728506787330317]
Average accuracy across all 10 folds: 0.9619909502262443
