# Preparing Labeled Training Documents
In this section we will further transform the document vectors into some form that NLTK can take to train a
classifier. Generally speaking, a classification algorithm needs to take in a set of training documents, each
associated with its correct label, in order to train a classifier. In NLTK, such labeled training data needs to
be represented as a list in which each element is a tuple having two elements. Tuple is a data structure in
Python. Each tuple contains a number of elements, usually of different types. Tuples are represented using
round brackets. For example, (’A’, 1) is a tuple with two elements, a string and a number. ([1, 3,
6], ’ABC’) is also a tuple with two elements, a list and a string. In NLTK, to represent a labeled training
document, we use a tuple that consists of a dict type of object together with a label.

For example, the following tuple represents a labeled training document:

(f’police’:1 ’lawyer’:1, ’court’:1g, ’Crime’)

Here the second element of this tuple is the string ’Crime’, which is the label of this training document.
The first element of the tuple is a dict object f’police’:1 ’lawyer’:1, ’court’:1g. What
this dict object means is that the value of the feature ’police’ is 1, the value of the feature ’lawyer’
is 1, and the value of the feature ’court’ is also 1.

A training set to be used by NLTK is a list of such tuples. For example, the following list is a training
data set that can be passed to a classification algorithm in NLTK:

[(f’police’:1 ’lawyer’:1, ’court’:1g, ’Crime’), (f’coach’:1 ’game’:1g,
’Sports’)]

This training data set consists of two training documents, one labeled as Crime and the other labeled as
Sports.

Note that the features do not have to be strings. If we have mapped the word police to 1, the word lawyer
to 2, and so on, then the training set above can also be represented as follows:

[(f1:1 2:1, 3:1g, ’Crime’), (f4:1 5:1g, ’Sports’)]

Our goal now is to create such kind of a list. We need to take out each vector from all tf vectors,
convert it to a dict object, associate it with a label, and add it to the final list of training data.

In [10]:
import nltk
import os
print (os.getcwd())
from nltk.corpus import PlaintextCorpusReader
newcorpus = PlaintextCorpusReader('/gpfs/global_fs01/sym_shared/YPProdSpark/user/sfbc-20c2d955c74628-3c618564d05f/notebook/work', '.*')
newcorpus.fileids()

/gpfs/global_fs01/sym_shared/YPProdSpark/user/sfbc-20c2d955c74628-3c618564d05f/notebook/work


['classificationcorpus/1.txt',
 'classificationcorpus/2.txt',
 'classificationcorpus/3.txt',
 'classificationcorpus/4.txt',
 'newcorpus/1.txt',
 'newcorpus/2.txt',
 'pard_agg_all_data.csv',
 'pard_agg_avg_data.csv']

In [13]:
from nltk.corpus import stopwords
from nltk.stem.porter import *
stemmer=PorterStemmer()
import gensim
from gensim import corpora
from gensim import similarities
from gensim import models


def tolower(docs):
    docs=[[w.lower() for w in doc] for doc in docs]
    return docs
    
def fetchdictionary(docs):
    dictionary=corpora.Dictionary(docs)
    return dictionary

def removestop(docs):
    stop_list=stopwords.words('english')
    docs=[[w for w in doc if w not in stop_list] for doc in docs]
    return docs;

def stemwords(docs):
    docs=[[stemmer.stem(w) for w in doc] for doc in docs]
    
    #text2_stemmed=[stemmer.stem(w) for w in wordlist]
    return docs;

def convertToVec(docs,dictionary):
    vecs=[dictionary.doc2bow(doc) for doc in docs]
    return vecs

def buildindex(docs):
    index=similarities.SparseMatrixSimilarity(docs,110)
    return index;

def createtdif(docs):
    tfidf=models.TfidfModel(docs)
    return tfidf

In [9]:
import sys
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
crime1="""The vice syndicate that lawyer Spencer Gwee Hak Theng was charged in relation to was operated by Seng Swee Meng, 42.
Together with his Vietnamese wife, Ngo Tien, 32, and two of her family members, they managed women from Vietnam who were flown into Singapore to work as prostitutes.
During a raid last August, police officers rounded up 30 Vietnamese hostesses and prostitutes working in various pubs in Joo Chiat and Geylang.
Half of them were managed by Seng and his wife. And four out of the 15 managed by the couple were virgins aged 16 and 17.
Seng faced four charges of harbouring prostitutes, four counts of receiving them at Changi Airport, four counts of living in part on earnings of prostitution, two counts of abetting to obtain commercial sex with minors and one count of managing a place where the sex workers were assigned.
Seng was sentenced to five years' jail on April 11. The next day, his sister-in-law, Ngo Ngoc, 27, was jailed for 18 months.
His wife is on the run after leaving Singapore for Vietnam last July. The whereabouts of her older brother, known only as Ba, are unknown.
Besides Gwee, 10 other men were also charged for having paid sex with an underage girl. Two of them have been convicted and sentenced to jail.
Taxi driver Chong Heng Kow, 52, who paid $100 twice for sex with a 16-year-old Vietnamese prostitute on July 3 last year, was jailed for three months on Monday.
Odd-job worker Tan Wah Eng, 60, was jailed for four months on June 15 for having sex with a then 17-year-old Vietnamese prostitute on June 19 last year.
Cases involving the rest of the men are at the pre-trial conference stage.
"""
crime2="""He was a deputy public prosecutor for seven years. Then, he set up his own law practice in the 1980s.
On Friday, the lawyer of almost 30 years found himself on the other side of the legal system.
Spencer Gwee Hak Theng, 59, was charged in court with having sex with an underage girl, who was 16 at the time of the offence.
He allegedly paid $300 for sexual services from a Vietnamese girl on July 19 last year at Four Chain View Hotel at 757, Geylang Road Lorong 39.
Gwee is out on $10,000 bail and his passport has been impounded. He will be back in court again on July 27.
If convicted, he can be jailed up to seven years or fined or both.
He is the last of a total of 11 men who have been charged in relation to a vice syndicate operated by a couple in Geylang.
The couple managed women from Vietnam who were flown into Singapore to work as prostitutes. (See report below.)
Gwee is the sole practitioner at his law firm, Spencer Gwee and Co., at Beach Road.
In an interview on Friday, he said that having to turn up in court to be charged was an embarrassing experience.
"It was an unpleasant experience you do not wish upon your greatest enemy. It gave me a few sleepless nights," he said.
Gwee said that he does not intend to plead guilty and will wait for advice from his lawyer, Mr Lawrence Ang Boon Kong.
He said that when he found out last week that he would be charged, it caught him by surprise.
"Last year, I was called up once or twice (by the police) and I gave a short statement. Then it fizzled out. I thought nothing would happen.
"Then suddenly, pong!" he said of finding out he was getting charged while gesturing that the news hit him hard.
He hired a lawyer, who asked for a deferment of the charge but was turned down.
Gwee is concerned that the charge will affect his business.
His passport has been impounded, but he said he needs it to travel to Hong Kong for business.
As he did not have the actual dates of his trip, he could not apply for permission from the court to travel overseas, he said.
In view of this, he said he would get a power of attorney to handle his business in Hong Kong.
He added: "I was a bit distressed because I have professional duties to discharge."
Gwee said that he wished he could have had "some time to do all these things so that it doesn't affect the discharge of my duties to my clients, which is my primary concern".
He added: "I want my clients to retain the prerogative to decide whether they want me to continue. I would never force them."
But amid his legal woes, the father of a 24-year-old daughter managed to find a silver lining.
Gwee had been living on his own since he and his wife got divorced in the mid-90s.
Despite their break-up, his ex-wife, a retired conveyancing lawyer who is 11 years younger than him, bailed him out on Friday.
He said: "We are still on pretty good terms with each other."
He hinted that disagreements and quarrels had led to their estrangement.
"When you are married, you are young, you quarrel over things... You may be a bit strong-headed, you just want your side of your argument so you clash," Gwee said.
"I didn't remarry because I think I had a very good wife. I couldn't find anybody as intelligent. She's a unique person."
When asked how his daughter is taking the news of his charge, he said: "The only thing that pains me from this whole episode is the amount of suffering and heartache I'm going to cause to people who matter to me.
"Sometimes you think about your daughter, you think about your wife. I just hope they can bear with the pain."
His defence lawyer, Mr Ang, was a former colleague whom he described as "very intelligent and very able".
Gwee said: "As you know, when you are in trouble, you always look for friends. I don't have many, but I think he is the character that's always eager to help."
They had lost touch, but "what a way to get in touch again", he remarked.
Gwee said that some lawyers slipped him "little notes" when he was in the Subordinate Courts on Friday.
He said: "All kinds of notes, all very supportive. You know you've got friends.
"It's fairly touching and you are in this happy state that despite your own difficulties, people are voicing support."
With a smile, he said: "Life goes on."
"""
sports1="""The "Big Brother" of Singapore football will be back, but not immediately, and not for long. In an exclusive interview with The New Paper, Persib Bandung striker Noh Alam Shah said he has agreed to sign a short-term deal with former club Tampines Rovers until the end of the season.
But the 31-year-old said: "Beyond that, I feel my future is still in Indonesia.
"I feel really appreciated here. Four Indo clubs already made me offers for the next season, which starts next January."
The move to Singapore still hinges on whether Tampines can secure his medical documents and International Transfer Certificate from the Indonesia FA before the transfer window closes today, although the Stags are optimistic.
If there are no surprises, Alam Shah will return to Singapore after July 11, after Persib play their final Indonesia Super League (ISL) match against champions Sriwijaya.
Said the striker: "Tampines have always been very close to my heart and I'm thankful to 'Boss' (Tampines chairman Teo Hock Seng).
"It will be great if we can win another title together.
"But I feel it is only right for me to finish the last three games with Persib because they have been very good to me.
"Then I have to go back to Malang for my personal belongings. I should be a back a few days after July 11."
This means Singapore fans will probably catch their first glimpse of the powerful forward on July 17, when Tampines visit Woodlands Wellington.
He will probably line-up alongside another new striker, Serb Sead Hadzibulic, with ageless stalwart Aleksandar Duric also in the mix.
Potentially, it is a fearsome combination, and Tampines coach Steven Tan quipped: "If we want to defend our title, more power better than no power."
In an eventful seven years with Tampines from 2003 to 2009, Alam Shah thrilled fans by thumping in more than 100 goals to help the eastern giants win two S-League titles and two Singapore Cups.
However, he also embroiled in a few controversial incidents because of his volatile temperament.
One of the gravest was his shocking attack on national teammate Daniel Bennett in the 2007 Singapore Cup final against SAFFC, who won 4-3.
Alam Shah, known for his straight-talking nature, had criticised the Beep Test before he left the S-League in 2009 to captain Arema Malang to ISL glory.
The mere mention of the compulsory fitness test yesterday - he has to pass it to be able to play for Tampines - was enough to make him bristle.
He said: "I still hold the same views.
"I will have to pass the Beep Test to play in the S-League again.
"If I can't pass, I can go back to Indonesia. But the sad thing is, I see so many good footballers who can't play because they fail this Beep Test."
The timing of his return coincides with this year's Suzuki Cup, which will be co-hosted by Malaysia and Thailand from Nov 14 to Dec 8.
Three-rime champions Singapore have targeted a place in the final and could do with a proven striker.
The former national skipper, who has 35 goals in 80 internationals, is the all-time top scorer in the regional competition with 17 goals.
Said Alam Shah: "Of course every player wants to play for his national team, but I will be 32 by then.
"If they call me up, it shows they have a lack of young strikers who can fill the gap.
"I've already thought of retiring from international football, but if I get a call-up, I will accept and share my experience with the younger players."
"""

sports2="""HE insists he is not biased.
Cristiano Ronaldo, according to Peter Schmeichel, has proven himself the best player in the world at Euro 2012.
Ronaldo has "stepped up" when it has mattered for Portugal and brought his team along with him, said Schmeichel, speaking at a press interview arranged by Carlsberg for Asian media in Poland.
"What about Leo Messi?" I challenged the former Manchester United great.
The difference is small but "Messi has limitations", said the Great Dane.
"Cristiano can play in any team and make them better," he said.
Unlike Messi, said Schmeichel, who needs to play alongside players of a high ability to be successful.
"When Messi passes the ball and moves, he needs a player who is intelligent enough to give him back the ball at another place," said the former goalkeeper.
Surely you are biased because Ronaldo used to play for the Red Devils, like you, I asked Schmeichel. "No, I don't think so," came his considered reply.
"I have exact knowledge of what Cristiano can do on the football pitch. I saw him do it for five years at United.
"But he also went on and did the same at Real Madrid and I'm not a fan of Real. I just think that he is a player like none other.
Gifted
"He is big and strong, and he is technically gifted too. Christ, he can defend too. There are no limitations to his play."
Schmeichel added that it was a pity that Ronaldo could not take his team into the final, but does not see it as a failing of the player.
"One team hit the post and the other did not, those are the margins between success and failure," he said.
On the ongoing tournament, the outspoken Dane also hit out at critics of Spain, arguing that they have played flawless "technical football".
"It's no surprise to me that they are in the final," he added.
"""
corpus = [crime1,crime2,sports1,sports2]
corpusdir = 'classificationcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)
    
filename = 0
for text in corpus:
    filename+=1
    with open(corpusdir+str(filename)+'.txt','w') as fout:
        print (text,fout)
        fout.write(text)

The vice syndicate that lawyer Spencer Gwee Hak Theng was charged in relation to was operated by Seng Swee Meng, 42.
Together with his Vietnamese wife, Ngo Tien, 32, and two of her family members, they managed women from Vietnam who were flown into Singapore to work as prostitutes.
During a raid last August, police officers rounded up 30 Vietnamese hostesses and prostitutes working in various pubs in Joo Chiat and Geylang.
Half of them were managed by Seng and his wife. And four out of the 15 managed by the couple were virgins aged 16 and 17.
Seng faced four charges of harbouring prostitutes, four counts of receiving them at Changi Airport, four counts of living in part on earnings of prostitution, two counts of abetting to obtain commercial sex with minors and one count of managing a place where the sex workers were assigned.
Seng was sentenced to five years' jail on April 11. The next day, his sister-in-law, Ngo Ngoc, 27, was jailed for 18 months.
His wife is on the run after leaving

In [14]:
newcorpus = PlaintextCorpusReader('/gpfs/global_fs01/sym_shared/YPProdSpark/user/sfbc-20c2d955c74628-3c618564d05f/notebook/work/classificationcorpus', '.*')
fids= (newcorpus.fileids())
docs=[newcorpus.words(f) for f in fids]

# Change words to lowercase
docs=tolower(docs)
#print(docs)
#Remove stop words
docs=removestop(docs)
#Perform stemming
docs=stemwords(docs)

#Create dictionary
dictionary=fetchdictionary(docs)
token_to_id=dictionary.token2id
#Convert to vector
print (type(docs))
vecs=convertToVec(docs,dictionary)
print (vecs)
#Build index for finding similarity
index=buildindex(vecs)
#print(index)

tdif=createtdif(vecs)
print (tdif)

<class 'list'>
[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 4), (10, 1), (11, 4), (12, 1), (13, 1), (14, 1), (15, 21), (16, 1), (17, 1), (18, 4), (19, 2), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 2), (27, 1), (28, 1), (29, 1), (30, 2), (31, 1), (32, 1), (33, 4), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1), (51, 1), (52, 1), (53, 1), (54, 2), (55, 1), (56, 2), (57, 2), (58, 1), (59, 1), (60, 1), (61, 1), (62, 6), (63, 1), (64, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1), (74, 5), (75, 8), (76, 1), (77, 4), (78, 5), (79, 1), (80, 1), (81, 1), (82, 1), (83, 2), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 2), (90, 3), (91, 1), (92, 1), (93, 1), (94, 1), (95, 2), (96, 1), (97, 1), (98, 1), (99, 2), (100, 1), (101, 1), (102, 5), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 5), (1

In [33]:
all_data_as_dict=[{id:1 for (id,tf_value) in vec} for vec in vecs]
print (type(all_data_as_dict))
crime_data=[(d,'Crime') for d in all_data_as_dict[0:2]]
sports_data=[(d,'Sports') for d in all_data_as_dict[2:]]
all_labeled_data=crime_data+sports_data
#dict(all_labeled_data)

<class 'list'>


#### The code {id:1 for (id, tf value) in vec} goes through every element in vec (which is a term ID together with its TF value) and creates a mapping from id to 1. All these mappings together form a dict object. Note that we simply use 1 as the value of each feature in the dict object rather than the raw term frequency. This is because NLTK implements a version of Naive Bayes that does not consider the term frequency of words in documents.

## Classification with Naive Bayes Classifier

In [38]:
from nltk.classify import NaiveBayesClassifier

# Train Naive Bayes classifier using training data
classifier=NaiveBayesClassifier.train(all_labeled_data)

# Test on random samples
test_doc=all_data_as_dict[1]
print (classifier.classify(test_doc))

#Show the accuracy of the classfier
nltk.classify.accuracy(classifier,all_labeled_data)

Crime


1.0