#### Importing Libraries

In [None]:
import pandas as pd
import string
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

import random
random.seed(10)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#### Loading the Dataset

In [None]:
df = pd.read_csv('all_documents.csv')
df.head()

Unnamed: 0,Sport,health,Politic
0,Claxton hunting first major medal\n\nBritish h...,Alabama Medicaid Agency (AMA) proposes to crea...,Labour plans maternity pay rise\n\nMaternity p...
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,Alabama's Medicaid Health Home SPA targets ind...,Watchdog probes e-mail deletions\n\nThe inform...
2,Greene sets sights on world title\n\nMaurice G...,The State of Alabama Medicaid Agency (AMA) is ...,Hewitt decries 'career sexism'\n\nPlans to ext...
3,IAAF launches fight against drugs\n\nThe IAAF ...,Alabama Medicaid Agency (AMA) proposes to crea...,Labour chooses Manchester\n\nThe Labour Party ...
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",Alabama's Medicaid Health Home SPA targets ind...,Brown ally rejects Budget spree\n\nChancellor ...


In [None]:
# checking the shape of the document.

df.shape

(108, 3)

The dataset have 3 types of document and 108 document in each column.

In [None]:
health = df.health.tolist()
health[:2]

['Alabama Medicaid Agency (AMA) proposes to create funding pools under the Demonstration that support the development, transition and maintenance of a coordinated care delivery system through the regional care organizations (RCOs), and to provide a mechanism for investments in delivery system reform. The funding pools will have three distinct components for which federal financial participation is requested: (1) funding for designated state health programs (DSHP), (2) transition payments to RCOs, hospitals, and other eligible providers to cover costs associated with transitioning to the RCO model, and (3) a delivery system reform incentive payment (DSRIP) program for RCOs, hospitals, and other eligible providers that will better align provider payment with the value of care.',
 'Alabama\'s Medicaid Health Home SPA targets individuals with a single behavioral health issue, two chronic conditions; or one chronic condition and the risk of developing another from the following list of cond

In [None]:
politics = df.Politic.tolist()
politics[0]

'Labour plans maternity pay rise\n\nMaternity pay for new mothers is to rise by £1,400 as part of new proposals announced by the Trade and Industry Secretary Patricia Hewitt.\n\nIt would mean paid leave would be increased to nine months by 2007, Ms Hewitt told GMTV\'s Sunday programme. Other plans include letting maternity pay be given to fathers and extending rights to parents of older children. The Tories dismissed the maternity pay plan as "desperate", while the Liberal Democrats said it was misdirected.\n\nMs Hewitt said: "We have already doubled the length of maternity pay, it was 13 weeks when we were elected, we have already taken it up to 26 weeks. "We are going to extend the pay to nine months by 2007 and the aim is to get it right up to the full 12 months by the end of the next Parliament." She said new mothers were already entitled to 12 months leave, but that many women could not take it as only six of those months were paid. "We have made a firm commitment. We will definit

In [None]:
sport = df.Sport.tolist()
sport[0]

'Claxton hunting first major medal\n\nBritish hurdler Sarah Claxton is confident she can win her first major medal at next month\'s European Indoor Championships in Madrid.\n\nThe 25-year-old has already smashed the British record over 60m hurdles twice this season, setting a new mark of 7.96 seconds to win the AAAs title. "I am quite confident," said Claxton. "But I take each race as it comes. "As long as I keep up my training but not do too much I think there is a chance of a medal." Claxton has won the national 60m hurdles title for the past three years but has struggled to translate her domestic success to the international stage. Now, the Scotland-born athlete owns the equal fifth-fastest time in the world this year. And at last week\'s Birmingham Grand Prix, Claxton left European medal favourite Russian Irina Shevchenko trailing in sixth spot.\n\nFor the first time, Claxton has only been preparing for a campaign over the hurdles - which could explain her leap in form. In previous

In [None]:
# combining all the documents into one list

combined_doc = health + sport + politics

# Shuffling the documents
random.shuffle(combined_doc)

combined_doc[:3]

['Nat Insurance to rise, say Tories\n\nNational Insurance will be raised if Labour wins the next election, Tory leader Michael Howard has claimed.\n\nTony Blair has said he does not want higher tax rates for top earners but on Wednesday said other tax promises would be left to Labour\'s manifesto. Prime minister\'s questions also saw Mr Blair predict that new plans would probably cut net immigration. He attacked Tory plans to process asylum claims abroad - but Mr Howard said Labour had proposed the idea too.\n\nThe Commons questions session again saw the leaders of the two biggest parties shape up for the forthcoming election campaign. The Tories have promised £4bn in tax cuts but have yet to say where they will fall. Mr Howard pointed to the Institute for Fiscal Studies\' predictions that Labour will need to increase taxes to cover an £11bn gap in its spending plans. He accused ministers of wasting money on unsuccessful attempts to curb bad behaviour and truancy in schools and on slow

In [None]:
# Removing punctuations, numbers and line spacing from the documents

cleaned_docs = []
for doc in combined_doc:
  # removing punctuations in each reviews
  remove_punc = re.sub(r'[^\w\s]', '', doc)
  # removing numbers from the reviews
  remove_numbers = re.sub(r'[0-9]+', '', remove_punc)
  line_split = remove_numbers.replace('\n', ' ')
  cleaned_docs.append(line_split)
cleaned_docs[:2]

['Nat Insurance to rise say Tories  National Insurance will be raised if Labour wins the next election Tory leader Michael Howard has claimed  Tony Blair has said he does not want higher tax rates for top earners but on Wednesday said other tax promises would be left to Labours manifesto Prime ministers questions also saw Mr Blair predict that new plans would probably cut net immigration He attacked Tory plans to process asylum claims abroad  but Mr Howard said Labour had proposed the idea too  The Commons questions session again saw the leaders of the two biggest parties shape up for the forthcoming election campaign The Tories have promised bn in tax cuts but have yet to say where they will fall Mr Howard pointed to the Institute for Fiscal Studies predictions that Labour will need to increase taxes to cover an bn gap in its spending plans He accused ministers of wasting money on unsuccessful attempts to curb bad behaviour and truancy in schools and on slow asylum processing It was n

#### Stopwords.

In [None]:
stpwords = stopwords.words('english')
print(stpwords)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Tokenizing.

In [None]:
ps = PorterStemmer()
filtered_docs = []
for doc in cleaned_docs:
  tokens = word_tokenize(doc)
  tmp = ''
  for word in tokens:
    if word not in stpwords:
      tmp += ps.stem(word) + ' '
  filtered_docs.append(tmp)

filtered_docs[:2]


['nat insur rise say tori nation insur rais labour win next elect tori leader michael howard claim toni blair said want higher tax rate top earner wednesday said tax promis would left labour manifesto prime minist question also saw mr blair predict new plan would probabl cut net immigr he attack tori plan process asylum claim abroad mr howard said labour propos idea the common question session saw leader two biggest parti shape forthcom elect campaign the tori promis bn tax cut yet say fall mr howard point institut fiscal studi predict labour need increas tax cover bn gap spend plan he accus minist wast money unsuccess attempt curb bad behaviour truanci school slow asylum process it good mr blair claim tax pledg left manifesto given one mp tuesday top rate incom tax argu mr howard point nation insur ad everyon know tax go labour isnt clear tax would mr blair instead hail labour achiev use strong economi invest public servic when money go extra teacher nurs equip school hospit money was

In [None]:
len(max(filtered_docs))

586

#### Vector Space

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vect = vectorizer.fit_transform(filtered_docs)
print(vect.todense())

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [None]:
vect.shape

(324, 6189)

#### Clustering

In [None]:
from sklearn.cluster import KMeans
K = 3
model = KMeans(n_clusters=K, random_state=42)
model.fit(vect)

print(model.labels_)
print(model.cluster_centers_)

[1 1 0 1 2 0 2 0 2 1 2 0 1 1 0 1 0 2 1 1 0 0 0 1 2 1 0 1 0 1 2 0 1 0 1 0 2
 1 2 2 2 1 2 1 1 1 0 1 0 0 0 1 0 2 1 2 0 2 0 1 2 0 2 0 0 1 2 1 2 0 0 0 0 1
 1 2 1 0 1 2 1 2 2 0 2 0 1 1 1 2 2 2 1 1 0 0 2 2 2 0 1 0 1 1 2 1 1 2 1 1 2
 2 2 1 1 0 0 1 0 2 2 1 2 0 0 0 2 1 0 2 1 1 0 1 2 1 0 1 1 1 0 0 2 0 0 0 1 0
 0 1 0 1 2 2 1 2 2 0 0 1 2 1 2 0 2 1 1 0 1 0 2 0 1 1 1 0 0 1 0 2 2 1 2 2 1
 2 1 1 2 0 0 0 2 1 2 1 1 1 1 2 1 0 1 2 0 2 2 1 0 2 0 1 0 0 1 0 0 0 2 2 0 1
 2 0 0 2 1 1 2 0 1 1 2 0 0 2 1 0 1 0 2 2 1 0 2 0 1 0 1 2 1 1 0 1 1 0 1 0 0
 1 2 2 0 0 0 2 1 1 0 1 1 2 0 1 2 0 1 1 1 2 1 2 1 1 0 1 2 0 1 2 1 2 1 0 2 2
 0 1 2 2 1 2 2 0 1 0 2 2 0 2 1 1 0 0 2 1 1 0 0 1 1 1 0 1]
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.00032966 ... 0.00051804 0.         0.        ]
 [0.01256634 0.00102528 0.         ... 0.         0.0005832  0.00138695]]


#### Function that cleans input texts.

In [None]:
def cleaner(texts):
  new_docs = []
  for doc in texts:
    # removing punctuations in each reviews
    remove_punc = re.sub(r'[^\w\s]', '', doc)
    # removing numbers from the reviews
    remove_numbers = re.sub(r'[0-9]+', '', remove_punc)
    line_split = remove_numbers.replace('\n', ' ')
    new_docs.append(line_split)
  filtered_docs = []
  for doc in new_docs:
    tokens = word_tokenize(doc)
    tmp = ''
    for word in tokens:
      if word not in stpwords:
        tmp += ps.stem(word) + ' '
    filtered_docs.append(tmp)
  return filtered_docs


Note : 

- The vectorizer requires an input of lists of document.
- Inputing a random texts will give an error, because the model is trained with a total of 7799 inputs.


In [None]:
# prediction.
all_docs = health + sport + politics
docs = cleaner(all_docs)

# since the first 100 docs is about tech, then they should give the same prediction
vect = vectorizer.transform(docs)
prediction = model.predict(vect[:108])
print(prediction)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


It can be seen that there is only 2 ones in the result. Meaning tech is tagged label 0.

trying for sports also

In [None]:
# Predicting the Sport documents

prediction = model.predict(vect[108:216])
print(prediction)

[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


The sport doc is labelled as two.

In [None]:
# Predicting the Politics documents

prediction = model.predict(vect[216:324])
print(prediction)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


The majority we have is one. Hence politics is labelled as one.

##### Conclusion :

- The clustering model used performed well. it's prediction were almost accurate.
- Improvements can be made about the limitation of the Vectorizer. 