# Text classification

## problem description
We will use the newsgroups dataset from sklearn, while the methods to be used are
still the same, i.e. the representation of the text is the simple word count features
such as tf or tf-idf, and the classification method is still the naïve Bayes. The dataset
is a collection of messages from 20 news groups including comp.graphics,
rec.sport.hockey, sci.space, talk.religion.misc etc (course material)

## implementation

we will use naive bayes to calculate the probability of category new article.
The Naive Bayes text classification algorithm is a type of probabilistic model used in machine learning. Harry R. Felson and Robert M. Maxwell designed the first text classification method to classify text documents with zero or more words from the document being classified as authorship or genre.(https://www.turing.com/)


In [7]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
# Get the training dataset for the specified categoires
categories = ['rec.sport.hockey', 'talk.religion.misc',
              'comp.graphics', 'sci.space']
training_data = fetch_20newsgroups(subset='train', categories=categories)

In [9]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=True)
# tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)


(2154, 35956)


In [10]:
# Create the tf-idf transformer
tfidf = TfidfVectorizer(use_idf=False)
training_tfidf = tfidf.fit_transform(training_data.data)
print(training_tfidf.shape)

# Train a Multinomial Naive Bayes classifier
classifier = MultinomialNB().fit(training_tfidf, training_data.target)

(2154, 35956)


In [11]:
from sklearn import metrics

testing_data = fetch_20newsgroups(subset='test', categories=categories)
testing_tfidf = tfidf.transform(testing_data.data)
predictions = classifier.predict(testing_tfidf)
print(metrics.classification_report(testing_data.target, predictions, target_names=categories))


                    precision    recall  f1-score   support

  rec.sport.hockey       0.96      0.88      0.92       389
talk.religion.misc       0.88      0.99      0.93       399
     comp.graphics       0.65      0.97      0.78       394
         sci.space       1.00      0.12      0.22       251

          accuracy                           0.81      1433
         macro avg       0.87      0.74      0.71      1433
      weighted avg       0.86      0.81      0.76      1433



In [5]:
errors = [i for i in range(len(predictions)) if predictions[i] != testing_data.target[i]]

for i, post_id in enumerate(errors[:5]):
  print("------------------------------------------------------------------")
  print("%s --> %s\n" %(testing_data.target_names[testing_data.target[post_id]], 
                      testing_data.target_names[predictions[post_id]]))
  print(testing_data.data[post_id])


------------------------------------------------------------------
comp.graphics --> sci.space

From: robert@slipknot.rain.com (Robert Reed)
Subject: Re: ACM SIGGRAPH (and ACM in general)
Reply-To: Robert Reed <robert@slipknot.rain.com>
Organization: Home Animation Ltd.
Lines: 50

In article <1993Apr29.023508.11556@koko.csustan.edu> rsc@altair.csustan.edu (Steve Cunningham) writes:
|
|And no, SIGGRAPH 93 has not skipped town -- we're preparing the best
|SIGGRAPH conference yet!

Speaking of SIGGRAPH, I just went through the ordeal of my annual registration
for SIGGRAPH and re-upping of membership in the ACM last night, and was I ever
grossed out!  The new prices for membership are almost highway robbery!

For example:

	SIGGRAPH basic fee went from $26 last year to $59 this year for the same
	thing, a 127% increase.  Those facile enough to arrange a trip to the
	annual conference could reduce this to $27 by selecting SIGGRAPH Lite,
	which means SIGGRAPH is charging an additional $32 (o

#discussion
the accuracy of trained classifier is 0.92. 

 * TfidfVectorizer(use_idf=False) is used to change between the choice of using tfidf or not.
 * In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.[1] It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.  (wikipedia).


* Every article we supplied is categorised using MultinomialNB() classifier.

* there are also some errors , where wrong categorization has been done, which hwe have pointed out in our output.