# SVM Model for Text Classification
##### Using SVM model to classify text documents into subject categories

### Import the 20 New Groups data set from the scikit-learn library
* Data comprises a number of emails, articles and other text documents
* Each document falls into one of 20 categories of "News"
* Use a classification model to predict the news category given the document text

In [10]:
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [11]:
twenty_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

#### View the first document in our data set

In [12]:
print(twenty_train.data[0]) 

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







#### View all the categories

In [13]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

#### The target is represented by numbers

In [14]:
twenty_train.target

array([7, 4, 4, ..., 3, 1, 8])

#### Create a bag of words from our document list

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

#### View the word counts for the first document

In [16]:
print(X_train_counts[0])

  (0, 86580)	1
  (0, 128420)	1
  (0, 35983)	1
  (0, 35187)	1
  (0, 66098)	1
  (0, 114428)	1
  (0, 78955)	1
  (0, 94362)	1
  (0, 76722)	1
  (0, 57308)	1
  (0, 62221)	1
  (0, 128402)	2
  (0, 67156)	1
  (0, 123989)	1
  (0, 90252)	1
  (0, 63363)	1
  (0, 78784)	1
  (0, 96144)	1
  (0, 128026)	1
  (0, 109271)	1
  (0, 51730)	1
  (0, 86001)	1
  (0, 83256)	1
  (0, 113986)	1
  (0, 37565)	1
  :	:
  (0, 4605)	1
  (0, 76032)	1
  (0, 92081)	1
  (0, 40998)	1
  (0, 79666)	1
  (0, 89362)	3
  (0, 118983)	1
  (0, 90379)	1
  (0, 98949)	1
  (0, 64095)	1
  (0, 95162)	1
  (0, 87620)	1
  (0, 114731)	5
  (0, 68532)	3
  (0, 37780)	5
  (0, 123984)	1
  (0, 111322)	1
  (0, 114688)	1
  (0, 85354)	1
  (0, 124031)	2
  (0, 50527)	2
  (0, 118280)	2
  (0, 123162)	2
  (0, 75358)	2
  (0, 56979)	3


#### Get TF-IDF Weights using TfidfTransformer
This is different from TfidfVectorizer:
* TfidfVectorizer takes in a list of documents as input and produces a TF-IDF weighted bag of words
* TfidfTransformer takes in a regular bag of words and creates a TF-IDF weighted bag of words
* TfidfVectorizer == CountVectorizer + TfidfTransformer

In [17]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

#### Viewing the TF-IDF weights for first document

In [18]:
print(X_train_tfidf[0])

  (0, 56979)	0.0574701540749
  (0, 75358)	0.353835013497
  (0, 123162)	0.259709024574
  (0, 118280)	0.211868072083
  (0, 50527)	0.0546142865886
  (0, 124031)	0.107987951542
  (0, 85354)	0.0369697850882
  (0, 114688)	0.0621407098631
  (0, 111322)	0.019156718025
  (0, 123984)	0.0368542926346
  (0, 37780)	0.381338912595
  (0, 68532)	0.0732581234213
  (0, 114731)	0.144472755128
  (0, 87620)	0.0356718631408
  (0, 95162)	0.0344713840933
  (0, 64095)	0.0354209242713
  (0, 98949)	0.160686060554
  (0, 90379)	0.0199288599566
  (0, 118983)	0.0370859780506
  (0, 89362)	0.065211743063
  (0, 79666)	0.109364012524
  (0, 40998)	0.0780136819692
  (0, 92081)	0.0991327449391
  (0, 76032)	0.0192194630522
  (0, 4605)	0.0633260395248
  :	:
  (0, 37565)	0.0343176044248
  (0, 113986)	0.176917506749
  (0, 83256)	0.0884438249646
  (0, 86001)	0.0700041144584
  (0, 51730)	0.0971474405798
  (0, 109271)	0.108447248221
  (0, 128026)	0.0606220958898
  (0, 96144)	0.108269044907
  (0, 78784)	0.0633940918806
  (0, 63363

#### Create a Linear Support Vector Classifier
* penalty specifies whether to use L1 norm or L2 norm
    * Like with Lasso and Ridge, choose whether to minimize sum of absolute values or sum of squares of coefficients
* dual specifies whether to solve the primal or dual optimization problem
    * A primal optimization problem (e.g. increase revenue) can have an equivalent dual problem (e.g. reduce costs) (this is a gross oversimplification - a lot of math needed to explain in detail)
    * In our example, the primal optimization could be to maximize distance between our model and nearest points on either side of it. This will have a corresponding dual optimization problem
    * scikit-learn recommends that dual=False when there are more samples than features (which is the case in this example)
* tol represents a tolerance for the algorithm to consider when trying to maximize or minimize an ojective function
    * if the model is within the tolerance of the maximum or minimum, it is not refined further

In [19]:
from sklearn.svm import LinearSVC

clf_svc = LinearSVC(penalty="l2", dual=False, tol=1e-3)
clf_svc.fit(X_train_tfidf, twenty_train.target)

LinearSVC(C=1.0, class_weight=None, dual=False, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0)

#### Alternatively, a scikit-learn Pipeline can be used
* Pipeline is a sequence of transformations with an estimator specified in the final step
* The output of one transformation is passed as input to the next transformation
* The pipeline returns a model of the type specified in the estimator
* When the fit() method of the model is called with arguments, the arguments are passed through the transformation steps before actually being applied to the model


In [20]:
from sklearn.pipeline import Pipeline

clf_svc_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf',LinearSVC(penalty="l2", dual=False, tol=0.001))
])

In our example:
* we pass the document corpus and the labels to the pipeline classifier
* The CountVectorizer takes the corpus and creates a bag of words
* The TfidfTransformer takes the bag of words and produces a TF-IDF weighted bag
* The LinearSVC model applies the fit method with the TF-IDF weighted bag and the labels

In [21]:
clf_svc_pipeline.fit(twenty_train.data, twenty_train.target)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.001,
     verbose=0))])

#### Obtain the test data which we will use to make predictions

In [22]:
twenty_test = fetch_20newsgroups(subset='test', shuffle=True)

#### Make the predictions using our classifier

In [23]:
predicted = clf_svc_pipeline.predict(twenty_test.data)

#### Compute the accuracy of the model
Remember, there are 20 categories, so wild guesses will result in an accuracy of about 0.05

In [24]:
from sklearn.metrics import accuracy_score

acc_svm = accuracy_score(twenty_test.target, predicted)

In [25]:
acc_svm

0.85315985130111527

#### How good is our model if we just used the word counts without transforming to TF-IDF weights?

In [29]:
clf_svc_pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf',LinearSVC(penalty="l2", dual=False, tol=0.001))
])

In [30]:
clf_svc_pipeline.fit(twenty_train.data, twenty_train.target)
predicted = clf_svc_pipeline.predict(twenty_test.data)

acc_svm = accuracy_score(twenty_test.target, predicted)
acc_svm

0.79845990440785974