<a href="https://colab.research.google.com/github/sreelekshmi1199/MLprograms/blob/main/Exp9_(2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Aim:** Program to implement text classification using Support vector machine.

Short notes: 

A **Support Vector Machine model** is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples.

**Concept of hyperplane:**

A hyperplane is a decision boundary that differentiates the two classes in SVM. A data point falling on either side of the hyperplane can be attributed to different classes. The dimension of the hyperplane depends on the number of input features in the dataset.

For two dimensions, the separating line was the hyperplane. Similarly, for three dimensions a plane with two dimensions divides the 3d space into two parts and thus act as a hyperplane. Thus for a space of *n* dimensions we have a hyperplane of n-1 dimensions separating it into two parts.

**Text classification using SVM:**

Text Classification is an automated process of classification of text into predefined categories. We can classify Emails into spam or non-spam, news articles into different categories like Politics, Stock Market, Sports, etc. SVM can be applied to any kind of vectors which encode any kind of data. This means that in order to leverage the power of svm text classification, texts have to be transformed into vectors.

Vectors are (sometimes huge) lists of numbers which represent a set of coordinates in some space. when SVM determines the decision boundary we mentioned above, SVM decides where to draw the best “line” (or the best hyperplane) that divides the space into two subspaces: one for the vectors which belong to the given category and one for the vectors which do not belong to it. So, provided we can find vector representations which encode as much information from our texts as possible, we will be able to apply the SVM algorithm to text classification problems and obtain very good results.

**Algorithm:** 

1. Add the Required Libraries
2. Set random seed [This is used to reproduce the same result every time if the script is kept consistent otherwise each run will produce different results. The seed can be set to any number.]
3. Add the Corpus [The data set can be easily added as a pandas Data Frame with the help of ‘read_csv’ function. I have set the encoding to ‘latin-1’ as the text had many special characters.]
4. Data pre-processing [This basically involves transforming raw data into an understandable format for NLP models.]
  
  (a) Remove Blank rows in Data, if any

  (b) Change all the text to lower case

  (c) Word Tokenization

  (d) Remove Stop words

  (e) Remove Non-alpha text

  (f) Word Lemmatization

5. Prepare Train and Test Data sets [This can be done through the *train_test_split* from the sklearn library. The Training Data will have 70% of the corpus and Test data will have the remaining 30% as we have set the parameter test_size=0.3]
6. Encoding [Label encode the target variable — This is done to transform Categorical data of string type in the data set into numerical values which the model can understand.]
7. Word Vectorization [It is a general process of turning a collection of text documents into numerical feature vectors. Most popular method is called TF-IDF (“Term Frequency — Inverse Document” Frequency) which are the components of the resulting scores assigned to each word.]

  (a) Term Frequency: This summarizes how often a given word appears within a document.

  (b) Inverse Document Frequency: This down scales words that appear a lot across documents.

8. Use the SVM to Predict the outcome and display the accuracy of prediction.



**Data set**: Amazon Review Data set which has 10,000 rows of Text data which is classified into “Label 1” and “Label 2”. The Data set has two columns “Text” and “Label”. Data can be found https://github.com/Gunjitbedi/Text-Classification/blob/master/corpus.csv.

In [None]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\user\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
np.random.seed(500)

In [None]:
Corpus = pd.read_csv(r"corpus.csv",encoding='latin-1')

In [None]:
from nltk.corpus import stopwords

In [None]:
# Step - a : Remove blank rows if any.
Corpus['text'].dropna(inplace=True)

In [None]:
# Step - b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [entry.lower() for entry in Corpus['text']]

In [None]:
# Step - c : Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]

In [None]:
# Step - d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
for index,entry in enumerate(Corpus['text']):
    # Declaring Empty List to store the words that follow the rules for this step
    Final_words = []
    # Initializing WordNetLemmatizer()
    word_Lemmatized = WordNetLemmatizer()
    # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.
    for word, tag in pos_tag(entry):
        # Below condition is to check for Stop words and consider only alphabets
        if word not in stopwords.words('english') and word.isalpha():
            word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
            Final_words.append(word_Final)
    # The final processed set of words for each iteration will be stored in 'text_final'
    Corpus.loc[index,'text_final'] = str(Final_words)

In [None]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=0.3)

In [None]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [None]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

print(Tfidf_vect.vocabulary_)



In [None]:
print(Train_X_Tfidf)

  (0, 4505)	0.37634188677099956
  (0, 4504)	0.1502086671688917
  (0, 3977)	0.35870975205557054
  (0, 3893)	0.25152943577361386
  (0, 3863)	0.2690840463105974
  (0, 3749)	0.3469774999759746
  (0, 3659)	0.28971770688512954
  (0, 3562)	0.29440491517773787
  (0, 2931)	0.22969709983777647
  (0, 1942)	0.13398240399394393
  (0, 1532)	0.17762585383071805
  (0, 519)	0.3210759641783664
  (0, 490)	0.1230432680090133
  (0, 240)	0.24487094004433968
  (1, 4694)	0.36974013511943044
  (1, 4073)	0.6167222431544791
  (1, 3434)	0.367922932130556
  (1, 2581)	0.3755181501193181
  (1, 1247)	0.3587203442870721
  (1, 592)	0.27907786873623097
  (2, 4740)	0.18369761701331289
  (2, 4627)	0.1499525624356807
  (2, 4465)	0.10285168719742008
  (2, 4206)	0.11682860303877614
  (2, 3855)	0.23042292688427712
  :	:
  (6998, 2508)	0.11515961575144278
  (6998, 2128)	0.13654425766350872
  (6998, 1977)	0.07125207506420554
  (6998, 1785)	0.22020146972839663
  (6998, 1749)	0.19941178219513117
  (6998, 1713)	0.13513147046115875

In [None]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  84.6
