# Text Clasification

### In this notebook we perform text classification experiments using Random Forest, SVM and Stohastic Gradient Descent Classifier.

The input dataset contains over 100000 documents and the labels consists of Technology, Business,
Health and Entertainment. In these experiments we first train our classifiers using the TF-IDF vectors evaluate them.
Then we apply SVD in order to reduce the dimensions of the vectors and again evaluate our classifiers using the new set.
Before each evaluation, we use grid search in order to find the best hyper-parameter of each model. Furthermore, since our
dataset is significantly big, in vectorization we use the hashing trick in order to compute the frequencies and then we 
use the TF-IDF transformer in order to calculate the TF-IDF values. 

To sum up, the whole procedure consists of:  
    1. Loading Dataset.
    2. Text pre-processing.
    3. TF-IDF vectorization using the hashing trick.
    4. Grid Search for optimal hyper-parameter detection.
    5. Evaluation of the classifiers with the TF-IDF vectors using 5-Fold Cross Validation. 
    6. Dimension reduction using SVD.
    7. Grid search again for the new set.
    8. Evaluation with the new set.
---
### Imports

In [1]:
import pandas as pd
import re
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import preprocessing, svm, metrics
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.linear_model import SGDClassifier
import time

### Method to implement K-Fold Cross Validation

In [2]:
def evaluation(clf, clf_name, X, y, k=5):
    starting_tm = time.time()
    clf_precision = 0
    clf_recall = 0
    clf_f1 = 0
    clf_accuracy = 0
    
    skf = StratifiedKFold(n_splits=k)
    for train_index, test_index in skf.split(X, y):
        
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        
        clf_precision += metrics.precision_score(y_test, predictions, average='micro')
        clf_recall += metrics.recall_score(y_test, predictions, average='micro')
        clf_f1 += metrics.f1_score(y_test, predictions, average='micro')
        clf_accuracy += metrics.accuracy_score(y_test, predictions)
    
     # compute the average of each value
    precision_score = clf_precision/k
    recall_score = clf_recall/k
    f1_score = clf_f1/k
    accuracy_score = clf_accuracy/k
    
    print(clf_name + "\nPrecision: " + str(precision_score)
          + "\nRecall: " + str(recall_score)
          + "\nF1-Measure: " + str(f1_score) 
          + "\nAccuracy: " + str(accuracy_score)
          + "\nExecution time: " + str(time.time() - starting_tm))

### Grid Search for hyper-parameter tuning
Giving an input dataset, seek to find the optimal hyper parameters using grid search

In [3]:
def grid_evaluattion(clf, msg, X, y, tuned_parameters, scores):
    print(msg)
    
    # Split the dataset in two equal parts
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)    
    for score in scores:
        print("# Tuning hyper-parameters for %s\n" % score)
    
        clf = GridSearchCV(clf, tuned_parameters, scoring='%s_micro' % score)
        clf.fit(X_train, y_train)
    
        print("Best parameters set found on development set: ")
        print(clf.best_params_)
        print("\nGrid scores on development set:")
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                  % (mean, std * 2, params))
        print()
    
        print("Detailed classification report:\n")
        print("The model is trained on the full development set.")
        print("The scores are computed on the full evaluation set.\n")
        
        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))
        print() 

# test all the classifiers
def hyper_parameter_tuning(train_set, labels):
    starting_tm = time.time()
    rf = RandomForestClassifier(n_jobs=12)
    msg = "Using Random Forest Classifier"
    tuned_parameters = [{'n_estimators': [i*10 for i in range(1, 21)]}]
    scores = ['recall']
    grid_evaluattion(rf, msg, train_set, labels, tuned_parameters, scores)
    
    svm_clf = svm.SVC()
    msg = "Using SMV Classifier"
    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1, 1e-1, 1e-2,], 'C': [10, 100, 1000]}]
    scores = ['recall']
    grid_evaluattion(svm_clf, msg, train_set, labels, tuned_parameters, scores)
    
    sgd_clf = SGDClassifier(n_jobs=12)
    msg ="Using Stohastic Gradient Descent"
    tuned_parameters = [{'loss':['hinge', 'modified_huber', 'log', 'squared_hinge'], 'max_iter':[3000]}]
    scores = ['recall']
    grid_evaluattion(sgd_clf, msg, train_set, labels, tuned_parameters, scores)
    print("Execution time: " + str(time.time() - starting_tm))

### Loading Dataset and encoding its labels.

In [4]:
train = pd.read_csv("files/data/train.csv")
X = (train['Title']+ " ")*5 + train['Content']

le = preprocessing.LabelEncoder()
y = le.fit_transform(train['Label'])

print(train.shape)
train.head()

(111795, 4)


Unnamed: 0,Id,Title,Content,Label
0,227464,"Netflix is coming to cable boxes, and Amazon i...",if you subscribe to one of three rinky-dink (...,Entertainment
1,244074,"Pharrell, Iranian President React to Tehran 'H...","pharrell, iranian president react to tehran '...",Entertainment
2,60707,Wildlife service seeks comments,the u.s. fish and wildlife service has reopen...,Technology
3,27883,Facebook teams up with Storyful to launch 'FB ...,the very nature of social media means it is o...,Technology
4,169596,Caesars plans US$880 mln New York casino,caesars plans us$880 mln new york casino jul ...,Business


### Pre-processing using Stem Tokenizer
For each document, remove all the non alpha characters, transform text to lower case, remove the stopwords
and stem.

In [5]:
ps = PorterStemmer()
my_stopwords = set([ps.stem(w) for w in 
                  ENGLISH_STOP_WORDS
                  .union(stopwords.words('english'))
                  .union(['include', 'way', 'work', 'look', 'add', 'time', 'year', 'month', 'day', 'help', 'think', 'tell', 'new', 'said', 'say','need', 'come', 'good', 'set', 'want', 'people', 'use', 'day', 'week', 'know'])])

def stem_tokenizer(doc):
    clean_document = re.sub('[^a-zA-Z]+', ' ', doc.lower())
    tokens = [ps.stem(token) for token in word_tokenize(clean_document) if len(token) > 2]
    return tokens

### TF-IDF Vectorization
Since the corpus are too large, we apply the hashing trick using the HashingVectorizer.
Afterwards, we calculate the TF-IDF values.

In [6]:
vectorizer = HashingVectorizer(tokenizer=stem_tokenizer, stop_words=my_stopwords, n_features=100000)

starting_tm = time.time()
vectors = vectorizer.fit_transform(X)
vectors = TfidfTransformer().fit_transform(vectors)
print("Vectorization time: " + str((time.time() - starting_tm)))

vectors.shape

  'stop_words.' % sorted(inconsistent))


Vectorization time: 668.3160946369171


(111795, 100000)

### Classifiers Evaluation with BoW
We examine the performance of the classifiers using the tf-idf vectors as input. In order to detect the optimal 
hyper-parameters, we apply grid search. However, since the data are too big, we use just a subset of the vectors.

In [9]:
hyper_parameter_tuning(vectors[:20000], y[:20000])

Using Random Forest Classifier
# Tuning hyper-parameters for recall

Best parameters set found on development set: 
{'n_estimators': 140}

Grid scores on development set:
0.826 (+/-0.012) for {'n_estimators': 10}
0.860 (+/-0.010) for {'n_estimators': 20}
0.869 (+/-0.015) for {'n_estimators': 30}
0.879 (+/-0.017) for {'n_estimators': 40}
0.880 (+/-0.011) for {'n_estimators': 50}
0.884 (+/-0.015) for {'n_estimators': 60}
0.890 (+/-0.008) for {'n_estimators': 70}
0.887 (+/-0.012) for {'n_estimators': 80}
0.889 (+/-0.015) for {'n_estimators': 90}
0.890 (+/-0.004) for {'n_estimators': 100}
0.891 (+/-0.006) for {'n_estimators': 110}
0.893 (+/-0.009) for {'n_estimators': 120}
0.892 (+/-0.008) for {'n_estimators': 130}
0.894 (+/-0.004) for {'n_estimators': 140}
0.893 (+/-0.010) for {'n_estimators': 150}
0.892 (+/-0.008) for {'n_estimators': 160}
0.893 (+/-0.012) for {'n_estimators': 170}
0.893 (+/-0.010) for {'n_estimators': 180}
0.894 (+/-0.012) for {'n_estimators': 190}
0.894 (+/-0.011) for 



In [7]:
print("Classifiers Performance using " + str(vectors.shape[0]) + " documents.\n")

rf = RandomForestClassifier(n_estimators=140, n_jobs=12)
svm_clf =svm.SVC(gamma=0.1, C=10, kernel='rbf')
sgd_clf = SGDClassifier(loss='hinge', max_iter=3000, n_jobs=12)

clfs = [(rf, "Random Forest Classifier"), (sgd_clf, "Stohastic Gradient Descent"), (svm_clf, "SVM Classifier")]
for clf, clf_name in clfs:
    evaluation(clf, clf_name, vectors, y)
    print("\n\n")

Classifiers Performance using 111795 documents.

Random Forest Classifier
Precision: 0.9356500737957869
Recall: 0.9356500737957869
F1-Measure: 0.9356500737957869
Accuracy: 0.9356500737957869
Execution time: 621.035596370697



Stohastic Gradient Descent
Precision: 0.9641397200232568
Recall: 0.9641397200232568
F1-Measure: 0.9641397200232568
Accuracy: 0.9641397200232568
Execution time: 4.2480950355529785



SVM Classifier
Precision: 0.975481908851022
Recall: 0.975481908851022
F1-Measure: 0.975481908851022
Accuracy: 0.975481908851022
Execution time: 14560.414593935013





### Dimension Reduction with SVD

In [8]:
components = 200
lsi_model = TruncatedSVD(n_components=components)
starting_tm = time.time()
lsi_X = lsi_model.fit_transform(vectors, y)
print("Produced dataset's shape is " + str(lsi_X.shape))
print("SVD time: " + str((time.time() - starting_tm)))

Produced dataset's shape is (111795, 200)
SVD time: 31.66509222984314


### Classifier Evaluation after SVD

In [13]:
hyper_parameter_tuning(lsi_X[:20000], y[:20000])

Using Random Forest Classifier
# Tuning hyper-parameters for recall

Best parameters set found on development set: 
{'n_estimators': 200}

Grid scores on development set:
0.911 (+/-0.007) for {'n_estimators': 10}
0.921 (+/-0.007) for {'n_estimators': 20}
0.922 (+/-0.008) for {'n_estimators': 30}
0.924 (+/-0.008) for {'n_estimators': 40}
0.924 (+/-0.008) for {'n_estimators': 50}
0.925 (+/-0.008) for {'n_estimators': 60}
0.926 (+/-0.008) for {'n_estimators': 70}
0.926 (+/-0.004) for {'n_estimators': 80}
0.926 (+/-0.009) for {'n_estimators': 90}
0.927 (+/-0.011) for {'n_estimators': 100}
0.925 (+/-0.006) for {'n_estimators': 110}
0.927 (+/-0.007) for {'n_estimators': 120}
0.926 (+/-0.011) for {'n_estimators': 130}
0.926 (+/-0.011) for {'n_estimators': 140}
0.927 (+/-0.007) for {'n_estimators': 150}
0.928 (+/-0.008) for {'n_estimators': 160}
0.928 (+/-0.008) for {'n_estimators': 170}
0.927 (+/-0.007) for {'n_estimators': 180}
0.927 (+/-0.006) for {'n_estimators': 190}
0.928 (+/-0.008) for 



In [9]:
print("Classifiers Performance using " + str(lsi_X.shape[0]) + " documents.\n")

rf = RandomForestClassifier(n_estimators=200, n_jobs=12)
svm_clf =svm.SVC(gamma=1, C=10, kernel='rbf')
sgd_clf = SGDClassifier(loss='modified_huber', max_iter=3000, n_jobs=12)

clfs = [(rf, "Random Forest Classifier"), (svm_clf, "SVM Classifier"), (sgd_clf, "Stohastic Gradient Descent")]
for clf, clf_name in clfs:
    evaluation(clf, clf_name, lsi_X, y)
    print("\n\n")

Classifiers Performance using 111795 documents.

Random Forest Classifier
Precision: 0.95464018963281
Recall: 0.95464018963281
F1-Measure: 0.95464018963281
Accuracy: 0.95464018963281
Execution time: 255.4513065814972



SVM Classifier
Precision: 0.9590858267364373
Recall: 0.9590858267364373
F1-Measure: 0.9590858267364373
Accuracy: 0.9590858267364373
Execution time: 1595.4284834861755



Stohastic Gradient Descent
Precision: 0.92914709960195
Recall: 0.92914709960195
F1-Measure: 0.92914709960195
Accuracy: 0.92914709960195
Execution time: 3.627742052078247



