# NLP Topic Modeling(Using HashingVectorizer + SVC)

## Problem statement

The problem statement is from Analytics Vidhya JanataHack hackathon. 

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.

Given the abstract and title for a set of research articles, predict the topics for each article included in the test set. 

Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics: 

1. Computer Science

2. Physics

3. Mathematics

4. Statistics

5. Quantitative Biology

6. Quantitative Finance

## Dataset

The dataset consists of three files `train.csv`, `test.csv` and `sample_submission.csv`.

|Fields| Description|
|-------|-----------|
|ID |Unique ID for each article|
|TITLE|Title of the research article|
|ABSTRACT|Abstract of the research article|
|Computer Science|Whether article belongs to topic computer science (1/0)|
|Physics	|Whether article belongs to topic physics (1/0)|
|Mathematics	|Whether article belongs to topic Mathematics (1/0)|
|Statistics	|Whether article belongs to topic Statistics (1/0)|
|Quantitative Biology	|Whether article belongs to topic Quantitative Biology (1/0)|
|Quantitative Finance|Whether article belongs to topic Quantitative Finance (1/0)|

## Approach

In this notebook, there are two approaches followed,
1. Consider each binary column as a target variable
2. Perform HashVectorizer for the feature extraction operation
3. Construct model for each target variable and combine the results. 


## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
    
# NLTK modules
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer

import re

from gensim.parsing.preprocessing import preprocess_string, strip_tags, strip_punctuation, remove_stopwords, strip_numeric, stem_text
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, HashingVectorizer
from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler, Normalizer
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC, SVC

from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## <span style="color:blue">Loading Dataset</span>

In [None]:

train_df = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv')
test_df = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv')

submission_df = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/sample_submission_UVKGLZE.csv')

Checking for missing values and data columns.

In [None]:
print(train_df.isnull().sum())
print(train_df.columns)

## <span style="color:blue">Explore Data</span>

In [None]:
# Converting binary column to category
target_cols = ['Computer Science', 'Physics', 'Mathematics','Statistics', 'Quantitative Biology', 'Quantitative Finance']
y_data = train_df[target_cols]

# Plot category data
plt.figure(figsize=(10,6))
y_data.sum(axis=0).plot.bar()
plt.show()


## Data Preparation

In [None]:
# Stemmer object
porter = PorterStemmer()
wnl = WordNetLemmatizer()

class DataPreprocess:
    
    def __init__(self):
        self.filters = [strip_tags,
                       strip_numeric,
                       strip_punctuation,
                       lambda x: x.lower(),
                       lambda x: re.sub(r'\s+\w{1}\s+', '', x),
                       remove_stopwords]
    def __call__(self, doc):
        clean_words = self.__apply_filter(doc)
        return clean_words
    
    def __apply_filter(self, doc):
        try:
            cleanse_words = set(preprocess_string(doc, self.filters))
#             filtered_words = set(wnl.lemmatize(w) if w.endswith('e') else porter.stem(w) for w in cleanse_words)
            filtered_words = set(wnl.lemmatize(word, 'v') for word in cleanse_words)
            return ' '.join(filtered_words)
        except TypeError as te:
            raise(TypeError("Not a valid data {}".format(te)))

## Combine Train and Test Data

Here, we are combining both train and test data into one single DataFrame. This will help us to perform the data preprocessing steps for both train and test data corpus at once. 

In [None]:
train_df['train_or_test'] = 0
test_df['train_or_test'] = 1

feature_col = ['ID', 'TITLE', 'ABSTRACT', 'train_or_test']

# Concat train and test data
combined_set = pd.concat([train_df[feature_col], test_df[feature_col]])

# Combine the Title and Abstract data
combined_set['TEXT'] = combined_set['TITLE'] + combined_set['ABSTRACT']

# Drop unwanted columns
combined_set = combined_set.drop(['TITLE', 'ABSTRACT'], axis=1)

**Pre-process the text data** 

We have combined the train and test dataset before applying the pre-processing steps. It will make us to execute the preprocessing pipeline only once for the entire dataset, otherwise we will have to run it separately for test dataset as well. 

In [None]:
# Invoke data preprocess operation on the text data
combined_set['Processed'] = combined_set['TEXT'].apply(DataPreprocess())

In [None]:
train_set = combined_set.loc[combined_set['train_or_test'] == 0]
test_set = combined_set.loc[combined_set['train_or_test'] == 1]

# Drop key reference column
train_set = train_set.drop('train_or_test', axis=1)
test_set = test_set.drop('train_or_test', axis=1)
# View just 2 row value
train_set[0:2].values

## Feature Extraction

**HashingVectorizer**
<pre>
sklearn.feature_extraction.text.HashingVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True)
</pre>

In [None]:
def lsa_reduction(X_train, X_test, n_comp=120):
    svd = TruncatedSVD(n_components=n_comp)
    normalizer = Normalizer()
    
    lsa_pipe = Pipeline([('svd', svd),
                        ('normalize', normalizer)]).fit(X_train)
    
    train_reduced = lsa_pipe.transform(X_train)
    test_reduced = lsa_pipe.transform(X_test)
    return train_reduced, test_reduced

def vectorize(vector, X_train, X_test):
    vector_fit = vector.fit(X_train)
    
    X_train_vec = vector_fit.transform(X_train)
    X_test_vec = vector_fit.transform(X_test)
    
    print("Vectorization is completed.")
    return X_train_vec, X_test_vec

In [None]:
# Hashing Vectorizer calculates the hash value for each term thus keep only the unique words in the vector
def hash_vectorizer(X_train, X_test):
    hasher = HashingVectorizer(ngram_range=(1,2), n_features=25000)
    tfidf_transformer = TfidfTransformer(use_idf=True)
    feature_extractor = Pipeline([('hash', hasher),
                             ('tfidf', tfidf_transformer)]).fit(X_train)
    
    x_train_tf = feature_extractor.transform(X_train)
    x_test_tf = feature_extractor.transform(X_test)
    
    return x_train_tf, x_test_tf


# Hashing Vectorizer performs better than TFIDF
X_train_hashed, X_test_hashed = hash_vectorizer(train_set['Processed'], test_set['Processed'])

# X_train_hashed, X_test_hashed = vectorize(tfidf_vector, train_set['Processed'], test_set['Processed'])

# Dimension reduction
# --------------------------------------------
# Result is not very good after feature reduction
# ---------------------------------------------
# x_train_svd, x_test_svd = lsa_reduction(X_train_hashed, X_test_hashed, 500)

In [None]:
print(X_train_hashed.shape)

## Build a Model

In [None]:
# lr = LogisticRegression(C=1.0,class_weight='balanced', 
#                         l1_ratio=0.9, 
#                         solver='saga', 
#                         penalty='l1')
svc = LinearSVC()

# One vs Restclassifier
orc_clf = OneVsRestClassifier(estimator=svc)

In [None]:
for target in target_cols:
    y = train_df[target]
#     print(y)
    
    # Split from the loaded dataset
    X_train, X_valid, y_train, y_test = train_test_split(X_train_hashed, y, test_size=0.2, shuffle=True, random_state=0)
    
    orc_clf.fit(X_train, y_train)
    
    y_pred = orc_clf.predict(X_valid)
    
    print("Label: %s \n Accuracy: %1.3f \tPrecision: %1.3f \tRecall: %1.3f \tF1-Score: %1.3f\n" % (target, 
                                                                                    accuracy_score(y_test, y_pred),
                                                                                     precision_score(y_test, y_pred, average='micro'),
                                                                                     recall_score(y_test, y_pred, average='micro'),
                                                                                     f1_score(y_test, y_pred, average='micro')))

## Predict the Test data

In [None]:
# Copy of submission dataframe
output_df = submission_df.copy()

# Iterate over the target variables
for target in target_cols:
    y = train_df[target]
    
    orc_clf.fit(X_train_hashed, y)
    
    # Predict the values for test data
    y_pred = orc_clf.predict(X_test_hashed)
    # Assign the predicted vector to each column
    output_df[target] = y_pred


# Submission dataframe
output_df

## Final Output

In [None]:
# Submission file.
output_df.to_csv("ovr_svc_hash_tfidf_07.csv", index=False)
# output_df.to_csv("ovr_lr_hash_tfidf_06.csv", index=False)