In [1]:
import pandas as pd             ## To work with the dataset
import numpy as np              ## Linear Algebra and other stuff

import matplotlib.pyplot as plt ## Visualization

import warnings                 ## Don't like these
warnings.filterwarnings('ignore')

In [2]:
## Import the Dataset
df = pd.read_csv("Data.csv")

In [3]:
## Sneak Peek into the data
df.head()

Unnamed: 0,TITLE,ABSTRACT,Computer Science,Physics,Mathematics,Statistics,Quantitative Biology,Quantitative Finance,labels
0,ChemGAN challenge for drug discovery: can AI r...,Generating molecules with desired chemical p...,1,0,0,1,0,0,"Computer Science,Statistics"
1,Hybrid graphene tunneling photoconductor with ...,Hybrid graphene photoconductor/phototransist...,0,1,0,0,0,0,Physics
2,Temperature Dependence of Magnetic Excitations...,When an ordered spin system of a given dimen...,0,1,0,0,0,0,Physics
3,A Las Vegas algorithm to solve the elliptic cu...,"In this paper, we describe a new Las Vegas a...",1,0,1,0,0,0,"Computer Science,Mathematics"
4,Comparing simulations and test data of a radia...,The VIS instrument on board the Euclid missi...,0,1,0,0,0,0,Physics


In [4]:
df.shape

(17000, 9)

- The dataset is not huge.
- We have 6 different labels.
- One entry(book) can have more than one label. Final labels are recorded in 'labels' column.
- We have to predict the labels for the given title and abstracts.
- Multilabel classification problelm.
- NLP problem


In [5]:
## Take a look into the kind of texts that we'll be encountering.
df['ABSTRACT'][0]

'  Generating molecules with desired chemical properties is important for drug\ndiscovery. The use of generative neural networks is promising for this task.\nHowever, from visual inspection, it often appears that generated samples lack\ndiversity. In this paper, we quantify this internal chemical diversity, and we\nraise the following challenge: can a nontrivial AI model reproduce natural\nchemical diversity for desired molecules? To illustrate this question, we\nconsider two generative models: a Reinforcement Learning model and the recently\nintroduced ORGAN. Both fail at this challenge. We hope this challenge will\nstimulate research in this direction.\n'

#### Preprocessing:
- Convert all text to lower english alphabets. (Could do upper as well. Consistancy is important.)
- Word tokenization.
- Removing stopwords. (Words that occur a lot in the corpus but add no special sense to data.)
- Filtering only alphabet-words. (No special characters or numerical values.)
- Using work lemmatization. (work, worked, works -> work)

In [6]:
## Convert alphabets
df['TITLE'] = df['TITLE'].apply(lambda x : x.lower())
df['ABSTRACT'] = df['ABSTRACT'].apply(lambda x : x.lower())

In [7]:
## Word tokenization
from nltk.tokenize import word_tokenize
df['TITLE'] = df['TITLE'].apply(word_tokenize)
df['ABSTRACT'] = df['ABSTRACT'].apply(word_tokenize)

In [8]:
## Removing stopwords
from nltk.corpus import stopwords 

StopWords = set(stopwords.words('english'))

def clean_words(x):
    words = []
    for i in x:
        if i.isalnum() and i not in StopWords:
            words.append(i)
    return words

df.loc[:,'TITLE'] = df['TITLE'].apply(clean_words)
df.loc[:,'ABSTRACT'] = df['ABSTRACT'].apply(clean_words)

In [9]:
## Lemmatization
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemma(l):
    ans = set()
    for word in l:
        ans.add(lemmatizer.lemmatize(word))
    return ans
df['TITLE'] = df['TITLE'].apply(lambda x : lemma(x))
df['ABSTRACT'] = df['ABSTRACT'].apply(lambda x : lemma(x))

In [10]:
## Filtering alphabet-words

def get_alpha(x):
    l = []
    for i in x:
        if i.isalpha():
            l.append(i)
    return l

df['ABSTRACT'] = df['ABSTRACT'].apply(get_alpha)

In [11]:
## For tf-idf representation
df['ABS_TF'] = df['ABSTRACT'].apply(lambda x : " ".join(i for i in x))

### Approach:
- Can build 6 different classifiers and then use them.
- Can treat the problem as a multi-class-classification problem. (Have to check total multi-class outputs. Number of all possible outcomes is huge.(6!))

In [12]:
## Number of classes if treated as multi-class problem
df['labels'].nunique()

24

- There are total of 24 classes. This won't be a big difficulty.
- Let's check if we have some class imbalance problem as well.

In [13]:
## Checking Class Imbalance scenario
df['labels'].value_counts()

Physics                                             4151
Computer Science                                    3975
Mathematics                                         2925
Computer Science,Statistics                         1881
Statistics                                          1319
Mathematics,Statistics                               674
Computer Science,Mathematics                         533
Quantitative Biology                                 358
Computer Science,Physics                             357
Physics,Mathematics                                  238
Quantitative Finance                                 171
Computer Science,Mathematics,Statistics              136
Statistics,Quantitative Biology                       82
Physics,Statistics                                    81
Computer Science,Physics,Statistics                   31
Computer Science,Quantitative Biology                 26
Statistics,Quantitative Finance                       21
Computer Science,Physics,Mathem

- Dataset has a huge class imbalance problem.
- Oversampling technique can be used to conquer this.

## Random Oversampling

In [14]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)

X = df[['ABS_TF']]
y = df['labels']
x_, y_ = ros.fit_resample(X, y)

## x_ = x_['ABS_TF']

In [15]:
y_.value_counts()

Computer Science,Statistics,Quantitative Finance    4151
Computer Science,Mathematics,Statistics             4151
Computer Science,Statistics                         4151
Computer Science,Physics                            4151
Physics,Statistics                                  4151
Mathematics                                         4151
Computer Science,Statistics,Quantitative Biology    4151
Statistics,Quantitative Biology                     4151
Computer Science,Physics,Statistics                 4151
Quantitative Biology,Quantitative Finance           4151
Physics                                             4151
Computer Science,Physics,Mathematics                4151
Mathematics,Statistics,Quantitative Finance         4151
Physics,Mathematics                                 4151
Quantitative Biology                                4151
Computer Science                                    4151
Computer Science,Quantitative Biology               4151
Quantitative Finance           

- All classes have same number of records.
- Some classes have more repeated records than others.
- For now, I'll mannualy resample the dataset once again.

In [16]:
x_['labels'] = y_

In [17]:
## Classes that had most records before oversampling
t = x_[((x_['labels']=='Physics' ) | (x_['labels']=='Computer Science'))]

In [18]:
## Mannual Oversampling data of some classes
x_ = pd.concat([x_, t])

In [19]:
## Random sampling
## x_ = x_.sample(frac=1)

In [20]:
x_, y_ = x_['ABS_TF'], x_['labels']

- WE'll use tf-idf vector representaion.

In [21]:
from sklearn.feature_extraction.text  import TfidfVectorizer
tf = TfidfVectorizer(stop_words = 'english', ngram_range = (1,1))
tf.fit(x_)

## Train-Dev Set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_, y_, test_size = 0.2, stratify = y_, random_state = 1)

## Tf-IDF Transformation
tf_x_train = tf.transform(x_train)
tf_x_test = tf.transform(x_test)

## Model evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
performance = {'Model' : [],
              'Accuracy Score' : [],
              'Precision Score' : [],
              'Recall Score' : [],
              'f1 Score' : []}

### Will try different classification techniques and choose the one that performs best

In [22]:
## Logistic Regression
from sklearn.linear_model import LogisticRegression

lr= LogisticRegression()
lr.fit(tf_x_train, y_train)
pred = lr.predict(tf_x_test)

performance['Model'].append('LogisticRegression')
performance['Accuracy Score'].append(accuracy_score(y_test, pred))
performance['Precision Score'].append(precision_score(y_test, pred, average='macro'))
performance['Recall Score'].append(recall_score(y_test, pred, average='macro'))
performance['f1 Score'].append(f1_score(y_test, pred, average='macro'))

In [23]:
## Support Vector Classifier
from sklearn.svm import SVC

svc = SVC()
svc.fit(tf_x_train, y_train)
pred = svc.predict(tf_x_test)

performance['Model'].append('SVC')
performance['Accuracy Score'].append(accuracy_score(y_test, pred))
performance['Precision Score'].append(precision_score(y_test, pred, average='macro'))
performance['Recall Score'].append(recall_score(y_test, pred, average='macro'))
performance['f1 Score'].append(f1_score(y_test, pred, average='macro'))

In [24]:
## Ensemble learning: Random Forest with default parameters
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(tf_x_train, y_train)
pred = rfc.predict(tf_x_test)

performance['Model'].append('Random Forest')
performance['Accuracy Score'].append(accuracy_score(y_test, pred))
performance['Precision Score'].append(precision_score(y_test, pred, average='macro'))
performance['Recall Score'].append(recall_score(y_test, pred, average='macro'))
performance['f1 Score'].append(f1_score(y_test, pred, average='macro'))

In [25]:
pd.DataFrame(performance)

Unnamed: 0,Model,Accuracy Score,Precision Score,Recall Score,f1 Score
0,LogisticRegression,0.949134,0.953254,0.953823,0.9529
1,SVC,0.985407,0.98904,0.986146,0.987475
2,Random Forest,0.984064,0.989464,0.98469,0.986759
