# Janatahack: Independence Day 2020 ML Hackathon
## Topic Modeling for Research Articles

Researchers have access to large online archives of scientific articles. As a consequence, finding relevant articles has become more difficult. Tagging or topic modelling provides a way to give token of identification to research articles which facilitates recommendation and search process.

Given the abstract and title for a set of research articles, predict the topics for each article included in the test set.

Note that a research article can possibly have more than 1 topic. The research article abstracts and titles are sourced from the following 6 topics:

1. Computer Science
2. Physics
3. Mathematics
4. Statistics
5. Quantitative Biology
6. Quantitative Finance

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/train.csv')
test = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/test.csv')
sample_submission = pd.read_csv('/kaggle/input/janatahack-independence-day-2020-ml-hackathon/sample_submission_UVKGLZE.csv')
df

In [None]:
sample_submission

Are there any rows which have multiple topics?

In [None]:
columns = df.columns
columns

In [None]:
df['No. of topics'] = (df['Computer Science'] + df['Physics'] + df['Mathematics'] + 
                       df['Statistics'] + df['Quantitative Biology'] + df['Quantitative Finance'])

df.head(3)

In [None]:
df['No. of topics'].value_counts()

There are many research articles with multiple topics. This is an example of multi-label classification. I am handling this type of a problem for the first time. 

NLP and a multi-label classification.. This will be fun!

# Data cleaning functions

In [None]:
from collections import Counter
results = Counter()
df['ABSTRACT'].str.lower().str.split().apply(results.update)

In [None]:
one_timers = []

for k,v in results.items():
    if v == 1:
        one_timers.append(k)

In [None]:
from wordcloud import WordCloud, STOPWORDS
new_stopwords = ['based', 'paper', 'we', 'the', 'model', 'using', 'show', 'that' 'used', 
                 'use', '!', '$', '%', '&', ',', '.', 'we', 'method', 'problem', 'models']
STOPWORDS.update(new_stopwords)
STOPWORDS.update(one_timers)

In [None]:
def remove_stopwords(text):
    from nltk.tokenize import word_tokenize

    text_tokens = word_tokenize(text)

    tokens_without_sw = [word for word in text_tokens if not word in STOPWORDS]
    
    filtered_sentence = (" ").join(tokens_without_sw)

    return filtered_sentence

In [None]:
def data_clean(df):
    df['text'] = df['TITLE'] + df['ABSTRACT']
    df['text'] = df['text'].apply(remove_stopwords)
    
    return df

# Some visualisations

In [None]:
cleaned_data = data_clean(df)

In [None]:
from collections import Counter
results = Counter()
cleaned_data['text'].str.lower().str.split().apply(results.update)
counter_df = pd.DataFrame.from_dict(results, orient='index')
counter_df['Total'] = counter_df[0]
counter_df

In [None]:
labels = ['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']



for label in labels:
    from collections import Counter
    results = Counter()
    cleaned_data[cleaned_data[label]==1]['text'].str.lower().str.split().apply(results.update)
    temp_counter_df = pd.DataFrame.from_dict(results, orient='index')
    temp_counter_df[label] = temp_counter_df[0]
    counter_df = counter_df.merge(how='outer', left_index=True, right_index=True, right=temp_counter_df[label])

    
counter_df.sort_values(by='Total', axis=0, ascending=False).head(15)

In [None]:
counter_df

In [None]:
import matplotlib.pyplot as plt
word_string=" ".join(cleaned_data['text'].str.lower())
wordcloud = WordCloud(stopwords=STOPWORDS).generate(word_string)

plt.subplots(figsize=(15,15))
plt.clf()
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

In [None]:
labels = ['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']

for label in labels:
    print(label)
    word_string=" ".join(cleaned_data[cleaned_data[label]==1]['text'].str.lower())
    wordcloud = WordCloud(stopwords=STOPWORDS).generate(word_string)

    

    plt.subplots(figsize=(15,15))
    plt.title(label)
    plt.clf()
    plt.imshow(wordcloud)
    plt.axis('off')
    plt.show()



# Six single label classification..

The initial approach I will use for this problem is to think of this as six single label classification. Will it work? Only one way to find out..

In [None]:
df.head(3)

I will combine Title and abstract into one column.

In [None]:
X = df[['ID','TITLE', 'ABSTRACT']]
y = df[['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']] 

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.33)

submission = pd.DataFrame(X_test['ID'])

X_train = data_clean(X_train)['text']
X_test = data_clean(X_test)['text']



In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import metrics

labels = ['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']


for label in labels:
    
    print(y_test[label].value_counts())
    
    text_clf = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                         ('clf', LinearSVC(random_state=0)),
    ])

    text_clf.fit(X_train, y_train[label])  

    predictions = text_clf.predict(X_test)

    submission[label] = predictions

    print('')
    print(metrics.confusion_matrix(y_test[label],predictions))
    print('')
    print(metrics.classification_report(y_test[label],predictions))
    print('')
    print('')
    print('')
    print('')

In [None]:
submission

In [None]:
test

In [None]:
X = df[['ID','TITLE', 'ABSTRACT']]
y = df[['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']] 


submission = pd.DataFrame(test['ID'])
#submission = test

X = data_clean(X)['text']
test = data_clean(test)['text']


from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn import metrics

labels = ['Computer Science', 'Physics', 'Mathematics','Statistics', 
          'Quantitative Biology', 'Quantitative Finance']

for label in labels:
    
    text_clf = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                         ('clf', LinearSVC(random_state=0)),
    ])

    text_clf.fit(X, y[label])  

    predictions = text_clf.predict(test)

    submission[label] = predictions

In [None]:
submission

In [None]:
filename = 'submission.csv'
submission.to_csv(filename,index=False)
print('Saved file: ' + filename)

I know that there is a lot and I can improve on.. But time to move on now..