INTRODUCTION:

The Coleridge Initiative is a not-for-profit that has been established to use data for social good. One way in which the organization does this is by furthering science through publicly available research.

Goal:
In this competition, we need to develop an algorithm using natural language processing (NLP) to automate the discovery of how scientific data is referenced in publications. 
We have to identify the data sets that publications authors used in their work. For this we have full text of scientific publications from numerous research areas. This model will eventually enable government agencies and researchers to quickly find the information they need. The approach will be used to develop data usage scorecards to better enable agencies to show how their data are used and bring down a critical barrier to the access and use of public data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_columns', None)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
    #for filename in filenames:
        #print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
train_files_path = '../input/coleridgeinitiative-show-us-the-data/train/'
test_files_path = '../input/coleridgeinitiative-show-us-the-data/test/'

os.listdir('/kaggle/input/coleridgeinitiative-show-us-the-data/')




We are provided with 4 main pieces of data:

1. train.csv: The CSV file containing all the metadata of the publications, such as their title and the dataset they utilize.
2. train: The directory containing the actual publications that are referenced in train.csvin JSON format.
3. test: The directory containing the actual publications that will be used for testing purposes (thus, with no ground truth CSV file available).
4. sample_submission.csv: The CSV file containing all the publications IDs in the test set, for which we'll have to populate the prediction column.

In [None]:
#load libraries
import pandas as pd
import numpy as np
import json


# Data Visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("darkgrid")

import cv2
from wordcloud import WordCloud, STOPWORDS


import nltk
nltk.download(['punkt', 'wordnet'])
nltk.download('stopwords')
  
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from sqlalchemy import create_engine

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support


from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


In [None]:
train = pd.read_csv('../input/coleridgeinitiative-show-us-the-data/train.csv')
train.head()

In [None]:
train.columns

In [None]:
#checking for missing values
train.info()

The 'train' dataset is free of any missing values.

In [None]:
#number of rows and columns in train dataset
train.shape

In [None]:
# finding unique values in each columns
for col in train.columns:
    print(col + ":" + str(len(train[col].unique())))

Inference
1. The Training Dataset has 19,661 samples but only 14,316 unique IDs in the dataset. This means that some publications include a multitude of datasets.
2. The pub_title unique count is also less than the Id unique counts. This points to the precense of several occurences of having 2 separate publications, each with a unique ID, but sharing the exact same title.
3. Also, there are a total of 45 unique dataset_title and 130 unique dataset_label. It means that a single dataset could have multible labels throughout different publications.

In [None]:
#Publication titles with more than one unique Ids
group_pub_title = train.drop_duplicates("Id").groupby('pub_title').count()
group_pub_title[group_pub_title['Id'] >1][['Id']].head(10)

In [None]:
#lets take a look at 'pub_title'== 'A quantitative examination of lightning as a predictor of peak winds in tropical cyclones':
train[train['pub_title'] == "A quantitative examination of lightning as a predictor of peak winds in tropical cyclones"]

In [None]:
#Duplicate Id's and dataset labels:
id_df = train[train['Id'] == '170113f9-399c-489e-ab53-2faf5c64c5bc'].drop_duplicates('dataset_title')
id_df[['Id', 'dataset_title']]

Inference: As we can see this "170113f9-399c-489e-ab53-2faf5c64c5bc" Id is mentioning multiple datasets. So, for each id in test we'll need to predict all posible datasets used.

In [None]:
#there are 130 unique labels for 45 unique dataset, lets take a look at distribution of these labels in dataset:
dataset_label_distribution = train.drop_duplicates('dataset_label').groupby('dataset_title').count()[['dataset_label']].sort_values(by = 'dataset_label', ascending = False).reset_index()

In [None]:
dataset_label_distribution.head()

In [None]:
#lets visualize the dataset-label distribution:

plt.figure(figsize=(10, 7))
sns.barplot(y='dataset_title', x='dataset_label', data=dataset_label_distribution.head(20))
plt.title('Dataset-labels distribution', fontsize = 20)
plt.yticks(fontsize = 8)
plt.ylabel('dataset_title', fontsize = 12)
plt.xlabel('unique_dateset_label_count', fontsize = 12);

# Publictation titles EDA:


In [None]:
# lets take a look at 130 unique publication titles
pd.DataFrame(train['pub_title'].unique())

Wordcloud

Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.

We will write a simple and intuitive function plot_wordcloud that will help us plot wordclouds with ease.

In [None]:
def plot_wordcloud(column, title):
    
    """
    Function to Plot Wordcloud of given dataframe column.
    
    params: column(string): The Column of the DataFrame for plotting.
            title(string) : The Title of the Wordcloud.
    """
    # Define stopwords
    stopwords = set(STOPWORDS) 
    
    # Define the Wordcloud    
    wordcloud = WordCloud(width = 800, 
                          height = 800,
                          background_color ='black',
                          min_font_size = 10,
                          stopwords = stopwords).generate(' '.join(train[column])) 

    # Plot the WordCloud image                        
    plt.figure(figsize = (8, 8), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    plt.title('Wordcloud: ' + title, fontsize = 20)

    plt.show() 

In [None]:
#Most frequent words in 'pub_title'
plot_wordcloud(column = 'pub_title', title = 'Publication Title')

# Text EDA:
Lets add a text column for each row corresponding to the full text : We will create a lambda function to get the text from the JSON file and append it to the new column in 'train' dataframe.

In [None]:
from tqdm.autonotebook import tqdm
#tqdm is used to show any code running with a progress bar.

In [None]:
#add text to 'train' columns:
tqdm.pandas()
train['text'] = train.progress_apply(lambda x : pd.read_json(train_files_path + x['Id'] + ".json")['text'].str.cat(sep=' '), axis = 1)

In [None]:
train.head()

In [None]:
train.describe()

In [None]:
#WordCloud of most frequent words in the texts
text = ' '.join(train['text'].sample(frac=0.3))
wordcloud = WordCloud(background_color='black', stopwords=STOPWORDS, width=800, height=800).generate(text)

barplot_dim = (8, 8)
ax = plt.subplots(figsize=barplot_dim, facecolor='w')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

Create a function to Preprocess the data using Basic NLP Filters:

In [None]:
# A text cleaning function
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()

In [None]:
tqdm.pandas()
train['text'] = train['text'].progress_apply(clean_text)

In [None]:
#WordCloud of most frequent words in the texts after cleaning : 
text = ' '.join(train['text'].sample(frac=0.3))
wordcloud = WordCloud(background_color='black', stopwords=STOPWORDS, width=800, height=800).generate(text)

barplot_dim = (8, 8)
ax = plt.subplots(figsize=barplot_dim, facecolor='w')
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.tight_layout(pad=0)
plt.title('WorldCloud: text')
plt.show()

# credits:
1. https://www.kaggle.com/pashupatigupta/starter-competition-data-eda-and-modelling
2. https://www.kaggle.com/ishandutta/coleridge-complete-eda-in-one-notebook
3. https://www.kaggle.com/anthokalel/coleridge-complete-eda
4. https://www.kaggle.com/harshsharma511/start-to-end-easy-understanding-eda-model
5. https://www.kaggle.com/ahmedewida/coleridge-model-using-gensim