<h2> Notebook Contents </h2>

In this notebook I'll do some Exploratory Data Analysis, trying to update it with new sections in the next weeks.

The sections are:


<div id="toc_container" style="background: #f9f9f9; border: 1px solid #aaa; display: table; font-size: 95%;
                               margin-bottom: 1em; padding: 20px; width: auto;">
<p class="toc_title" style="font-weight: 700; text-align: center">Notebook Contents</p>
<ul class="toc_list">
    <li><a href="#files">0. File Structure explained</a>
        <br>
    Here I just draw the relationships between our different datasets and our file tree appears.<br>
    <ul>
    <li><a href="#visual">0.1. Visual Explanation of our data</a></li>
  </ul>
    </li>
  <li><a href="#train_csv">1. train.csv</a>
      <br>
    Here I compute some statistics for Datasets and Publications. 
      <br>
      <ul>
    <li><a href="#rules">1.1 Association Rules over datasets</a></li>
  </ul>
</li>
</ul>
</div>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.options.display.max_columns = 50
pd.options.display.max_colwidth  = 200
import os

! pip install --index-url https://test.pypi.org/simple/ PyARMViz
from dataclasses import dataclass
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import colorama
from colorama import Fore, Back, Style
import folium
import json
import geopandas as gpd
from PyARMViz import PyARMViz
from PyARMViz.Rule import generate_rule_from_dict

import re
import pyproj
from pyproj import Proj, transform

from shapely.ops import cascaded_union
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams.update({'figure.max_open_warning': 0})
plt.style.use('fivethirtyeight')
import seaborn as sns # visualization
import warnings # Supress warnings 
warnings.filterwarnings('ignore')

import plotly.graph_objs as go
from PIL import Image

from tqdm import tqdm

y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA
c_ = Fore.CYAN
sr_ = Style.RESET_ALL

def get_df_basic_information(df, color, df_name): 
    
    n_rows, n_columns = df.shape
    
    mb_size = round(df.memory_usage(deep=True).sum()/1000000., 3)
    
    print("""{0}{1}\n
          N rows: {2}\tN columns: {3}\n
          Memory Usage: {4} Mb\n\n\n""".format(color, df_name,
                                           n_rows, n_columns, mb_size))

root_path = '/kaggle/input/coleridgeinitiative-show-us-the-data'
train_path = os.path.join(root_path, 'train')
test_path = os.path.join(root_path, 'test')


def pretty_files_root(root_path, max_enum = 5):
    
    print(c_ + str(root_path))
    train_path = os.path.join(root_path, 'train')
    test_path = os.path.join(root_path, 'test')
    ident = 1
    
    print("\t"*ident, y_ + str(os.path.join(root_path, 'train.csv')))
    print("\n")
    print("\t"*ident, y_ + str(os.path.join(root_path, 'sample_submission.csv')))
    print("\n")
    print("\t"*ident, y_ + train_path)
    files_list = os.listdir(train_path)
    
    for i, file in enumerate(files_list):
        new_ident = ident+1
        if i < max_enum:
            print('\t'*new_ident, b_ + str(file))
            
    print("\n")
    print("\t"*ident, y_ + test_path)
    files_list = os.listdir(test_path)
    
    for i, file in enumerate(files_list):
        new_ident = ident+1
        if i < max_enum:
            print('\t'*new_ident, b_ + str(file))
            

root_path = "/kaggle/input/coleridgeinitiative-show-us-the-data/"
train_path = os.path.join(root_path, 'train')
test_path = os.path.join(root_path, 'test')

<a id = "files"></a>
<h3> Files Structure </h3>

The objective of the competition is to identify the mention of datasets within scientific publications. Your predictions will be short excerpts from the publications that appear to note a dataset. Predictions that more accurately match the precise words used to identify the dataset within the publication will score higher. Predictions should be cleaned using the clean_text function from the Evaluation page to ensure proper matching.

Publications are provided in JSON format, broken up into sections with section titles.

The goal in this competition is not just to match known dataset strings but to generalize to datasets that have never been seen before using NLP and statistical techniques. Not all datasets have been identified in train, but you have been provided enough information to generalize.

Note that the hidden test set has roughly ~8000 publications, many times the size of the public test set. Plan your compute time accordingly.

**Files**

**train** - the full text of the training set's publications in JSON format, broken into sections with section **titles** <br>
**test** - the full text of the test set's publications in JSON format, broken into sections with section titles<br>
**train.csv** - labels and metadata for the training set<br>
**sample_submission.csv** - a sample submission file in the correct format<br>

**Columns**

*id* - publication id - note that there are multiple rows for some training documents, indicating multiple mentioned datasets<br>
*pub_title* - title of the publication (a small number of publications have the same title)<br>
*dataset_title* - the title of the dataset that is mentioned within the publication<br>
*dataset_label* - a portion of the text that indicates the dataset<br>
*cleaned_label* - the dataset_label, as passed through the clean_text function from the Evaluation page

<a id = "visual"></a>

<h5> Visual explanation of our Data </h5>

Our working directory tree (I limited train json to show just 5, there are many more of course)

In [None]:
pretty_files_root(root_path, 5)

<img src = "https://i.imgur.com/cNF2qCA.png" width="65%" align="left">
<img src = "https://i.imgur.com/OeVAdz7.png" width="35%" align="right">


<a id = "train_csv"></a>

<h5> Train.csv </h5>

In [None]:
train = pd.read_csv(os.path.join(root_path, 'train.csv'))
get_df_basic_information(train, y_, 'train.csv')

In [None]:
print("{}Some sample rows from train.csv".format(c_))
display(train.sample(3))

assert train.Id.nunique() == len(os.listdir(train_path)), "Number of Publications files does not coincide with number of ids"

print("{}There are {} unique datasets in train.csv".format(c_, train.cleaned_label.nunique()))

In [None]:
cmap_plot = plt.get_cmap('jet_r')

explo_df_1 = (train.groupby('Id')['cleaned_label'].nunique().reset_index()
.cleaned_label.value_counts().rename("Number_of_Publications").reset_index()
.rename(columns = {'index': 'Datasets_in_Publication'}))

print("{}Number of Datasets per Publication".format(c_))
fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(18, 6),gridspec_kw={'width_ratios': [2.5, 1]})

explo_df_1.set_index('Datasets_in_Publication')['Number_of_Publications'].plot(kind = 'bar', 
                                                                               ax = ax0, 
                                                                               title = 'Number of Datasets per Publication Distribution',
                                                                               width = 0.7)

ax0.set_xlim([0, 10])
ax0.set_ylabel('Number_of_Publications')
font_size=14
bbox=[-0.2, 0, 1.2, 0.7]
ax1.axis('off')
ccolors = plt.cm.BuPu(np.full(len(explo_df_1.columns), 0.1))
mpl_table = ax1.table(cellText = explo_df_1.values, bbox=bbox, colLabels=explo_df_1.columns, colColours=ccolors)
mpl_table.auto_set_font_size(False)
mpl_table.set_fontsize(font_size)

In [None]:
print("{}Number of publications per Dataset".format(c_))

explo_df_2 = (train.groupby('cleaned_label')['Id'].nunique().rename('Number_of_Publications')
              .reset_index()
              .sort_values('Number_of_Publications', ascending = False, ignore_index = True))

fig, (ax0, ax1, ax2) = plt.subplots(1, 3, figsize=(30, 8),gridspec_kw={'width_ratios': [1.5, 0.6, 1]})

ax0.set_ylabel('Number_of_Datasets')
ax0.set_xlim(0, 800)
ax0.title.set_text('Number of Publications per Dataset distribution')

sns.histplot(explo_df_2['Number_of_Publications'], color = 'red', ax = ax0)

sns.violinplot(y = explo_df_2['Number_of_Publications'], ax=ax1, color = 'blue', orient = 'v', saturation = 0.4)
ax1.title.set_text('Violin Plot')
ax1.title.set_size(12)

bbox=[-0.2, 0, 1.2, 0.9]
ax2.axis('off')
ax2.title.set_text('Top 10 datasets per number of Publications')
ax2.title.set_size(12)

top_datasets = explo_df_2.sort_values('Number_of_Publications', ascending = False).head(10).rename(columns = {'Number_of_Publications': 'Num_publications'})

ccolors = plt.cm.BuPu(np.full(len(top_datasets.columns), 0.1))
mpl_table = ax2.table(cellText = top_datasets.values, bbox=bbox, colLabels=top_datasets.columns, colColours=ccolors)
mpl_table.auto_set_font_size(False)
mpl_table.auto_set_column_width(col=list(range(len(top_datasets.columns))))
mpl_table.set_fontsize(14)

In [None]:
print("{}Distribution of Number of publications per Dataset".format(c_))
display(explo_df_2.Number_of_Publications.quantile(np.linspace(0.05, 1, 20)).to_frame().transpose().rename_axis('Quantile'))

<a id = "rules"></a>
<h6> Association Rules Mining </h6>

Check [this](https://stackabuse.com/association-rule-mining-via-apriori-algorithm-in-python/) nice tutorial for reference. 
There's nothing complex here, just looking for datasets appearing together in publications. 

I'll consider a Publication as a transaction where the items bought are the datasets. Hopefully finding patterns now may help anyone in improving their models later. 


Given an item $A$ we define **Support of $A$**, $S(A)$ as:

$S(A)$ = $\frac{transactions \ containing \ A}{All \ transactions}$;

Here we can substitute *transactions* with *publications* and *item* with *dataset*. 

Given $2$ items $A$ and $B$ we define **Confidence of $A$ -> $B$**, $C(A->B)$ as:

$C(A->B)$ = $\frac{transactions \ containing \ A \ and \ B}{transactions \ containing \ A}$;

see that it is not commutative (I may buy many eggs without ever buying bacon, but when I buy bacon I buy also eggs... Yeah, something like that).

Given $2$ items $A$ and $B$ we define **Lift of $A$ -> $B$**, $L(A->B)$ as:

$L(A->B)$ = $\frac{C(A->B)}{S(B)}$.



In [None]:
print("{}Market Basket Analysis: datasets occurring together".format(c_))
def to_list(x):
    return [x]

pd.options.display.max_colwidth = 300
train['cleaned_label_list'] = train['cleaned_label'].apply(to_list)

transactions = train.groupby('Id').cleaned_label_list.sum().rename('cleaned_label_list').reset_index()
te = TransactionEncoder()
te_ary = te.fit(transactions.cleaned_label_list).transform(transactions.cleaned_label_list)

df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = (apriori(df, min_support=0.001, use_colnames=True))
rules = (association_rules(frequent_itemsets, metric='support', min_threshold=0.005))
rules = rules.rename(columns = {'antecedents': 'dataset1', 'consequents': 'dataset2', 
                                'antecedent support': 'dataset1_support', 
                                'consequent support': 'dataset2_support'}).drop(['leverage', 'conviction'], axis = 1)

rules['dataset1'] = rules['dataset1'].astype(str).str.replace("frozenset|\{|\}|\(|\)|\'", "")
rules['dataset2'] = rules['dataset2'].astype(str).str.replace("frozenset|\{|\}|\(|\)|\'", "")

rules = (rules.merge(explo_df_2.rename(columns = {'cleaned_label': 'dataset1', 'Number_of_Publications': 'count1'}),
                    on = 'dataset1')
              .merge(explo_df_2.rename(columns = {'cleaned_label': 'dataset2', 'Number_of_Publications': 'count2'}),
                    on = 'dataset2'))

print("{} Top 5 Confidence".format(c_))
display(rules.sort_values('confidence', ascending = False, ignore_index = True).head(5))
print("{} Top 5 Support".format(c_))
display(rules.sort_values('support', ascending = False, ignore_index = True).head(5))
print("{} Top 5 Lift".format(c_))
display(rules.sort_values('lift', ascending = False, ignore_index = True).head(5))

We can see rules with a 1.0 confidence value which is pretty crucial: it means that when dataset1 occurs in a publication also dataset2 occurs, at least for our training data. 

In [None]:
table_confidence = (rules[['dataset1', 'dataset2', 'confidence']]
        .sort_values('confidence', ascending = False, ignore_index = True).head(10))

table_confidence['confidence'] = table_confidence['confidence'].round(3)

bbox=[-0, 0, 1, 0.95]

fig, ax = plt.subplots(1, 1, figsize = (13, 6)) 
ax.axis('off')
ax.title.set_text('Top 10 dataset1->dataset2 rules per Confidence')
ax.title.set_size(12)
ccolors = plt.cm.BuPu(np.full(len(table_confidence.columns), 0.2))

mpl_table = ax.table(cellText = table_confidence.values, bbox=bbox, colLabels=table_confidence.columns, colColours=ccolors,
                     cellColours=[['w', 'w', 'r']]*3+[['w', 'w', 'w']]*7)
mpl_table.auto_set_font_size(False)
mpl_table.auto_set_column_width(col=list(range(len(table_confidence.columns))))
mpl_table.set_fontsize(12)

Hopefully in the next days I'll start with some proper text analysis