# String matching V1
A first attempt solution for matching and formatting strings from publication text with dataset labels ([Score](https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data/leaderboard): 0.47)

Workflow:
- check sentences for matching dataset labels and titles only

In [4]:
import glob
import re
import pandas as pd

In [5]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower())

## Data

Dataframe of training `pub_title`, `dataset_title`, `dataset_label`

In [6]:
df_train = pd.read_csv('../data/train.csv')
df_train.sample(n=5)

Unnamed: 0,Id,pub_title,dataset_title,dataset_label,cleaned_label
14073,69063cf9-0429-460b-8ded-c50a1c6dc971,Adult Competencies and Employment Outcomes amo...,Program for the International Assessment of Ad...,Program for the International Assessment of Ad...,program for the international assessment of ad...
4404,7e875ded-00ec-4570-8c5d-c39478bdf85d,Modeling and prediction of clinical symptom tr...,Alzheimer's Disease Neuroimaging Initiative (A...,ADNI,adni
19441,df14b3b5-ab24-46c4-8e05-9aa4e6d182a5,Exploring Diversity of COVID‑19 Based on Subst...,SARS-CoV-2 genome sequence,genome sequence of SARS-CoV-2,genome sequence of sars cov 2
9332,baa59b52-3793-4775-a539-190a0e059182,Chronic divalproex sodium use and brain atroph...,Alzheimer's Disease Neuroimaging Initiative (A...,Alzheimer's Disease Neuroimaging Initiative (A...,alzheimer s disease neuroimaging initiative adni
14325,6fa237ed-3dcf-4a21-b01b-be2f05ef5526,NATIONAL HURRICANE CENTER ANNUAL SUMMARY,NOAA Tide Gauge,NOAA Tide Gauge,noaa tide gauge


Dataframe of training json publications

In [7]:
train_files = glob.glob('../data/train/*.json')

df_train_pubs = pd.DataFrame()
for train_file in train_files: 
    file_data = pd.read_json(train_file)
    file_data.insert(0,'Id', train_file.split('/')[-1].split('.')[0])
    df_train_pubs = pd.concat([df_train_pubs, file_data])

df_train_pubs['clean_text'] = df_train_pubs['text'].apply(clean_text)
df_train_pubs.head()

Unnamed: 0,Id,section_title,text,clean_text
0,07cbcedc-9f95-42e3-8340-468a866916b9,Abstract,"In this study, we highlight the importance of ...",in this study we highlight the importance of s...
1,07cbcedc-9f95-42e3-8340-468a866916b9,,consequences of these early work experiences a...,consequences of these early work experiences a...
2,07cbcedc-9f95-42e3-8340-468a866916b9,Socioeconomic Disadvantage and Early Work Expe...,Some scholars have suggested that long work ho...,some scholars have suggested that long work ho...
3,07cbcedc-9f95-42e3-8340-468a866916b9,The Youth Development Study,"To address these issues, we draw on data from ...",to address these issues we draw on data from t...
4,07cbcedc-9f95-42e3-8340-468a866916b9,Teenage Work and the Process of Socioeconomic ...,Precursors of Teenage Work. We first distingui...,precursors of teenage work we first distinguis...


Dataframe of testing json publications

In [14]:
test_files = glob.glob('../data/test/*.json')

df_test_pubs = pd.DataFrame()
for test_file in test_files: 
    file_data = pd.read_json(test_file)
    file_data.insert(0,'Id', test_file.split('/')[-1].split('.')[0])
    df_test_pubs = pd.concat([df_test_pubs, file_data])

df_test_pubs['clean_text'] = df_test_pubs['text'].apply(clean_text)
df_test_pubs.head()

Unnamed: 0,Id,section_title,text,clean_text
0,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Introduction,A significant body of research has been conduc...,a significant body of research has been conduc...
1,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Literature review,We reviewed the literature that explored retai...,we reviewed the literature that explored retai...
2,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Food shopping patterns: where do people shop?,Diversification in the food retail sector offe...,diversification in the food retail sector offe...
3,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,Food shopping patterns: what do people buy?,"Many factors, including income, participation ...",many factors including income participation in...
4,8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,2,Anne Palmer et al. shopping at the same store ...,anne palmer et al shopping at the same store h...


Submission template

In [11]:
submission_df = pd.read_csv('../data/sample_submission.csv', index_col=0)
submission_df

Unnamed: 0_level_0,PredictionString
Id,Unnamed: 1_level_1
2100032a-7c33-4bff-97ef-690822c43466,
2f392438-e215-4169-bebf-21ac4ff253e1,
3f316b38-1a24-45a9-8d8c-4e05a42257c6,
8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,


Union of unique dataset titles in dataset_labels

In [12]:
dataset_titles = [x.lower() for x in set(df_train['dataset_title'].apply(clean_text).unique()).union(set(df_train['dataset_label'].unique()))]
len(dataset_titles)

175

If a dataset_title is in publication text, return the dataset_titles mentioned in the publication

In [15]:
labels = []
for index in submission_df.index:
    publication_text = df_test_pubs[df_test_pubs['Id'] == index].text.str.cat(sep='\n').lower()
    label = []
    for dataset_title in dataset_titles:
        if dataset_title in publication_text:
            label.append(clean_text(dataset_title))
    labels.append('|'.join(label))

submission_df['PredictionString'] = labels
submission_df

Unnamed: 0_level_0,PredictionString
Id,Unnamed: 1_level_1
2100032a-7c33-4bff-97ef-690822c43466,adni|alzheimer s disease neuroimaging initiati...
2f392438-e215-4169-bebf-21ac4ff253e1,trends in international mathematics and scienc...
3f316b38-1a24-45a9-8d8c-4e05a42257c6,slosh model|noaa storm surge inundation
8e6996b4-ca08-4c0b-bed2-aaf07a4c6a60,rural urban continuum codes


In [None]:
# submission_df.to_csv("../results/submission.csv",index=True)