# Show Us The Data

This competition challenges data scientists to show how publicly funded data are used to serve science and society. Evidence through data and science are critical if government is to address the many threats facing society, including; pandemics, climate change, Alzheimerâ€™s disease, child hunger, increasing food production, maintaining biodiversity, and addressing many other challenges. Yet much of the information about data necessary to inform evidence and science is locked inside publications.

Can natural language processing find the hidden-in-plain-sight data citations? Can machine learning find the link between the words used in research articles and the data referenced in the article?

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline

from collections import Counter
import re
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords

In [None]:
file_train = '../input/coleridgeinitiative-show-us-the-data/train.csv'

df = pd.read_csv(file_train)
df.head()

In [None]:
file_sub = '../input/coleridgeinitiative-show-us-the-data/sample_submission.csv'
fs = pd.read_csv(file_sub)
fs.head()

# Start with EDA to understand the data more

## Exploring Information in Titles of Datasets

In [None]:
df.info()

## Information about the train dataset

In [None]:
print('===========Train Dataset has the following=========== ')
print(f"Unique Publication Titles: {df['pub_title'].nunique()}")
print(f"Unique Dataset Titles: {df['dataset_title'].nunique()}")
print(f"Unique Dataset Labels: {df['dataset_label'].nunique()}")
print(f"Unique Cleaned Labels: {df['cleaned_label'].nunique()}")


In [None]:
# Get titles with corresponding frequency in literature
df['dataset_title'].value_counts()


In [None]:
# Bar Plot showing the title with corresponding value counts

df['dataset_title'].value_counts().plot(kind='bar',figsize=(22, 8))
plt.title('Frequency of Dataset Usage in Literature', fontsize=24)

In [None]:
def clean_text(txt):
    return re.sub('[^A-Za-z0-9]+', ' ', str(txt).lower()).strip()


In [None]:
title_cleaned = df['dataset_title'].apply(lambda x: clean_text(x))
title_cleaned

In [None]:
words = title_cleaned.str.split().values.tolist()
title_corpus = [word for i in words for word in i]

title_counter = Counter(title_corpus)
title_most = title_counter.most_common()

stop = set(stopwords.words('english'))

title_top_words, title_top_words_count = [], []
for word, count in title_most[:100]:
    if word not in stop:
        title_top_words.append(word)
        title_top_words_count.append(count)

In [None]:
len (title_top_words)

In [None]:
sns.set(rc={'figure.figsize':(20,16)}, font_scale=2)
sns.color_palette("tab10")
plt.title('TOP-20 title words', color = 'black', size = 25)
sns.barplot(y = title_top_words[:25], x = title_top_words_count[:25])
plt.show()

# Can Create labels for the field the data is related to?

## Example [Disease, Alzehimer, Covid,...]--> Medical field