# Capstone Project: UN Sustainable Development Goals Text Classification

The Uniter Nations created 17 measurable indicators for development called the Sustainable Development Goals or SCG, they consists of: 

1. No Poverty
2. Zero Hunger
3. Good Health and Well Being
4. Quality Education
5. Gender Equality
6. Clean Water and Sanitation
7. Affordable and Clean Energy
8. Decent Work and Economic Growth
9. Industry, Innovation, and Infrastructure
10. Reduced Inequalities
11. Sustainable Cities and Communities
12. Responsible Consumption and Production
13. Climate Action
14. Life Below Water
15. Life on Land
16. Peace, Justice and String Institutions
17. Partnerships


Each SDG has several “targets”, or social outcomes that the UN hopes to achieve by 2030. Each of these targets is measured using a set of indicators. These indicators represent the quantitative measurements that will be used to judge whether each SDG target has been achieved or not by 2030.

I will be classifying text content by its relevance to the measurable indicators of the United Nations’ Sustainable Development Goal 3: Health and Well-Being or being refer as SDG 3. SDG 3 has 14 targets and 27 indicators.

##### The SDG 3 Indicators

Theere are 27 possible SDG 3 indicators, in the dataset each indicator will be refer using code, e.g. "3.1.1":

- 3.1.1 - Maternal mortality ratio
- 3.1.2 - Proportion of births attended by skilled health personnel
- 3.2.1 - Under-5 mortality rate
- 3.2.2 - Neonatal mortality rate
- 3.3.1 - Number of new HIV infections per 1 000 uninfected population, by sex, age and key populations
- 3.3.2 - Tuberculosis incidence per 100 000 population
- 3.3.3 - Malaria incidence per 1 000 population
- 3.3.4 - Hepatitis B incidence per 100 000 population
- 3.3.5 - Number of people requiring interventions against neglected tropical diseases
- 3.4.1 - Mortality rate attributed to cardiovascular disease, cancer, diabetes or chronic respiratory disease
- 3.4.2 - Suicide mortality rate
- 3.5.1 - Coverage of treatment interventions (pharmacological, psychosocial and rehabilitation and aftercare services) for substance use disorders
- 3.5.2 - Harmful use of alcohol, defined according to the national context as alcohol per capita consumption (aged 15 years and older) within a calendar year in litres of pure alcohol
- 3.6.1 - Death rate due to road traffic injuries
- 3.7.1 - Proportion of women of reproductive age (aged 15–49 years) who have their need for family planning satisfied with modern methods
- 3.7.2 - Adolescent birth rate (aged 10–14 years; aged 15–19 years) per 1 000 women in that age group
- 3.8.1 - Coverage of essential health services (defined as the average coverage of essential services based on tracer interventions that include reproductive, maternal, newborn and child health, infectious diseases, non-communicable diseases and service capacity and access, among the general and the most disadvantaged population)
- 3.8.2 - Proportion of population with large household expenditures on health as a share of total household expenditure or income
- 3.9.1 - Mortality rate attributed to household and ambient air pollution
- 3.9.2 - Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene (exposure to unsafe Water, Sanitation and Hygiene for All (WASH) services)
- 3.9.3 - Mortality rate attributed to unintentional poisoning
- 3.a.1 - Age-standardized prevalence of current tobacco use among persons aged 15 years and older
- 3.b.1 - Proportion of the target population covered by all vaccines included in their national programme
- 3.b.2 - Total net official development assistance to medical research and basic health sector
- 3.b.3 - Proportion of health facilities that have a core set of relevant essential medicines available and affordable on a sustainable basis
- 3.c.1 - Health worker density and distribution
- 3.d.1 - International Health Regulations (IHR) capacity and health emergency preparedness

### The Problem

Many government bodies, donors and developers sttrugle to identify which tenders, programs or news articles are related to which of the 27 SDG 3 indicators and if they are worth investment or partnership. 

The aim for this project is to create a classifier that labels text content by the 27 SDG 3 indicators that are most “relevant.” 

The model that classifies text content by relevant SDG indicators would be used to help algorithmically identify development projects, organizations, and documents that relate to the same SDG targets and indicators.

This would be useful for governments, development donors, and development implementers seeking to align resources, find partners, or perform outcomes-focused research.

---
# The Train Data

The training data includes approximately 3,000 web-scraped text from tenders, programs, and documents, as well as news articles about international development and humanitarian aid, and finally text descriptions of organizations involved in those sectors.

This dataset includes the "outcome" which in this competition is the "label(s)," i.e. which of the 27 indicators are relevant to the given text. This dataset includes the columns: 
- ID: Unique ID of text to be classified
- Type: The type or source of the text.
- Text
- Label_1 through Label_12

Label_1 through Label_12: These columns are populated starting at Label_1 increasing incrementally until all relevent classifications are populated, to a maximum of 12 Labels. The remaining Labels are left blank.

Source: https://zindi.africa/competitions/sustainable-development-goals-sdgs-text-classification-challenge

In [89]:
import nltk

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

import regex as re

In [90]:
train = pd.read_csv('../data/Devex_train.csv')

In [91]:
train

Unnamed: 0,Unique ID,Type,Text,Label 1,Label 2,Label 3,Label 4,Label 5,Label 6,Label 7,Label 8,Label 9,Label 10,Label 11,Label 12
0,12555,Grant,Centers of Biomedical Research Excellence (COB...,3.b.2 - Total net official development assista...,3.c.1 - Health worker density and distribution,,,,,,,,,,
1,14108,Grant,Research on Regenerative Medicine <h2><strong>...,3.b.2 - Total net official development assista...,,,,,,,,,,,
2,23168,Organization,Catholic Health Association of India (CHAI): <...,3.d.1 - International Health Regulations (IHR)...,3.8.1 - Coverage of essential health services ...,3.8.2 - Proportion of population with large ho...,3.b.3 - Proportion of health facilities that h...,,,,,,,,
3,219512,Contract,Quality Improvement Initiatives for Diabetes,3.4.1 - Mortality rate attributed to cardiovas...,,,,,,,,,,,
4,274093,Tender,Provision of Thalassemia Drugs and Disposables...,3.3.5 - Number of people requiring interventio...,3.4.1 - Mortality rate attributed to cardiovas...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,92153,News,How rats could help reduce the global TB burden:,"3.3.2 - Tuberculosis incidence per 100,000 pop...",,,,,,,,,,,
2991,1209,Open Opp,Exploratory Analyses of Adherence Strategies a...,3.b.2 - Total net official development assista...,,,,,,,,,,,
2992,14342,Grant,Study on Vaccines for Diarrhoeal Diseases or L...,3.b.1 - Proportion of the target population co...,3.b.2 - Total net official development assista...,3.3.5 - Number of people requiring interventio...,,,,,,,,,
2993,12353,Grant,Regional Engagement Stimulation Fund on Human ...,"3.3.1 - Number of new HIV infections per 1,000...",,,,,,,,,,,


In [92]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2995 entries, 0 to 2994
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Unique ID  2995 non-null   int64  
 1   Type       2995 non-null   object 
 2   Text       2995 non-null   object 
 3   Label 1    2995 non-null   object 
 4   Label 2    1635 non-null   object 
 5   Label 3    738 non-null    object 
 6   Label 4    312 non-null    object 
 7   Label 5    142 non-null    object 
 8   Label 6    59 non-null     object 
 9   Label 7    21 non-null     object 
 10  Label 8    10 non-null     object 
 11  Label 9    4 non-null      object 
 12  Label 10   2 non-null      object 
 13  Label 11   0 non-null      float64
 14  Label 12   0 non-null      float64
dtypes: float64(2), int64(1), object(12)
memory usage: 351.1+ KB


In [93]:
train.fillna(0.0,inplace=True)

In [94]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2995 entries, 0 to 2994
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Unique ID  2995 non-null   int64  
 1   Type       2995 non-null   object 
 2   Text       2995 non-null   object 
 3   Label 1    2995 non-null   object 
 4   Label 2    2995 non-null   object 
 5   Label 3    2995 non-null   object 
 6   Label 4    2995 non-null   object 
 7   Label 5    2995 non-null   object 
 8   Label 6    2995 non-null   object 
 9   Label 7    2995 non-null   object 
 10  Label 8    2995 non-null   object 
 11  Label 9    2995 non-null   object 
 12  Label 10   2995 non-null   object 
 13  Label 11   2995 non-null   float64
 14  Label 12   2995 non-null   float64
dtypes: float64(2), int64(1), object(12)
memory usage: 351.1+ KB


In [95]:
sample = pd.read_csv('../data/Devex_submission_format.csv')

In [96]:
sample

Unnamed: 0,ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,11437,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
1,11474,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,11475,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0
3,11476,1.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
4,11486,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,11375,,,,,,,,,,...,,,,,,,,,,
994,11390,,,,,,,,,,...,,,,,,,,,,
995,11406,,,,,,,,,,...,,,,,,,,,,
996,11428,,,,,,,,,,...,,,,,,,,,,


---
# Data Cleaning and Preprocessing

There are 2995 text to work with. I will be exploring what each text looks like and create a cleaning function to get rid of irrelevent text in the explored text block.

Here are some sample text:

In [97]:
train['Text']

0       Centers of Biomedical Research Excellence (COB...
1       Research on Regenerative Medicine <h2><strong>...
2       Catholic Health Association of India (CHAI): <...
3            Quality Improvement Initiatives for Diabetes
4       Provision of Thalassemia Drugs and Disposables...
                              ...                        
2990     How rats could help reduce the global TB burden:
2991    Exploratory Analyses of Adherence Strategies a...
2992    Study on Vaccines for Diarrhoeal Diseases or L...
2993    Regional Engagement Stimulation Fund on Human ...
2994    Graphic Design Services Consultancy ; ; <p><st...
Name: Text, Length: 2995, dtype: object

In [98]:
#example of one of the test
train.Text[2]

'Catholic Health Association of India (CHAI): <p>The Catholic Health Association of India celebrates it 73 years of service. The organization has grown in terms of its membership, services and expanded the scope for encompassing and achieving the mission for which it was established in 1943. The organization has been shaped and nurtured by the visionaries who directed it and by the impact of national and international happenings. There have been paradigm shifts to meet the needs and to fulfill the vision and mission of reaching the poor and marginalized.</p>    <p>VISION&nbsp;</p>    <p>The Catholic Health Association of India upholds its commitment to bring &lsquo;health for all&rsquo;.</p>    <p>It views health as a state of complete physical, mental, social and spiritual well-being, and not merely the absence of sickness. Accordingly, CHAI envisions an INDIA wherein people,</p>    <p>are assured of clean air, water and environment;<br />do not suffer from any preventable disease;<br

### Normalization

From the example the text have a lot of html markings, &nbsp and non letter. To make sure the text are the same I will normalized them.

This will be done by:
- Removing html marking
- Removing &nbsp
- Removing non letters, symbols and numbers
- Lowercasing everything

In [99]:
test = train.Text[0]
test

'Centers of Biomedical Research Excellence (COBRE) Phase III - Transitional Centers     <p><strong>Funding Opportunity Description</strong></p>    <p><a name="_Toc258873267"></a>The Institutional Development Award (IDeA) Program endeavors to stimulate research at institutions in states that have not traditionally received significant levels of research funding from the NIH. Created through congressional mandate, the IDeA Program broadens the geographic distribution of NIH funding for competitive biomedical and behavioral research by enhancing the research capabilities of institutions in eligible states. The IDeA Program aims to achieve this goal through two major initiatives: (1) the IDeA Networks of Biomedical Research Excellence (INBRE), and (2) the Centers of Biomedical Research Excellence (COBRE).</p>    <p>The COBRE initiative seeks to develop unique, innovative, multidisciplinary, and collaborative state-of-the-art biomedical and behavioral research centers focused on a scientifi

In [100]:
#deleting the html marking
html = re.sub(r"<.*?>","",test)
html

'Centers of Biomedical Research Excellence (COBRE) Phase III - Transitional Centers     Funding Opportunity Description    The Institutional Development Award (IDeA) Program endeavors to stimulate research at institutions in states that have not traditionally received significant levels of research funding from the NIH. Created through congressional mandate, the IDeA Program broadens the geographic distribution of NIH funding for competitive biomedical and behavioral research by enhancing the research capabilities of institutions in eligible states. The IDeA Program aims to achieve this goal through two major initiatives: (1) the IDeA Networks of Biomedical Research Excellence (INBRE), and (2) the Centers of Biomedical Research Excellence (COBRE).    The COBRE initiative seeks to develop unique, innovative, multidisciplinary, and collaborative state-of-the-art biomedical and behavioral research centers focused on a scientific theme that is nascent or only minimally developed at applica

In [101]:
#removing &nbsp
nbsp = re.sub(r"&nbsp","",html)
nbsp

'Centers of Biomedical Research Excellence (COBRE) Phase III - Transitional Centers     Funding Opportunity Description    The Institutional Development Award (IDeA) Program endeavors to stimulate research at institutions in states that have not traditionally received significant levels of research funding from the NIH. Created through congressional mandate, the IDeA Program broadens the geographic distribution of NIH funding for competitive biomedical and behavioral research by enhancing the research capabilities of institutions in eligible states. The IDeA Program aims to achieve this goal through two major initiatives: (1) the IDeA Networks of Biomedical Research Excellence (INBRE), and (2) the Centers of Biomedical Research Excellence (COBRE).    The COBRE initiative seeks to develop unique, innovative, multidisciplinary, and collaborative state-of-the-art biomedical and behavioral research centers focused on a scientific theme that is nascent or only minimally developed at applica

In [102]:
#removing all numbers, symbols and non letters
letters_only = re.sub("[^a-zA-Z]", " ", nbsp)
letters_only

'Centers of Biomedical Research Excellence  COBRE  Phase III   Transitional Centers     Funding Opportunity Description    The Institutional Development Award  IDeA  Program endeavors to stimulate research at institutions in states that have not traditionally received significant levels of research funding from the NIH  Created through congressional mandate  the IDeA Program broadens the geographic distribution of NIH funding for competitive biomedical and behavioral research by enhancing the research capabilities of institutions in eligible states  The IDeA Program aims to achieve this goal through two major initiatives      the IDeA Networks of Biomedical Research Excellence  INBRE   and     the Centers of Biomedical Research Excellence  COBRE      The COBRE initiative seeks to develop unique  innovative  multidisciplinary  and collaborative state of the art biomedical and behavioral research centers focused on a scientific theme that is nascent or only minimally developed at applica

In [103]:
#making everything lowercase
lowercase = letters_only.lower().split()
lowercase

['centers',
 'of',
 'biomedical',
 'research',
 'excellence',
 'cobre',
 'phase',
 'iii',
 'transitional',
 'centers',
 'funding',
 'opportunity',
 'description',
 'the',
 'institutional',
 'development',
 'award',
 'idea',
 'program',
 'endeavors',
 'to',
 'stimulate',
 'research',
 'at',
 'institutions',
 'in',
 'states',
 'that',
 'have',
 'not',
 'traditionally',
 'received',
 'significant',
 'levels',
 'of',
 'research',
 'funding',
 'from',
 'the',
 'nih',
 'created',
 'through',
 'congressional',
 'mandate',
 'the',
 'idea',
 'program',
 'broadens',
 'the',
 'geographic',
 'distribution',
 'of',
 'nih',
 'funding',
 'for',
 'competitive',
 'biomedical',
 'and',
 'behavioral',
 'research',
 'by',
 'enhancing',
 'the',
 'research',
 'capabilities',
 'of',
 'institutions',
 'in',
 'eligible',
 'states',
 'the',
 'idea',
 'program',
 'aims',
 'to',
 'achieve',
 'this',
 'goal',
 'through',
 'two',
 'major',
 'initiatives',
 'the',
 'idea',
 'networks',
 'of',
 'biomedical',
 'resear

### Stop Words

Some words in the English language, while necessary, don't contribute much to the meaning of a phrase. These words, such as "when", "had", "those" or "before", are called stop words and should be filtered out. 

In [104]:
stop_words = nltk.corpus.stopwords.words('english')

In [105]:
#removing stop words
words = [w for w in lowercase if not w in stop_words]
words

['centers',
 'biomedical',
 'research',
 'excellence',
 'cobre',
 'phase',
 'iii',
 'transitional',
 'centers',
 'funding',
 'opportunity',
 'description',
 'institutional',
 'development',
 'award',
 'idea',
 'program',
 'endeavors',
 'stimulate',
 'research',
 'institutions',
 'states',
 'traditionally',
 'received',
 'significant',
 'levels',
 'research',
 'funding',
 'nih',
 'created',
 'congressional',
 'mandate',
 'idea',
 'program',
 'broadens',
 'geographic',
 'distribution',
 'nih',
 'funding',
 'competitive',
 'biomedical',
 'behavioral',
 'research',
 'enhancing',
 'research',
 'capabilities',
 'institutions',
 'eligible',
 'states',
 'idea',
 'program',
 'aims',
 'achieve',
 'goal',
 'two',
 'major',
 'initiatives',
 'idea',
 'networks',
 'biomedical',
 'research',
 'excellence',
 'inbre',
 'centers',
 'biomedical',
 'research',
 'excellence',
 'cobre',
 'cobre',
 'initiative',
 'seeks',
 'develop',
 'unique',
 'innovative',
 'multidisciplinary',
 'collaborative',
 'state',

### Stemming, Tolkenizing and Lemmatizing

The three method allow me to clean and separate wods to their stem, root and individual words. I will be exploring what each process do to the sample text.

In [106]:
porter = PorterStemmer()
ps = [porter.stem(i) for i in words]
ps

['center',
 'biomed',
 'research',
 'excel',
 'cobr',
 'phase',
 'iii',
 'transit',
 'center',
 'fund',
 'opportun',
 'descript',
 'institut',
 'develop',
 'award',
 'idea',
 'program',
 'endeavor',
 'stimul',
 'research',
 'institut',
 'state',
 'tradit',
 'receiv',
 'signific',
 'level',
 'research',
 'fund',
 'nih',
 'creat',
 'congression',
 'mandat',
 'idea',
 'program',
 'broaden',
 'geograph',
 'distribut',
 'nih',
 'fund',
 'competit',
 'biomed',
 'behavior',
 'research',
 'enhanc',
 'research',
 'capabl',
 'institut',
 'elig',
 'state',
 'idea',
 'program',
 'aim',
 'achiev',
 'goal',
 'two',
 'major',
 'initi',
 'idea',
 'network',
 'biomed',
 'research',
 'excel',
 'inbr',
 'center',
 'biomed',
 'research',
 'excel',
 'cobr',
 'cobr',
 'initi',
 'seek',
 'develop',
 'uniqu',
 'innov',
 'multidisciplinari',
 'collabor',
 'state',
 'art',
 'biomed',
 'behavior',
 'research',
 'center',
 'focus',
 'scientif',
 'theme',
 'nascent',
 'minim',
 'develop',
 'applic',
 'institut',
 

In [107]:
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(test)
tokens

['Centers',
 'of',
 'Biomedical',
 'Research',
 'Excellence',
 'COBRE',
 'Phase',
 'III',
 'Transitional',
 'Centers',
 'p',
 'strong',
 'Funding',
 'Opportunity',
 'Description',
 'strong',
 'p',
 'p',
 'a',
 'name',
 '_Toc258873267',
 'a',
 'The',
 'Institutional',
 'Development',
 'Award',
 'IDeA',
 'Program',
 'endeavors',
 'to',
 'stimulate',
 'research',
 'at',
 'institutions',
 'in',
 'states',
 'that',
 'have',
 'not',
 'traditionally',
 'received',
 'significant',
 'levels',
 'of',
 'research',
 'funding',
 'from',
 'the',
 'NIH',
 'Created',
 'through',
 'congressional',
 'mandate',
 'the',
 'IDeA',
 'Program',
 'broadens',
 'the',
 'geographic',
 'distribution',
 'of',
 'NIH',
 'funding',
 'for',
 'competitive',
 'biomedical',
 'and',
 'behavioral',
 'research',
 'by',
 'enhancing',
 'the',
 'research',
 'capabilities',
 'of',
 'institutions',
 'in',
 'eligible',
 'states',
 'The',
 'IDeA',
 'Program',
 'aims',
 'to',
 'achieve',
 'this',
 'goal',
 'through',
 'two',
 'major

In [108]:
lemmatizer = WordNetLemmatizer()
lem = [lemmatizer.lemmatize(i) for i in words]
lem

['center',
 'biomedical',
 'research',
 'excellence',
 'cobre',
 'phase',
 'iii',
 'transitional',
 'center',
 'funding',
 'opportunity',
 'description',
 'institutional',
 'development',
 'award',
 'idea',
 'program',
 'endeavor',
 'stimulate',
 'research',
 'institution',
 'state',
 'traditionally',
 'received',
 'significant',
 'level',
 'research',
 'funding',
 'nih',
 'created',
 'congressional',
 'mandate',
 'idea',
 'program',
 'broadens',
 'geographic',
 'distribution',
 'nih',
 'funding',
 'competitive',
 'biomedical',
 'behavioral',
 'research',
 'enhancing',
 'research',
 'capability',
 'institution',
 'eligible',
 'state',
 'idea',
 'program',
 'aim',
 'achieve',
 'goal',
 'two',
 'major',
 'initiative',
 'idea',
 'network',
 'biomedical',
 'research',
 'excellence',
 'inbre',
 'center',
 'biomedical',
 'research',
 'excellence',
 'cobre',
 'cobre',
 'initiative',
 'seek',
 'develop',
 'unique',
 'innovative',
 'multidisciplinary',
 'collaborative',
 'state',
 'art',
 'biom

### Clean Text Function

From the text preprocessing and feature engineering I combined the process together to clean the text sample.

In [109]:
def text_to_words(raw_text):
    # 1. Creating only letter text
    html = re.sub(r"<.*?>","",raw_text) #removing html
    nbsp = re.sub(r"&nbsp","",html) #removing &nbsp
    rsquo = re.sub(r"&rsquo","",nbsp) #removing &rsquo
    letters_only = re.sub("[^a-zA-Z]", " ", rsquo) #removing all non letters

    # 2. Convert to lower case, split into individual words.
    lowercase = letters_only.lower().split()
    
    # 3. Remove stopwords.
    stop_words = nltk.corpus.stopwords.words('english')
    new_stops = ['health','program','service','country','support','research','project','development','clinical trial']
    stop_words.extend(new_stops)
    
    words = [w for w in lowercase if not w in stop_words]
    
    # 4. Lemmatizing
    lemmatizer = WordNetLemmatizer()
    lem = [lemmatizer.lemmatize(i) for i in words]

    # 5. Join the words back into one string separated by space, 
    return(" ".join(lem))

In [110]:
testing = '<p><strong>Funding Opportunity Description</strong></p>    <p><a name="_Toc258873267"></a>The Institutional Development Award (IDeA) health program with service support'
print('original text:',testing)
print('clean text:',text_to_words(testing))

original text: <p><strong>Funding Opportunity Description</strong></p>    <p><a name="_Toc258873267"></a>The Institutional Development Award (IDeA) health program with service support
clean text: funding opportunity description institutional award idea


### Cleaning the Text

Now we will use the function on the provided text.

In [111]:
clean_text=[]

for text in train.Text:
    clean_text.append(text_to_words(text))

In [112]:
clean_text

['center biomedical excellence cobre phase iii transitional center funding opportunity description institutional award idea endeavor stimulate institution state traditionally received significant level funding nih created congressional mandate idea broadens geographic distribution nih funding competitive biomedical behavioral enhancing capability institution eligible state idea aim achieve goal two major initiative idea network biomedical excellence inbre center biomedical excellence cobre cobre initiative seek develop unique innovative multidisciplinary collaborative state art biomedical behavioral center focused scientific theme nascent minimally developed applicant institution accomplished nurturing expanding critical mass competitive biomedical investigator intensive career advising emerging faculty aggressive recruitment seasoned investigator enhancing environment infrastructure establishment critical core resource cobre consists three sequential five year phase phase focus requis

In [113]:
train['clean_text'] = clean_text
train

Unnamed: 0,Unique ID,Type,Text,Label 1,Label 2,Label 3,Label 4,Label 5,Label 6,Label 7,Label 8,Label 9,Label 10,Label 11,Label 12,clean_text
0,12555,Grant,Centers of Biomedical Research Excellence (COB...,3.b.2 - Total net official development assista...,3.c.1 - Health worker density and distribution,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,center biomedical excellence cobre phase iii t...
1,14108,Grant,Research on Regenerative Medicine <h2><strong>...,3.b.2 - Total net official development assista...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,regenerative medicine introduction support tra...
2,23168,Organization,Catholic Health Association of India (CHAI): <...,3.d.1 - International Health Regulations (IHR)...,3.8.1 - Coverage of essential health services ...,3.8.2 - Proportion of population with large ho...,3.b.3 - Proportion of health facilities that h...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,catholic association india chai catholic assoc...
3,219512,Contract,Quality Improvement Initiatives for Diabetes,3.4.1 - Mortality rate attributed to cardiovas...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,quality improvement initiative diabetes
4,274093,Tender,Provision of Thalassemia Drugs and Disposables...,3.3.5 - Number of people requiring interventio...,3.4.1 - Mortality rate attributed to cardiovas...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,provision thalassemia drug disposable backgrou...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,92153,News,How rats could help reduce the global TB burden:,"3.3.2 - Tuberculosis incidence per 100,000 pop...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,rat could help reduce global tb burden
2991,1209,Open Opp,Exploratory Analyses of Adherence Strategies a...,3.b.2 - Total net official development assista...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,exploratory analysis adherence strategy data s...
2992,14342,Grant,Study on Vaccines for Diarrhoeal Diseases or L...,3.b.1 - Proportion of the target population co...,3.b.2 - Total net official development assista...,3.3.5 - Number of people requiring interventio...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,study vaccine diarrhoeal disease lower respira...
2993,12353,Grant,Regional Engagement Stimulation Fund on Human ...,"3.3.1 - Number of new HIV infections per 1,000...",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,regional engagement stimulation fund human imm...


---
# Feature Engineering

### Generating Labels

Since the labels are attached to what they are we will be removing the indicator names from the label and only kee the indicator code (eg. 3.1.1).

Then we will encode the label to the text to show which text is related to which indicator using 1 (related) and 0 (not related).

In [114]:
#labels = train[['Label 1', 'Label 2', 'Label 3','Label 4', 'Label 5', 'Label 6','Label 7','Label 8', 'Label 9','Label 10', 'Label 11', 'Label 12']].applymap(str).applymap(lambda x: x.split('-')[0].strip())
#labels

In [115]:
clean_train_nolabel = train[['Unique ID','Type','clean_text']]
clean_train_nolabel

Unnamed: 0,Unique ID,Type,clean_text
0,12555,Grant,center biomedical excellence cobre phase iii t...
1,14108,Grant,regenerative medicine introduction support tra...
2,23168,Organization,catholic association india chai catholic assoc...
3,219512,Contract,quality improvement initiative diabetes
4,274093,Tender,provision thalassemia drug disposable backgrou...
...,...,...,...
2990,92153,News,rat could help reduce global tb burden
2991,1209,Open Opp,exploratory analysis adherence strategy data s...
2992,14342,Grant,study vaccine diarrhoeal disease lower respira...
2993,12353,Grant,regional engagement stimulation fund human imm...


In [116]:
#creating column to match sample submission format
label_col = pd.DataFrame(columns=sample.columns[1:])
label_col

Unnamed: 0,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,3.4.1,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1


In [117]:
clean_train = pd.concat([clean_train_nolabel,label_col],axis=1)
clean_train.fillna(0,inplace=True)
clean_train

Unnamed: 0,Unique ID,Type,clean_text,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,12555,Grant,center biomedical excellence cobre phase iii t...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,14108,Grant,regenerative medicine introduction support tra...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,23168,Organization,catholic association india chai catholic assoc...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,219512,Contract,quality improvement initiative diabetes,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,274093,Tender,provision thalassemia drug disposable backgrou...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,92153,News,rat could help reduce global tb burden,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2991,1209,Open Opp,exploratory analysis adherence strategy data s...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2992,14342,Grant,study vaccine diarrhoeal disease lower respira...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2993,12353,Grant,regional engagement stimulation fund human imm...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [118]:
#getting data from train table into clean train dataframe labels
for i in range(len(train)):
    for row in range (3,15):
        if train.iloc[i,row] != 0:
            labeling = train.iloc[i,row][0:5] #matching the label to the column label
            clean_train.at[i,labeling] = 1

In [119]:
clean_train

Unnamed: 0,Unique ID,Type,clean_text,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,12555,Grant,center biomedical excellence cobre phase iii t...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,14108,Grant,regenerative medicine introduction support tra...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,23168,Organization,catholic association india chai catholic assoc...,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
3,219512,Contract,quality improvement initiative diabetes,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,274093,Tender,provision thalassemia drug disposable backgrou...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,92153,News,rat could help reduce global tb burden,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
2991,1209,Open Opp,exploratory analysis adherence strategy data s...,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2992,14342,Grant,study vaccine diarrhoeal disease lower respira...,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,0
2993,12353,Grant,regional engagement stimulation fund human imm...,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [120]:
clean_train.columns

Index(['Unique ID', 'Type', 'clean_text', '3.1.1', '3.1.2', '3.2.1', '3.2.2',
       '3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.1',
       '3.5.2', '3.6.1', '3.7.1', '3.7.2', '3.8.1', '3.8.2', '3.9.1', '3.9.2',
       '3.9.3', '3.a.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1'],
      dtype='object')

In [121]:
clean_train.to_csv('../data/clean_train.csv', index=False)

### Word Count and Dummy Variable

In [122]:
clean_train['word_count']=train['Text'].apply(lambda x: len(x.split()))

In [123]:
clean_train['unique_word'] = train['Text'].apply(lambda x: len(set(x.split())))

In [124]:
type_dummy = pd.get_dummies(train.Type)

In [125]:
feature_train = pd.concat([clean_train,type_dummy],axis=1)
feature_train = pd.concat([feature_train,train.Text],axis=1)
feature_train

Unnamed: 0,Unique ID,Type,clean_text,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,...,unique_word,Contract,Funding Info,Grant,News,Open Opp,Organization,Program,Tender,Text
0,12555,Grant,center biomedical excellence cobre phase iii t...,0,0,0,0,0,0,0,...,221,0,0,1,0,0,0,0,0,Centers of Biomedical Research Excellence (COB...
1,14108,Grant,regenerative medicine introduction support tra...,0,0,0,0,0,0,0,...,170,0,0,1,0,0,0,0,0,Research on Regenerative Medicine <h2><strong>...
2,23168,Organization,catholic association india chai catholic assoc...,0,0,0,0,0,0,0,...,192,0,0,0,0,0,1,0,0,Catholic Health Association of India (CHAI): <...
3,219512,Contract,quality improvement initiative diabetes,0,0,0,0,0,0,0,...,5,1,0,0,0,0,0,0,0,Quality Improvement Initiatives for Diabetes
4,274093,Tender,provision thalassemia drug disposable backgrou...,0,0,0,0,0,0,0,...,291,0,0,0,0,0,0,0,1,Provision of Thalassemia Drugs and Disposables...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,92153,News,rat could help reduce global tb burden,0,0,0,0,0,1,0,...,9,0,0,0,1,0,0,0,0,How rats could help reduce the global TB burden:
2991,1209,Open Opp,exploratory analysis adherence strategy data s...,0,0,0,0,0,0,0,...,966,0,0,0,0,1,0,0,0,Exploratory Analyses of Adherence Strategies a...
2992,14342,Grant,study vaccine diarrhoeal disease lower respira...,0,0,0,0,0,0,0,...,69,0,0,1,0,0,0,0,0,Study on Vaccines for Diarrhoeal Diseases or L...
2993,12353,Grant,regional engagement stimulation fund human imm...,0,0,0,0,1,0,0,...,319,0,0,1,0,0,0,0,0,Regional Engagement Stimulation Fund on Human ...


In [126]:
feature_train.to_csv('../data/feature_train.csv', index=False)

---
# Test Data

This dataset includes the columns ID, Type, and Text.

In [127]:
test = pd.read_csv('../data/Devex_test_questions.csv')

In [128]:
test

Unnamed: 0,Unique ID,Type,Text
0,49848,Organization,4th Sector Health: <p>4th Sector Health is a U...
1,52348,Organization,Action for Global Health: <p>Action for Global...
2,103541,Organization,Scottish Association for Mental Health (SAMH):...
3,52382,Organization,Singapore Immunology Network: <p>The Singapore...
4,47212,Organization,Coastal Conservation and Education Foundation ...
...,...,...,...
993,38108,Program,\r Ulaanbaatar Air Quality Improvement Program...
994,30360,Program,Supporting National Urban Health Mission<p>Thr...
995,33883,Program,WHO Contingency Fund - 2015<p>The project aims...
996,36296,Program,Lebanon Health Resilience Project<p>The projec...


### Clean Test Data

In [129]:
#clean test data set
clean_test_text=[]

for text in test.Text:
    clean_test_text.append(text_to_words(text))

In [130]:
clean_test_text

['th sector th sector usaid initiative form alliance exchange advance latin america caribbean mobilize business community expert bring sustainable solution public issue affect equitable maternal child reproductive disease hiv aid solving problem together th sector seek cultivate locally driven solution challenge latin america caribbean use two approach mobilize various sector region use share resource towards maximizing impact alliance develop public private alliance private company ngo government foundation develop fund implement activity invest usaid fund leverage partner contribution foster greater private sector involvement latin america caribbean alliance also help build capacity partner work across sector improve region exchange exchange across country latin america caribbean enable south south cross fertilization expertise best practice lesson learned program team th sector usaid initiative led abt associate inc collaboration rti international forum one communication service con

In [131]:
test['clean_text'] = clean_test_text
test

Unnamed: 0,Unique ID,Type,Text,clean_text
0,49848,Organization,4th Sector Health: <p>4th Sector Health is a U...,th sector th sector usaid initiative form alli...
1,52348,Organization,Action for Global Health: <p>Action for Global...,action global action global afgh broad europea...
2,103541,Organization,Scottish Association for Mental Health (SAMH):...,scottish association mental samh around since ...
3,52382,Organization,Singapore Immunology Network: <p>The Singapore...,singapore immunology network singapore immunol...
4,47212,Organization,Coastal Conservation and Education Foundation ...,coastal conservation education foundation ccef...
...,...,...,...,...
993,38108,Program,\r Ulaanbaatar Air Quality Improvement Program...,ulaanbaatar air quality improvement government...
994,30360,Program,Supporting National Urban Health Mission<p>Thr...,supporting national urban missionthrough adb r...
995,33883,Program,WHO Contingency Fund - 2015<p>The project aims...,contingency fund aim contingency fund better p...
996,36296,Program,Lebanon Health Resilience Project<p>The projec...,lebanon resilience projectthe objective pdo in...


In [132]:
clean_test = test[['Unique ID', 'Type', 'clean_text']]
clean_test

Unnamed: 0,Unique ID,Type,clean_text
0,49848,Organization,th sector th sector usaid initiative form alli...
1,52348,Organization,action global action global afgh broad europea...
2,103541,Organization,scottish association mental samh around since ...
3,52382,Organization,singapore immunology network singapore immunol...
4,47212,Organization,coastal conservation education foundation ccef...
...,...,...,...
993,38108,Program,ulaanbaatar air quality improvement government...
994,30360,Program,supporting national urban missionthrough adb r...
995,33883,Program,contingency fund aim contingency fund better p...
996,36296,Program,lebanon resilience projectthe objective pdo in...


In [133]:
#feature engineering
clean_test['word_count']=test['Text'].apply(lambda x: len(x.split()))

In [134]:
clean_test['unique_word'] = test['Text'].apply(lambda x: len(set(x.split())))

In [135]:
clean_test

Unnamed: 0,Unique ID,Type,clean_text,word_count,unique_word
0,49848,Organization,th sector th sector usaid initiative form alli...,470,250
1,52348,Organization,action global action global afgh broad europea...,844,438
2,103541,Organization,scottish association mental samh around since ...,62,55
3,52382,Organization,singapore immunology network singapore immunol...,133,101
4,47212,Organization,coastal conservation education foundation ccef...,341,209
...,...,...,...,...,...
993,38108,Program,ulaanbaatar air quality improvement government...,127,94
994,30360,Program,supporting national urban missionthrough adb r...,139,89
995,33883,Program,contingency fund aim contingency fund better p...,77,65
996,36296,Program,lebanon resilience projectthe objective pdo in...,159,98


In [136]:
clean_test.to_csv('../data/clean_test.csv',index=False)