### Title: Simple Text Analysis


### 1. Introduction:
This project is an interesting demo on the application of natural language processing to summarize text documents <br>
<br><strong>Goal: </strong>We'll use gensim tool to perform a quick summary on a short body of text 
We'll want to :
* summarize the document so that the final output will be about 2% the length of the orignal text
* 



In [145]:
#imports

# *** NLP tools
#import textacy
from gensim.summarization.summarizer import summarize


# *** text resources

# *** Visualization tools
import matplotlib.pyplot as plt
import matplotlib.style as style

# *** Misc tools
from collections import Counter
import string 
import pandas as pd

### 2 Get data
<strong> Data Description:</strong>
* The data used in this project comes from the WHO website and is a news release regarding the progress on war against ebola
* source:  https://www.unenvironment.org/news-and-stories/editorial/facing-our-global-environmental-challenges-requires-efficient

In [124]:
# load data
with open('UNEnvironment_gloabl_challenges.txt','r') as inf:
    raw_text = inf.readlines()
    
    
# preview what's in the first few lines of the file

raw_text[:5]

['## Source: https://www.unenvironment.org/news-and-stories/editorial/facing-our-global-environmental-challenges-requires-efficient\n',
 '## title: Facing our global environmental challenges requires efficient international cooperation\n',
 '\n',
 'The changes we need are huge—time is short\n',
 '\n']

### 2. Preprocessing
* <strong>Goal:</strong> Prepare the data for subsequent analysis. We remove any bits of data that isn't helpful for our goals.

In [125]:
# remove the first two lines: source url and title respectively

raw_text =raw_text[3:]                        
raw_text = ''.join(raw_text).replace('\n','')  # replace '\n' 

### 3. Text summary

In [126]:
### NOTE:
print('The original document is',len(raw_text.split(' ')),'words long' )

The original document is 1067 words long


In [28]:
# Q1: What are the most common words?
summary = summarize(raw_text,ratio=0.2,split=False)
print(summary)

Our planet and humankind face three unprecedented, mutually reinforcing challenges: climate change, the loss of biodiversity and the overuse of critical natural resources.In the past year, we have seen intense heatwaves and raging wildfires in Europe, the United States and in Japan cause huge economical loss and damage.
State budgeting is, in our view, a very powerful tool to implement the 2030 Agenda.Finland also has strong cross-sectoral coordination mechanisms between different ministries as well as other environmental interest groups.No country, however, can successfully meet these global challenges on their own.
Reform efforts have focused specifically on 1) the governance, financing, and functioning of UN Environment, and 2) enhancement of synergies among multilateral environmental agreements.The most significant reform so far has been the transformation of UN Environment’s 58-member governing council into the universal UN Environment Assembly, bringing political leaders of all U

In [19]:
print('The summarized document is',len(summary.split(' ')),'words long' )

The summarized document is 378 words long


### 5. References to places,people,etc in the data

In [40]:
entities = list((ent.text,ent.label_)for ent in textacy.extract.entities(doc,include_types=['PERSON','ORG','GPE','LOC']))
entity_df = pd.DataFrame(entities,columns=['entity','type'])
entity_df.head()

Unnamed: 0,entity,type
0,Erik Lundberg,PERSON
1,Finland,GPE
2,Kenya,GPE
3,Somalia,GPE
4,Uganda,GPE


In [94]:
# Quick analysis on the references to people, places etc

group = entity_df.groupby(by='entity').count().sort_values(by='type',ascending=False)

In [103]:
import seaborn as sns

cm = sns.light_palette("green", as_cmap=True)

s = group.head(10).style.background_gradient(cmap=cm)
s

Unnamed: 0_level_0,type
entity,Unnamed: 1_level_1
United Nations,13
UN Environment Assembly,11
Finland,6
UN Environment,5
UN Environment’s,4
UN Environment Programme,2
Somalia,2
Africa,1
Rio+20,1
United Nations systemWe,1


### 5. Most important keywords and phrases

In [160]:
from textacy.keyterms import key_terms_from_semantic_network
key_terms = key_terms_from_semantic_network(doc,join_key_words=True)
keyterms_df=pd.DataFrame(key_terms,columns=['term','rank'])
keyterms_df.index = keyterms_df.term

td_props = [
  ('font-size', '14px')
  ]
styles = [
  dict(selector="td", props=td_props)
  ]
(keyterms_df.style
 .background_gradient(cmap=cm)
.hide_index()
.set_table_styles(styles))

term,rank
un environment assembly,0.0591341
un environment programme,0.0494191
un environment,0.0432702
different multilateral environmental agreement,0.0426883
united nations member states,0.0397549
effective international environmental governance,0.0396399
united nations body,0.0376542
international environmental governance system,0.0375486
united nations reform,0.0358983
multilateral environmental agreement,0.0350259


### The list above extracted keyworwds and phrases from a <a href='https://en.wikipedia.org/wiki/Semantic_network'>semantic network</a>. The top ranking phrases are 
selected based on how frequently the words in them co-occur.

The basic explanation is that, the fact that we got these phrases show that they accur very frequently in the text analysed and appear to be central in the contexts that they appear in.
It is no surprise on reading the whole text that these phrases are actually central to the subject matter of the full text

In [161]:
# Q2: What are the most common multi-word phrases?
# We use textacy from this point on

In [137]:
# NOTE: we use the previously created doc from the text statistics section

common_phrases = list(textacy.extract.ngrams(doc,n=2,filter_stops=True,filter_punct=True,min_freq=2))
common_phrases = [phrase.text.lower() for phrase in common_phrases]
commonPhrases_counts = Counter(common_phrases)
commonPhrases_counts

Counter({'un environment': 24,
         'environment programme': 2,
         'climate change': 2,
         'natural resources': 4,
         'united nations': 15,
         'we need': 4,
         '2030 agenda': 3,
         'the un': 2,
         'environment assembly': 12,
         'international environmental': 4,
         'environmental governance': 4,
         'reform efforts': 2,
         'multilateral environmental': 8,
         'environmental agreements': 7,
         'environment’s': 4,
         'member states': 2,
         'environmental agenda': 3,
         'nations system': 2,
         'review process': 2,
         'governing bodies': 3,
         '’s role': 2,
         'nations agencies': 2,
         'nations reform': 2})

In [22]:
# Q3: What are the main references to people, places, and organizations (also called entities)in the text?

In [139]:
entities = [(ent.text,ent.label_) for ent in doc.ents if ent.label_ in ['PERSON','ORG','LOC','GPE','']]
entities

[('Erik Lundberg', 'PERSON'),
 ('Finland', 'GPE'),
 ('Kenya', 'GPE'),
 ('Somalia', 'GPE'),
 ('Uganda', 'GPE'),
 ('Seychelles and Permanent Representative', 'ORG'),
 ('the UN Environment Programme', 'ORG'),
 ('UN-Habitat', 'ORG'),
 ('Europe', 'LOC'),
 ('the United States', 'GPE'),
 ('Japan', 'GPE'),
 ('Africa', 'LOC'),
 ('Asia', 'LOC'),
 ('East Africa', 'GPE'),
 ('Somalia', 'GPE'),
 ('the United Nations', 'ORG'),
 ('Time', 'ORG'),
 ('Finland', 'GPE'),
 ('Finland', 'GPE'),
 ('The UN Environment Assembly', 'ORG'),
 ('the UN Environment Programme', 'ORG'),
 ('Rio+20', 'GPE'),
 ('UN Environment', 'ORG'),
 ('UN Environment’s', 'ORG'),
 ('UN Environment Assembly', 'ORG'),
 ('United Nations', 'ORG'),
 ('the UN Environment Assembly', 'ORG'),
 ('United Nations', 'ORG'),
 ('United Nations', 'ORG'),
 ('the UN Environment Assembly', 'ORG'),
 ('UN Environment', 'ORG'),
 ('the United Nations', 'ORG'),
 ('the UN Environment Assembly', 'ORG'),
 ('the Fifth UN Environment Assembly', 'ORG'),
 ('the Fourt

### Observe...
that a few of the items extracted as a PERSON was wrong, but then we got it right for most of the remaining text. This is good enough for our purpose, and also we can manually remove the wrong entry

In [140]:
description_map = {'GPE':'Geopolitical Entity eg country',
                   'LOC':'Location,place etc',
                   'ORG':'Organization',
                   'PERSON':'Any person as the word suggests'}
entity_df=pd.DataFrame(entities,columns=['entity','type'])
entity_df['description'] = entity_df['type'].map(description_map)
entity_df.head()

Unnamed: 0,entity,type,description
0,Erik Lundberg,PERSON,Any person as the word suggests
1,Finland,GPE,Geopolitical Entity eg country
2,Kenya,GPE,Geopolitical Entity eg country
3,Somalia,GPE,Geopolitical Entity eg country
4,Uganda,GPE,Geopolitical Entity eg country


In [163]:
entity_df.groupby(by='entity').count().sort_values(by='type',ascending=False)

Unnamed: 0_level_0,type,description
entity,Unnamed: 1_level_1,Unnamed: 2_level_1
United Nations,9,9
the UN Environment Assembly,7,7
Finland,6,6
UN Environment,5,5
UN Environment’s,4,4
the United Nations,4,4
Somalia,2,2
The UN Environment Assembly,2,2
UN Environment Assembly,2,2
the UN Environment Programme,2,2


### Note:


### Conclusion

In this project we extracted insights from a body of raw text by applying a series of techniques including document summarization, keyword extraction as well as entity extraction. The result is beautiful in the fact that it can be applied to a large body of text to summarize and extract the most important bits of information.