<a href="https://colab.research.google.com/github/sydneymcolumbia/COPanalysis/blob/main/COPAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Creating data frame for 2021**

#we import COP 2021 executive secretary speech from GitHub

In [None]:
import pandas as pd
url = 'https://raw.githubusercontent.com/gigisp/COPanalysis/main/ExecSecGen/COP2021ExecutiveSec.txt'
cop2021 = pd.read_csv(url, header = None, sep = "~", error_bad_lines=False)
cop2021[0]

0                                     Honourable guests
1                              Distinguished delegates,
2                                  Ladies and gentlemen
3                                Our long wait is over.
4     It is with joy and enthusiasm that I officiall...
                            ...                        
68     I encourage you to keep the big picture in mind.
69    I encourage you to look beyond your specific a...
70    I encourage you to consider the choices that w...
71    Let us rise to the enormous challenge of our t...
72                                           Thank you.
Name: 0, Length: 73, dtype: object

Merge all lines into blob

In [None]:
 blob = cop2021[0].str.cat(sep=' ')
 blob


"Honourable guests Distinguished delegates, Ladies and gentlemen Our long wait is over. It is with joy and enthusiasm that I officially welcome you to COP26. I thank the outgoing COP Presidency of Chile — especially Minister Schmidt — for their leadership in the last two challenging years. I also officially welcome the incoming UK Presidency and Minister Sharma. I thank you for your collaborative efforts and as we work to make COP26 a success. To all of you I say this: congratulations. Congratulations to those in this room, those watching online and to everyone involved in this process. Think back on the last two years since we last met in Madrid: the early confusion — the pandemic and what it could mean for our process. Think about the decisions we made together, the skills we developed together to take advantage of communication technologies! Think also of those we lost to COVID-19 — our hearts are with those who continue to suffer. But let us also acknowledge what we’ve accomplished

See 20 most common noun chunks of 2021

In [None]:
import spacy 
from collections import Counter

nlp = spacy.load("en")

doc = nlp(blob)

nounchunks = doc.noun_chunks

nc = list(nounchunks)
nc

lowernc = [n.text.lower() for n in nc]
lowernc

lowernc_freq = Counter(lowernc)
common_lowernc = lowernc_freq.most_common(20)
print (common_lowernc)

[('we', 43), ('it', 17), ('i', 17), ('you', 13), ('cop26', 8), ('us', 6), ('success', 6), ('they', 6), ('what', 5), ('humanity', 5), ('the paris agreement', 5), ('an era', 4), ('history', 3), ('emissions', 3), ('parties', 3), ('solutions', 3), ('climate change', 3), ('ours', 3), ('congratulations', 2), ('this process', 2)]


Make a dataset with noun chunks

In [None]:
type(lowernc_freq)
nc21df = pd.DataFrame.from_dict(lowernc_freq, orient='index').reset_index()
nc21df = nc21df.rename(columns={'index':'word', 0:'word_freq'})
nc21df['word_type'] = "noun_chunk"
nc21df['year'] = 2021
nc21df = nc21df[['year', 'word_type', 'word', 'word_freq']]
nc21df

Unnamed: 0,year,word_type,word,word_freq
0,2021,noun_chunk,honourable guests distinguished delegates,1
1,2021,noun_chunk,ladies,1
2,2021,noun_chunk,gentlemen,1
3,2021,noun_chunk,our long wait,1
4,2021,noun_chunk,it,17
...,...,...,...,...
232,2021,noun_chunk,billions,1
233,2021,noun_chunk,the enormous challenge,1
234,2021,noun_chunk,our times,1
235,2021,noun_chunk,not just our present generation,1


Repeat with nouns and verbs

In [None]:
#words = [token.text.lower()
#         for token in doc
#         if not token.is_stop and not token.is_punct]

#word_freq = Counter(words)
#common_words = word_freq.most_common(20)

nouns = [token.text.lower()
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "NOUN")]


noun_freq = Counter(nouns)
common_nouns = noun_freq.most_common(20)

n21df = pd.DataFrame.from_dict(noun_freq, orient='index').reset_index()
n21df = n21df.rename(columns={'index':'word', 0:'word_freq'})
n21df['word_type'] = "noun"
n21df['year'] = 2021
n21df = n21df[['year', 'word_type', 'word', 'word_freq']]
n21df

Unnamed: 0,year,word_type,word,word_freq
0,2021,noun,guests,1
1,2021,noun,delegates,1
2,2021,noun,ladies,1
3,2021,noun,gentlemen,1
4,2021,noun,wait,1
...,...,...,...,...
175,2021,noun,trust,1
176,2021,noun,billions,1
177,2021,noun,challenge,1
178,2021,noun,times,1


...and verbs

In [None]:
verbs = [token.text.lower()
         for token in doc
         if (not token.is_stop and
             not token.is_punct and
             token.pos_ == "VERB")]

verbs_freq = Counter(verbs)

v21df = pd.DataFrame.from_dict(verbs_freq, orient='index').reset_index()
v21df = v21df.rename(columns={'index':'word', 0:'word_freq'})
v21df['word_type'] = "verb"
v21df['year'] = 2021
v21df = v21df[['year', 'word_type', 'word', 'word_freq']]
v21df

Unnamed: 0,year,word_type,word,word_freq
0,2021,verb,distinguished,1
1,2021,verb,welcome,2
2,2021,verb,thank,4
3,2021,verb,challenging,1
4,2021,verb,work,1
...,...,...,...,...
99,2021,verb,getting,1
100,2021,verb,depends,5
101,2021,verb,trying,1
102,2021,verb,consider,1


Merge all of them

In [None]:
df21 = nc21df.append(n21df)
df21 = df21.append(v21df)
df21

Unnamed: 0,year,word_type,word,word_freq
0,2021,noun_chunk,honourable guests distinguished delegates,1
1,2021,noun_chunk,ladies,1
2,2021,noun_chunk,gentlemen,1
3,2021,noun_chunk,our long wait,1
4,2021,noun_chunk,it,17
...,...,...,...,...
99,2021,verb,getting,1
100,2021,verb,depends,5
101,2021,verb,trying,1
102,2021,verb,consider,1


**Creating loop that creates dataset for all years**

In [None]:
DF = pd.DataFrame(columns=['year', 'word_type', 'word', 'word_freq'])
years = list(range(2006,2020))
years.append(2021)

for year in years : 

  print(year)

  l1 = 'https://raw.githubusercontent.com/gigisp/COPanalysis/main/ExecSecGen/COP'
  l2 = 'ExecutiveSec.txt'
  link = l1 + str(year) + l2

  cop = pd.read_csv(link, header = None, sep = "~", error_bad_lines=False)
  blob = cop[0].str.cat(sep=' ')
  blob
  doc = nlp(blob)

  # Extract all type of words
  #noun chunks
  nounchunks = doc.noun_chunks
  nc = list(nounchunks)
  lowernc = [n.text.lower() for n in nc]
  lowernc_freq = Counter(lowernc)
  #nouns
  nouns = [token.text.lower()
          for token in doc
          if (not token.is_stop and
              not token.is_punct and
              token.pos_ == "NOUN")]
  noun_freq = Counter(nouns)
  #verbs 
  verbs = [token.text.lower()
          for token in doc
          if (not token.is_stop and
              not token.is_punct and
              token.pos_ == "VERB")]
  verbs_freq = Counter(verbs)


  nc = pd.DataFrame.from_dict(lowernc_freq, orient='index').reset_index()
  nc = nc.rename(columns={'index':'word', 0:'word_freq'})
  nc['word_type'] = "noun chunk"
  n = pd.DataFrame.from_dict(noun_freq, orient='index').reset_index()
  n = n.rename(columns={'index':'word', 0:'word_freq'})
  n['word_type'] = "noun"
  v = pd.DataFrame.from_dict(verbs_freq, orient='index').reset_index()
  v = n.rename(columns={'index':'word', 0:'word_freq'})
  v['word_type'] = "verb"

  df = nc.append(n).append(v)
  df['year'] = year
  df = df[['year', 'word_type', 'word', 'word_freq']]

  DF = DF.append(df)

DF

2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2021


Unnamed: 0,year,word_type,word,word_freq
0,2006,noun chunk,the post-independence years,1
1,2006,noun chunk,kenya,3
2,2006,noun chunk,a big development challenge,1
3,2006,noun chunk,the voices,2
4,2006,noun chunk,this experience,1
...,...,...,...,...
175,2021,verb,trust,1
176,2021,verb,billions,1
177,2021,verb,challenge,1
178,2021,verb,times,1


Cleaning the merged dataset and downloading it

In [None]:
DF = DF.sort_values(['year', 'word_type', 'word_freq'], 
               ascending = [False, False, False])

DF.to_csv("NounChunksCOP.csv")