## Mapping Countries

This section was adapted from code and instructions provided by Shivangi Patel ["A Complete Guide to an Interactive Geographical Map using Python"](https://towardsdatascience.com/a-complete-guide-to-an-interactive-geographical-map-using-python-f4c5197e23e0). Any cells marked with "Patel" comes from this guide.

We will first import a shapefile with boundaries for the world's 258 countries (downloaded from Natural Earth. Note: they identify 258 countries but only 208 sovereign states).

*To understand how this code works, it is recommended you try tweaking different lines of code (to see how the output differs), print / output some variables to see what kind of data is contained within, and break code cells into smaller parts and run those smaller chunks one at a time, inspecting what each does.*  

## I. Import Necessary Packages

In [None]:
import pandas as pd       #for working with dataframes
import geopandas as gpd   #... with geodataframes
import spacy              # NLP
import collections        # freq lists
import country_converter as coco   #for standardizing country names
import json
from spacy.lang.en.examples import sentences
from spacy import displacy

#in addition to matplotlib, seaborn, and plotly, bokeh is a 4th powerful visualization library for Python
from bokeh.io import output_notebook, show, output_file #bokeh - viz package
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer  #Input GeoJSON source that contains features for plotting.
from bokeh.models import ContinuousColorMapper, EqHistColorMapper, LogColorMapper

nlp = spacy.load("en_core_web_sm")

## II. Import and Set Up a Dataframe of Countries

1. Import a shapefile using **Geopandas** as a geo-dataframe, then subset it, keeping only the columns we want.

In [None]:
#[Patel]

shapefile = "countries_110m/ne_110m_admin_0_countries.shp"

#Read shapefile using Geopandas
gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]

#Rename columns.
gdf.columns = ['country', 'country_code', 'geometry']
gdf.head()

[Patel]: 3. We can drop the row for ‘Antarctica’ as it unnecessarily occupies a large space in our map and is not required in our current analysis.

In [None]:
print(gdf[gdf['country'] == 'Antarctica']) #Drop row corresponding to 'Antarctica'
gdf = gdf.drop(gdf.index[159])

In [None]:
type(gdf)

## II. Standardize & Count Countries Mentioned in SOTU

4. To standardize country names, we can use the [**country_converter**](https://notebook.community/konstantinstadler/country_converter/doc/country_converter_examples) package. Let's try it with some examples below.

In [None]:

some_names = ['United Rep. of Tanzania', 'DE', 'Cape Verde', '788', 'Burma', 'COG',
              'Iran (Islamic Republic of)', 'Korea, Republic of',
              "Dem. People's Rep. of Korea"]
standard_names = coco.convert(names = some_names, to = 'name_short')
print(standard_names)

4b. We can also use a three-digit ("iso3") code that serves as a unique identifier for each country.

In [None]:
countries_list = list(gdf['country'])
print(countries_list[:10])
countries_iso3 = coco.convert(names = countries_list, to = 'ISO3')
countries_iso3[:10]

<div class="alert alert-success" role="alert"><p style="color:green">5. Try applying a list of countries of your choosing through the coco.convert function and examine the results.</p></div>

6. To extract countries from our the Biden 2023 SOTU, we will want to focus on the GPE named entities. The following code extracts GPEs only and then creates a frequency list.

In [None]:
sotudf = pd.read_csv("sotudf.tsv", encoding="utf-8", sep="\t", index_col=0)
sotudf = sotudf.sort_values(by = ['year'])
biden23text = sotudf[sotudf['year'] == 2023]['fulltext'].item()
doc = nlp(biden23text)
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
sotu_gpes = [ent[0] for ent in ents if ent[1] == 'GPE']
sotu_gpes_freqs = collections.Counter(sotu_gpes)
sotu_gpes_freqs

7. We can use list comprehensions to standardize our country names before re-compiling our frequency list. We want to do two quick tasks:
+ replace "America" with "United States" as the country_converter does not have "America" listed as one of its aliases for the USA.
+ convert all remaining countries in our list of geopolitical entities (GPEs) extracted from the Biden speech into their iso-3 code.

In [None]:

print("original list:", sotu_gpes)
sotu_countries = ["United States" if country == "America" else country for country in sotu_gpes]
print("with 'America' replaced: ", sotu_countries)
sotu_countries_std = [coco.convert(country, to = "ISO3") for country in sotu_countries]
print("with iso3 codes:", sotu_countries_std)


8. Now, let's see how our frequency has changes since Step 6 above:

In [None]:
collections.Counter(sotu_countries_std)

In [None]:
#we can then save this freq list into memory
sotu_countries_std_freqs = collections.Counter(sotu_countries_std)

## III. Merge country frequency list (actually a dictionary) with geoPandas dataframe created from country shapefile

9. Convert country frequency dictionary into a dataframe

In [None]:
sotu_countries_freqs_df = pd.DataFrame.from_dict(sotu_countries_std_freqs, orient = 'index', columns = ['freq'])
sotu_countries_freqs_df.index.name = "iso3"
sotu_countries_freqs_df

9. We can remove the "not found" countries using the **.drop() method.

In [None]:
sotu_countries_freqs_df.drop(['not found'], inplace = True)

10. We now want to merge our new dataframe of countries listed in the Biden 23 address with our geoPandas dataframe. First let's review our countries geo-dataframe:

In [None]:
gdf.head()

In [None]:
gdf_merged = gdf.merge(sotu_countries_freqs_df, left_on = "country_code", right_index = True, how = "outer")
# I used "outer" above to include all countries, not just those that appeared in the SOTU address.

In [None]:
gdf_merged


11. Often it helps maintain a consistent data type in our columns. In this case, we want to replace the "NaN"'s in the freq column with 0's and then convert that column into all integers.

In [None]:
gdf_merged['freq'] = gdf_merged['freq'].fillna(0)
gdf_merged['freq'] = gdf_merged['freq'].astype(int)
gdf_merged.head()

11. [Patel] The merged file is a GeoDataframe object that can be rendered using geopandas module. However, since we want to incorporate data visualization interactivity, we will use Bokeh library. Bokeh consumes GeoJSON format which represents geographical features with JSON. GeoJSON describes points, lines and polygons (called Patches in Bokeh) as a collection of features. We therefore convert the merged file to GeoJSON format.

In [None]:
import json   #Read data to json.
gdf_json = json.loads(gdf_merged.to_json())#Convert to String like object.
json_data = json.dumps(gdf_json)

In [None]:
json_data

12. We are now ready to render our choropleth map using Bokeh. Import the required modules. The code is described inline. [Patel]

In [None]:
brewer['YlGnBu']

In [None]:
geosource = GeoJSONDataSource(geojson = json_data)  #Define a sequential multi-hue color palette.
palette = brewer['YlGnBu'][8]  #Reverse color order so that dark blue is highest obesity.
palette = palette[::-1]  #Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LinearColorMapper(palette = palette, low = 1, low_color = "white")   #Define custom tick labels for color bar.
#tick_labels = {'0': '0%', '5': '5%', '10':'10%', '15':'15%', '20':'20%', '25':'25%', '30':'30%','35':'35%', '40': '>40%'}  #Create color bar. 
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8, width = 500, height = 20, border_line_color=None,location = (0,0), orientation = 'horizontal') #, major_label_overrides = tick_labels)  #Create figure object.
p = figure(title = "Countries mentioned in Biden's 2023 SOTU address", plot_height = 600 , plot_width = 950, toolbar_location = None)
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None  #Add patch renderer to figure. 
p.patches('xs','ys', source = geosource,fill_color = {'field' :'freq', 'transform' : color_mapper},
          line_color = 'black', line_width = 0.25, fill_alpha = 1)  #Specify figure layout.
p.add_layout(color_bar, 'below')  #Display figure inline in Jupyter Notebook.
output_notebook()  #Display figure.
show(p)

## IV. Map the entire SOTU Corpus

13. Given what you have learned, what steps will we need to complete in order to create a map of all countries listed across the entire corpus. Write your answer down in the empty markdown cell(s) below.



14. Review the code below. Identify what each of the following steps does.

In [None]:
sotudf.iloc[0]

In [None]:
def extractGPES(text):
    if pd.isnull(text):
        gpes = []
    else:
        doc = nlp(text)
        gpes = [e.text for e in doc.ents if e.label_ == "GPE"]
    return(gpes)

In [None]:
extractGPES(sotudf.iloc[1]['fulltext'])

In [None]:
sotudf.head()

15. We can now apply this function across our dataframe of SOTU addresses. **NOTE: this will take several minutes to run (nearly 5 minutes on my fast computer). This might be a good time to do something else (take a break, discuss your projects, etc.) while waiting for it to complete.**

In [None]:
sotudf['gpes'] = sotudf['fulltext'].apply(extractGPES)

In [None]:
sotudf.head()

16. The cell below combines all GPEs list for each individual SOTU address in the 'gpes' column into one large list. There are certainly other ways to do this.

In [None]:
all_gpes = [a for b in sotudf.gpes.tolist() for a in b]
all_gpes[-30:]

17. We want to replace all instances of 'America' in this GPE list with 'United States':

In [None]:
all_gpes = list(map(lambda x: x.replace('America', 'United States'), all_gpes))
all_gpes[-30:]

18. We can use the **collections** package to identify the most frequent GPEs mentioned in the SOTU addresses.

In [None]:
all_gpes_freqs = collections.Counter(all_gpes)
print(type(all_gpes_freqs))
all_gpes_freqs.most_common()


In [None]:
all_gpes_freqs.items()

19. As you can see above, we have multiple different aliases referring to the United States. The code below uses the **country_converter** packages (imported as "coco") to standardize country references as a three-digit ISO code. We then will calculate a cumulative sum for each country. For example, in the above we have: 

```
('the United States', 3133),
('United States', 2126),
('the United\nStates', 366)
```

Identifying these all under the iso3 code "USA" and summing their totals would return: 

```
(USA, 5625)
```

The actual total for "USA" is probably higher as their our undoubtedly other variations within the SOTU address. But, you get the idea.

In [None]:
#allsotu_countries_freqs_iso3 = {coco.convert(k, to = "ISO3"):v for (k, v) in all_gpes_freqs.items()}
all_gpes_freqs_iso3 = {}
for (k, v) in all_gpes_freqs.items():
    new_k = coco.convert(k, to = "ISO3")
    #sometimes country_converter returns two possible candidates in a list, i.e. "United States of Colombia" --> ['COL', 'USA']
    ##the if statement below just extracts the first answer from the list
    if type(new_k) == list:    
        new_k = new_k[0]
    if new_k in all_gpes_freqs_iso3.keys():    #if the iso3 code is already in our new dictionary, then just cumulatively sum the frequencies
        all_gpes_freqs_iso3[new_k] += v
    else:                                      #else: if this is the first appearance of the iso3 code then just takes its frequency
        all_gpes_freqs_iso3[new_k] = v
    #newk = coco.convert(k, to = "ISO3")

20. We can sort the dictionary of countries and their frequencies using the following:

In [None]:
dict(sorted(all_gpes_freqs_iso3.items(), key = lambda item: item[1], reverse = True))

21. Let's place this information in a new dataframe.

In [None]:
allsotu_countries_freqs_df = pd.DataFrame.from_dict(all_gpes_freqs_iso3, orient = 'index', columns = ['freq'])
allsotu_countries_freqs_df.index.name = "iso3"
allsotu_countries_freqs_df

22. We can't map the "not founds" so let's drop them.

In [None]:
allsotu_countries_freqs_df.drop(['not found'], inplace = True)

23. We can then merge the new dataframe with country frequencies ("allsotu_countries_freqs_df") with our original Geopandas dataframe of countries.

In [None]:
#merge dataframes
gdf_merged_all = gdf.merge(allsotu_countries_freqs_df, left_on = "country_code", right_index = True, how = "outer")

#replace NaNs with 0s for the frequency column
gdf_merged_all['freq'] = gdf_merged_all['freq'].fillna(0)

#with NaNs replaced with 0s we can now convert all values in this column to integers
gdf_merged_all['freq'] = gdf_merged_all['freq'].astype(int)

#and let's remove any countries we do not have geographic location info ("geometry") for.
gdf_merged_all = gdf_merged_all[gdf_merged_all['geometry'] != None]

In [None]:
#we can view this new dataframe, sorted by frequency:
gdf_merged_all.sort_values(by = ['freq'], ascending = False).head(20)

[Patel]
24. Convert the merged geodataframe into a json file.

In [None]:
gdf_json_all = json.loads(gdf_merged_all.to_json()) #Convert to String like object.
print(type(gdf_json_all))
json_data_all = json.dumps(gdf_json_all)
print(type(json_data_all))
json_data_all

[Patel]

25. We can now create a chloropleth map using Bokeh.

In [None]:
##the modules below were moved to the top of the notebook
#from bokeh.io import output_notebook, show, output_file
#from bokeh.plotting import figure
#from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
#from bokeh.palettes import brewer  #Input GeoJSON source that contains features for plotting.

geosource = GeoJSONDataSource(geojson = json_data_all)  
#Define a sequential multi-hue color palette.
palette = brewer['YlGnBu'][8]  
#Reverse color order so that dark blue is highest obesity.
palette = palette[::-1]  
#Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
color_mapper = LogColorMapper(palette = palette, low = 1, low_color = "white")   
#Define custom tick labels for color bar.
tick_labels = {'0': '0', '10':'10', '100':'100', '1000':'1000', '10000':'10000'}  
#Create color bar. 
color_bar = ColorBar(color_mapper=color_mapper, label_standoff=8, width = 500, height = 20, 
                    border_line_color=None,location = (0,0), orientation = 'horizontal', major_label_overrides = tick_labels)  #Create figure object.
p = figure(title = "Countries mentioned in all SOTU address (1791 - 2023)", plot_height = 600 , plot_width = 950, 
    toolbar_location = None)
#remove grid lines
p.xgrid.grid_line_color = None
p.ygrid.grid_line_color = None  
#Add patch renderer to figure. 
p.patches('xs','ys', source = geosource,fill_color = {'field' :'freq', 'transform' : color_mapper},
          line_color = 'black', line_width = 0.25, fill_alpha = 1)  
#Specify figure layout.
p.add_layout(color_bar, 'below')  
#Display figure inline in Jupyter Notebook.
output_notebook()  
#Display figure.
show(p)

<div class="alert alert-success" role="alert"><p style="color:green">26. Examine the map above including the colors and scale used.</p>
<ul> 
    <li style="color:green">What conclusions could you draw from this map?</li>
    <li style="color:green">How might this map be deceiving?</li>
    <li style="color:green">What further questions do you have? How could you go about answering those questions?</li>
</ul>
</p>
</div>


**Possible options to fulfill weekly assignment for this week (talk to instructor):**
1. Apply a NLP technique we learned about (place / person names using NER; Part-of-speech [POS] tagging) to a different corpus of your choice.
2. Apply a NER technique to the SOTU corpus (but using a different technique than we learned about or applying it in a way).