# parse_country

#### PURPOSE: Parse and geocode countries mentioned within a text to then geolocate that text using GIS software.

##### Librarys Used:
- Pandas
- Country_Converter (coco)
- Logging
- Typing

##### READ ME
This fuction was designed to parse many entries of text within a padas DataFrame. Another function has been provided (vals_to_df) to quickly convert a *list* to a DataFrame to use the parse_country function.

##### Description:
Use to parse country names that can be found in a string of text stored in a df cell. Geolocate the countries using regex and match to the ISO3 code. Return an exploded or unexploded df that contains all the countries ISO3 codes that were found in the text.

##### Output:
An exploded or unexploded DataFrame with the text and corrosponding ISO3 country code.

##### NOTE: This function may not catch all countries within a text.

In [1]:
def parse_country(df, col_to_be_parsed, new_match_column, explode=True, log="CRITICAL"):
    """
    Args:
        df: pandas DataFrame
        col_to_be_parsed: str
        explode: True or False
        log: Can be "CRITICAL" or "INFO"
    Description:
    -- Use to parse country names that can be found in a string of text stored in a df cell. Geolocate the countries using regex and match to the ISO3 code. Return an exploded or unexploded df that contains all the countries ISO3 codes that were found in the text.
        df: pandas DataFrame.
        col_to_be_parsed: Name of column that the country(s) will be extracted from.
        explode: Default True
            True - will create a new row for each country within the list. 
            False - will keep all orginal rows; the matched countries will reman in list form within each cell.
        log: "CRITICAL" or "INFO" will denote whether the 'not found' regexes are displayed as an error.
            CRITICAL - 'not found' regexes will not be displayed.
            INFO - 'not found' regexes will be displayed.
    """
    import pandas as pd
    import country_converter as coco
    import logging
    from typing import Any
    
    # Check that 'explode' param is a bool
    if isinstance(explode, bool):
        None
    else:
        raise Exception('explode param does not have bool value.')

    # Check that the 'log' param is 'CRITICAL' or 'INFO'
    if log == "CRITICAL":
        coco_logger = coco.logging.getLogger()
        coco_logger.setLevel(logging.CRITICAL)
    elif log == "INFO":
        coco_logger = coco.logging.getLogger()
        coco_logger.setLevel(logging.INFO)
    else:
        raise Exception(f"log param cannot equal '{log}'...log must equal 'CRITICAL' or 'INFO'")
    
    # Translator to remove/correct extraneous characters
    replacements = str.maketrans({
        '–': '', 
        ',': '', 
        ';': '', 
        '.': '', 
        '(': '', 
        ')': '', 
        '-': ' '
    })
    
    df[col_to_be_parsed] = df[col_to_be_parsed].str.translate(replacements)
    
    # Add "TempText" to start of text as the first word to not match the coco regex for sorting purposes
    df['col_with_id'] = "TempText " + df[col_to_be_parsed]  
    # Fill newly created column with 'null' values as placeholder
    df[new_match_column] = None

    # Temporary column for parsing
    col_parsed_filled = df['col_with_id'].fillna("N/A")
    
    # country_converter
    cc = coco.CountryConverter()
    
    i = -1
    while i < (len(col_parsed_filled) - 1):
        i += 1
    
        # Split text at every " "...then check the regex to find any countries matching in coco and match using the ISO3 Code.
        # Any word not matching a country is labeled "AAAA" for sorting and final data cleaning
        # mc is the 'new' matching column
        mc = pd.Series(cc.convert(col_parsed_filled[i].split(), src="regex", to="ISO3", not_found='AAAA'))
        
        # Split text at every " " for comparison to coco matched column "mc"
        # og is the original column separated to creat a one for on match when concatenating "mc" and "og"
        og = pd.Series(col_parsed_filled[i].split())
    
        # Combine "mc" and "og" for comparison
        comb = pd.concat([og, mc], axis=1)
    
        # Access the 2nd column (mc) and drop duplicated countries and all words labeled as "AAAA" to leave only a single "AAAA" and countries
        comb_drop_d = comb[1].drop_duplicates()

        # Sort values alphabetically...remove "comb_drop_d[0]" (i.e. "AAAA")...reset index...remove old index column...rename ISO3 Coded column as "new_match_column"
        new_df = pd.DataFrame(comb_drop_d)[1].sort_values().drop(0).reset_index().drop(columns='index').rename(columns={1: new_match_column})
        
        # Turn "new_match_column" into list to insert values into the "df"
        new_df_list = new_df[new_match_column].to_list()

        # Replace empty df["new_match_column"] values with ISO3 Codes from "new_df_list" based on matching index number
        df.at[i, new_match_column] = new_df_list

    # explaode values by default. Do not explode with False and keep macthed country ISO 3 codes in list form wihtin the cell.
    if explode == True:
        return df.explode(new_match_column).drop(columns='col_with_id')
    else:
        return df.drop(columns='col_with_id')

## vals_to_df

#### PURPOSE: Convert *list* to pandas DataFrame

In [2]:
def vals_to_df(entries):
    """
    Description:
        df_name: name for new DataFrame
        entries: list of text to be in each row of the DataFrame
    """  
    if isinstance(entries, list):
        None
    else:
        raise Exception('entries param must be a list')
    
    import pandas as pd
    df = {"values": entries}
    df = pd.DataFrame(df)
    return df

## Examples

Take a list of entries and convert to a pandas DataFrame -- my_df

In [3]:
entries = [
    "Ukraine must urgently be given the €300bn of frozen Russian assets | Phillip Inman", 
    "At an international forum in Singapore, Defense Secretary Pete Hegseth said the U.S. is refocusing its strength and policies on deterring China, and coaxed China's neighbors and U.S. allies to help.",
    "Israel blocks Arab ministers from occupied West Bank visit",
    "The sale of disposable vapes will be banned in the UK starting Sunday, as the country becomes the latest to tackle the “environmental nightmare” of the single-use devices",
    'Hungarian Prime Minister Viktor Orbán has been called "Trump before there was a Trump." Heres why his reshaping of Hungarys political institutions inspires U.S. conservatives.'
]

my_df = vals_to_df(entries)

NOTE: 'explode' and 'log' are in default

In [4]:
parse_country(my_df, 'values', 'countries', explode=True, log="CRITICAL")

Unnamed: 0,values,countries
0,Ukraine must urgently be given the €300bn of f...,RUS
0,Ukraine must urgently be given the €300bn of f...,UKR
1,At an international forum in Singapore Defense...,CHN
1,At an international forum in Singapore Defense...,SGP
1,At an international forum in Singapore Defense...,USA
2,Israel blocks Arab ministers from occupied Wes...,ISR
3,The sale of disposable vapes will be banned in...,GBR
4,Hungarian Prime Minister Viktor Orbán has been...,HUN
4,Hungarian Prime Minister Viktor Orbán has been...,USA


NOTE: 'explode' is using the other optional value -- False

In [5]:
parse_country(my_df, 'values', 'countries', explode=False, log="CRITICAL")

Unnamed: 0,values,countries
0,Ukraine must urgently be given the €300bn of f...,"[RUS, UKR]"
1,At an international forum in Singapore Defense...,"[CHN, SGP, USA]"
2,Israel blocks Arab ministers from occupied Wes...,[ISR]
3,The sale of disposable vapes will be banned in...,[GBR]
4,Hungarian Prime Minister Viktor Orbán has been...,"[HUN, USA]"


NOTE: 'log' is using the other optional value -- "INFO"

In [6]:
parse_country(my_df, 'values', 'countries', explode=True, log="INFO")

TempText not found in regex
must not found in regex
urgently not found in regex
be not found in regex
given not found in regex
the not found in regex
€300bn not found in regex
of not found in regex
frozen not found in regex
assets not found in regex
| not found in regex
Phillip not found in regex
Inman not found in regex
TempText not found in regex
At not found in regex
an not found in regex
international not found in regex
forum not found in regex
in not found in regex
Defense not found in regex
Secretary not found in regex
Pete not found in regex
Hegseth not found in regex
said not found in regex
the not found in regex
is not found in regex
refocusing not found in regex
its not found in regex
strength not found in regex
and not found in regex
policies not found in regex
on not found in regex
deterring not found in regex
and not found in regex
coaxed not found in regex
neighbors not found in regex
and not found in regex
allies not found in regex
to not found in regex
help not found in

Unnamed: 0,values,countries
0,Ukraine must urgently be given the €300bn of f...,RUS
0,Ukraine must urgently be given the €300bn of f...,UKR
1,At an international forum in Singapore Defense...,CHN
1,At an international forum in Singapore Defense...,SGP
1,At an international forum in Singapore Defense...,USA
2,Israel blocks Arab ministers from occupied Wes...,ISR
3,The sale of disposable vapes will be banned in...,GBR
4,Hungarian Prime Minister Viktor Orbán has been...,HUN
4,Hungarian Prime Minister Viktor Orbán has been...,USA
