## Step 3: COUNTRY-BASED LOCATION INFERENCE<a class="anchor\" id="localization"></a>



<div class="alert alert-block alert-success">
In order to infer the location the library pycountry will be used. It reads the string passed as input and extracts the country. 
    
Those affiliations not providing the the location in form of CountryName in the AffiliationName cannot be classified (they are a few percentage).
</div>



<div class="alert alert-block alert-warning">
    <b>IMPROVING COUNTRY AFFILIATION</b>
    <br>
    
- Using PYCOUNTRY library
- Importing a university dataset and pre-process the data before using PYCOUNTRY

</div> 



#### Substitute some countries to normalize the data

In [89]:
def sustitute_country_names(df, col):
    
    df[col] = df[col].str.replace('UK','United Kingdom')
    df[col] = df[col].str.replace('USA','United States')
    df[col] = df[col].str.replace('US','United States')
    df[col] = df[col].str.replace('Northern Ireland','United Kingdom')
    df[col] = df[col].str.replace('N.Ireland','United Kingdom')
    df[col] = df[col].str.replace('South Korea', 'Korea')
    
    return df

### 3.1(Pre-processing) Import University-Country dataset

As some Affiliations do not include the country, if we first substitute all universities in the imported dataframe by the corresponding Country, we will have left values with missing countries afterwards

In [90]:
route = '' #route of univerisity dataset

df_universities = pd.read_csv(route + '/university_dataset4.csv')
df_universities = df_universities[['institution', 'country']]

In [91]:
df_universities[15:20]

Unnamed: 0,institution,country
15,Swiss Federal Institute of Technology in Zurich,Switzerland
16,Kyoto University,Japan
17,Weizmann Institute of Science,Israel
18,"University of California, Los Angeles",USA
19,"University of California, San Diego",USA


#### Transform Universities dataset to dictionary in order to do the match with Affiliation Dataset

In [92]:
def create_dict_from_df(df, cols):
    
    df = sustitute_country_names(df, cols[1])
    dict_ = dict(zip(df[cols[0]], df[cols[1]]))
    
    return dict_

In [93]:
universities_dict = create_dict_from_df(df_universities, ['institution', 'country'])

#### Substitute universities values in Affiliation Dataset by the corresponding country

In [94]:
def sustitute_university_by_country(df, dict_, col):
    
    for university, country in dict_.items():
        df[col] = df[col].str.replace(str(university), str(country))
        
    return df

In [95]:
def create_CountryAffiliation_column(df, orig_col, aff_col, countryaff_col):
    
    df[orig_col] = df[aff_col]
    df = sustitute_country_names(df, aff_col)
    #df = sustitute_country_names(df, orig_col)
    df = sustitute_university_by_country(df, universities_dict, aff_col)

    df[countryaff_col] = df[aff_col]
    
    return df

<div class="alert alert-block alert-info">
<b> DBLP Conference </b>
</div>

In [111]:
df_complete = create_CountryAffiliation_column(df_complete, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
df_complete[['Url','Conference','Affiliation','Gender','Affiliation_orig','CountryAffilitation']].head()

Unnamed: 0,Url,Conference,Affiliation,Gender,Affiliation_orig,CountryAffilitation
0,https://dblp.org/pid/08/4162.html,UAI 2019,"Israel, Israel",Male,"Hebrew University of Jerusalem, Israel","Israel, Israel"
1,https://dblp.org/pid/42/2642.html,UAI 2019,,Male,,
2,https://dblp.org/pid/163/3862.html,UAI 2019,,Male,,
3,https://dblp.org/pid/f/BoiFaltings.html,UAI 2019,"Switzerland, Switzerland",Male,Swiss Federal Institute of Technology in Lausa...,"Switzerland, Switzerland"
4,https://dblp.org/pid/181/2834.html,UAI 2019,,Male,,


#### If we have a list of Affiliations we can also use this method to infer country

In [None]:
# route = '' # route of the csv with the affiliation information
# df_complete = pd.read_csv(route)
# df_complete = create_CountryAffiliation_column(df_complete, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')

### 3.3 Implementing library pycountry


When there are two possible countries the last one is taken.

In [112]:
import pycountry
import numpy as np

def country_inference(df):

    countries = {}
    for affiliation in df.Affiliation:
        if affiliation!= None:

            #im also saving the values where there is no country inferred
            for country in pycountry.countries:
                text = affiliation

                # to avoid null values
                if text == text:

                    if country.name in text:
                        name = country.name
                        countries[affiliation] = name
                        
                        
    # Replace original values with country
    df.CountryAffilitation = df.CountryAffilitation.replace(countries)

    return df
    

In [113]:
def create_final_df(df, col, col_to_substitute):
    
    df = country_inference(df)
    df = df.drop([col], axis=1)
    df = df.rename(columns={col_to_substitute: col})
    
    return df

<div class="alert alert-block alert-warning">
Pycountry library has a problem when reading Taiwan and Korea, manually infer this country.
</div> 




In [114]:
def taiwan_korea_inference(df):
    for i in df.index:
        if 'Taiwan' in df['CountryAffilitation'].iloc[i]:
            df.at[i, 'CountryAffilitation'] = 'Taiwan'
            
    for i in df.index:
        if 'Korea' in df['CountryAffilitation'].iloc[i]:
            df.at[i, 'CountryAffilitation'] = 'Korea'
            
    return df

<div class="alert alert-block alert-info">
<b> DBLP Conferences </b>
</div>

In [122]:
df_complete = create_final_df(df_complete, 'Affiliation', 'Affiliation_orig')
df_complete[['Url','Conference','Gender', 'Affiliation', 'CountryAffilitation']][54:59]

Unnamed: 0,Url,Conference,Gender,Affiliation,CountryAffilitation
54,https://dblp.org/pid/96/7521.html,UAI 2019,Male,"University of California, Los Angeles, Compute...",United States
55,https://dblp.org/pid/165/4473.html,UAI 2019,Male,,
56,https://dblp.org/pid/96/3115-1.html,UAI 2019,Male,"Carnegie Mellon University, Department of Phil...",United States
57,https://dblp.org/pid/56/3623.html,UAI 2019,Male,"University of Helsinki, Finland",Finland
58,https://dblp.org/pid/58/1568-3.html,UAI 2019,Male,"University of Amsterdam, The Netherlands",Netherlands


#### Rows with no AffiliationCountry value

In [123]:
def country_affiliation_missing(df):
    
    no_countryaffiliation = df[(df.Affiliation == df.CountryAffilitation) | (df.CountryAffilitation.isna())] 
    print('There are', len(no_countryaffiliation),'rows out of', len(df),'with missing AffiliationCountry value')
    
    for i in df.index:
        if df.Affiliation.iloc[i] == df.CountryAffilitation.iloc[i]:
            df.at[i, 'CountryAffilitation'] = np.nan
    
    return df, no_countryaffiliation

<div class="alert alert-block alert-info">
<b> DBLP Conferences </b>
</div>

In [125]:
df_complete, df_complete_nocountry = country_affiliation_missing(df_complete)

There are 311 rows out of 414 with missing AffiliationCountry value


In [130]:
df_complete[['Url','Conference','Gender','CountryAffilitation', 'Affiliation']][51:65]

Unnamed: 0,Url,Conference,Gender,CountryAffilitation,Affiliation
51,https://dblp.org/pid/26/2122.html,UAI 2019,Male,Canada,"University of Waterloo, ON, Canada"
52,https://dblp.org/pid/126/4298.html,UAI 2019,Male,,
53,https://dblp.org/pid/27/10464.html,UAI 2019,Male,,
54,https://dblp.org/pid/96/7521.html,UAI 2019,Male,United States,"University of California, Los Angeles, Compute..."
55,https://dblp.org/pid/165/4473.html,UAI 2019,Male,,
56,https://dblp.org/pid/96/3115-1.html,UAI 2019,Male,United States,"Carnegie Mellon University, Department of Phil..."
57,https://dblp.org/pid/56/3623.html,UAI 2019,Male,Finland,"University of Helsinki, Finland"
58,https://dblp.org/pid/58/1568-3.html,UAI 2019,Male,Netherlands,"University of Amsterdam, The Netherlands"
59,https://dblp.org/pid/92/5526.html,UAI 2019,Male,,
60,https://dblp.org/pid/44/9886.html,UAI 2019,Male,,
