## Step 3: COUNTRY-BASED LOCATION INFERENCE<a class="anchor\" id="localization"></a>



<div class="alert alert-block alert-success">
In order to infer the location the library pycountry will be used. It reads the string passed as input and extracts the country. 
    
Those affiliations not providing the the location in form of CountryName in the AffiliationName cannot be classified (they are a few percentage).
</div>



<div class="alert alert-block alert-warning">
    <b>IMPROVING COUNTRY AFFILIATION</b>
    <br>
    
- Using PYCOUNTRY library
- Importing a university dataset and pre-process the data before using PYCOUNTRY

</div> 



<div class="alert alert-block alert-info">
<b> Step necessary for ACII datasets, as ground truth is being used to check accuracy </b>
</div>

In [40]:
def import_df(route, csv, col_drop, col_modify, cols_test):
    
    df = pd.read_csv(route + csv)
    df = df.drop(columns = [col_drop])
    df[col_modify] = df[col_modify].str.replace('United States of America','United States')
    df_test = df[cols_test]
    
    return df, df_test

In [41]:
ACII = True

if ACII == True:
    route = '' #specify where the ACII conference datasets are saved
    col_to_drop = 'Unnamed: 0'
    col_to_modify ='CountryAffilitation'
    cols_for_testing = ['Name', 'Affiliation']


    df_2019, df_2019_test = import_df(route,'/all_2019.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2017, df_2017_test = import_df(route,'/all_2017.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2015, df_2015_test = import_df(route,'/all_2015.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2013, df_2013_test = import_df(route,'/all_2013.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2011, df_2011_test = import_df(route,'/all_2011.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2009, df_2009_test = import_df(route,'/all_2009.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2007, df_2007_test = import_df(route,'/all_2007.csv', col_to_drop, col_to_modify, cols_for_testing)
    df_2005, df_2005_test = import_df(route,'/all_2005.csv', col_to_drop, col_to_modify, cols_for_testing)

In [45]:
df_2015[['Affiliation', 'CountryAffilitation', 'Code']].head() # Name column is not shown due to anonymity issues

Unnamed: 0,Affiliation,CountryAffilitation,Code
0,"Computer Laboratory, University of Cambridge, ...",United Kingdom,Academia
1,"Computer Laboratory, University of Cambridge,...",United Kingdom,Academia
2,"audEERING UG, Gilching, Germany",Germany,Industry
3,"audEERING UG, Gilching, Germany",Germany,Industry
4,"audEERING UG, Gilching, Germany",Germany,Industry


#### Substitute some countries to normalize the data

In [46]:
def sustitute_country_names(df, col):
    
    df[col] = df[col].str.replace('UK','United Kingdom')
    df[col] = df[col].str.replace('USA','United States')
    df[col] = df[col].str.replace('US','United States')
    df[col] = df[col].str.replace('Northern Ireland','United Kingdom')
    df[col] = df[col].str.replace('N.Ireland','United Kingdom')
    df[col] = df[col].str.replace('South Korea', 'Korea')
    
    return df

### 3.1(Pre-processing) Import University-Country dataset

As some Affiliations do not include the country, if we first substitute all universities in the imported dataframe by the corresponding Country, we will have left values with missing countries afterwards

In [47]:
route = '' #route of univerisity datasets

df_universities = pd.read_csv(route + '/university_dataset4.csv')
df_universities = df_universities[['institution', 'country']]

In [51]:
df_universities[15:20]

Unnamed: 0,institution,country
15,Swiss Federal Institute of Technology in Zurich,Switzerland
16,Kyoto University,Japan
17,Weizmann Institute of Science,Israel
18,"University of California, Los Angeles",USA
19,"University of California, San Diego",USA


#### Transform Universities dataset to dictionary in order to do the match with Affiliation Dataset

In [52]:
def create_dict_from_df(df, cols):
    
    df = sustitute_country_names(df, cols[1])
    dict_ = dict(zip(df[cols[0]], df[cols[1]]))
    
    return dict_

In [53]:
universities_dict = create_dict_from_df(df_universities, ['institution', 'country'])

#### Substitute universities values in Affiliation Dataset by the corresponding country

In [54]:
def sustitute_university_by_country(df, dict_, col):
    
    for university, country in dict_.items():
        df[col] = df[col].str.replace(str(university), str(country))
        
    return df

In [55]:
def create_CountryAffiliation_column(df, orig_col, aff_col, countryaff_col):
    
    df[orig_col] = df[aff_col]
    df = sustitute_country_names(df, aff_col)
    #df = sustitute_country_names(df, orig_col)
    df = sustitute_university_by_country(df, universities_dict, aff_col)

    df[countryaff_col] = df[aff_col]
    
    return df

<div class="alert alert-block alert-info">
<b> ACII </b>
</div>

In [56]:
if ACII == True:
    df_2019_test = create_CountryAffiliation_column(df_2019_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2017_test = create_CountryAffiliation_column(df_2017_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2015_test = create_CountryAffiliation_column(df_2015_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2013_test = create_CountryAffiliation_column(df_2013_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2011_test = create_CountryAffiliation_column(df_2011_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2009_test = create_CountryAffiliation_column(df_2009_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2007_test = create_CountryAffiliation_column(df_2007_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')
    df_2005_test = create_CountryAffiliation_column(df_2005_test, 'Affiliation_orig', 'Affiliation', 'CountryAffilitation')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[orig_col] = df[aff_col]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.replace('USA','United States')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[col] = df[col].str.replace('US','United States')
A value is trying to be set on a copy of a slice from a DataFrame.
Try us

In [58]:
df_2015_test[['Affiliation', 'Affiliation_orig', 'CountryAffilitation']].head()

Unnamed: 0,Affiliation,Affiliation_orig,CountryAffilitation
0,"Computer Laboratory, United Kingdom, Cambridge...","Computer Laboratory, University of Cambridge, ...","Computer Laboratory, United Kingdom, Cambridge..."
1,"Computer Laboratory, United Kingdom, Cambridg...","Computer Laboratory, University of Cambridge,...","Computer Laboratory, United Kingdom, Cambridg..."
2,"audEERING UG, Gilching, Germany","audEERING UG, Gilching, Germany","audEERING UG, Gilching, Germany"
3,"audEERING UG, Gilching, Germany","audEERING UG, Gilching, Germany","audEERING UG, Gilching, Germany"
4,"audEERING UG, Gilching, Germany","audEERING UG, Gilching, Germany","audEERING UG, Gilching, Germany"


The previous dataset shows:
- **Affiliation** : takes the universities in the universities dictionary and substitute them by the corresponding country. For instance, in the first case University of Cambrigde substituted by United Kingdom.
- **Affiliation_orig** : original affiliation column
- **CountryAffiliation** : for now it does not mean anything, but it will contain the corresponding country to the affiliations.

### 3.3 Implementing library pycountry


When there are two possible countries the last one is taken.

In [59]:
import pycountry
import numpy as np

def country_inference(df):

    countries = {}
    for affiliation in df.Affiliation:
        if affiliation!= None:

            #im also saving the values where there is no country inferred
            for country in pycountry.countries:
                text = affiliation

                # to avoid null values
                if text == text:

                    if country.name in text:
                        name = country.name
                        countries[affiliation] = name
                        
                        
    # Replace original values with country
    df.CountryAffilitation = df.CountryAffilitation.replace(countries)

    return df
    

In [60]:
def create_final_df(df, col, col_to_substitute):
    
    df = country_inference(df)
    df = df.drop([col], axis=1)
    df = df.rename(columns={col_to_substitute: col})
    
    return df

<div class="alert alert-block alert-warning">
Pycountry library has a problem when reading Taiwan and Korea, manually infer this country.
</div> 




In [61]:
def taiwan_korea_inference(df):
    for i in df.index:
        if 'Taiwan' in df['CountryAffilitation'].iloc[i]:
            df.at[i, 'CountryAffilitation'] = 'Taiwan'
            
    for i in df.index:
        if 'Korea' in df['CountryAffilitation'].iloc[i]:
            df.at[i, 'CountryAffilitation'] = 'Korea'
            
    return df

<div class="alert alert-block alert-info">
<b> ACII </b>
</div>

In [62]:
if ACII == True:
    df_2019_test = create_final_df(df_2019_test, 'Affiliation', 'Affiliation_orig')
    df_2017_test = create_final_df(df_2017_test, 'Affiliation', 'Affiliation_orig')
    df_2015_test = create_final_df(df_2015_test, 'Affiliation', 'Affiliation_orig')
    df_2013_test = create_final_df(df_2013_test, 'Affiliation', 'Affiliation_orig')
    df_2011_test = create_final_df(df_2011_test, 'Affiliation', 'Affiliation_orig')
    df_2009_test = create_final_df(df_2009_test, 'Affiliation', 'Affiliation_orig')
    df_2007_test = create_final_df(df_2007_test, 'Affiliation', 'Affiliation_orig')
    df_2005_test = create_final_df(df_2005_test, 'Affiliation', 'Affiliation_orig')

    df_2019_test.head()

In [63]:
if ACII == True:
    df_2019_test = taiwan_korea_inference(df_2019_test)
    df_2017_test = taiwan_korea_inference(df_2017_test)
    df_2015_test = taiwan_korea_inference(df_2015_test)
    df_2013_test = taiwan_korea_inference(df_2013_test)
    df_2011_test = taiwan_korea_inference(df_2011_test)
    df_2009_test = taiwan_korea_inference(df_2009_test)
    df_2007_test = taiwan_korea_inference(df_2007_test)
    df_2005_test = taiwan_korea_inference(df_2005_test)

#### Rows with no AffiliationCountry value

In [66]:
def country_affiliation_missing(df):
    
    no_countryaffiliation = df[(df.Affiliation == df.CountryAffilitation) | (df.CountryAffilitation.isna())] 
    print('There are', len(no_countryaffiliation),'rows out of', len(df),'with missing AffiliationCountry value')
    
    for i in df.index:
        if df.Affiliation.iloc[i] == df.CountryAffilitation.iloc[i]:
            df.at[i, 'CountryAffilitation'] = np.nan
    
    return df, no_countryaffiliation

<div class="alert alert-block alert-info">
<b> ACII </b>
</div>

In [67]:
if ACII == True:
    df_2019_test, df_2019_test_nocountry = country_affiliation_missing(df_2019_test)
    df_2017_test, df_2017_test_nocountry = country_affiliation_missing(df_2017_test)
    df_2015_test, df_2015_test_nocountry = country_affiliation_missing(df_2015_test)
    df_2013_test, df_2013_test_nocountry = country_affiliation_missing(df_2013_test)
    df_2011_test, df_2011_test_nocountry = country_affiliation_missing(df_2011_test)
    df_2009_test, df_2009_test_nocountry = country_affiliation_missing(df_2009_test)
    df_2007_test, df_2007_test_nocountry = country_affiliation_missing(df_2007_test)
    df_2005_test, df_2005_test_nocountry = country_affiliation_missing(df_2005_test)

There are 24 rows out of 472 with missing AffiliationCountry value
There are 21 rows out of 347 with missing AffiliationCountry value
There are 27 rows out of 620 with missing AffiliationCountry value
There are 4 rows out of 556 with missing AffiliationCountry value
There are 134 rows out of 431 with missing AffiliationCountry value
There are 44 rows out of 510 with missing AffiliationCountry value
There are 105 rows out of 281 with missing AffiliationCountry value
There are 70 rows out of 304 with missing AffiliationCountry value


In [71]:
# Examples of affiliations where the country could not be inferred
df_2017_test_nocountry[0:len(df_2015_test_nocountry)][0:10]['Affiliation']

54     Army Research Laboratory, Los Angeles, California
96     School of Computer and Information, Hefei Univ...
97      School of Computer and Information, Hefei Uni...
99      School of Computer and Information, Hefei Uni...
131                    audEERING GmbH, Gilching, Germnay
133    Department of Computer Science, State Universi...
134     Department of Computer Science, State Univers...
162                     University of Bari �?Aldo Moro�?
163                     University of Bari �?Aldo Moro�?
164                     University of Bari �?Aldo Moro�?
Name: Affiliation, dtype: object

<div class="alert alert-block alert-warning">
    <b>TEST : manually inferring location in order to see how my implementation works</b>
    <br>
    
The ACII Conference was already labelled and revised manually, therefore the accuracy of the new method develped was checked using this as ground truth.
</div> 

In [72]:
def wrong_country(df):
    for i in df.index:
        if df.CountryAffilitation.iloc[i] != df.CountryAffiliation_infered.iloc[i]:
            df.at[i, 'WrongCountry'] = 'Yes'
        else:
            df.at[i,'WrongCountry'] = 'No'
    return df

<div class="alert alert-block alert-info">
<b> ACII </b>
</div>

In [73]:
if ACII == True:
    df_2019['CountryAffiliation_infered']= df_2019_test.CountryAffilitation
    df_2017['CountryAffiliation_infered']= df_2017_test.CountryAffilitation
    df_2015['CountryAffiliation_infered']= df_2015_test.CountryAffilitation
    df_2013['CountryAffiliation_infered']= df_2013_test.CountryAffilitation
    df_2011['CountryAffiliation_infered']= df_2011_test.CountryAffilitation
    df_2009['CountryAffiliation_infered']= df_2009_test.CountryAffilitation
    df_2007['CountryAffiliation_infered']= df_2007_test.CountryAffilitation
    df_2005['CountryAffiliation_infered']= df_2005_test.CountryAffilitation

In [74]:
 if ACII == True:
    count_2019 = (len(df_2019[df_2019.CountryAffilitation != df_2019.CountryAffiliation_infered])/len(df_2019))*100
    count_2017 = (len(df_2017[df_2017.CountryAffilitation != df_2017.CountryAffiliation_infered])/len(df_2017))*100
    count_2015 = (len(df_2015[df_2015.CountryAffilitation != df_2015.CountryAffiliation_infered])/len(df_2015))*100
    count_2013 = (len(df_2013[df_2013.CountryAffilitation != df_2013.CountryAffiliation_infered])/len(df_2013))*100
    count_2011 = (len(df_2011[df_2011.CountryAffilitation != df_2011.CountryAffiliation_infered])/len(df_2011))*100
    count_2009 = (len(df_2009[df_2009.CountryAffilitation != df_2009.CountryAffiliation_infered])/len(df_2009))*100
    count_2007 = (len(df_2007[df_2007.CountryAffilitation != df_2007.CountryAffiliation_infered])/len(df_2007))*100
    count_2005 = (len(df_2005[df_2005.CountryAffilitation != df_2005.CountryAffiliation_infered])/len(df_2005))*100
    
    count_not_2019 = (len(df_2019_test_nocountry)/len(df_2019_test))*100
    count_not_2017 = (len(df_2017_test_nocountry)/len(df_2017_test))*100
    count_not_2015 = (len(df_2015_test_nocountry)/len(df_2015_test))*100
    count_not_2013 = (len(df_2013_test_nocountry)/len(df_2013_test))*100
    count_not_2011 = (len(df_2011_test_nocountry)/len(df_2011_test))*100
    count_not_2009 = (len(df_2009_test_nocountry)/len(df_2009_test))*100
    count_not_2007 = (len(df_2007_test_nocountry)/len(df_2007_test))*100
    count_not_2005 = (len(df_2005_test_nocountry)/len(df_2005_test))*100
    
    acc_2019 = (1- (count_2019/100))*100
    acc_2017 = (1- (count_2017/100))*100
    acc_2015 = (1- (count_2015/100))*100
    acc_2013 = (1- (count_2013/100))*100
    acc_2011 = (1- (count_2011/100))*100
    acc_2009 = (1- (count_2009/100))*100
    acc_2007 = (1- (count_2007/100))*100
    acc_2005 = (1- (count_2005/100))*100
    acc = [acc_2019, acc_2017, acc_2015, acc_2013, acc_2011, acc_2009, acc_2007, acc_2005]

In [75]:
if ACII == True:
    data = {'2019':[count_2019, count_not_2019, acc_2019],
            '2017':[count_2017, count_not_2017, acc_2017],
            '2015':[count_2015, count_not_2015, acc_2015],
            '2013':[count_2013, count_not_2013, acc_2013],
            '2011':[count_2011, count_not_2011, acc_2011],
            '2009':[count_2009, count_not_2009, acc_2009],
            '2007':[count_2007, count_not_2007, acc_2007],
            '2005':[count_2005, count_not_2005, acc_2005]}
    
    df_results = pd.DataFrame(data, index =[
                                    '% of countries NOT coinciding with ground truth',
        '% of countries NOT inferred ',
                                    'ACCURACY'])
    
    df_results = df_results.round(2)
    
    for i in df_results.columns:
        df_results[i] =  df_results[i].astype(str) + ' %'
    
    mean_acc = np.mean(acc).round(3)
        
    print('MEAN ACCURACY OF COUNTRY INFERENCE:', mean_acc, '%')
df_results.head()

MEAN ACCURACY OF COUNTRY INFERENCE: 85.222 %


Unnamed: 0,2019,2017,2015,2013,2011,2009,2007,2005
% of countries NOT coinciding with ground truth,5.08 %,6.92 %,4.35 %,0.72 %,31.09 %,8.63 %,38.08 %,23.36 %
% of countries NOT inferred,5.08 %,6.05 %,4.35 %,0.72 %,31.09 %,8.63 %,37.37 %,23.03 %
ACCURACY,94.92 %,93.08 %,95.65 %,99.28 %,68.91 %,91.37 %,61.92 %,76.64 %


In [76]:
if ACII == True:
    df_2019 = wrong_country(df_2019)
    df_2017 = wrong_country(df_2017)
    df_2015 = wrong_country(df_2015)
    df_2013 = wrong_country(df_2013)
    df_2011 = wrong_country(df_2011)
    df_2009 = wrong_country(df_2009)
    df_2007 = wrong_country(df_2007)
    df_2005 = wrong_country(df_2005)

In [84]:
df_2007[['Affiliation', 'CountryAffilitation', 'CountryAffiliation_infered', 'WrongCountry']][0:10]

Unnamed: 0,Affiliation,CountryAffilitation,CountryAffiliation_infered,WrongCountry
0,National Laboratory of Pattern Recognition (NL...,China,China,No
1,National Laboratory of Pattern Recognition (NL...,China,China,No
2,National Laboratory of Pattern Recognition (NL...,China,China,No
3,National Laboratory of Pattern Recognition (NL...,China,China,No
4,"IUT de Monreuil, Universit?? Paris 8",France,,Yes
5,"IUT de Monreuil, Universit?? Paris 8",France,,Yes
6,"Department of Computer Science and Technology,...",China,China,No
7,Department of Systems Engineering and Engineer...,Hong Kong,Hong Kong,No
8,Department of Systems Engineering and Engineer...,Hong Kong,Hong Kong,No
9,"Department of Computer Science and Technology,...",China,China,No
