# CSV Analysis: Country

In [1]:
import pandas as pd
pd.options.display.max_colwidth = 400 # Avoid "..." in large strings

We will analyze the content of
[this CSV file](https://drive.google.com/file/d/1XmBh6YlfPkB5WfYSolAMP1EA5e02jHQO/view?usp=sharing)
created [here](https://github.com/scieloorg/normalizations-experiments/blob/master/experiments_2018-06-04.ipynb)
with [SciELO's Clea](https://github.com/scieloorg/clea)
from a selection of SciELO's articles corpora (XML documents).

In [2]:
dataset = pd.read_csv("inner_join_2018-06-04.csv",
                      dtype=str,
                      keep_default_na=False) \
            .drop_duplicates()

In [3]:
dataset.describe()[[k for k in dataset.columns if k.startswith("addr")]]

Unnamed: 0,addr_city,addr_country,addr_country_code,addr_postal_code,addr_state
count,89451.0,89451,89451,89451.0,89451.0
unique,2729.0,267,125,426.0,672.0
top,,Brazil,BR,,
freq,9529.0,48429,56596,87918.0,20906.0


The address information is incomplete in most entries:
87918 documents don't have a postal code.
But for the remaining columns
it seems that most rows have some information regarding the address.
Data is unbalanced and has unmatched fields:

- Brazil appears far more often than other countries;
- Brazil and BR don't count the same.

There are $266$ distinct country names and $124$ distinct country codes in this dataset
($+1$ of each if we count the empty entries):

In [4]:
print(", ".join(dataset["addr_country"].unique()))

Brazil, Portugal, Brasil, Uruguay, Spain, Italy, Argentina, Venezuela, Germany, USA, United Kingdom, República Argentina, France, England, Itália, França, Dinamarca, Colombia, Switzerland, Canada, Pakistan, Cuba, México, Mexico, Costa Rica, China, Saudi Arabia, BraZil, Greece, Espanha, , Alemanha, Estados Unidos, Australia, Malaysia, Jamaica, India, United States, Unites States, Paraguay, España, Chile, Sweden, Panama, UAE, Egypt, Qatar, Nigeria, Sri Lanka, Belgium, Denmark, U.K, Turkey, Iran, Russia, Italia, Poland, Dominican Republic, Índia, Austrália, Tailândia, People's Republic of China, Tunisia, United Arab Emirates, Azerbaijan, Romania, Bulgaria, Thailand, Japan, PR China, Republic of Korea, Slovakia, Taiwan, Algeria, Korea, Kingdom of Bahrain, Austria, BR, Netherlands, South Korea, Bangladesh, Serbia, Hong Kong, South Africa, Peru, Canadá, Colômbia, Equador, Slovenia, China., The Netherlands, Brazil., UK, Bolivia, BO, Ireland, sofialopezmdp@gmail.com, EUA, Guiana, Portuga, Braz

In [5]:
len(dataset["addr_country"].unique())

267

In [6]:
dataset["addr_country_code"].unique()

array(['BR', 'PT', '', 'UY', 'ES', 'IT', 'AR', 'VE', 'DE', 'US', 'GB',
       'FR', 'DK', 'CO', 'CH', 'CA', 'PK', 'CU', 'MX', 'CR', 'CN', 'SA',
       'GR', 'AT', 'MY', 'JM', 'IN', 'SE', 'AU', 'CL', 'PA', 'AE', 'EG',
       'QA', 'NG', 'LK', 'BE', 'TR', 'IR', 'RU', 'PL', 'TN', 'RO', 'BG',
       'TH', 'JP', 'KP', 'SK', 'TW', 'DZ', 'KR', 'NL', 'BD', 'RS', 'HK',
       'ZA', 'PE', 'EC', 'SI', 'BO', 'IE', 'GY', 'BR; BR', 'CY', 'HU',
       'IQ', 'PY', 'UK', 'HR', 'UA', 'FI', 'IL', 'MD', 'NI', 'NO', 'JO',
       'NZ', 'GD', 'SW', 'BV', 'AG', 'UG', 'CZ', 'CS', 'HN', 'BJ', 'MG',
       'ME', 'TA', 'TZ', 'ZM', 'MZ', 'AO', 'LU', 'PR', 'GH', 'ID', 'BF',
       'PF', 'BM', 'MK', 'EE', 'VN', 'PO', 'SN', 'MU', 'CM', 'MA', 'BA',
       'TK', 'SZ', 'GE', 'OM', 'AL', 'IS', 'LY', 'SD', 'LB', 'KZ', 'SV',
       'GT', 'DO', 'SB', 'LT', 'BY'], dtype=object)

In [7]:
len(dataset["addr_country_code"].unique())

125

But more than 20 thousand rows don't include the country code.
That's quite a lot! Also, this data is clearly unbalanced:

In [8]:
pd.DataFrame(dataset.groupby("addr_country_code")
                    .size()
                    .sort_values(ascending=False)
                    .head(10),
             columns=["count"])

Unnamed: 0_level_0,count
addr_country_code,Unnamed: 1_level_1
BR,56596
,21727
CN,1421
PT,1021
US,989
TR,851
AR,690
ES,599
CO,540
MX,533


As we've already seen, the top `addr_country_code` ($56596$ `BR` rows)
and top `addr_country` ($48429$ `Brazil` rows) should have had the same value,
but there's a difference,
the country names and codes aren't always filled together,
and there are spurious country names filled in some rows:

In [9]:
dataset.groupby(["addr_country", "addr_country_code"]).size().sort_values(ascending=False).head(30)

addr_country   addr_country_code
Brazil         BR                   38521
Brasil         BR                   17187
Brazil                               9895
Brasil                               5883
                                     2829
China          CN                    1296
Portugal       PT                    1007
Turkey         TR                     844
Argentina      AR                     683
               BR                     638
USA            US                     627
Colombia       CO                     507
Spain          ES                     425
Portugal                              420
Iran           IR                     413
India          IN                     398
Chile          CL                     369
China                                 363
Turkey                                334
Mexico         MX                     310
USA                                   237
México         MX                     223
France         FR                     221
I

There's even some country codes filled as country names:

In [10]:
import re
from unidecode import unidecode

In [11]:
def pre_normalize(name):
    return " ".join(re.sub("[^a-z ]", "", unidecode(name).lower()).split())

In [12]:
data_countries = dataset["addr_country"].apply(pre_normalize)
pd.DataFrame({"count": data_countries[data_countries.apply(len) == 2].value_counts()})

Unnamed: 0,count
br,154
uk,81
sp,10
am,2
ru,2
be,2
rs,1
us,1
al,1
fr,1


There's no contributor referencing multiple countries at once in this dataset,
there's just one document referencing Brazil twice:

In [13]:
dataset[dataset["addr_country_code"].str.contains(";")]["addr_country_code"].unique()

array(['BR; BR'], dtype=object)

In [14]:
dataset[dataset["addr_country_code"] == "BR; BR"].T

Unnamed: 0,6353
addr_city,Goiás
addr_country,Brazil; Brazil
addr_country_code,BR; BR
addr_postal_code,
addr_state,GO
aff_email,
aff_id,aff1
aff_text,Federal University at Goiás (UFG) Universidade Federal de Goiás Federal University at Goiás Brazil Goiás GO Brazil
article_doi,10.1590/1982-02592018v21n1p09
article_publisher_id,


Our goal is to find what can be said about the country data of this dataset.

Looking only to the country name (`addr_country` column)
and to the country code (`addr_country_code` column),
let's see what we can find.
Some possible questions are:

- Is the country pair inconsistent?
- Which inconsistencies are more common?

We won't be able to fill the correct data without some extra information
about the document being analyzed,
unless we use some *prior*,
which would just be a bias towards the unbalanced data we have
(e.g. both `Brasil, CL` and `Chile, BR` would yield `Brasil, BR`,
 because that's the most common pair).
On the other hand,
this *more common* approach for a single country name value
out of its row context
is meaningful:
the name `Chile` is assigned to `CL`,
as `CL` is the code that most `Chile` entries have.
What we can't say is whether the name or the code is wrong
in a given inconsistent row.

So let's find if there are inconsistencies, and whose are they.

In [15]:
pairs = dataset.groupby(["addr_country", "addr_country_code"]).size()
non_empty = pd.DataFrame(
    pairs[(pairs.index.get_level_values("addr_country").str.strip() != "") &
          (pairs.index.get_level_values("addr_country_code").str.strip() != "")],
    columns=["count"]
)
non_empty.head(14)

Unnamed: 0_level_0,Unnamed: 1_level_0,count
addr_country,addr_country_code,Unnamed: 2_level_1
AL,AL,1
Alemanha,DE,27
Algeria,DZ,31
Algerie,DZ,3
Algérie,DZ,2
Angola,AO,2
Angola,AU,1
Angola,BR,1
Antigua and Barbuda,AG,1
Argentin,AR,2


These are the top entries, not the worst conflicts,
but there are clear conflicts and multiple languages in there.
Some highlights:

- **AL** is *Albania*, but is that correct in that single entry?
Sometimes we can't trust in neither column values;

- *Algérie* in French, *Argéria* in Portuguese and *Algeria* in English
are a single country with multiple names
due to multiple document languages in this dataset.
At least the country code **DZ** is correct;

- In the ISO 3166-1 alpha-2 code,
*Angola* is **AO**,
*Argentina* is **AR**,
*Australia* is **AU**,
*Austria* is **AT** and
*Brazil* is **BR**.
But there are clearly mixed rows, and they're not just a few
($10$ out of $693$ *Argentina* entries have **BR** as their code).

We can select some country codes to see what's going on:

In [16]:
codes = ["BR", "BV", "CH", "CN", "ES", "FR", "MX", "PT", "TR", "US"]

The country names connected to these codes,
segregated by the number of entries (*count*), is:

In [17]:
df_pairs = pairs.reset_index(name="count")
df_pairs_code = df_pairs[df_pairs["addr_country_code"].isin(codes)]

In [18]:
pd.DataFrame(df_pairs_code.groupby(["addr_country_code", "count"])
                          .apply(lambda grp: sorted(grp["addr_country"])),
             columns=["List of country names"])

Unnamed: 0_level_0,Unnamed: 1_level_0,List of country names
addr_country_code,count,Unnamed: 2_level_1
BR,1,"[Angola, Braszil, Brazl, China, Colômbia, FR, Germany, India, Italy, Japan, KSA, Mexico, Moçambique, Nigeria, PA, RS, Slovenia, São Paulo, Venezuela, br]"
BR,2,"[Brasil., França, Not Normalized, Spain]"
BR,3,"[BRASIL, BRAZIL, Chile, Malaysia, People's Republic of China, Uruguay]"
BR,4,"[Brasill, Portugal, SP, Turkey, USA]"
BR,5,"[BraZil, Espanha, México]"
BR,8,[Bra]
BR,10,[Argentina]
BR,16,[Brazil.]
BR,56,[Br]
BR,79,[BR]


The country name has several valid versions we can understand,
but there's too much information in that table:
we don't need to see the names that are always connected to a single code.

The following names are connected to the formerly selected country codes,
but they aren't connected to a single country code:

In [19]:
addr_country_counts = df_pairs_code["addr_country"].value_counts()
mixed_names = addr_country_counts[addr_country_counts > 1].index.tolist()
mixed_names

['',
 'Brazil',
 'Brasil',
 'USA',
 'China',
 'BR',
 'França',
 'Portugal',
 'Spain',
 'Turkey',
 'México',
 'Mexico',
 'Espanha',
 'Chile',
 "People's Republic of China"]

Let's see all the pairs regarding these names and the formerly selected codes:

In [20]:
df_pairs_code[df_pairs_code["addr_country"].isin(mixed_names)] \
     .sort_values(["addr_country_code", "addr_country"], ascending=True) \
     .set_index(["addr_country_code", "addr_country"])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
addr_country_code,addr_country,Unnamed: 2_level_1
BR,,638
BR,BR,79
BR,Brasil,17187
BR,Brazil,38521
BR,Chile,3
BR,China,1
BR,Espanha,5
BR,França,2
BR,Mexico,1
BR,México,5


There are many unfilled and inconsistent pairs in this subset of our data.

Nevertheless, the whole data denotes a bipartite graph,
where the two partitions are the *names* and the *codes*,
and each $(name, code)$ pair is an edge.