# A2: Bias in Data

The goal of this assignment is to undestand the concepts of bias by exploring data on English Wikipedia articles. We are going to be particularly looking at articles on polictal figures from various countries.

## 1. Data Acquisiton

In [176]:
import pandas as pd
import numpy as np
import requests

### 1.1 Wikipedia page data
Wikipedia articles data is found at this https://figshare.com/articles/Untitled_Item/5513449. The data contains articles of political figures by country. It is titled page_data.csv. The data is licensed under CC-BY 4.0 license. The data is stored in ./raw/page_data.csv

Citation: Keyes, Os (2017): Politicians by Country from the English-language Wikipedia. figshare. Dataset.

Reading the data into a dataframe

In [177]:
page_data = pd.read_csv('./raw/page_data.csv')
page_data.head(5)

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### 1.2 Population data

The population data contains world populations for 207 countries as of 2018. The file was provided to us as part of the assignment. This data is obtained form the world population datasheet published by the Population Reference bureau.  

In [179]:
population_data = pd.read_csv('./raw/WPDS_2018_data.csv')
population_data.head()

Unnamed: 0,Geography,Population mid-2018 (millions)
0,AFRICA,1284.0
1,Algeria,42.7
2,Egypt,97.0
3,Libya,6.5
4,Morocco,35.2


### 1.3 Fetch article scores using ORES

The documentation for ORES API can be found at https://www.mediawiki.org/wiki/ORES.

This API is used to fetch article scores for each wikipedia article

In [14]:
API_ENDPOINT = 'https://ores.wikimedia.org/v3/scores/{project}/?models={model}&revids={revids}'
HEADERS = {'User-Agent' : 'https://github.com/tharunsikhinam', 'From' : 'tharun@uw.edu'}
API_PARAMS = {
                "project": "enwiki",
                "model": "wp10",
                "revids": ""
}

In [180]:
def call_api(revision_ids, headers):
    """
    Function to call the API and fetch ORES data
    """
    API_PARAMS['revids'] = '|'.join(str(x) for x in revision_ids)
    api_call = requests.get(API_ENDPOINT.format(**API_PARAMS))
    response = api_call.json()
    return response

In [181]:
def get_ores_data(rev_ids):
    """
    this function groups 100 rev_ids together and makes an API call to the ORES API to fetch article ratings.
    """
    result = pd.DataFrame(columns=['rev_ids','ratings'])
    start = 0
    flag = 0
    end = 100

    while(flag!=1):
        response_ores = call_api(rev_ids[start:end], HEADERS)
        for revid in response_ores['enwiki']['scores']:
            try:
                rating = response_ores['enwiki']['scores'][revid]['wp10']['score']['prediction']
            except:
                # No Rating found for article
                rating = np.nan
                
            result = result.append({'rev_ids':revid, 
                                    'ratings':rating}, ignore_index=True)
        # Update loops    
        start +=100
        if end+100 > len(rev_ids):
            end = len(rev_ids)
            flag = 1
        else:
            end += 100
    return result

In [182]:
rev_ids = list(page_data['rev_id'])
ratings = get_ores_data(rev_ids)
ratings.to_csv("./raw/ores_data.csv")

### 2. Data Processing

### 2.1 Combining ratings and wikipedia articles dataset

1. Remove all pages with the title starting with "Template"
2. Convert rev_ids to integers
3. Join Wikipedia ratings data and ores_data on rev_id


In [198]:
## 1. Remove all pages with the title starting with template
page_data = page_data[~page_data["page"].str.contains("Template")]
page_data.head(5)

Unnamed: 0,page,country,rev_id
1,Bir I of Kanem,Chad,355319463
10,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188
12,Yos Por,Cambodia,393822005
23,Julius Gregr,Czech Republic,395521877
24,Edvard Gregr,Czech Republic,395526568


In [206]:
# 2. Convert revision ids to integer
ratings['rev_ids'] = ratings['rev_ids'].astype(int)

# 3. Merge Page_data and ores_data on rev_id
wiki_ratings = page_data.merge(ratings, left_on='rev_id', right_on='rev_ids', how='inner')
page_ratings = wiki_ratings[~pd.isnull(wiki_ratings['ratings'])]

In [207]:
page_ratings.head(5)

Unnamed: 0,page,country,rev_id,rev_ids,ratings
0,Bir I of Kanem,Chad,355319463,355319463,Stub
1,Information Minister of the Palestinian Nation...,Palestinian Territory,393276188,393276188,Stub
2,Yos Por,Cambodia,393822005,393822005,Stub
3,Julius Gregr,Czech Republic,395521877,395521877,Stub
4,Edvard Gregr,Czech Republic,395526568,395526568,Stub


In [208]:
# 4. Dump articles without any ratings to file
page_no_ratings = wiki_ratings[pd.isnull(wiki_ratings['ratings'])]
page_no_ratings.to_csv("./clean/articles_no_ratings.csv")
page_no_ratings.head(5)

Unnamed: 0,page,country,rev_id,rev_ids,ratings
14,List of politicians in Poland,Poland,516633096,516633096,
21,Tingtingru,Vanuatu,550682925,550682925,
51,Daud Arsala,Afghanistan,627547024,627547024,
204,Bharat Saud,Nepal,671484594,671484594,
301,Robert Sych,Poland,684023803,684023803,


### 2.2 Combining wiki ratings data and country data

1. Convert country string to lower case
2. Convert Geography string to lower case
3. Join wiki_ratings and population data on country name
4. Remove all articles that don't have a country associated and dump to file

In [212]:
# 1. country to lower case
page_ratings['country'] = page_ratings['country'].apply(lambda x:x.lower())

# 2. Geography to lower case
population_data['Geography'] = population_data['Geography'].apply(lambda x:x.lower())

# 3. Join on country/Geography
final_data = page_ratings.merge(population_data, left_on ='country', right_on = 'Geography', how='outer')
cleaned_data = final_data[~pd.isnull(final_data["Geography"])]
cleaned_data.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,page,country,rev_id,rev_ids,ratings,Geography,Population mid-2018 (millions)
0,Bir I of Kanem,chad,355319463.0,355319463.0,Stub,chad,15.4
1,Abdullah II of Kanem,chad,498683267.0,498683267.0,Stub,chad,15.4
2,Salmama II of Kanem,chad,565745353.0,565745353.0,Stub,chad,15.4
3,Kuri I of Kanem,chad,565745365.0,565745365.0,Stub,chad,15.4
4,Mohammed I of Kanem,chad,565745375.0,565745375.0,Stub,chad,15.4


In [215]:
# 4. Dump all rows without population/country
final_data_no_data = final_data[pd.isnull(final_data["Geography"])]
final_data_no_data.to_csv("./clean/wp_wpds_countries-no_match.csv")
final_data_no_data.head(5)

Unnamed: 0,page,country,rev_id,rev_ids,ratings,Geography,Population mid-2018 (millions)
97,Information Minister of the Palestinian Nation...,palestinian territory,393276188.0,393276188.0,Stub,,
98,Finance Minister of the Palestinian National A...,palestinian territory,596181202.0,596181202.0,Start,,
99,Planning Minister of the Palestinian National ...,palestinian territory,633612729.0,633612729.0,Start,,
100,Hossam Arafat (politician),palestinian territory,680933208.0,680933208.0,Stub,,
101,Tawfik Tirawi,palestinian territory,701106976.0,701106976.0,Start,,


### 2.3 Cleaning up final data

1. Drop any NA's
2. Drop duplicate columns
3. Rename columns
4. Convert population numbers to Integers 
5. Dump the final cleaned data to file 

In [214]:
cleaned_data = cleaned_data.dropna()
cleaned_data = cleaned_data.drop(["rev_ids","Geography"],axis=1)
cleaned_data = cleaned_data.rename(index=str, 
                             columns={"page": "article_name", 
                                      "rev_id": "revision_id", 
                                      "ratings": "article_quality",
                                      "Population mid-2018 (millions)": "population"   })
cleaned_data['population'] = cleaned_data['population'].apply(lambda x:x.replace(',',''))
cleaned_data['population'] = cleaned_data['population'].astype('float')
#Converting to millions
cleaned_data['population'] = cleaned_data['population'].apply(lambda x:x*1000000)
#Again, back to int
cleaned_data['population'] = cleaned_data['population'].astype(int)
cleaned_data.to_csv("./clean/wp_wpds_politicians_by_country.csv")
cleaned_data.head(5)

Unnamed: 0,article_name,country,revision_id,article_quality,population
0,Bir I of Kanem,chad,355319463.0,Stub,15400000
1,Abdullah II of Kanem,chad,498683267.0,Stub,15400000
2,Salmama II of Kanem,chad,565745353.0,Stub,15400000
3,Kuri I of Kanem,chad,565745365.0,Stub,15400000
4,Mohammed I of Kanem,chad,565745375.0,Stub,15400000


### 3. Data Analysis

In [216]:
# Read data from file
final_data = pd.read_csv("./clean/wp_wpds_politicians_by_country.csv")

### 3.1 Top 10 countries by coverage: 10 highest-ranked countries in terms of number of politician articles as a proportion of country population

1. Group by country, population and count number of polictian articles
2. Create a proportion column and divide the count/population
3. Rename columns and display in descending order

In [227]:
# 1. Group by
articles_by_country = final_data.groupby(['country','population'])['revision_id'].count().to_frame()
articles_by_country = articles_by_country.reset_index()

# 2. Calculate coverage
articles_by_country['coverage'] = (articles_by_country['revision_id']/articles_by_country['population']) * 100

# 3. rename columns
articles_by_country = articles_by_country.rename(columns = {'revision_id':'total_articles'})

In [269]:
# 3. Display in descending order
articles_by_country.sort_values(by='coverage', ascending = False).head(10)[['country', 'population','total_articles', 'coverage']]

Unnamed: 0,country,population,total_articles,coverage
166,tuvalu,10000,54,0.54
115,nauru,10000,52,0.52
135,san marino,30000,81,0.27
108,monaco,40000,40,0.1
93,liechtenstein,40000,28,0.07
161,tonga,100000,63,0.063
103,marshall islands,60000,37,0.061667
68,iceland,400000,201,0.05025
3,andorra,80000,34,0.0425
61,grenada,100000,36,0.036


### 3.2 Top 10 countries by coverage: 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population

1. Display the above dataset in ascending order

In [267]:
articles_by_country.sort_values(by='coverage', ascending=True).head(10)[['country','population','total_articles','coverage']]

Unnamed: 0,country,population,total_articles,coverage
69,india,1371300000,978,7.1e-05
70,indonesia,265200000,209,7.9e-05
34,china,1393800000,1126,8.1e-05
173,uzbekistan,32900000,28,8.5e-05
51,ethiopia,107500000,101,9.4e-05
82,"korea, north",25600000,35,0.000137
178,zambia,17700000,25,0.000141
159,thailand,66200000,112,0.000169
112,mozambique,30500000,58,0.00019
13,bangladesh,166400000,319,0.000192


### 3.3 Top 10 countries by relative quality: 10 highest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

1. Create copies of final data
2. Filter out only high-quality articles
3. Count the number of such articles and store as a dataframe
4. Use articles_by_country dataset to get total number of articles
5. Only keep columns needed
6. Join these two datasets
7. Compute relative quality
8. Rename columns

In [265]:
# 1. Create copies of final data
hq_articles = deepcopy(final_data)

# 2. Filter out high-quality articles
hq_articles =hq_articles[(hq_articles['article_quality'] == 'FA') | (hq_articles['article_quality'] == 'GA')]

# 3. Groupby country and count number of articles to get the count of high quality articles by country
hq_articles = hq_articles.groupby(['country'])['revision_id'].count().to_frame()
hq_articles = hq_articles.rename(columns = {'revision_id':'high_quality_articles'})
hq_articles.reset_index(inplace=True)

# 4. Use articles_by_country dataset to get total counts
total_articles = deepcopy(articles_by_country)

# 5. Keep the columns required
total_articles = total_articles[['country','total_articles']]
hq_articles = hq_articles[['country','high_quality_articles']]

# 6. Join on country 
relative_quality = total_articles.merge(hq_articles, left_on = 'country', right_on = 'country', how = 'left')
relative_quality = relative_quality.fillna(0)
relative_quality['high_quality_articles'] = relative_quality['high_quality_articles'].astype(int)

#Calculating the proportion
relative_quality['ratio'] = (relative_quality['high_quality_articles']/relative_quality['total_articles'])*100
relative_quality.sort_values(by='ratio', ascending = False).head(10)[['country', 'total_articles', 'high_quality_articles','ratio']]

### 3.4 Bottom 10 countries by relative quality: 10 lowest-ranked countries in terms of the relative proportion of politician articles that are of GA and FA-quality

Ignoring countries with zero high quality articles

In [245]:
# Ignoring zeroes
relative_quality[relative_quality['ratio']!=0].sort_values(by='ratio', ascending = True).head(10)[['country', 'total_articles', 'high_quality_articles','ratio']]

Unnamed: 0,country,total_articles,high_quality_articles,ratio
16,belgium,519,1,0.192678
154,switzerland,402,1,0.248756
158,tanzania,401,1,0.249377
116,nepal,357,1,0.280112
127,peru,350,1,0.285714
121,nigeria,676,2,0.295858
35,colombia,284,1,0.352113
94,lithuania,244,1,0.409836
95,luxembourg,178,1,0.561798
10,azerbaijan,178,1,0.561798


In [264]:
# Not ignoring zeroes
relative_quality.sort_values(by='ratio', ascending = True).head(10)[['country', 'total_articles', 'high_quality_articles','ratio']]

Unnamed: 0,country,total_articles,high_quality_articles,ratio
143,slovakia,116,0,0.0
30,cape verde,37,0,0.0
112,mozambique,58,0,0.0
38,costa rica,147,0,0.0
108,monaco,40,0,0.0
43,djibouti,37,0,0.0
107,moldova,423,0,0.0
167,uganda,184,0,0.0
49,eritrea,16,0,0.0
50,estonia,149,0,0.0


### 3.5 Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the total count of politician articles from countries in each region as a proportion of total regional population

1. Manually tag continents to countries by going into WPDS_2018_data dataset and tagging all the nations that belong to the same continent
2. Rename columns
3. Parse population data
4. Join the relative quality dataset with geography data
5. Drop unwanted columns
6. Compute total_articles by geography
7. Compute population by geography
8. Compute ratio
9. Sort by descending order

In [249]:
# 1. Reading in manually tagged country, continents file
gro = pd.read_csv('./raw/WPDS_2018_data_continents.csv')
gro['Geography'] = gro['Geography'].apply(lambda x:x.lower())

# 2.Rename columns
gro = gro.rename(index=str, columns = {'Population mid-2018 (millions)': 'population'})

# 3. Parse population data
gro['population'] = gro['population'].apply(lambda x:x.replace(',',''))
gro['population'] = gro['population'].astype('float')
gro['population'] = gro['population'].apply(lambda x:x*1000000)
gro['population'] = gro['population'].astype(int)
gro['population'] = gro['population'].apply(lambda x: float(x))

# 4. Join the relative quality dataset with geography data
geo_quality = high_quality_prop.merge(gro,left_on="country",right_on="Geography")

# 5. Drop unwanted columns
geo_quality = geo_quality.drop(['country','Geography'],axis=1)

In [250]:
# 6. Compute total articles
articles_by_geo = geo_quality.groupby(['continent'])['total_articles'].sum().to_frame()

# 7. Compute total population
pop_by_geo = geo_quality.groupby(['continent'])['population'].sum().to_frame()

# 8. Join datasets
final_geo_1 = articles_by_geo.merge(pop_by_geo,left_on='continent',right_on='continent')

# 9. Compute coverage
final_geo_1["coverage"] = final_geo_1["total_articles"]/final_geo_1["population"]

In [262]:
final_geo_1.sort_values(by='coverage', ascending = False)

Unnamed: 0_level_0,total_articles,population,coverage
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
OCEANIA,3119,39780000.0,7.8e-05
EUROPE,15829,734590000.0,2.2e-05
LATIN AMERICA AND THE CARIBBEAN,5166,628270000.0,8e-06
AFRICA,6839,1172400000.0,6e-06
NORTHERN AMERICA,1913,365200000.0,5e-06
ASIA,11506,4513100000.0,3e-06


### 3.6 Geographic regions by coverage: Ranking of geographic regions (in descending order) in terms of the relative proportion of politician articles from countries in each region that are of GA and FA-quality

1. Compute total_articles by geography
2. Compute total high quality articles by geography
3. Compute ratio
4. Sort by descending order

In [257]:
# 1. Compute total_articles by geography
high_articles_by_geo = geo_quality.groupby(['continent'])['high_quality_articles'].sum().to_frame()

#2. Compute total high quality articles by geography
articles_by_geo = geo_quality.groupby(['continent'])['total_articles'].sum().to_frame()

# 3. Compute ratio
final_geo_2 = high_articles_by_geo.merge(articles_by_geo,left_on='continent',right_on='continent')
final_geo_2["ratio"] = final_geo_2["high_quality_articles"]/final_geo_2["total_articles"]

In [260]:
# 4. Sort by descending order
final_geo_2.sort_values(by='ratio', ascending = False)

Unnamed: 0_level_0,high_quality_articles,total_articles,ratio
continent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NORTHERN AMERICA,99,1913,0.051751
ASIA,305,11506,0.026508
OCEANIA,64,3119,0.020519
EUROPE,316,15829,0.019963
AFRICA,124,6839,0.018131
LATIN AMERICA AND THE CARIBBEAN,69,5166,0.013357


### 4. Reflections 

#### 1.  What biases did you expect to find in the data, and why? 

Since the analysis is performed on English Wikipedia articles, there is bias in this analysis. The article would  be of high-quality from English speaking countries since English is their first language. The articles written in non-English speaking countries, might find their politician articles to be of higher quality in their native language.

By looking at the data some of the titles are not Politican Names but the designation of the post such as "Information Minister of the Palestinian Nation" , "Finance Minister of", "List of politicians in Poland". This can affect downstream analysis. 

I also  expected to find more number of English Wikipedia articles from English Speaking countries compared to the rest of the world. Since certain language articles are not accounted for as part of our analysis, we can't judge the quality of articles from just the English Wikipedia subset. If these languages are not supported by Wiki/ORES you would find those countries to have poorer quality of articles regardless of other factors.

#### 2. What potential sources of bias did you discover in the course of your data processing and analysis?

By looking at the Top 10 countries by relative quality. We observe North Korea and Saudi Arabia at the top of the list. This result is quite suspect, since North Korea and Saudia Arabia have quite a bad rep in the public media and their goverments are generally oppresive. It is also not surprising to see countries with the lowest populations have the highest coverage. Since they would have the best high quality articles proportion. 

This leads me to think, if the metric for coverage was the right one? Populations might not be a good measure for calculating coverage. If the population increases x2 it doesn't correlate to twice the number of politicians or twice the number of English Wikipedia articles. Also, the scale at which populations work (millions and billions) is not comparable to the number of high quality articles (hundreds and thousands)

By looking at the documentation for ORES API we find that the service ranks articles not on the English or the grammar but on the structure of the page. This might not be the best indicator for quality of politican articles. This should be gauged on a different level to understand whether the article is high-quality or not

#### 3. How might a researched supplement or transform this dataset to potentially correct for the limitations/biases you created?

The data in the analysis can be enriched by adding more context to each article, such as who is the author and where does the author reside? It might be helpful to know if the person writing the article actually is a citizen of the country. Additionally, I am not entirely confident that page_data.csv contains all the politician related articles in Wikipedia. More articles are added by the day, and it would be interesting to know what are the data collection methods used to fetch this data. 