# Hw2: Bias in data
The goal of this analysis is to explore the definition of 'bias' through data on Wikipedia articles on political from different countries.

I will do an analysis about how the quality of articles about politicians differences between countries. My analysis will constist the following:

 1. the countries with the greatest and least coverage of politicians on Wikipedia compared to their population.
 
 2. the countries with the highest and lowest proportion of high quality articles about politicians.
 
I will also write a short reflection for this project

In [314]:
import pandas as pd 
import numpy as np
import requests
import csv
import matplotlib.pyplot as plt

## 1.Getting the article and population data
***page_data.csv***:  The wikipedia dataset can be found on [Figshare](https://figshare.com/articles/Untitled_Item/5513449). 

***Population Mid-2015.csv***:  The population data is on the [Population Research Bureau website](http://www.prb.org/DataFinder/Topic/Rankings.aspx?ind=14). 

（1） use pandas.read_csv to read Population Mid-2015.csv as datafrme

In [134]:
df_population = pd.read_csv('Population Mid-2015.csv')

In [354]:
df_population.head()

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Population Mid-2015
Location,Location Type,TimeFrame,Data Type,Data,Footnotes
Afghanistan,Country,Mid-2015,Number,32247000,
Albania,Country,Mid-2015,Number,2892000,
Algeria,Country,Mid-2015,Number,39948000,
Andorra,Country,Mid-2015,Number,78000,


However, The data frame only have one column Population Mid-2015, Thus I decide use another way to read csv file.

In [355]:
#read data
csv_population=[]
with open ('Population Mid-2015.csv') as csvfile:
    read = csv.reader(csvfile)
    for row in read :
        csv_population.append(row)
csv_population[:5]

[['Population Mid-2015'],
 [],
 ['Location', 'Location Type', 'TimeFrame', 'Data Type', 'Data', 'Footnotes'],
 ['Afghanistan', 'Country', 'Mid-2015', 'Number', '32,247,000', ''],
 ['Albania', 'Country', 'Mid-2015', 'Number', '2,892,000', '']]

(2) use pandas.read_csv to load page_data.csv as dataframe

In [356]:
df_page = pd.read_csv('country/data/page_data.csv')
df_page.head()

Unnamed: 0,page,country,rev_id
0,Template:ZambiaProvincialMinisters,Zambia,235107991
1,Bir I of Kanem,Chad,355319463
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046
3,Template:Uganda-politician-stub,Uganda,391862070
4,Template:Namibia-politician-stub,Namibia,391862409


### 2. Getting article quality predictions
The next step is to get the  quality scores for each article. For this step, I am  using a  API endpoint called [ORES](https://ores.wikimedia.org/v3/#!/scoring/get_v3_scores_context_revid_model) ("Objective Revision Evaluation Service"). ORES estimates the quality of an article (at a particular point in time), and assigns a series of probabilities that the article is in one of 6 quality categories. The options are, from best to worst:

 1. FA - Featured article
 2. GA - Good article
 3. B - B-class article
 4. C - C-class article
 5. Start - Start-class article
 6. Stub - Stub-class article

Here, I set up endpoint with parameter and headers for calling API. Then I use the rev_id column to get the prediction and I use 100 rev_id for each iteration. Thus I got a list of prediction score. 

In [181]:
#set up endpoint and headers
endpoint = 'https://ores.wikimedia.org/v3/scores/{context}?model={model}&revids={revids}'
headers={'User-Agent' : 'https://github.com/runlaizeng', 'From' : 'runlaiz@uw.edu'}
mylist = []
start = 0
while start < len(df_page['rev_id']):
    #passing 100 rev id every time
    end = min(start + 100,len(df_page['rev_id']))
    params = {'context' : 'enwiki',
              'model' : 'wp10',
              'revids'  : '|'.join (str(x) for x in df_page['rev_id'][start:end])}
    api_call = requests.get(endpoint.format(**params))
    response = api_call.json()
    for rev_id in response['enwiki']['scores']:
        #some article does't have score, so this is to avoid nah value
        if 'score' in response['enwiki']['scores'][rev_id]['wp10']:
            article_quality = response['enwiki']['scores'][rev_id]['wp10']['score']['prediction']
            mylist.append(article_quality)
        else :
            mylist.append(None)
    start = start + 100

In [190]:
len(mylist)

47197

### 3.Combining the datasets

The next step is to merge the wikipedia data and population data together. I will use inner join to join two dataset.


In [357]:
#first append the article_quality to page data
df_page=df_page.assign(article_quality=mylist)

In [358]:
df_page[:5]

Unnamed: 0,page,country,rev_id,article_quality
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub
1,Bir I of Kanem,Chad,355319463,Stub
2,Template:Zimbabwe-politician-stub,Zimbabwe,391862046,Stub
3,Template:Uganda-politician-stub,Uganda,391862070,Stub
4,Template:Namibia-politician-stub,Namibia,391862409,Stub


In [360]:
#extract country from population data set
country = []
for i in range(3,len(csv_population)-1):
    country.append(csv_population[i][0])
country[:5]

['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola']

In [362]:
#extract population from population data set
population=[]
for i in range(3,len(csv_population)-1):
    population.append(int(csv_population[i][4].replace(',','')))
population[:5]

[32247000, 2892000, 39948000, 78000, 25000000]

In [363]:
#make them as dataframe
df_country_population = pd.DataFrame({'country' : country,
                                     'population' : population})
df_country_population.head()

Unnamed: 0,country,population
0,Afghanistan,32247000
1,Albania,2892000
2,Algeria,39948000
3,Andorra,78000
4,Angola,25000000


In [365]:
#merge two datasets togather with inner join
df=pd.merge(df_page,df_country_population,on="country", how='inner')
df.head()

Unnamed: 0,page,country,rev_id,article_quality,population
0,Template:ZambiaProvincialMinisters,Zambia,235107991,Stub,15473900
1,Gladys Lundwe,Zambia,757566606,Stub,15473900
2,Mwamba Luchembe,Zambia,764848643,Stub,15473900
3,Thandiwe Banda,Zambia,768166426,Start,15473900
4,Sylvester Chisembele,Zambia,776082926,C,15473900


In [366]:
#rename the columns and reorder the columns
df.columns=["article_name","country","revision_id","article_quality","population"]
df=df[['country','article_name','revision_id','article_quality','population']]
df.head()

Unnamed: 0,country,article_name,revision_id,article_quality,population
0,Zambia,Template:ZambiaProvincialMinisters,235107991,Stub,15473900
1,Zambia,Gladys Lundwe,757566606,Stub,15473900
2,Zambia,Mwamba Luchembe,764848643,Stub,15473900
3,Zambia,Thandiwe Banda,768166426,Start,15473900
4,Zambia,Sylvester Chisembele,776082926,C,15473900


In [367]:
#save it as csv file
df.to_csv('final.csv', sep=',')

### 4. Analysis
I will perform analysis of calculating the percentageof articles per populationn and high quality articles rate for each country.
#### (1) Article per population: 
Articles per population for a country means the percentage of generating political article from population.

For example, the country has a population of 10,000 people, there are 100 articles about politicians , then the percentage of articles per population would be 1%.


In [370]:
#first get count number of article group by country
article_count_by_country = df.groupby(['country'])['article_name'].count()
#then get the population grouby by country
population_by_contry = df.groupby(['country'])['population'].mean()
#then count/population
article_per_population = article_count_by_country/population_by_contry*100
article_per_population.head()

country
Afghanistan    0.001014
Albania        0.015906
Algeria        0.000298
Andorra        0.043590
Angola         0.000440
dtype: float64

In [371]:
#get the df for only FA and GA, and get the count number group by country
df_high_quality = df[(df['article_quality']=='FA')|(df['article_quality']=='GA')]
high_quality_count_per_country = df_high_quality.groupby(['country'])['article_name'].count()
hight_quality_per_country = high_quality_count_per_country/population_by_contry
hight_quality_per_country = hight_quality_per_country.replace('NaN',0)
hight_quality_per_country.head()

country
Afghanistan    5.892021e-07
Albania        1.728907e-06
Algeria        7.509763e-08
Andorra        0.000000e+00
Angola         8.000000e-08
dtype: float64

### 5.Tables

I will generate four tables for analyzing the bias. The four table are blow:

 1. 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
 2. 10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
 3. 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country
 4. 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

#### (1) 10 highest-ranked countries in terms of number of politician articles as a proportion of country population
generating the table for 10 highest rank countries by articles per country

In [350]:
ten_highest_country = article_per_population.sort_values(ascending=False)[:10]
df_10_highest_rank_country = pd.DataFrame({"country" : ten_highest_country.index,
                                            "percentage" :ten_highest_country.values})
df_10_highest_rank_country

Unnamed: 0,country,percentage
0,Nauru,0.488029
1,Tuvalu,0.466102
2,San Marino,0.248485
3,Monaco,0.10502
4,Liechtenstein,0.077189
5,Marshall Islands,0.067273
6,Iceland,0.062268
7,Tonga,0.060987
8,Andorra,0.04359
9,Federated States of Micronesia,0.036893


#### (2)10 lowest-ranked countries in terms of number of politician articles as a proportion of country population
generating the table for 10 lowest rank countries by articles per country

In [349]:
ten_lowest_article_country = article_per_population.sort_values()[:10]
df_10_lowest_rank_country = pd.DataFrame({"country" : ten_lowest_article_country.index,
                                        "percentage" :ten_lowest_article_country.values})
df_10_lowest_rank_country

Unnamed: 0,country,percentage
0,India,7.5e-05
1,China,8.3e-05
2,Indonesia,8.4e-05
3,Uzbekistan,9.3e-05
4,Ethiopia,0.000107
5,"Korea, North",0.000156
6,Zambia,0.000168
7,Thailand,0.000172
8,"Congo, Dem. Rep. of",0.000194
9,Bangladesh,0.000202


#### (3) 10 highest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [348]:
ten_highest_quality_country = hight_quality_per_country.sort_values(ascending=False)[:10]
df_10_highest_quality_country = pd.DataFrame({"country" : ten_highest_quality_country.index,
                                                                    "percentage" :ten_highest_quality_country.values})
df_10_highest_quality_country

Unnamed: 0,country,percentage
0,Tuvalu,8.5e-05
1,Vanuatu,1.1e-05
2,Iceland,9e-06
3,Grenada,9e-06
4,Ireland,7e-06
5,Maldives,6e-06
6,Bhutan,4e-06
7,Gabon,3e-06
8,Montenegro,3e-06
9,Palestinian Territory,3e-06


#### （4） 10 lowest-ranked countries in terms of number of GA and FA-quality articles as a proportion of all articles about politicians from that country

In [346]:
ten_lowest_quality_article_country = hight_quality_per_country.sort_values()[:10]
df_10_lowest_quality_country = pd.DataFrame({"country" : ten_lowest_quality_article_country.index,
                                                                    "percentage" :ten_lowest_quality_article_country.values})
df_10_lowest_quality_country 

Unnamed: 0,country,percentage
0,Guyana,0.0
1,Mozambique,0.0
2,Kazakhstan,0.0
3,Kiribati,0.0
4,Seychelles,0.0
5,Burundi,0.0
6,Monaco,0.0
7,Dominica,0.0
8,Nauru,0.0
9,Djibouti,0.0


### 6. Writeup

It is surprised to learn that the big countries, which the countries have huge population such as China or India, they have low number of politician articles per population. However the small country, which the countries has low population, has high yield of politician articles. The same situation happened on high quality article rate. Iceland not only in the top 10 countries in terms of number of politician articles per population but also in the top 10 countries with high quality article rate.
There are some bias for this result: 
1. The article data come from Wikipedia only,  It exist bias in the analysis for  the countries they do not use Wikipedia even these countries have high population. I think the data should have a count for how many people use Wikipedia rather than the population of the country.
2. The data is only for English Wiki, it is possible to exist bias for the the countries , which their first language is not English. For example, like big country china, their first language is Chinese. Thus they might have more politician article on Chinese rather than English
3. Also for some countries, they don’t have freedom of speech about politics. For example like North Korea, they don’t freedom of speech about politics. This is also possible bias for the result.