# A2 Bias In Data

## TL;DR   
The purpose of this exploration is to identify potential sources of bias in a corpus of human-annotated data, and describe some implications of those biases.

_Data Source:  https://figshare.com/projects/Wikipedia_Talk/16731_  
_Overview of the project:  https://meta.wikimedia.org/wiki/Research:Detox_  
_Overview of the data: https://meta.wikimedia.org/wiki/Research:Detox/Data_Release_

#### Data Prep

Download all the data, using the `%%caption` magic keyword to suppress the outputs.

In [None]:
%%capture  

# download toxicity data
!wget https://ndownloader.figshare.com/files/7394539 -O Raw_Data/toxicity_annotations.tsv
!wget https://ndownloader.figshare.com/files/7394542 -O Raw_Data/toxicity_annotated_comments.tsv
!wget https://ndownloader.figshare.com/files/7640581 -O Raw_Data/toxicity_worker_demographics.tsv

# download aggression data
!wget https://ndownloader.figshare.com/files/7394506 -O Raw_Data/aggression_annotations.tsv
!wget https://ndownloader.figshare.com/files/7038038 -O Raw_Data/aggression_annotated_comments.tsv
!wget https://ndownloader.figshare.com/files/7640644 -O Raw_Data/aggression_worker_demographics.tsv
     
# downlaod personal attacks
!wget https://ndownloader.figshare.com/files/7554637 -O Raw_Data/attack_annotations.tsv
!wget https://ndownloader.figshare.com/files/7554634 -O Raw_Data/attack_annotated_comments.tsv
!wget https://ndownloader.figshare.com/files/7640752 -O Raw_Data/attack_worker_demographics.tsv


Read in the downloaded data as a pandas dataframe.

In [1]:
import pandas as pd

# toxicity data
toxicity_annotations = pd.read_csv("Raw_Data/toxicity_annotations.tsv", delimiter="\t")
toxicity_annotated_comments = pd.read_csv("Raw_Data/toxicity_annotated_comments.tsv", delimiter="\t")
toxicity_worker_demographics = pd.read_csv("Raw_Data/toxicity_worker_demographics.tsv", delimiter="\t")

# aggression data
aggression_annotations = pd.read_csv("Raw_Data/aggression_annotations.tsv", delimiter="\t")
aggression_annotated_comments = pd.read_csv("Raw_Data/aggression_annotated_comments.tsv", delimiter="\t")
aggression_worker_demographics = pd.read_csv("Raw_Data/aggression_worker_demographics.tsv", delimiter="\t")

# personal attack data
attack_annotations = pd.read_csv("Raw_Data/attack_annotations.tsv", delimiter="\t")
attack_annotated_comments = pd.read_csv("Raw_Data/attack_annotated_comments.tsv", delimiter="\t")
attack_worker_demographics = pd.read_csv("Raw_Data/attack_worker_demographics.tsv", delimiter="\t")

Join the dataframes from similar datasets in order to analyze as a whole.

In [2]:
joined_toxicity_annotations = toxicity_annotations.join(toxicity_worker_demographics, on="worker_id", rsuffix="_r")
joined_aggression_annotations = aggression_annotations.join(aggression_worker_demographics, on="worker_id", rsuffix="_r")
joined_attack_annotations = attack_annotations.join(attack_worker_demographics, on="worker_id", rsuffix="_r")

Preview the results.

In [3]:
joined_toxicity_annotations.head()

Unnamed: 0,rev_id,worker_id,toxicity,toxicity_score,worker_id_r,gender,english_first_language,age_group,education
0,2232.0,723,0,0.0,1789.0,male,1.0,30-45,bachelors
1,2232.0,4000,0,0.0,,,,,
2,2232.0,3989,0,1.0,,,,,
3,2232.0,3341,0,0.0,3974.0,male,0.0,18-30,hs
4,2232.0,1574,0,1.0,3863.0,female,0.0,18-30,professional


## Analysis / Visualization

Calculate the average score per worker - that worker's "toxicity bias". Is this different for different age groups?

In [4]:
avg_worker_toxicity = joined_toxicity_annotations.groupby("worker_id")["toxicity_score"].mean()
toxicity_worker_demographics = toxicity_worker_demographics.join(avg_worker_toxicity)
toxicity_worker_demographics.head()

Unnamed: 0,worker_id,gender,english_first_language,age_group,education,toxicity_score
0,85,female,0,18-30,bachelors,-0.208768
1,1617,female,0,45-60,bachelors,0.0
2,1394,female,0,,bachelors,0.258403
3,311,male,0,30-45,bachelors,0.689373
4,1980,male,0,45-60,masters,-0.105263



Now let's compute an average toxicity statistic for each group...

In [5]:
toxicity_worker_demographics.groupby("age_group").toxicity_score.mean()

age_group
18-30       0.219225
30-45       0.204754
45-60       0.190176
Over 60     0.121644
Under 18    0.252275
Name: toxicity_score, dtype: float64

## Implications

### 

---

In [None]:
# requires plotly install
# conda install -c anaconda plotly
import plotly.express as px



# also needs ipython widgets
conda install jupyterlab "ipywidgets=7.5"


# JupyterLab renderer support
jupyter labextension install jupyterlab-plotly@4.11.0

jupyter labextension install @jupyter-widgets/jupyterlab-manager plotlywidget@4.11.0

In [None]:
import plotly.graph_objects as go
fig = go.FigureWidget(data=go.Bar(y=[2, 3, 1]))
fig

In [None]:
import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.write_html('second_figure.html', auto_open=True)

In [None]:
import plotly.graph_objects as go
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()

In [None]:
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt