# Sysrev Label Types and Counts

Many sysrevs extract boolean or categorical information from medical abstracts, articles, or other pdfs.  This involves creating "labels".  You can create your own project and labels easily at sysrev.com.  We even wrote a getting started post at [blog.sysrev.com/posts/SysrevGettingStarted](https://blog.sysrev.com/posts/SysrevGettingStarted) to help. 

## Part I getting some data
The public sysrev [EBTC - Effects on the liver as observed in experimental animals after dosing of 10 specified compounds](https://sysrev.com/p/100) is a good example project for this.  Below we download user answers from 5818 articles.  This can also be done by visiting the project export page https://sysrev.com/p/100.

This project involved screening literature for compound liver toxicity mechanisms. The users extracted information about the referenced compounds and mechanisms, but were primarily interested in marking articles as "Include" or "Exclude".

In [8]:
import pandas as pd

url = "https://sysrev.com/api/export-answers-csv/100/Sysrev_Answers_100_20181107.csv"
df  = pd.read_csv(url)
df.head(3)

Unnamed: 0,Article ID,User Name,Resolve?,Include,Primary research,Species,Compound,Mechanistic,User Note,Title,Journal,Authors
0,37334,ktsaiou1,True,False,False,,,,,Ciproflaxin (sip ro floks' a sin) (Cipro by Mi...,Hospital Pharmacy,
1,37334,gunn.vist,,False,,,,,,Ciproflaxin (sip ro floks' a sin) (Cipro by Mi...,Hospital Pharmacy,
2,37334,dwikoff,,True,False,,Cipro (ciprofloxacin),,Cipro-specific review - may contain useful inf...,Ciproflaxin (sip ro floks' a sin) (Cipro by Mi...,Hospital Pharmacy,


The above table of user answers has document identity data, user identity data, document descriptive data, and then user created columns.  Each row corresponds to the review of an article by a user:

1. **Document Identity**: The *Article ID* column provides a unique identifier for each reviewed article.  
2. **User Identity**: The *User Name* column provides a unique name for the reviewing user.
3. **Document Descriptions**: *Title*, *Journal*, *Authors* (frequently NaN) are provided for each article.  

Finally, the *Resolve?* column identifies articles where reviewers disagreed and an administrator made a conflict resolution.   

## Part 2 Counting User Reviews
One fun thing we can do is count the number articles reviewed by each user:

In [15]:
df.groupby(['User Name']).size().reset_index(name='counts').sort_values('counts')

Unnamed: 0,User Name,counts
3,gouedraogo,2
1,berube,5
7,maja,11
10,oana,100
9,noffisat.oki,101
12,rwrigh32,145
8,nicole.kleinstreuer,174
0,amccorm3,226
2,dwikoff,663
5,hubert.dirven,2438


Wow looks like ktsaiou1 reviewed 2753, thats a lot of work.  Gouedraogo needs to step it up a bit! You can also track user progress on the overview page of your review [sysrev.com/p/100](https://sysrev.com/p/100).  

We can also count the number of times different values occurred for the `Compound` label created in this review.  But first we need to make a separate row for each value in this column.  For example a value of `'monkey','rat'` should be split over two otherwise rows:

In [44]:
from pandas import DataFrame

# taken from stackoverflow https://goo.gl/x311Jm
# this function can 'explode' a column with 
# multiple ',' separated answers
def tidy_split(df, column, sep=',', keep=False):
    indexes = list()
    new_values = list()
    df = df.dropna(subset=[column])
    for i, presplit in enumerate(df[column].astype(str)):
        values = presplit.split(sep)
        if keep and len(values) > 1:
            indexes.append(i)
            new_values.append(presplit)
        for value in values:
            indexes.append(i)
            new_values.append(value)
    new_df = df.iloc[indexes, :].copy()
    new_df[column] = new_values
    return new_df

speciesDF = tidy_split(df,'Species')
speciesDF['Species'] = speciesDF['Species'].str.strip() #remove white space
speciesDF.head(3) #Species now has one answer per row. 

Unnamed: 0,Article ID,User Name,Resolve?,Include,Primary research,Species,Compound,Mechanistic,User Note,Title,Journal,Authors
10,37351,gunn.vist,,False,True,human,Rezulin (troglitazone),,,The Diabetes Prevention Program. Design and me...,Diabetes Care,
23,37379,ktsaiou1,True,False,,human,Avandia (rosiglitazone),,review - KT resolved,Two new oral antidiabetics: Both poorly assessed,Prescrire International,
77,37455,ktsaiou1,True,False,,human,,,"our drugs not mentioned, but maybe in full text?","Dabigatran: Continue to use heparin, a better-...",Prescrire International,


Finally, we can count the number of times different species have been labeled:

In [48]:
speciesDF.groupby(['Species'])
    .size()
    .reset_index(name='counts')
    .sort_values('counts')

Unnamed: 0,Species,counts
3,non-human primate,15
0,dog,21
4,other,62
2,mouse,211
5,rat,294
1,human,1215


So we see that most of the Articles reviewed by johns hopkins at https://sysrev.com/p/100 involve humans and only a few involve non-human primates.