# File mapping and Entry points

This notebook creates the tables below from a mix of hand-input data (based on recordings) and computationally-created data via a survey tool.

**Important**: This file assumes previous data preparation steps have been run, and won't work if the file below isn't present. - see the readme for more info.

In [1]:
# Setup - import library and load data from local file
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

fm = pd.read_csv("data/file_mapping_responses.csv")
fm = fm.iloc[:,1:11] #subset the data to just strings and remove the notes columns

## Spread of cards matched to term across different terms

This function takes in processed data from the file above, and identifies which cards were matched to which terms, and how many times. `NaN` results mean there were no matches and is equivalent to "0". 

GO Term was reasonably consistent, everything else was much more varied

In [2]:
#"split a multi choice answer, returning a series with a column 
# header and the responses with the header stripped out."
def count_unique_answers( a_series ):
    # variable to store the final response in as we build it
    tokenise = list()
    split_answer = pd.DataFrame(a_series.str.split("|", expand=True))
    for i in split_answer:   
        tokenise.extend(split_answer[i])
        
    response = {}
    for i in tokenise:
        if not pd.isnull(i):  
            if i in response:
                response[i] = response[i]+1
            else:
                response[i] = 1
    return response

result_counts = {}

for i in fm: 
    if (fm[i].dtype == np.object):
        result_counts[fm[i].name] = count_unique_answers(fm[i])
        
results = pd.DataFrame.from_dict(result_counts)
results


Unnamed: 0,homo_sapiens.gff|100287102,homo_sapiens.gff|HGNC:37102,homo_sapiens.gff|DDX11L1,flybase_d_melanogaster.gaf|FBgn0043467,flybase_d_melanogaster.gaf|GO:0048149,ncbi_homo_sapiens.gene_info|9606,ncbi_homo_sapiens.gene_info|1,ncbi_homo_sapiens.gene_info|A1BG,ncbi_homo_sapiens.gene_info|MIM:138670,ncbi_homo_sapiens.gene_info|HGNC:HGNC:5
identifier,15.0,9.0,5.0,13.0,2.0,3.0,9.0,2.0,6.0,4.0
gene,14.0,6.0,7.0,5.0,,1.0,6.0,6.0,2.0,3.0
accession,2.0,3.0,1.0,4.0,1.0,,,1.0,2.0,1.0
symbol,1.0,4.0,8.0,3.0,,2.0,1.0,15.0,1.0,1.0
dataset/database,1.0,3.0,1.0,4.0,1.0,,,,5.0,1.0
h. sapiens,1.0,2.0,,,,3.0,,1.0,,
organism,1.0,1.0,1.0,1.0,,8.0,1.0,1.0,1.0,1.0
name,1.0,2.0,15.0,1.0,,1.0,1.0,7.0,1.0,2.0
go: protein binding,,1.0,,,2.0,,,,,
location,,1.0,,,,,,,,


Same view as above, but with the axes flipped. 

In [4]:
results = results.transpose()
results.head(10)

Unnamed: 0,identifier,gene,accession,symbol,dataset/database,h. sapiens,organism,name,go: protein binding,location,...,database,Chromosome location,pathway,accesssion,taxon id,human,chromosome,BRCA1_HUMAN,links to other dbs,disease
homo_sapiens.gff|100287102,15.0,14.0,2.0,1.0,1.0,1.0,1.0,1.0,,,...,,,,,,,,,,
homo_sapiens.gff|HGNC:37102,9.0,6.0,3.0,4.0,3.0,2.0,1.0,2.0,1.0,1.0,...,,,,,,,,,,
homo_sapiens.gff|DDX11L1,5.0,7.0,1.0,8.0,1.0,,1.0,15.0,,,...,,,,,,,,,,
flybase_d_melanogaster.gaf|FBgn0043467,13.0,5.0,4.0,3.0,4.0,,1.0,1.0,,,...,1.0,,,,,,,,,
flybase_d_melanogaster.gaf|GO:0048149,2.0,,1.0,,1.0,,,,2.0,,...,,1.0,1.0,1.0,,,,,,
ncbi_homo_sapiens.gene_info|9606,3.0,1.0,,2.0,,3.0,8.0,1.0,,,...,,,,,3.0,5.0,,,,
ncbi_homo_sapiens.gene_info|1,9.0,6.0,,1.0,,,1.0,1.0,,,...,,,,,,,3.0,,,
ncbi_homo_sapiens.gene_info|A1BG,2.0,6.0,1.0,15.0,,1.0,1.0,7.0,,,...,,,,,,,,1.0,,
ncbi_homo_sapiens.gene_info|MIM:138670,6.0,2.0,2.0,1.0,5.0,,1.0,1.0,,,...,,,,,,,2.0,,1.0,1.0
ncbi_homo_sapiens.gene_info|HGNC:HGNC:5,4.0,3.0,1.0,1.0,1.0,,1.0,2.0,,,...,1.0,,,,,,2.0,,,


In [5]:
# GO Term is quite consistent in the matching
results["GO Term"]

homo_sapiens.gff|100287102                  NaN
homo_sapiens.gff|HGNC:37102                 NaN
homo_sapiens.gff|DDX11L1                    NaN
flybase_d_melanogaster.gaf|FBgn0043467      1.0
flybase_d_melanogaster.gaf|GO:0048149      18.0
ncbi_homo_sapiens.gene_info|9606            NaN
ncbi_homo_sapiens.gene_info|1               NaN
ncbi_homo_sapiens.gene_info|A1BG            NaN
ncbi_homo_sapiens.gene_info|MIM:138670      NaN
ncbi_homo_sapiens.gene_info|HGNC:HGNC:5     NaN
Name: GO Term, dtype: float64

In [6]:
# Symbol is the next best and is already a mess - it's used in a lot of different places.
results["symbol"]

homo_sapiens.gff|100287102                  1.0
homo_sapiens.gff|HGNC:37102                 4.0
homo_sapiens.gff|DDX11L1                    8.0
flybase_d_melanogaster.gaf|FBgn0043467      3.0
flybase_d_melanogaster.gaf|GO:0048149       NaN
ncbi_homo_sapiens.gene_info|9606            2.0
ncbi_homo_sapiens.gene_info|1               1.0
ncbi_homo_sapiens.gene_info|A1BG           15.0
ncbi_homo_sapiens.gene_info|MIM:138670      1.0
ncbi_homo_sapiens.gene_info|HGNC:HGNC:5     1.0
Name: symbol, dtype: float64

## Variation within a given term
When a term is matched to cards, how many different options are given in total, and per person?

In [6]:
#"split a multi choice answer, returning a series with a column 
# header and the responses with the header stripped out."
def count_variation( a_series ):
    # variable to store the final response in as we build it
    split_answer = a_series.str.split("|")
    split_answer = split_answer.sort_values()
    
    term_combinations = list()
    for terms_list in split_answer:
        if isinstance(terms_list, list):
            #Record unique combinations of terms
            if terms_list not in term_combinations:
                term_combinations.append(terms_list)
    return term_combinations

#"split a multi choice answer, returning a series with a column 
# header and the responses with the header stripped out."
def count_unique_terms( a_series ):
    # variable to store the final response in as we build it
    split_answer = a_series.str.split("|")
    split_answer = split_answer.sort_values()
    
    unique_terms = list()
    for terms_list in split_answer:
        if isinstance(terms_list, list):
            #Record unique terms regardless of combo
            for individual_term in terms_list:
                if individual_term not in unique_terms:
                    unique_terms.append(individual_term)            

    return unique_terms

term_combinations = {}
unique_results_per_term = {}
for i in fm: 
    if (fm[i].dtype == np.object):
        
        #column name
        col = fm[i].name
        
        #calculate number of different term combos per matched field
        term_combos = count_variation(fm[i])
        term_combinations[col] = {"terms": term_combos, "count": len(term_combos)}
        
        #calculate number of different unique terms per field
        unique_terms = count_unique_terms(fm[i])
        unique_results_per_term[col] = {"terms": unique_terms, "count": len(unique_terms)}


print(pd.DataFrame(unique_results_per_term).transpose())
print(pd.DataFrame(term_combinations).transpose())

                                                                                     terms  \
homo_sapiens.gff|100287102               [accession, identifier, gene, dataset/database...   
homo_sapiens.gff|HGNC:37102              [dataset/database, h. sapiens, gene, accession...   
homo_sapiens.gff|DDX11L1                 [Q9H4C3_HUMAN, gene, name, symbol, identifier,...   
flybase_d_melanogaster.gaf|FBgn0043467   [GO Term, accession, gene, identifier, name, s...   
flybase_d_melanogaster.gaf|GO:0048149    [Chromosome location, GO Term, accesssion, dat...   
ncbi_homo_sapiens.gene_info|9606         [h. sapiens, human, identifier, taxon id, orga...   
ncbi_homo_sapiens.gene_info|1            [chromosome, gene, identifier, protein, name, ...   
ncbi_homo_sapiens.gene_info|A1BG         [BRCA1_HUMAN, gene, name, symbol, organism, id...   
ncbi_homo_sapiens.gene_info|MIM:138670   [accession, chromosome, dataset/database, iden...   
ncbi_homo_sapiens.gene_info|HGNC:HGNC:5  [chromosome, databa

## Entry points
We asked participants if they were thinking of doing research or an experiment themselves, which card would they consider the entry point to this set of items?

In [7]:
#Data prep to load the data and subset for usage
entry_points = pd.read_csv("data/entry_point.csv")
 #subset the data to just the columns that answer this question
entry_terms = entry_points.iloc[:,1:2]
#let's take a peek at the data
entry_terms.head(5)

Unnamed: 0,term_selected
0,homologue
1,accession|identifier|DOI|pubmed ID
2,disease|flymine|ensembl|uniprot
3,gene
4,identifier|dataset|database|publication


In [9]:
#"Split out responses to the entry point question and return a count of how many
# times each term is used"
def count_entry_point_unique_answers( a_series ):
    # variable to store the final response in as we build it
    tokenise = list()
    split_answer = pd.DataFrame(a_series.str.split("|", expand=True))
    for i in split_answer:   
        tokenise.extend(split_answer[i])
        
    response = {}
    for i in tokenise:
        if not pd.isnull(i):  
            if i in response:
                response[i] = response[i]+1
            else:
                response[i] = 1
    return response


#we'll store unique counts in this var
result_counts = {}

# split up the terms and count them
for i in entry_terms: 
    if (entry_terms[i].dtype == np.object):
        result_counts[entry_terms[i].name] = count_entry_point_unique_answers(entry_terms[i])

entry_df = pd.DataFrame(result_counts)

#make sure it's in order from big to small
entry_df.sort_values(by=['term_selected'], inplace=True, ascending=False)

#show the results to the world as a nice lil table
print(entry_df)

                 term_selected
gene                         7
identifier                   5
publication                  4
database                     3
protein                      2
disease                      2
organism                     2
accession                    2
pubmed ID                    1
pathway                      1
name                         1
DOI                          1
homologue                    1
flymine                      1
ensembl                      1
dataset                      1
d. melanogaster              1
chromosome                   1
uniprot                      1


## Entry point types
Counts the number of different response types (mostly information or bio centric)

In [10]:
#subset the data 
entry_point_type = entry_points[["participant_id","bio_or_info"]]
#we'll store unique counts in this var
entry_point_type_result_counts = {}

# split up the terms and count them
for i in entry_point_type: 
    if (entry_point_type[i].dtype == np.object):
        entry_point_type_result_counts[entry_point_type[i].name] = count_entry_point_unique_answers(entry_point_type[i])

In [11]:
# Let's print each type as a nice table, sorted big to small. 
result_types = pd.DataFrame(entry_point_type_result_counts["bio_or_info"], index=[0])
result_types.rename(index = {0:'bio_or_info'}, inplace = True)
result_types.sort_values(by=["bio_or_info"], inplace=True, axis=1, ascending=False)
result_types


Unnamed: 0,info,bio,both,unclear,money,circuits
bio_or_info,12,5,2,1,1,1
