# Spiroplasma vs Fly Sampling Locations

These are the analyses and results regarding the spiroplasma assays and any putative biased sex distribution etc.

__<font color="red">ATTENTION:</font> Click here to skip directly to the [Results](#Final-Results)__:
- [Methods Overview](#Overview-of-what-was-done:)
- [Sex Bias Results](#Sex-Bias-Results:)
    - [Conclusions](#Sex-Bias-Conclusions:)
- [Geographical Bias Results](#Geographical-Bias-Results:)
    - [Conclusions](#Geographical-Bias-Conclusions:)
    
 ----

In [44]:
%matplotlib inline

from StringIO import StringIO  # got moved to io in python3.

import requests

import seaborn as sns
import pandas as pd
pd.set_option("max_rows", 100)
pd.set_option("max_columns", 100)

import patsy
import numpy as np
import scipy as sp
import statsmodels as smd
import statsmodels.api as sm
from statsmodels.formula.api import logit, glm

from scipy.stats import fisher_exact
from scipy.stats import f_oneway

from spartan.utils import spandas as spd

# Load spreadsheets

In [31]:
# get openrefined google spreadsheet data
table_gsh = pd.read_csv("/home/gus/Documents/YalePostDoc/project_stuff/Spiroplasma/MF_Spiro_gsh.csv")

In [32]:
# Get openrefined excel file data
table_xls = pd.read_csv("/home/gus/Documents/YalePostDoc/project_stuff/Spiroplasma/MF_Spiro_xls.csv")

## Run script to load main database info into this notebook

In [4]:
%run /home/gus/Documents/YalePostDoc/project_stuff/g_f_fucipes_uganda/scripts/gff_pandas_database.py

Main dataframe is dfp: and is only Gff.


In [5]:
def recode_dfp(df):
    df['Collection Year'] = pd.DatetimeIndex(df.Date).year
    df = df.rename(columns={"Village": "Location Code", "Fly_Number": "Fly Number"})
    return df[["Location Code","Collection Year","Fly Number","Sex"]]

In [6]:
d = recode_dfp(dfp.copy())
d.head()

Unnamed: 0,Location Code,Collection Year,Fly Number,Sex
0,UWA,2014,1,F
1,UWA,2014,2,F
2,UWA,2014,3,M
3,UWA,2014,4,M
4,UWA,2014,5,M


## Recover sex data into `table_gsh` by crossref with `dfp`

In [7]:
# recover sex data into table_gsh by joining with dfp
table_gsh = pd.merge(left=table_gsh.copy(), right=recode_dfp(dfp.copy()), 
                     how='left', 
                     on=["Location Code","Collection Year","Fly Number"], 
                     left_on=None, right_on=None, 
                     left_index=False, right_index=False, 
                     sort=False, suffixes=('_x', '_y'), copy=True)

In [8]:
table_gsh.head()

Unnamed: 0,Location Code,Collection Year,Fly Number,Spiroplasma,Sex
0,CHU,2014,4,True,F
1,CHU,2014,6,False,F
2,CHU,2014,7,False,M
3,CHU,2014,10,False,M
4,CHU,2014,31,False,M


In [9]:
table_all = pd.concat([table_gsh.dropna(),
                       table_xls[["Location Code","Collection Year","Fly Number","Sex","Spiroplasma"]].dropna()])
table_all.head()

Unnamed: 0,Collection Year,Fly Number,Location Code,Sex,Spiroplasma
0,2014,4,CHU,F,True
1,2014,6,CHU,F,False
2,2014,7,CHU,M,False
3,2014,10,CHU,M,False
4,2014,31,CHU,M,False


-----------------

# Group data by number of flies belonging to any combination of +/- vs M/F

In [10]:
table_all_pivot = table_all.pivot_table(values="Fly Number", 
                                        index=["Location Code"], 
                                        columns=["Spiroplasma","Sex"], 
                                        aggfunc=[len], 
                                        fill_value=0, margins=False, dropna=True)

table_all_pivot.columns = table_all_pivot.columns.droplevel() # removes useless 'len' top multilevel index
table_all_pivot

Spiroplasma,False,False,True,True
Sex,F,M,F,M
Location Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AMI,10,10,0,1
BOL,6,6,3,1
CHU,7,8,3,0
DUK,3,6,6,4
GAN,11,3,8,3
KIL,12,3,2,0
NGO,12,5,0,0
OCU,6,5,1,1
ORB,22,6,19,7
TUM,9,7,0,0


# Hypotheses tests of locations and multiple testing corrections

In [11]:
def do_tests(df):
    
    locations = df.index.values
    
    tests = {}
    
    for loc in locations:
        
        locdf = df.loc[loc,:]
        contingency_table = locdf.reshape((2,2))
        
        odds_ratio, p_val = fisher_exact(contingency_table)
        
        tests[loc] = (odds_ratio, p_val)
        
    contingency_table_all = df.sum().reshape((2,2))
    odds_ratio_all, p_val_all = fisher_exact(contingency_table_all)
    
    tests['all'] = (odds_ratio_all, p_val_all)
    
    testsdf = pd.DataFrame(data=tests, index=["Odds ratio","pvals"], columns=tests.keys(), dtype=None, copy=False)
    return testsdf.T

def add_fdr(df):
    multitests =smd.stats.multitest.multipletests 
    
    # do the fdr correction
    reject_or_not,corrected_pval = multitests(pvals=df.pvals, alpha=0.05, method='fdr_bh')[:2]
    
    # add results to dataframe
    df["Reject the null?"],df["adjusted pvals"] = reject_or_not,corrected_pval

In [12]:
results = do_tests(table_all_pivot)
add_fdr(results)
results.sort()

Unnamed: 0,Odds ratio,pvals,Reject the null?,adjusted pvals
AMI,inf,1.0,False,1
BOL,0.333333,0.584615,False,1
CHU,0.0,0.215686,False,1
DUK,0.333333,0.36985,False,1
GAN,1.375,1.0,False,1
KIL,0.0,1.0,False,1
NGO,,1.0,False,1
OCU,1.2,1.0,False,1
ORB,1.350877,0.754009,False,1
TUM,,1.0,False,1


# Is there a significant difference of Spiroplasma prevalence between populations?

- will NOT use one way ANOVA bc the data is categorical
- will use logit regression probably
- may run both `scipy` and `statsmodels` version out of curiousity and to double check results

## To make the easier to interpret going to pull in the GPS coords of the locations

- this will let me run the comparison vs location name (categorical) and location coords (numerical) 

In [33]:
location_gps = pd.read_csv('/home/gus/Dropbox/uganda_data/data_repos/field_data/locations/gps/villages/uganda_villages_gps.csv',
                           sep='\t'
                          )

In [34]:
location_gps.head()

Unnamed: 0,Location,Latitude,Longitude
0,ACA,2.27008,32.52053
1,OD,2.44749,32.66024
2,OCL,2.46757,32.56832
3,LIB,3.28078,32.854265
4,PAW,3.61199,32.68167


In [39]:
table_all_gps = pd.merge(left=table_all, 
                         right=location_gps, 
                         how='left', 
                         on=None, 
                         left_on="Location Code", right_on="Location", 
                         left_index=False, right_index=False, 
                         sort=False, suffixes=('_x', '_y'), copy=True).drop(labels=["Location"],axis=1)
table_all_gps.head()

Unnamed: 0,Collection Year,Fly Number,Location Code,Sex,Spiroplasma,Latitude,Longitude
0,2014,4,CHU,F,True,2.606845,32.93758
1,2014,6,CHU,F,False,2.606845,32.93758
2,2014,7,CHU,M,False,2.606845,32.93758
3,2014,10,CHU,M,False,2.606845,32.93758
4,2014,31,CHU,M,False,2.606845,32.93758


In [41]:
# recode Spiroplasma as 0/1 vs True/False
table_all_gps["Infected"] = table_all_gps.Spiroplasma.map({True:1,False:0})
table_all_gps.head()

Unnamed: 0,Collection Year,Fly Number,Location Code,Sex,Spiroplasma,Latitude,Longitude,Infected
0,2014,4,CHU,F,True,2.606845,32.93758,1
1,2014,6,CHU,F,False,2.606845,32.93758,0
2,2014,7,CHU,M,False,2.606845,32.93758,0
3,2014,10,CHU,M,False,2.606845,32.93758,0
4,2014,31,CHU,M,False,2.606845,32.93758,0


## Run the logistic regression model

In [70]:
logit_gps = logit('Infected ~ Longitude + Latitude',
                data=table_all_gps,
               ).fit()
logit_gps.summary()

Optimization terminated successfully.
         Current function value: 0.547521
         Iterations 6


0,1,2,3
Dep. Variable:,Infected,No. Observations:,182.0
Model:,Logit,Df Residuals:,179.0
Method:,MLE,Df Model:,2.0
Date:,"Mon, 24 Aug 2015",Pseudo R-squ.:,0.113
Time:,17:02:08,Log-Likelihood:,-99.649
converged:,True,LL-Null:,-112.34
,,LLR p-value:,3.085e-06

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,34.9301,9.268,3.769,0.000,16.765 53.095
Longitude,-1.1576,0.271,-4.279,0.000,-1.688 -0.627
Latitude,0.3796,0.570,0.666,0.505,-0.737 1.496


-------------------

# Final Results

### Overview of what was done:

#### Sex Bias
1. Tables were cleaned using [OpenRefine](http://openrefine.org/) to standardize things like
    - "positive"
    - "Positive"
    - "yes"
    - "Yes"

2. Tables were read into this notebook and the table without Sex information was cross-referenced with the original database entries to recover any sex information availible.
3. Tables were further cleaned to remove any columns that were not needed to unambiguously identify each fly or represent the Sex/Spiroplasma data.
    - Flies that had missing data or needed to be re-run were removed.
4. Tables were combined to a single table and the data were grouped by the number of flies belonging to any combination of __Spiroplasma results__ (pos/neg) vs __Sex__ (M/F).
    - This table represents contingency table information for each location.
5. The contingency tables for each location as well as the summed contingency table for all data combined, were used to calculate [Fisher's exact test of independence](http://www.biostathandbook.com/fishers.html) to yield odds ratios and initial p-values for all sub-sets of data examined.
6. Multiple testing correction (Benjamini-Hochberg) was applied and adjusted p-values along with rejection of null hypothese recommentations appened to the table.

#### Geographical Location Bias

1. Started with the same data as used above but brought in the representative latitude and longitude coordinates of each location to use as the independent variables.
2. Logistic regression was run with the model: `infection ~ lat + long`:
    - _independent variables:_ `lat`, `long`
    - _dependent variables:_ `infection`

---------------------

# Sex Bias Results:

## Contingency Data:

In [14]:
table_all_pivot

Spiroplasma,False,False,True,True
Sex,F,M,F,M
Location Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AMI,10,10,0,1
BOL,6,6,3,1
CHU,7,8,3,0
DUK,3,6,6,4
GAN,11,3,8,3
KIL,12,3,2,0
NGO,12,5,0,0
OCU,6,5,1,1
ORB,22,6,19,7
TUM,9,7,0,0


## Significance tests:

In [15]:
results.sort()

Unnamed: 0,Odds ratio,pvals,Reject the null?,adjusted pvals
AMI,inf,1.0,False,1
BOL,0.333333,0.584615,False,1
CHU,0.0,0.215686,False,1
DUK,0.333333,0.36985,False,1
GAN,1.375,1.0,False,1
KIL,0.0,1.0,False,1
NGO,,1.0,False,1
OCU,1.2,1.0,False,1
ORB,1.350877,0.754009,False,1
TUM,,1.0,False,1


# Sex Bias Conclusions:

There is <b><font color="red">no sex bias detected</font></b> for probability of being infected with Spiroplasma based on these data at the location level nor over all.

 ----

# Geographical Bias Results:

In [72]:
logit_gps.summary2()

0,1,2,3
Model:,Logit,Pseudo R-squared:,0.113
Dependent Variable:,Infected,AIC:,205.2978
Date:,2015-08-24 17:21,BIC:,214.9098
No. Observations:,182,Log-Likelihood:,-99.649
Df Model:,2,LL-Null:,-112.34
Df Residuals:,179,LLR p-value:,3.0845e-06
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,34.9301,9.2678,3.7690,0.0002,16.7655,53.0947
Longitude,-1.1576,0.2705,-4.2786,0.0000,-1.6878,-0.6273
Latitude,0.3796,0.5696,0.6664,0.5051,-0.7368,1.4960


# Geographical Bias Conclusions:

There <b><font color="green">is geographical bias detected</font></b> for probability of being infected with Spiroplasma based on these data.

- The overall significance of the analysis is $p_{_{LLR}} = 3.0845\times10^{-06}$ which is easily significant.
- Particularly, the coefficient for __Longitude__ (-1.1576, $p < 10^{-4}$) suggests that as __Longitude__ increases, the probability of infection DE-creases.
- The effect of __Latitude__ is predicted to be that its increase would result in less pronounced IN-crease of infection probability.  _However this prediction was __not__ significant in the current data._ 

 ----