# Spiroplasma vs Fly Sampling Locations

These are the analyses and results regarding the spiroplasma assays and any putative biased sex distribution etc.

__<font color="red">ATTENTION:</font> Click here to skip directly to the [Results](#Final-Results)__:
- [Methods Overview](#Overview-of-what-was-done:)
- [Sex Bias Results](#Sex-Bias-Results:)
    - [Conclusions](#Sex-Bias-Conclusions:)
- [Geographical Bias Results](#Geographical-Bias-Results:)
    - [Conclusions](#Geographical-Bias-Conclusions:)
    
 ----

In [1]:
%matplotlib inline

from StringIO import StringIO  # got moved to io in python3.

import requests

import seaborn as sns
import pandas as pd
pd.set_option("max_rows", 100)
pd.set_option("max_columns", 100)

import patsy
import numpy as np
import scipy as sp
import statsmodels as smd
import statsmodels.api as sm
from statsmodels.formula.api import logit, glm

from scipy.stats import fisher_exact
from scipy.stats import f_oneway

from spartan.utils import spandas as spd

# Load spreadsheets

In [10]:
# spreadsheet data
table_all = pd.read_csv("/home/gus/MEGAsync/zim/main/Yale/Projects/Spiroplasma/related_files/2015_11_04__Spiroplasma_tested_samples_REFINED.csv")

## Run script to load main database info into this notebook

In [3]:
# %run /home/gus/Documents/YalePostDoc/project_stuff/g_f_fucipes_uganda/scripts/gff_pandas_database.py

In [4]:
# def recode_dfp(df):
#     df['Collection Year'] = pd.DatetimeIndex(df.Date).year
#     df = df.rename(columns={"Village": "Location Code", "Fly_Number": "Fly Number"})
#     return df[["Location Code","Collection Year","Fly Number","Sex"]]

In [5]:
# d = recode_dfp(dfp.copy())
# d.head()

## Recover sex data into `table_gsh` by crossref with `dfp`

In [6]:
# # recover sex data into table_gsh by joining with dfp
# table_gsh = pd.merge(left=table_gsh.copy(), right=recode_dfp(dfp.copy()), 
#                      how='left', 
#                      on=["Location Code","Collection Year","Fly Number"], 
#                      left_on=None, right_on=None, 
#                      left_index=False, right_index=False, 
#                      sort=False, suffixes=('_x', '_y'), copy=True)

In [7]:
# table_gsh.head()

In [8]:
# table_all = pd.concat([table_gsh.dropna(),
#                        table_xls[["Location Code","Collection Year","Fly Number","Sex","Spiroplasma"]].dropna()])
# table_all.head()

-----------------

# Group data by number of flies belonging to any combination of +/- vs M/F

In [11]:
table_all.head()

Unnamed: 0,Box,Location Code,Numbers on Vial,Fly Number,Month,Sex,DNA made,Wolbachia,Spiroplasma
0,RP4,DUK,AR14016 T20,16,41812,F,Yrp,False,False
1,RP4,DUK,AR14001 T20,1,41812,M,Yrp,False,False
2,RP4,DUK,AR14002 T20,2,41812,M,Yrp,False,False
3,RP4,DUK,AR14004 T20,4,41812,F,Yrp,False,False
4,RP4,DUK,AR14007 T20,7,41812,M,Yrp,False,False


In [14]:
table_all_pivot = table_all.pivot_table(values="Fly Number", 
                                        index=["Location Code"], 
                                        columns=["Spiroplasma","Sex"], 
                                        aggfunc=[len], 
                                        fill_value=0, margins=False, dropna=True)

table_all_pivot.columns = table_all_pivot.columns.droplevel() # removes useless 'len' top multilevel index
table_all_pivot

Spiroplasma,False,False,True,True
Sex,F,M,F,M
Location Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AMI,12,10,0,1
BOL,9,6,2,1
CHU,9,8,1,0
DUK,5,7,5,3
GAN,16,4,9,3
KIL,12,3,2,0
NGO,12,5,0,0
OCU,7,6,1,1
ORB,26,8,19,7
TUM,9,7,0,0


# Hypotheses tests of locations and multiple testing corrections

In [15]:
def do_tests(df):
    
    locations = df.index.values
    
    tests = {}
    
    for loc in locations:
        
        locdf = df.loc[loc,:]
        contingency_table = locdf.reshape((2,2))
        
        odds_ratio, p_val = fisher_exact(contingency_table)
        
        tests[loc] = (odds_ratio, p_val)
        
    contingency_table_all = df.sum().reshape((2,2))
    odds_ratio_all, p_val_all = fisher_exact(contingency_table_all)
    
    tests['all'] = (odds_ratio_all, p_val_all)
    
    testsdf = pd.DataFrame(data=tests, index=["Odds ratio","pvals"], columns=tests.keys(), dtype=None, copy=False)
    return testsdf.T

def add_fdr(df):
    multitests =smd.stats.multitest.multipletests 
    
    # do the fdr correction
    reject_or_not,corrected_pval = multitests(pvals=df.pvals, alpha=0.05, method='fdr_bh')[:2]
    
    # add results to dataframe
    df["Reject the null?"],df["adjusted pvals"] = reject_or_not,corrected_pval

In [16]:
results = do_tests(table_all_pivot)
add_fdr(results)
results.sort()

  app.launch_new_instance()


Unnamed: 0,Odds ratio,pvals,Reject the null?,adjusted pvals
AMI,inf,0.478261,False,1
BOL,0.75,1.0,False,1
CHU,0.0,1.0,False,1
DUK,0.428571,0.649917,False,1
GAN,1.333333,1.0,False,1
KIL,0.0,1.0,False,1
NGO,,1.0,False,1
OCU,1.166667,1.0,False,1
ORB,1.197368,0.772985,False,1
TUM,,1.0,False,1


# Is there a significant difference of Spiroplasma prevalence between populations?

- will NOT use one way ANOVA bc the data is categorical
- will use logit regression probably
- may run both `scipy` and `statsmodels` version out of curiousity and to double check results

## To make the easier to interpret going to pull in the GPS coords of the locations

- this will let me run the comparison vs location name (categorical) and location coords (numerical) 

In [19]:
location_gps = pd.read_csv('/home/gus/Dropbox/uganda_data/data_repos/field_data/locations/gps/villages/uganda_villages_gps.csv',
                           sep=','
                          )

In [20]:
location_gps.head()

Unnamed: 0,Location,Latitude,Longitude
0,ABO,2.466775,32.56499
1,ACA,2.27008,32.52053
2,AG,2.413985,32.59915
3,AIN,3.304225,31.11941
4,AKA,2.37258,32.67495


In [21]:
table_all_gps = pd.merge(left=table_all, 
                         right=location_gps, 
                         how='left', 
                         on=None, 
                         left_on="Location Code", right_on="Location", 
                         left_index=False, right_index=False, 
                         sort=False, suffixes=('_x', '_y'), copy=True).drop(labels=["Location"],axis=1)
table_all_gps.head()

Unnamed: 0,Box,Location Code,Numbers on Vial,Fly Number,Month,Sex,DNA made,Wolbachia,Spiroplasma,Latitude,Longitude
0,RP4,DUK,AR14016 T20,16,41812,F,Yrp,False,False,3.2668,31.134205
1,RP4,DUK,AR14001 T20,1,41812,M,Yrp,False,False,3.2668,31.134205
2,RP4,DUK,AR14002 T20,2,41812,M,Yrp,False,False,3.2668,31.134205
3,RP4,DUK,AR14004 T20,4,41812,F,Yrp,False,False,3.2668,31.134205
4,RP4,DUK,AR14007 T20,7,41812,M,Yrp,False,False,3.2668,31.134205


In [22]:
# recode Spiroplasma as 0/1 vs True/False
table_all_gps["Spiroplasma"] = table_all_gps.Spiroplasma.map({True:1,False:0})
table_all_gps.head()

Unnamed: 0,Box,Location Code,Numbers on Vial,Fly Number,Month,Sex,DNA made,Wolbachia,Spiroplasma,Latitude,Longitude
0,RP4,DUK,AR14016 T20,16,41812,F,Yrp,False,0,3.2668,31.134205
1,RP4,DUK,AR14001 T20,1,41812,M,Yrp,False,0,3.2668,31.134205
2,RP4,DUK,AR14002 T20,2,41812,M,Yrp,False,0,3.2668,31.134205
3,RP4,DUK,AR14004 T20,4,41812,F,Yrp,False,0,3.2668,31.134205
4,RP4,DUK,AR14007 T20,7,41812,M,Yrp,False,0,3.2668,31.134205


## Run the logistic regression model

In [24]:
logit_gps = logit('Spiroplasma ~ Longitude + Latitude',
                data=table_all_gps,
               ).fit()
logit_gps.summary()

Optimization terminated successfully.
         Current function value: 0.485212
         Iterations 6


0,1,2,3
Dep. Variable:,Spiroplasma,No. Observations:,252.0
Model:,Logit,Df Residuals:,249.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 04 Nov 2015",Pseudo R-squ.:,0.1006
Time:,15:13:44,Log-Likelihood:,-122.27
converged:,True,LL-Null:,-135.95
,,LLR p-value:,1.154e-06

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,33.2038,8.907,3.728,0.000,15.746 50.661
Longitude,-1.0600,0.258,-4.107,0.000,-1.566 -0.554
Latitude,-0.1084,0.312,-0.347,0.729,-0.721 0.504


In [25]:
logit_gps.get_margeff().summary()

0,1
Dep. Variable:,Spiroplasma
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[95.0% Conf. Int.]
Longitude,-0.167,0.036,-4.645,0.0,-0.238 -0.097
Latitude,-0.0171,0.049,-0.348,0.728,-0.113 0.079


-------------------

# Final Results

### Overview of what was done:

#### Sex Bias
1. Tables were cleaned using [OpenRefine](http://openrefine.org/) to standardize things like
    - "positive"
    - "Positive"
    - "yes"
    - "Yes"

2. Tables were read into this notebook and the table without Sex information was cross-referenced with the original database entries to recover any sex information availible.
3. Tables were further cleaned to remove any columns that were not needed to unambiguously identify each fly or represent the Sex/Spiroplasma data.
    - Flies that had missing data or needed to be re-run were removed.
4. Tables were combined to a single table and the data were grouped by the number of flies belonging to any combination of __Spiroplasma results__ (pos/neg) vs __Sex__ (M/F).
    - This table represents contingency table information for each location.
5. The contingency tables for each location as well as the summed contingency table for all data combined, were used to calculate [Fisher's exact test of independence](http://www.biostathandbook.com/fishers.html) to yield odds ratios and initial p-values for all sub-sets of data examined.
6. Multiple testing correction (Benjamini-Hochberg) was applied and adjusted p-values along with rejection of null hypothese recommentations appened to the table.

#### Geographical Location Bias

1. Started with the same data as used above but brought in the representative latitude and longitude coordinates of each location to use as the independent variables.
2. Logistic regression was run with the model: `infection ~ lat + long`:
    - _independent variables:_ `lat`, `long`
    - _dependent variables:_ `infection`

---------------------

# Sex Bias Results:

## Contingency Data:

In [26]:
table_all_pivot

Spiroplasma,False,False,True,True
Sex,F,M,F,M
Location Code,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
AMI,12,10,0,1
BOL,9,6,2,1
CHU,9,8,1,0
DUK,5,7,5,3
GAN,16,4,9,3
KIL,12,3,2,0
NGO,12,5,0,0
OCU,7,6,1,1
ORB,26,8,19,7
TUM,9,7,0,0


## Significance tests:

In [27]:
results.sort()

  if __name__ == '__main__':


Unnamed: 0,Odds ratio,pvals,Reject the null?,adjusted pvals
AMI,inf,0.478261,False,1
BOL,0.75,1.0,False,1
CHU,0.0,1.0,False,1
DUK,0.428571,0.649917,False,1
GAN,1.333333,1.0,False,1
KIL,0.0,1.0,False,1
NGO,,1.0,False,1
OCU,1.166667,1.0,False,1
ORB,1.197368,0.772985,False,1
TUM,,1.0,False,1


# Sex Bias Conclusions:

There is <b><font color="red">no sex bias detected</font></b> for probability of being infected with Spiroplasma based on these data at the location level nor over all.

 ----

# Geographical Bias Results:

In [28]:
logit_gps.summary()

0,1,2,3
Dep. Variable:,Spiroplasma,No. Observations:,252.0
Model:,Logit,Df Residuals:,249.0
Method:,MLE,Df Model:,2.0
Date:,"Wed, 04 Nov 2015",Pseudo R-squ.:,0.1006
Time:,15:16:17,Log-Likelihood:,-122.27
converged:,True,LL-Null:,-135.95
,,LLR p-value:,1.154e-06

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,33.2038,8.907,3.728,0.000,15.746 50.661
Longitude,-1.0600,0.258,-4.107,0.000,-1.566 -0.554
Latitude,-0.1084,0.312,-0.347,0.729,-0.721 0.504


In [29]:
logit_gps.get_margeff().summary()

0,1
Dep. Variable:,Spiroplasma
Method:,dydx
At:,overall

Unnamed: 0,dy/dx,std err,z,P>|z|,[95.0% Conf. Int.]
Longitude,-0.167,0.036,-4.645,0.0,-0.238 -0.097
Latitude,-0.0171,0.049,-0.348,0.728,-0.113 0.079


# Geographical Bias Conclusions:

There <b><font color="green">is geographical bias detected</font></b> for probability of being infected with Spiroplasma based on these data.

- The overall significance of the analysis is $p_{_{LLR}} = 1.154\times10^{-06}$ which is easily significant.
- Particularly, the marginal effects value for __Longitude__ (-0.1670, $p < 10^{-4}$) suggests that as __Longitude__ increases, the probability of infection __DE__-creases.
- The effect of __Latitude__ is predicted to be that its increase would result in less pronounced IN-crease of infection probability.  _However this prediction was __not__ significant in the current data._ 

## Comparison with results from last time:

- The marginal effect of Longitude decreased slightly from last time:
    - `[then]` -0.2139 ($p < 10^{-4}$)
    - `[now]` -0.1670 ($p < 10^{-4}$)

- Likewise so did the non-statistically significant marginal effect of Latitude decrease (AND change direction) from last time:
    - `[then]` 0.0701 ($p = 0.505 $)
    - `[now]` -0.0171($p = 0.728 $)

- The Latitude effect switch in direction is likely beacuse it is noise as the p-values in both analyses suggest. 

----