## Loading explanations and comparing groups

One hundred randomly selected observations from the validation data were given "explanations" using the LIME algorithm. These explanations are contained in a dictionary which I load here, along with two files containing the indices and the predicted values. I also load another file that contains the names and descriptions of each variable.

In [None]:
import pickle
import pandas as pd
import string
import numpy as np
from statsmodels.stats.proportion import proportions_ztest
import matplotlib.pyplot as plt
from collections import defaultdict, Counter

%matplotlib inline
plt.rcParams["figure.figsize"] = (9,6)
#%config InlineBackend.figure_format = 'retina' # Uncomment if using a retina display
plt.rc('pdf', fonttype=42)
plt.rcParams['ps.useafm'] = True
plt.rcParams['pdf.use14corefonts'] = True
#plt.rcParams['text.usetex'] = True # Uncomment if LaTeX installed to render plots in LaTeX
#plt.rcParams['font.serif'] = 'Times'
plt.rcParams['font.family'] = 'serif'
plt.rcParams.update({'figure.autolayout': False})
plt.rcParams["figure.figsize"] = (12,9)

In [None]:
exp = pickle.load(open('../../output/lime_explanations_dict.p','rb'))

In [None]:
exp

Now that the data have been loaded I first do some basic analysis to see the range of variables that are present in the explanations.

First I convert the explanations to a dictionary, which is an easier format to process than that returned by LIME. I then convert them to a pandas dataframe.

In [None]:
explanations = {}
for k,v in exp.items():
    user_exp = {}
    for x in v:
        user_exp[x[0]] = x[1]
    explanations[k] = user_exp

We can inspect a given element of the dictionary to see the explanation for a particular observation. For example the sub-dictionary below contains the explanation for observation 15. The keys in this dictionary are a combination of variables and values. For example the first key `f3d3a_5_1.0 <= 0.00` denotes the variable `f3d3a_5`, corresponding to the question posted to the father of the child in year 3 of the survey: "Who could you trust: child's sibling?". The second part of the key denotes that the response category `1.0` was less than or equal to `0`. Looking this up in the [survey documentation](https://fragilefamilies.princeton.edu/sites/fragilefamilies/files/ff_dad_cb3.txt) indicates that `1.0` indicates an answer of `Yes` to the question. While this syntax is somewhat confusing it indicates that this particular dummy variable had a value of 0 for this respondent. This therefore indicates that the child's rather did not answer yes to this particular question. The value of this element of the dictionary is a local coefficient generated by LIME that indicates the weight that this variable contributed to the local prediction. In this case the predictor was positive.



I now convert this dictionary into a pandas dataframe.

In [None]:
df = pd.DataFrame.from_dict(explanations, orient='index')

In [None]:
df.shape

In [None]:
df.head()

Now, to get a sense of the important variables I can simplify the columns by extracting the variable names and creating a new dataframe. Apologies for the rather ugly code.

In [None]:
def extract_variable_name(s):
    """
    This function parses the column names in the explanations to extract the variable name
    from the FF survey.
    
    I have left in the comments to illustrate how the algorithm is working."""
    components = s.split()
    print(s)
    try: 
        float(components[0]) # if first component can be case to a float then var name in 2nd
        print('First component is a float')
        var = components[2]
        print('Name is in ', var)
    except ValueError:
        var = components[0]
        print('Name is in ', var)
        
    if '_' in var:
        subcomponents = var.split('_')
        if var.count('_') == 1:
            # if substring after the _ can't be cast to float then it is part of the name
            try:
                float(subcomponents[1])
                varname = subcomponents[0]
            except ValueError:
                varname = var
        elif var.count('_') > 1:
            print("More than one underscore in ", var)
            varname = subcomponents[0]+'_'+subcomponents[1]
            print("Variable name is ", varname)
            
    else:
        varname = var
    print(varname)
    return varname 

explanations_2 = {}
for k,v in explanations.items():
    user_exp = {}
    for x, v in v.items():
        var = extract_variable_name(x)
        user_exp[var] = v
    explanations_2[k] = user_exp
    
df_names = pd.DataFrame.from_dict(explanations_2, orient='index')

In [None]:
df_names.shape

In [None]:
names = list(df_names.columns)

In [None]:
df_names.head()

Also defining a new dataframe and a function to count the number of observations each variable occurs in:

In [None]:
df_names_counts = df_names.notnull()*1

def count_occurrences(var):
    return df_names_counts[var].sum()

In [None]:
df_names_counts.head()

# Getting variable metadata

To get metadata for these variables there are a number of different steps. During the challenge, participant Connor Gilroy created a meta-data csv file that contains some information on each variable; since the challenge the Fragile Families team have built an API to programmatically get metadata. I mostly rely on the API below but use Gilroy's csv when metadata is not available.


Loading Gilroy's file from Github:

In [None]:
url = 'https://raw.githubusercontent.com/fragilefamilieschallenge/variables-metadata/master/ffc_variable_types.csv'
meta = pd.read_csv(url)
meta.index = meta['variable']
del meta['variable']

In [None]:
meta.head()

Copying over [code](https://github.com/fragilefamilieschallenge/ffmetadata-py/blob/master/ff.py) from the challenge github.

In [None]:
import json
import urllib
import requests

BASE_URL = 'http://api.metadata.fragilefamilies.princeton.edu'


def select(var_name, attr_name=None):
    """
    Return attribute(s) of a variable given the variable name and an optional field name, or list of attribute name(s)
    :param var_name: Name of the variable we're interested in.
    :param attr_name: A string representing the name of the attribute whose value we want to fetch. This can also be
        a list of strings in case of multiple attributes. If None, all attributes of the variable are returned.
    :return: A dictionary of attribute => value mappings if multiple attributes were requested (i.e. attr_name is a
        list), or a string value if a single attribute name was requested (i.e. attr_name is a string)
    """
    single = isinstance(attr_name, str)
    if attr_name is not None:
        if single:
            params = {attr_name: attr_name}
        else:
            params = dict([(f, f) for f in attr_name])
    else:
        params = None

    endpoint = 'variable/%s' % var_name
    data = _get(endpoint, params)

    return data[attr_name] if single else data


def search(filters=None):
    """
    Search for variable names given a list or dictionary of 'filter(s)'.
    A 'filter' is defined as a dictionary with keys 'name','op','val' representing the attribute name, a comparison
    operator, and the value for comparison.
    If multiple filters are specified as a list, they're combined using the AND operator.
    Filters can also be specified as a dictionary, keyed with 'and' or 'or', and the values being a list of individual
    'filters'.
    See examples of usage in this module. Note that this function doesn't do any advanced processing whatsoever, but
    passes on the filters 'as-is' to the server.
    :param filters: A list of filters, or a dictionary with key 'and' or 'or', and the values as a list of filters.
    :return: A list of variable names corresponding to the search criteria.
    """
    filters = filters or []
    query_string_dict = {'filters': filters}
    query_string = urllib.parse.quote(json.dumps(query_string_dict))
    return _get('variable?q=%s' % query_string)


def _get(endpoint, params=None):
    """Return a dictionary of attribute => value mapping for JSON results
    obtained at a specified endpoint, with optional query parameters.
    Raises SystemError on 5xx responses or RuntimeError on 4xx responses
    """
    url = '%s/%s' % (BASE_URL, endpoint)
    url = requests.Request('GET', url, params=params).prepare().url
    response = requests.get(url)

    if 500 <= response.status_code < 600:
        raise SystemError("Internal Error on Server")

    d = response.json()
    if 400 <= response.status_code < 500:
        raise RuntimeError(d['message'])
    return d

The names of some of the variables used in the challenge have changed so they cannot be found in the API. This section creates a dictionary mapping the old names to the new names. Note that some variables from the challenge are not in the metadata so may still fail to be found.

In [None]:
# Getting the raw metadata file and creating a name conversion dictionary
url = "http://metadata.fragilefamilies.princeton.edu/get_metadata"
df = pd.read_csv(url, encoding="latin1")
old_name_to_new_name = {}
for _, r in df.iterrows():
  old_name_to_new_name[r['old_name']] = r['new_name']

It turns out that some of the variables have been renamed multiple times and cannot easily be found in the metadata, either in the new API or in Connor Gilroy's file. After discussion on Github it appears that almost all of these come from the in-house survey. The following function can be used to convert these to get names that can be used to get metadata from the new API.

In [None]:
def convertToNew(var):
    """Takes an old variable name from the in house survey and converts it to a new one."""
    chars = [x for x in string.ascii_lowercase]
    if not var.startswith('hv'):
        print("This variable does not start with hv")
        return
    else:
        var = var[2:] # Remove hv prefix
        if var[1] in chars[:14]: #if [a-n]
            return 'p'+var
        elif var[1] in chars[14:22]: #if [p-v]
            if var[2] in chars: # if next element is another character
                return 'ch'+var 
            else: # if not assign o prefix
                return 'o'+var

Now I can finally iterate through the names and get as much metadata as possible. The code before first checks the API for the raw variable name. If this fails it uses the dictionary to get the old name and then checks the API again. If this still fails it either uses the above function to get the new name (if the variable prefix is 'hv') or uses Gilroy's metadata. If either of these fail then it sets the metadata to None.

In [None]:
error = 0
meta_data = {}
count_by_new_name = {} # A dictionary mapping the new name to the number of observations var occurs in
for i in names:
    try: # Try to directly query metadata
        m = select(i)
        meta_data[i] = m
        count_by_new_name[i] = count_occurrences(i)
        print("Obtained metadata for ", i)
    except: # If that doesn't work try using the new name
        try:
            n = old_name_to_new_name[i]
            m = select(n)
            meta_data[n] = m
            count_by_new_name[n] = count_occurrences(i)
            print("Obtained metadata for ", i, " using new name ", n)
        except: # If this fails

            try:
                if i.startswith('hv'):
                    n = convertToNew(i)
                    m = select(n)
                    meta_data[n] = m
                    count_by_new_name[n] = count_occurrences(i)
                else:
                    print("Getting information from original metadata for ",i)
                    meta_data[i] = meta.loc[i]
                    count_by_new_name[i] = count_occurrences(i)
            except:
                print("Unable to obtain metadata for ",i)
                meta_data[i] = None
                count_by_new_name[i] = count_occurrences(i)

In [None]:
[x for x in meta_data.keys() if meta_data[x] is None]

After running this procedure the metadata has been obtained for all but 8 of these variables. These are ignored in the analysis below.

Now to consider analysis to summarize the findings:

- Top K most frequently occuring values
- Histogram of relevant waves
- Histogram of respondents
- Histogram of topic / umbrella topic

# Top 25 most frequent

In [None]:
for i, j in sorted(count_by_new_name.items(), key=lambda x: x[1], reverse=True)[:25]:
    print(meta_data[i]['label'], j)

## Creating a table summarizing all of the variables

In [None]:
table_data = {}
for k,v in meta_data.items():
    try:
        wave = v['wave']
        respondent = v['respondent']
        topics = v['topics'].split(';')
        topic1 = topics[0].strip()
        if len(topics) > 1:
            topic2 = topics[1].strip()
            table_data[k] = {'Count in LIME exp.': count_by_new_name[k],'Wave':wave, 'Respondent': respondent,
                        'topic 1': topic1, 'topic 2': topic2}
        else:
            table_data[k] = {'Count in LIME exp.': count_by_new_name[k],'Wave':wave, 'Respondent': respondent,
                        'topic 1': topic1}
    except:
        pass

In [None]:
results_and_metadata = pd.DataFrame.from_dict(table_data, orient='index')

In [None]:
results_and_metadata = results_and_metadata.fillna('N/A')

In [None]:
results_and_metadata.head(10)

In [None]:
results_and_metadata.to_csv('../../output/lime_results_and_metadata.csv')

# Waves

In [None]:
wave_count = defaultdict(int)
for _, r in results_and_metadata.iterrows():
    wave_count[r['Wave']] +=1

In [None]:
sorted(wave_count.items(), key=lambda x: x[1], reverse=True)

In [None]:
data = list(sorted(wave_count.items(), key=lambda x: x[0]))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, [x/500 for x in freq], color='b')
plt.xticks(indices, word, rotation='vertical')
plt.title('LIME identified variables by survey wave',size=18)
plt.ylabel('Proportion of variables',size=16)
plt.xlabel('Wave', size=16)
plt.xticks(size=14, rotation=360)
plt.yticks(size=14)
plt.tight_layout()
plt.show()

# Respondents

In [None]:
respondent_count = defaultdict(int)
for _, r in results_and_metadata.iterrows():
    respondent_count[r['Respondent']] +=1

In [None]:
sorted(respondent_count.items(), key=lambda x: x[1], reverse=True)

In [None]:
data = list(sorted(respondent_count.items(), key=lambda x: x[1], reverse=True))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, [x/500 for x in freq], color='b')
plt.xticks(indices, word, rotation='vertical', size=14)
plt.title('LIME identified variables by survey respondent',size=18)
plt.ylabel('Proportion of variables',size=16)
plt.xlabel('Respondent', size=16)
plt.tight_layout()
plt.yticks(size=14)

# Topic

In [None]:
topic_count = defaultdict(int)
umbrella_count = defaultdict(int)
for k,v in meta_data.items():
    try:
        for t in v['topics']:
            topic_count[t['topic']] += count_by_new_name[k]
            if t['umbrella'] == 'Parenting':
                print(v['label'])
            umbrella_count[t['umbrella']] += count_by_new_name[k]
    except:
        pass

In [None]:
topic_count = defaultdict(int)
for _, r in results_and_metadata.iterrows():
    try:
        topic_count[r['topic 2']] +=1
        topic_count[r['topic 1']] +=1
    except:
        topic_count[r['topic 1']] +=1

In [None]:
sorted(topic_count.items(), key=lambda x: x[1], reverse=True)

In [None]:
del topic_count['N/A'] # Remove missing observations as they simply indicate no second topic

In [None]:
data = list(sorted(topic_count.items(), key=lambda x: x[1], reverse=True))[:25]
word, freq = zip(*data)
indices = np.arange(len(data))
plt.rcParams["figure.figsize"] = (12,9)
plt.bar(indices, [x/500 for x in freq], color='b')
plt.xticks(indices, word, rotation='vertical',size=14)
plt.title('LIME identified variables by topic',size=18)
plt.ylabel('Proportion of variables',size=16)
plt.xlabel('Topic', size=16)
plt.yticks(size=14)
plt.savefig('../../figures/topic_lime_proportions.pdf')
plt.show()

## Differences in proportions and ratios

Finding proportion of questions in each wave

In [None]:
df = df[df['wave'] != 'Year 15'] # We do not want to include Year 15 waves as they were not used in the Challenge

In [None]:
df = df[df['new_name'] != 'idnum']

In [None]:
wave_count_full = dict(Counter(list(df['wave'])))

In [None]:
wave_count_full

In [None]:
data = list(sorted(wave_count_full.items(), key=lambda x: x[0]))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, [x/df.shape[0] for x in freq], color='blue')
plt.xticks(indices, word, rotation='vertical',size=14)
plt.title('Proportion of variables by wave in entire survey',size=18)
plt.ylabel('Proportion of variables',size=16)
plt.xlabel('Wave', size=16)
plt.yticks(size=14)
plt.show()

Now to take the difference between the proportion in my results and the proportion in the survey overall.

In [None]:
wave_counts_mod = {}
for k,v in wave_count.items():
    prop_observed = v/500
    prop_in_survey = wave_count_full[k]/df.shape[0]
    wave_counts_mod[k] = prop_observed-prop_in_survey # Diff in proportion

In [None]:
wave_counts_mod

In [None]:
wsignif = {}
for k,v in wave_count.items():
    count = np.array([v, wave_count_full[k]])
    nobs = np.array([500, df.shape[0]])
    stat, pval = proportions_ztest(count, nobs)
    if pval > 0.05:
        wsignif[k] = ''
    elif pval <= 0.05 and pval > 0.01:
        wsignif[k] = '*'
    elif pval <= 0.01 and pval > 0.001:
        wsignif[k] = '**'
    elif pval <= 0.001:
        wsignif[k] = '***'

In [None]:
data = list(sorted(wave_counts_mod.items(), key=lambda x: x[0]))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, freq, color='blue')
plt.xticks(indices, [wsignif[x] + ' ' + x for x in word], rotation='vertical',size=14)
plt.title('Proportion of LIME identified variables by wave relative to survey',size=18)
plt.ylabel('Difference in proportions',size=16)
plt.xlabel('Wave', size=16)
plt.ylim(-0.15, 0.032)
plt.yticks(size=14)
plt.show()

In [None]:
wave_counts_mod = {}
for k,v in wave_count.items():
    prop_observed = v/500
    prop_in_survey = wave_count_full[k]/df.shape[0]
    wave_counts_mod[k] = prop_observed/prop_in_survey # Ratio

In [None]:
data = list(sorted(wave_counts_mod.items(), key=lambda x: x[0]))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, freq, color='blue')
plt.xticks(indices, [wsignif[x] + ' ' + x for x in word], rotation='vertical',size=14)
plt.title('Ratio LIME identified variables to overall survey by wave',size=18)
plt.ylabel('Ratio',size=16)
plt.xlabel('Wave', size=16)
#plt.ylim(-0.10, 0.032)
plt.yticks(size=14)
plt.show()

Now to do the same for respondents:

In [None]:
resp_count_full = {}
for k,v in dict(Counter(list(df['respondent']))).items():
    resp_count_full[k] = v

In [None]:
data = list(sorted(resp_count_full.items(), key=lambda x: x[1]))

In [None]:
no_resp = data[0][1] # Replace nan category with N/A
data = data[1:]

In [None]:
data.insert(0, ('N/A', no_resp))

In [None]:
data

In [None]:
resp_count_full = dict(data)

In [None]:
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, [x/df.shape[0] for x in freq], color='blue')
plt.xticks(indices, word, rotation='vertical',size=14)
plt.title('Proportion of variables by respondent',size=18)
plt.ylabel('Proportion of variables',size=16)
plt.xlabel('Wave', size=16)
plt.yticks(size=14)
plt.show()

In [None]:
respondent_count

In [None]:
resp_counts_mod = {}
for k,v in respondent_count.items():
    prop_observed = v/500
    prop_in_survey = resp_count_full[k]/df.shape[0]
    resp_counts_mod[k] = prop_observed-prop_in_survey

In [None]:
resp_counts_mod

In [None]:
rsignif = {}
for k,v in respondent_count.items():
    count = np.array([v, resp_count_full[k]])
    nobs = np.array([500, df.shape[0]])
    stat, pval = proportions_ztest(count, nobs)
    if pval > 0.05:
        rsignif[k] = ''
    elif pval <= 0.05 and pval > 0.01:
        rsignif[k] = '*'
    elif pval <= 0.01 and pval > 0.001:
        rsignif[k] = '**'
    elif pval <= 0.001:
        rsignif[k] = '***'

In [None]:
data = list(sorted(resp_counts_mod.items(), key=lambda x: x[1]))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, freq, color='blue')
plt.xticks(indices, [rsignif[x]+' '+ x for x in word], rotation='vertical',size=14)
plt.title('Proportion of LIME identified variables \n by respondent relative to entire survey',size=18)
plt.ylabel('Difference in proportions',size=16)
plt.xlabel('Respondent', size=16)
plt.yticks(size=14)
plt.ylim(-0.1,0.05)
plt.show()

In [None]:
resp_counts_mod = {}
for k,v in respondent_count.items():
    prop_observed = v/500
    prop_in_survey = resp_count_full[k]/df.shape[0]
    resp_counts_mod[k] = prop_observed/prop_in_survey

In [None]:
data = list(sorted(resp_counts_mod.items(), key=lambda x: x[1]))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, freq, color='blue')
plt.xticks(indices, [rsignif[x]+' '+ x for x in word], rotation='vertical',size=14)
plt.title('Ratio of LIME identified variables \n by respondent relative to entire survey',size=18)
plt.ylabel('Ratio',size=16)
plt.xlabel('Respondent', size=16)
plt.yticks(size=14)
#plt.ylim(-0.1,0.15)
plt.show()

Now getting the same for topic:

In [None]:
topic_count_full = defaultdict(int)
topics = list(df['topics'])
for t in topics:
    x = t.split(';')
    for i in x:
        topic_count_full[i.strip()] +=1

In [None]:
topic_counts_mod = {}
for k,v in topic_count.items():
    prop_observed = v/500
    prop_in_survey = topic_count_full[k]/df.shape[0]
    topic_counts_mod[k] = prop_observed-prop_in_survey

In [None]:
tsignif = {}
for k,v in topic_count.items():
    count = np.array([v, topic_count_full[k]])
    nobs = np.array([500, df.shape[0]])
    stat, pval = proportions_ztest(count, nobs)
    if pval > 0.05:
        tsignif[k] = ''
    elif pval <= 0.05 and pval > 0.01:
        tsignif[k] = '*'
    elif pval <= 0.01 and pval > 0.001:
        tsignif[k] = '**'
    elif pval <= 0.001:
        tsignif[k] = '***'

In [None]:
data = list(sorted(topic_counts_mod.items(), key=lambda x: x[1], reverse=True))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, freq, color='blue')
plt.xticks(indices, [tsignif[x] + ' ' + x for x in word], rotation='vertical',size=14)
plt.title('LIME identified variables by respondent compared to survey',size=18)
plt.ylabel('Difference in proportions',size=16)
plt.xlabel('Topic', size=16)
plt.yticks(size=14)
plt.ylim(-0.13,0.13)
plt.savefig('../../figures/relative_topic_proportions.pdf')
plt.show()

In [None]:
topic_counts_mod_ratio = {}
for k,v in topic_count.items():
    prop_observed = v/500
    prop_in_survey = topic_count_full[k]/df.shape[0]
    topic_counts_mod_ratio[k] = prop_observed/prop_in_survey

In [None]:
data = list(sorted(topic_counts_mod_ratio.items(), key=lambda x: x[1],reverse=True))
word, freq = zip(*data)
indices = np.arange(len(data))
plt.bar(indices, freq, color='blue')
plt.xticks(indices, [tsignif[x]+' '+x for x in word], rotation='vertical',size=14)
plt.title('Ratio of umbrella topics in LIME identified variables \n relative to entire survey',size=18)
plt.ylabel('Ratio',size=16)
plt.xlabel('Umbrella topic', size=16)
plt.yticks(size=14)
plt.tight_layout()
plt.savefig('../../figures/relative_topic_ratio.pdf')
plt.show()