# Exploring Github API

<br />

Usage demonstration of some base python functionalities to compute results on data from API.

<br />

### Computed results:
1. Total stargazers count per language
2. Average Stargazers-count per Language
3. Language-wise ratio of forks to watchers and individual averages of each
4. 'Repo Bigness' and its effect on %age contribution to total watchers of its respective language

<br />

### Key highlights:
1. Approach
2. Documentation
3. Attention to forward compatibility
4. (one of the) Simple solution to **_Windowing_** problem.

In [1]:
import requests
from pprint import pprint
from itertools import groupby
from collections import defaultdict


URL = 'http://api.github.com/orgs/zendesk/repos'

r = requests.get(URL)

if r.status_code == 200:
    print ('Response received')
else: print ('Failure')

if 'json' in r.headers.get('Content-Type'):
    data = r.json()
    print ('Data loaded successfully')
else:
    print('Response content is not in JSON format.')
    data = 'spam'

Response received
Data loaded successfully


## Total stargazers count per language

<br />

Basic summation of stargazers on each language. The solution makes use of:

- filter
- sorted

In [2]:

# total stargazers_count per language
keys_of_interest = {'name', 'language', 'stargazers_count'}

# Ensure all keys exist - Discard rows that don't have required values
removeInvalids = filter(lambda dct: keys_of_interest.issubset(set(dct.keys())), data) # lazy evaluation

# project dict to tuple with required columns/keys for easy groupby operation
#---NOTE: Sorted keys_of_interest to ensure that order of result doesnt change
sorted_KOI = sorted(keys_of_interest) # pre-computed to avoid calls for all rows
projectedData = (tuple(dct[k] for k in sorted_KOI) for dct in removeInvalids) #generator


# Remove language info as we know resultset is only for 'Ruby'
def filter_language(inpData, lang):
    return [(r[1], r[2]) for r in (filter(lambda tp: tp[0] == lang, inpData))] # tp[0] = language

# Lazy map till filter and then sorted computes
res = sorted(filter_language(projectedData, 'Ruby'), key=lambda tp: tp[1], reverse=True)

# Clean printing
print ('Top 3 repos in \'RUBY\'')
print ('--------')
for r in res[:3]:
    print (r)


Top 3 repos in 'RUBY'
--------
('dropbox-api', 357)
('zendesk_api_client_rb', 340)
('arturo', 164)


## Average Stargazers-count per Language

A basic calculation of average stargazers-count per language and then display top 3 languages as result. Formula for calculation is given below:

> formula of average stargazers per language = (total stargazers for language)/(repo count of language)

<br />

### Key areas:

1. Kept "keys_of_interest" as set to use 'issubset' - optimization for removing invalids (for demonstration only)
2. Projected data from dict to tuple - remove unnecessary data
3. column_wise_sum(): removes hardcoding of which columns to sum. Also unifies logic in 1 place for exception handling. Supports any number of columns to find creative summations.

In [3]:
# Using same projected data from previously but adding a variation for COUNT

keys_of_interest = {'name', 'language', 'stargazers_count'}
# Ensure all keys exist - Discard rows that don't have required values
removeInvalids = filter(lambda dct: keys_of_interest.issubset(set(dct.keys())), data) # lazy evaluation

sorted_KOI = sorted(keys_of_interest) # pre-computed to avoid calls for all rows
# project dict to tuple with required columns/keys for easy groupby operation
#---NOTE: Sorted keys_of_interest to ensure that order of result doesnt change
#------ Added extra value of '1' to tuple for using as COUNTER
projectedData = (tuple(dct[k] for k in sorted_KOI) + (1,) for dct in removeInvalids)

# sort descending by language and stargazer-count
sortedData = sorted(projectedData, key=lambda k: (k[0], k[2]), reverse=True) # Groupby pre-requisite


def column_wise_sum (grpList, column_ids):
    # Input: list of tuples and list of id's of columns to sum.
    inpList = list(grpList) # groupby object conversion
    tuple_length = len(inpList[0]) # pre-compute to save cost
    # subset data, i.e., extract numeric columns to apply custom logic on.
    subset = [tuple(tp[i] for i in column_ids if i < tuple_length) for tp in inpList]
    
    """
    Logic: Expand and zip the list of tuples to convert it into column-wise list. Transpose. e.g.,
        [(2, 6), (7, 9)] => [(2, 7), (6, 9)]
           Then use SUM on each list to compute total
    """
    return tuple(sum(z) for z in zip(*subset))
    
# Use custom function to compute sum of each individual column, here: total-stargazers and repo-count
# Result = Language-wise total stargazers and repository count
groupedTotal = [ (k,) + (column_wise_sum(grp, [2, 3])) \
                        for k, grp in groupby(sortedData, key=lambda t: t[0])]


# Computing average stargazers per language
for tp in groupedTotal:
    print ("Average stargazers for language \'{}\' are: {}".format(tp[0], round(tp[1] / tp[2], 2)))

Average stargazers for language 'Ruby' are: 47.69
Average stargazers for language 'JavaScript' are: 1.0
Average stargazers for language 'Java' are: 13.0
Average stargazers for language 'C#' are: 8.0
Average stargazers for language 'C' are: 4.0


## Language-wise ratio of forks to watchers and individual averages of each

Refer to the ratio of forks to watchers as 'Conversion'.

<br />

This is really do demonstrate the re-usability of the approach I used in earlier example. Only data projection step will differ and final formula since we are calculating on multiple columns but rest stays the same as it is forward compatible.

This is usually my preferred approach of coding. Keep things nicely atomic so changes are centralized. And keep Tier 1, 2, and 3 functions where T1 are atomic functions, T2 uses T1 functions, and T3 is mostly pipeline structured that group multiple operations together.

> column_wise_sum() function will really come in handy here as we will be calculating sum of 5 columns at once.


In [4]:
# Using same projected data from previously but adding a variation for COUNT

keys_of_interest = {'name', 'language', 'forks', 'watchers'}
# Ensure all keys exist - Discard rows that don't have required values
removeInvalids = filter(lambda dct: keys_of_interest.issubset(set(dct.keys())), data) # lazy evaluation

sorted_KOI = ['name', 'language', 'forks', 'watchers'] # desired order of columns for our computation
# project dict to tuple with required columns/keys for easy groupby operation
#---NOTE: Sorted keys_of_interest to ensure that order of result doesnt change
#------ Added extra value of '1' to tuple for using as COUNTER
projectedData = (tuple(dct[k] for k in sorted_KOI) + (1,) for dct in removeInvalids)

# sort descending by language (column 1) and watchers (column 3)
sortedData = sorted(projectedData, key=lambda k: (k[1], k[3]), reverse=True) # Groupby pre-requisite

def column_wise_sum (grpList, column_ids):
    # Input: list of tuples and list of id's of columns to sum.
    if isinstance(grpList, list):
        print ("its already list")
        inpList = grpList
    else: inpList = list(grpList) # groupby object conversion
    
    tuple_length = len(inpList[0]) # pre-compute to save cost
    # subset data, i.e., extract numeric columns to apply custom logic on.
    subset = [tuple(tp[i] for i in column_ids if i < tuple_length) for tp in inpList]
    
    """
    Logic: Expand and zip the list of tuples to convert it into column-wise list. Transpose. e.g.,
        [(2, 6), (7, 9)] => [(2, 7), (6, 9)]
           Then use SUM on each list to compute total
    """
    return tuple(sum(z) for z in zip(*subset))
    
# Use custom function to compute sum of each individual column, here: total-stargazers and repo-count
# Result = Language-wise total stargazers and repository count
groupedTotal = [ (k,) + (column_wise_sum(grp, [2, 3, 4])) \
                        for k, grp in groupby(sortedData, key=lambda t: t[1])] # groupby LANGUAGE


# Computing Language-wise ratio of forks to watchers and individual averages of each
# conversion = ratio of forks to watchers.
for tp in groupedTotal:
    print ("For \'{}\' language -- Repo count = {} ====> conversion: {}% ~~~ Average Forks: {} and Average Watchers: {}" \
        .format(tp[0], tp[3], round((tp[1] / tp[2])*100, 2), round(tp[1] / tp[3], 2), round(tp[2] / tp[3], 2)))

For 'Ruby' language -- Repo count = 26 ====> conversion: 26.77% ~~~ Average Forks: 12.77 and Average Watchers: 47.69
For 'JavaScript' language -- Repo count = 1 ====> conversion: 0.0% ~~~ Average Forks: 0.0 and Average Watchers: 1.0
For 'Java' language -- Repo count = 1 ====> conversion: 53.85% ~~~ Average Forks: 7.0 and Average Watchers: 13.0
For 'C#' language -- Repo count = 1 ====> conversion: 12.5% ~~~ Average Forks: 1.0 and Average Watchers: 8.0
For 'C' language -- Repo count = 1 ====> conversion: 50.0% ~~~ Average Forks: 2.0 and Average Watchers: 4.0


## Display 'bigness' in descending order of the %age contribution to total watchers of its respective language

<br />

Computed an arbitrary column 'bigness' for demonstration; categorized repos to 'small', 'medium', and 'large' on the basis of their 'size' attribute. The idea was to compute two aggregates with different grain on data in efficient way, i.e., one grouping over 'bigness' and other to calculate total over 'language'.

Suggested methodology computes aggregates in O(N) on a generator function and then uses the dictionary representation for data that is easy to display output with.

<br />

Operation looks like:


```
e.g.,
Last 2 columns are computed

bigness      | watchers  | language  | language_watchers  | bigness_contribution
................................................................................
medium       | 200       | Python    | 500                | 40%
small        | 300       | Python    | 500                | 60%
medium       | 150       | Ruby      | 200                | 75%
large        | 50        | Ruby      | 200                | 25%


RESULT:
bigness    | watchers  | language  | language_watchers  | bigness_contribution
..............................................................................
medium     | 150       | Ruby      | 200                | 75%
small      | 300       | Python    | 500                | 60%
medium     | 200       | Python    | 500                | 40%
large      | 50        | Ruby      | 200                | 25%
```

In [5]:
from collections import defaultdict # to avoid initializations of dictionary

# keys_of_interest = {'name', 'language', 'watchers'}
# removeInvalids = filter(lambda dct: keys_of_interest.issubset(set(dct.keys())), data)
dataNew = data
# Adding a new computed column
for dct in dataNew:
    if dct['size'] <= 100:
        dct['bigness'] = 'small'
    elif dct['size'] > 100 and dct['size'] <= 500:
        dct['bigness'] = 'medium'
    else: dct['bigness'] = 'large'

sorted_KOI = ['bigness', 'language', 'watchers'] # desired order of columns for our computation
# project dict to tuple with required columns/keys for easy groupby operation
#------ Added extra value of '1' to tuple for using as COUNTER
projectedData = (tuple(dct[k] for k in sorted_KOI) + (1,) for dct in dataNew)

# Avoiding the need to manually initalize 'expected keys' in advance
aggResult = defaultdict(lambda: defaultdict(int))

# leverage power of generator and simplistic approach using dictionary for aggregates
for row in projectedData:
    groupByKey = row[0] # main aggregation over 'bigness' column
    windowKey = row[1] # windowing over language
    aggResult[groupByKey][windowKey] += row[2] # add watchers to respective language

# Expanding nested and repeated format. Calculate window aggregate at same time. 
P4resultSet = []
aggWindow = defaultdict(int) # hold totals for each language
for k,v in aggResult.items():
    for ik in v.keys():
        P4resultSet.append({'bigness' : k, 'watchers' : v[ik], 'language' : ik}) # modifying key
        aggWindow[ik] += v[ik]


"""
Apply final transformation to JOIN two aggregates over 'language' column and compute 
    contribution percentage
"""
for dct in P4resultSet:
    lang_total = aggWindow[dct['language']]
    bigness_contrib = round((dct['watchers'] / lang_total) * 100, 2)
    dct.update({'language_watchers' : lang_total, 'bigness_contribution' : bigness_contrib})
    
print ("\'bigness\' in descending order of the %age contribution to total watchers of its respective language")
print ('-----------')
finalResP4 = sorted(P4resultSet, key=lambda d: d['bigness_contribution'], reverse=True)
for r in finalResP4: # clean printing
    print (r)
    
    
    


'bigness' in descending order of the %age contribution to total watchers of its respective language
-----------
{'bigness': 'medium', 'watchers': 13, 'language': 'Java', 'language_watchers': 13, 'bigness_contribution': 100.0}
{'bigness': 'small', 'watchers': 8, 'language': 'C#', 'language_watchers': 8, 'bigness_contribution': 100.0}
{'bigness': 'large', 'watchers': 1, 'language': 'JavaScript', 'language_watchers': 1, 'bigness_contribution': 100.0}
{'bigness': 'large', 'watchers': 4, 'language': 'C', 'language_watchers': 4, 'bigness_contribution': 100.0}
{'bigness': 'medium', 'watchers': 650, 'language': 'Ruby', 'language_watchers': 1240, 'bigness_contribution': 52.42}
{'bigness': 'large', 'watchers': 524, 'language': 'Ruby', 'language_watchers': 1240, 'bigness_contribution': 42.26}
{'bigness': 'small', 'watchers': 66, 'language': 'Ruby', 'language_watchers': 1240, 'bigness_contribution': 5.32}
