# Interactive geographic visualisation of Swiss Github users

One of the most interesting visualisations we can do if to situate swiss github users with respect to their geographic features. Our goal here is to see if we can visualize some interesting patterns that might have arisen intuitively while thinking about the community of Swiss users. 

One example of this intuition would be that users are concentrated around universities, most notably EPFL and ETHZ. Other interesting geographic divisions to study could be the Rostigraben or differences between cantons.

## Geodata pre-processing

The first step is to process our data. This includes connecting to our database of course, but also extracting relevant features and statistics, geocoding users.

In [887]:
# Include ALL the things

# Pretty plots
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_context('notebook')
%matplotlib inline

# Connecting to DB
from utils import get_mongo_db

# Requesting stuff
import requests

# Data handling
import pandas as pd
import itertools
import pickle
import numpy as np
import re

# Map drawing
import folium

### Collecting the users

Connecting to the database, then fetching our dataset of users.

In [888]:
db = get_mongo_db()

Connecting to MongoDB at localhost:27017...


In [889]:
# Get users from DB
res = db.users.find({ 'in_ch': True, 'repositories': { '$ne': None } })

users = []

# For each user, find his repositories
for user in res:
    repos = db.repositories.find(
        { '_id': { '$in': user['repositories'] } }
    )
    
    geo = user.get('geocode', {})
    canton = geo.get('state', '')
    lat = geo.get('lat', '')
    lng = geo.get('lng', '')
    
    users.append({
        '_id': user['_id'],
        'login': user['login'],
        'name': user['name'],
        'location': user['location'],
        'repositories_docs': list(repos),
        'canton' : canton,
        'lat' : lat,
        'lng' : lng
    })
    
print("Our dataset includes {} users.".format(len(users)))

Our dataset includes 5976 users.


We now have **5976** users, with the following data fields :

* *_id* : a uid
* *login* : username
* *name* : the user's name
* *location* : the user's location
* *repositories_docs* : a list of the user's repositories

### Collecting statistics

For each user let us collect a few interesting statistics :

* *repo_count* : The number of repositories for each user
* *star_count* : The number of stars on all the user's repos
* *watchers_count* : The number of watchers on all the user's repos
* *forks_count* : The number of forks on all the user's repos


In [890]:
# Define the helper functions

def count_repos(user):
    return len(user['repositories_docs'])

def count_stat(user, key):
    count = 0
    for repo in user['repositories_docs']:
        count = count + repo[key]
    return count

def count_stars(user):
    return count_stat(user, 'stargazers_count')

def count_watchers(user):
    return count_stat(user, 'watchers_count')

def count_forks(user):
    return count_stat(user, 'forks_count')


In [891]:
users_data = [{ 'id': user['_id'],
                'location': user['location'],
                'name' : user['name'],
                'username' : user['login'],
                'canton' : user['canton'],
                'lat' : user['lat'],
                'lng' : user['lng'],
                'repo_count' : count_repos(user),
                'star_count' : count_stars(user),
                'watchers_count' : count_watchers(user),
                'forks_count' : count_forks(user),
                'users_count' : 1
              } 
              for user in users]

users_df = pd.DataFrame(users_data)

In [892]:
users_df.sample(10)

Unnamed: 0,canton,forks_count,id,lat,lng,location,name,repo_count,star_count,username,users_count,watchers_count
1411,,0,8177607,46.8182,8.22751,Schweiz,Cordula Braun,0,0,cobra3,1,0
5861,BE,0,699130,47.1368,7.24679,biel-bienne,alaric,6,0,wullerot,1,0
3647,Basel-Stadt,0,9884884,47.5581,7.58783,"Basel, Switzerland",James Wong,10,0,jameswtc,1,0
1578,ZH,2,1353587,47.3769,8.54169,"Zürich, CH",Daniel Keller,17,8,danielkeller,1,8
5386,BS,0,2838237,47.5596,7.58858,Basel,Stephan,4,0,StephanSST,1,0
4147,Vaud,0,8378319,46.5093,6.49832,"Morges, Switzerland",Marco Maccio,11,3,marcusmaccio,1,3
3643,Zürich,1,1539369,47.3686,8.54044,"Zürich, Switzerland",Woodsie,3,1,james-a-woods,1,1
2095,ZH,1,1897296,47.3769,8.54169,Zürich,Hirzel,4,0,ahirzel,1,0
1572,TG,0,5507223,47.6038,9.05574,"High Tech Center 1, Taegerwilen - Thurgau, CH-...",Vaclav Cechticky,0,0,cechticky,1,0
1333,,0,138842,46.8182,8.22751,Schweiz,Marc van Nuffel,2,0,marcvannuffel,1,0


In [893]:
grouped = users_df.groupby(['canton']).sum().reset_index()
grouped = grouped.drop('id', axis=1)

In [894]:
missing_cantons = [canton for canton in cantons if canton not in grouped['canton'].values]

with_all_cantons = grouped.copy()

for canton in missing_cantons:
    data = {
        'canton': [canton],
        'star_count': [0],
        'repo_count' : [0],
        'watchers_count' : [0],
        'forks_count' : [0],
        'users_count' : [0]
    }
    df = pd.DataFrame.from_dict(data, orient='columns')
    
    with_all_cantons = with_all_cantons.append(df, ignore_index=True)

with_all_cantons = with_all_cantons[with_all_cantons['canton'].isin(cantons)].reset_index()
with_all_cantons

Unnamed: 0,index,canton,forks_count,repo_count,star_count,users_count,watchers_count
0,3,AG,538,603,3014,55,3014
1,4,AR,0,8,0,1,0
2,9,BE,3419,4951,10502,265,10502
3,10,BL,83,169,302,16,302
4,11,BS,2056,3125,5838,154,5838
5,33,FR,1422,961,8872,47,8872
6,36,GE,4722,6408,15706,308,15706
7,37,GL,11,129,34,1,34
8,39,GR,64,169,272,13,272
9,53,JU,2,78,12,5,12


In [895]:
pickle.dump(with_all_cantons[['canton', 'star_count']], open('stars_by_cantons_alt.p','wb'))
pickle.dump(with_all_cantons[['canton', 'repo_count']], open('repos_by_cantons_alt.p','wb'))
pickle.dump(with_all_cantons[['canton', 'users_count']], open('users_by_cantons_alt.p','wb'))

In [896]:
pickle.dump(users_df[['username', 'lng', 'lat']], open('users_locations_alt.p','wb'))

In [897]:
pickle.dump(users_df, open('users_data.p', 'wb'))

### Maps

We'll re-use the topojson overlay that was given to us in HW03 to build a map over Swiss cantons. For this we will use a similar procedure to HW03 where we use folium to draw over the topojson.

In [898]:
# Map overlay
canton_overlay  = 'ch-cantons.topojson.json'

# Statistics
stars_by_canton = 'stars_by_cantons_alt.p'
repos_by_canton = 'repos_by_cantons_alt.p'
users_by_canton = 'users_by_cantons_alt.p'
users_locations = 'users_locations_alt.p'

In [899]:
# Initialize the map to ~ the center of Switzerland
ch_center_loc = [46.92287,8.3829913] # Empirical "center" of Switzerland
map_ch = folium.Map(location=ch_center_loc, zoom_start=8)

# overlay the cantons onto the map
folium.TopoJson(open(canton_overlay),
                'objects.cantons',
                name='topojson'
               ).add_to(map_ch)

<folium.features.TopoJson at 0x1925c84e0>

In [900]:
map_ch.save('1_map_ch.html')
map_ch

### You can view the map [here](1_map_ch.html)

### [Map 1](2_map_ch_choro_repos.html) : Number of repositories per canton

In [901]:
# Load the data
repos_by_cantons_data = pickle.load(open(repos_by_canton,'rb')).reset_index()

# Plot a Choropleth map
cols = ['canton', 'repo_count'] # Columns of interest
color_map = 'YlOrRd'                 # Color Map used, Yellow for low values, Red for high
legend_str = 'Number of repositories'   # Legend title

map_ch.choropleth(
    geo_path=canton_overlay, 
    data=repos_by_cantons_data,
    columns=cols,
    topojson='objects.cantons',
    key_on='feature.id',
    fill_color=color_map,
    fill_opacity=0.7, 
    line_opacity=0.5,
    legend_name=legend_str,
    reset=True
)

map_ch.save('2_map_ch_choro_repos.html')
map_ch



### You can view the map [here](2_map_ch_choro_repos.html)

### [Map 2](3_map_ch_choro_users.html) : Localization of Github users in Switzerland

Let's add the individual users as markers, and underneath have a Choropleth with the number of users per canton.

In [902]:
from folium.plugins import MarkerCluster

user_data = pickle.load(open('users_data.p', 'rb')).dropna()
locations = list(zip(user_data['lat'], user_data['lng']))

ch_center_loc = [46.92287,8.3829913] # Empirical "center" of Switzerland
map_ch2 = folium.Map(location=ch_center_loc, zoom_start=8, tiles='OpenStreetMap')

# overlay the cantons onto the map
folium.TopoJson(open(canton_overlay),
                'objects.cantons',
                name='topojson'
               ).add_to(map_ch2)

# Plot a Choropleth map
cols = ['canton', 'users_count'] # Columns of interest
color_map = 'YlOrRd'             # Color Map used, Yellow for low values, Red for high
legend_str = 'Number of users'   # Legend title
canton_data = pickle.load(open(users_by_canton, 'rb'))

map_ch2.choropleth(
    geo_path=canton_overlay, 
    data=canton_data,
    columns=cols,
    topojson='objects.cantons',
    key_on='feature.id',
    fill_color=color_map,
    fill_opacity=0.7, 
    line_opacity=0.5,
    legend_name=legend_str)

map_ch2.add_child(MarkerCluster(locations=locations))

map_ch2.save("3_map_ch_choro_users.html")
map_ch2



### You can view the map [here](3_map_ch_choro_users.html)

### [Map 3](4_map_ch_heat_users.html) User Heatmap

We can also visualise our users on a heatmap, which assigns colors to the density of users in certain areas. Our original intuitions are confirmed, where we have major activity centers near Zurich, around Neuchâtel, and along the Lemanic arc. 

In [926]:
from folium.plugins import HeatMap

heat_users = pickle.load(open('users_locations_alt.p', 'rb'))
heat_locations = list(zip(heat_users['lat'], heat_users['lng']))

ch_center_loc = [46.92287,8.3829913] # Empirical "center" of Switzerland
map_ch3 = folium.Map(location=ch_center_loc, zoom_start=8, tiles='stamentoner')

HeatMap(heat_locations).add_to(map_ch3)

<folium.plugins.heat_map.HeatMap at 0x1e25fd668>

In [929]:
map_ch3.save('4_map_ch_heat_users.html')
map_ch3

### You can view the map [here](4_map_ch_heat_users.html)

### Language usage

Let's take a look at the most used languages in Github Switzerland.

In [905]:
# Get users from DB
res = db.users.find({ 'in_ch': True, 'repositories': { '$ne': None } })

localized_repos = []

# For each user, find his repositories
for user in res:
    repos = db.repositories.find(
        { '_id': { '$in': user['repositories'] } }
    )
    
    geo = user.get('geocode', {})
    canton = geo.get('state', '')
    lat = geo.get('lat', None)
    lng = geo.get('lng', None)
    
    for repo in repos:
        localized_repos.append({
            'created_by' : user['login'],
            'project_name' : repo['full_name'],
            'url' : repo['clone_url'],
            'language' : repo['language'],
            'canton' : canton,
            'star_count' : repo['stargazers_count'],
            'lat' : lat,
            'lng' : lng
        })
    
print("Our dataset includes {} repos.".format(len(localized_repos)))

Our dataset includes 98862 repos.


In [906]:
localized_repos_df = pd.DataFrame(localized_repos)
localized_repos_df = localized_repos_df[localized_repos_df['canton'].isin(cantons)].reset_index()
localized_repos_df.sample(10)

Unnamed: 0,index,canton,created_by,language,lat,lng,project_name,star_count,url
27284,42357,BE,adoweb,JavaScript,46.947974,7.447447,adoweb/file-uploader,0,https://github.com/adoweb/file-uploader.git
2467,3351,BE,chrisglass,Python,46.947974,7.447447,chrisglass/wsgiservice,2,https://github.com/chrisglass/wsgiservice.git
37944,64277,VD,horiaradu,Ruby,46.519653,6.632273,horiaradu/ruby-git,0,https://github.com/horiaradu/ruby-git.git
36868,60700,ZH,frne,PHP,47.376887,8.541694,frne/Neo4jUserBundle,1,https://github.com/frne/Neo4jUserBundle.git
32696,53225,BL,cs2ag,PHP,47.464394,7.810661,cs2ag/jpfaq,0,https://github.com/cs2ag/jpfaq.git
10604,15435,VD,yageek,,46.519653,6.632273,yageek/iCache,0,https://github.com/yageek/iCache.git
10037,14749,BE,hairmare,PHP,46.947974,7.447447,hairmare/SmsSender-Swisscom-Provider,0,https://github.com/hairmare/SmsSender-Swisscom...
23736,37733,VD,yvaucher,Python,46.519653,6.632273,yvaucher/maintainer-quality-tools,0,https://github.com/yvaucher/maintainer-quality...
24491,38519,VS,183amir,Python,46.10498,7.075533,183amir/bob.db.nist_sre12-feedstock,0,https://github.com/183amir/bob.db.nist_sre12-f...
41629,89519,BE,st4ple,CSS,46.947974,7.447447,st4ple/st4ple.github.io,0,https://github.com/st4ple/st4ple.github.io.git


In [907]:
def count_lang_occurences(g):
    count_lang = {}
    for idx, row in g.iterrows():
        if row['canton'] in count_lang:
            if row['language'] in count_lang[row['canton']]:
                count_lang[row['canton']][row['language']] += 1
            elif row['language'] is not None: 
                count_lang[row['canton']][row['language']] = 1
        else:
            count_lang[row['canton']] = {row['language'] : 1}
    return count_lang
            
count_lang = count_lang_occurences(localized_repos_df)

In [908]:
count_lang

{'AG': {'AGS Script': 1,
  'Arduino': 1,
  'Assembly': 2,
  'C': 17,
  'C#': 23,
  'C++': 14,
  'CSS': 18,
  'Clojure': 2,
  'CoffeeScript': 2,
  'Cuda': 1,
  'Elm': 1,
  'Emacs Lisp': 1,
  'Go': 22,
  'HTML': 11,
  'Haskell': 2,
  'Java': 100,
  'JavaScript': 155,
  'Julia': 3,
  'Lua': 1,
  'Objective-C': 11,
  'PHP': 61,
  'Perl': 9,
  'Puppet': 4,
  'Python': 11,
  'R': 5,
  'Ruby': 24,
  'Scala': 2,
  'Scheme': 1,
  'Shell': 19,
  'Swift': 5,
  'VimL': 2,
  'Visual Basic': 2,
  'XSLT': 1},
 'AR': {'HTML': 1, 'JavaScript': 3, 'PHP': 2, 'Shell': 1},
 'BE': {'ASP': 1,
  'ApacheConf': 4,
  'AppleScript': 1,
  'Arduino': 18,
  'Assembly': 5,
  'Batchfile': 7,
  'C': 95,
  'C#': 112,
  'C++': 122,
  'CMake': 1,
  'CSS': 142,
  'CartoCSS': 1,
  'Clojure': 7,
  'CoffeeScript': 20,
  'D': 1,
  'Dart': 1,
  'Elixir': 11,
  'Elm': 1,
  'Emacs Lisp': 52,
  'Erlang': 3,
  'F#': 2,
  'GDScript': 1,
  'GLSL': 1,
  'Go': 37,
  'Groff': 1,
  'Groovy': 8,
  'HTML': 122,
  'Haskell': 4,
  'Java': 46

In [909]:
count_lang_df = pd.DataFrame(count_lang).fillna(0)

In [910]:
count_lang_df = count_lang_df.applymap(np.int64)
count_lang_df

Unnamed: 0,AG,AR,BE,BL,BS,FR,GE,GL,GR,JU,...,SG,SH,SO,SZ,TG,TI,VD,VS,ZG,ZH
AGS Script,1,0,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Arduino,1,0,18,0,5,3,8,0,0,0,...,6,0,0,0,0,2,27,0,1,24
Assembly,2,0,5,0,0,0,3,0,0,0,...,1,0,0,0,1,0,5,0,0,7
C,17,0,95,5,70,28,296,0,10,7,...,11,0,7,2,9,20,358,11,44,623
C#,23,0,112,3,27,16,82,0,4,0,...,25,0,3,0,3,3,137,4,2,285
C++,14,0,122,2,115,82,556,0,7,5,...,11,0,2,2,1,53,372,45,21,706
CSS,18,0,142,3,97,22,175,0,3,1,...,30,1,1,2,6,23,215,10,13,450
CartoCSS,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
Clojure,2,0,7,0,1,0,120,0,0,0,...,0,0,0,0,0,0,60,0,1,53
CoffeeScript,2,0,20,2,15,1,27,0,4,0,...,8,0,0,0,0,4,33,1,1,122


In [911]:
def get_most_popular_langs(df):
    pop_langs = {}
    for column in df:
        pop_langs[column] = df[column].idxmax()
    return pop_langs

pop_langs = get_most_popular_langs(count_lang_df)

In [912]:
pop_langs_df = pd.DataFrame(pop_langs, index={'Language'}).transpose()
pop_langs_df

Unnamed: 0,Language
AG,JavaScript
AR,JavaScript
BE,JavaScript
BL,PHP
BS,JavaScript
FR,Python
GE,Python
GL,Ruby
GR,Python
JU,JavaScript


In [913]:
mapping = {'JavaScript' : 1, 'Ruby' : 2, 'Python' : 3, 'PHP' : 4, 'Perl' : 5, 'Java' : 6}
pop_langs_df = pop_langs_df.replace({'Language': mapping}).reset_index().rename(columns={'index': 'Cantons'})
missing_cantons = [canton for canton in cantons if canton not in pop_langs_df['Cantons'].values]

pop_langs_df_complete = pop_langs_df.copy()

for canton in missing_cantons:
    data = {
        'Cantons': [canton],
        'Language' : [0]
    }
    df = pd.DataFrame.from_dict(data, orient='columns')
    
    pop_langs_df_complete = pop_langs_df_complete.append(df, ignore_index=True)

pop_langs_df_complete = pop_langs_df_complete[pop_langs_df_complete['Cantons'].isin(cantons)]

In [930]:
# Initialize the map to ~ the center of Switzerland
ch_center_loc = [46.92287, 8.3829913] # Empirical "center" of Switzerland
map_ch4 = folium.Map(location=ch_center_loc, zoom_start=8)

# overlay the cantons onto the map
folium.TopoJson(open(canton_overlay),
                'objects.cantons',
                name='topojson').add_to(map_ch4)

# Plot a Choropleth map
color_map = 'Spectral'

map_ch4.choropleth(
    geo_path=canton_overlay, 
    data=pop_langs_df_complete,
    columns = ['Cantons','Language'],
    topojson='objects.cantons',
    key_on='feature.id',
    fill_color=color_map,
    fill_opacity=0.7, 
    line_opacity=0.5)

map_ch4.save('5_map_ch_choro_lang.html')
map_ch4



> Legend :

* **0** : no data
* **1** : JavaScript 
* **2** : Ruby
* **3** : Python
* **4** : PHP
* **5** : Perl 
* **6** : Java

### You can view the map [here](5_map_ch_choro_lang.html)

## Most popular Swiss repositories explorer


In [915]:
# Top 100 repositories in terms of stars
top_100_repos = localized_repos_df.sort_values(by='star_count', ascending=False)[0:100].reset_index()
top_100_repos.head(10)

Unnamed: 0,level_0,index,canton,created_by,language,lat,lng,project_name,star_count,url
0,487,650,ZH,jwagner,JavaScript,47.376887,8.541694,jwagner/smartcrop.js,9477,https://github.com/jwagner/smartcrop.js.git
1,1467,1737,ZH,gionkunz,JavaScript,47.376887,8.541694,gionkunz/chartist-js,8999,https://github.com/gionkunz/chartist-js.git
2,10,115,ZH,Seldaek,PHP,47.376887,8.541694,Seldaek/monolog,5601,https://github.com/Seldaek/monolog.git
3,2244,3111,ZH,adrai,JavaScript,47.376887,8.541694,adrai/flowchart.js,3535,https://github.com/adrai/flowchart.js.git
4,189,352,FR,0xced,Objective-C,46.806477,7.161972,0xced/iOS-Artwork-Extractor,2635,https://github.com/0xced/iOS-Artwork-Extractor...
5,9580,14088,AG,garnele007,Swift,47.387666,8.255429,garnele007/SwiftOCR,2133,https://github.com/garnele007/SwiftOCR.git
6,252,415,FR,0xced,Objective-C,46.806477,7.161972,0xced/XCDYouTubeKit,1952,https://github.com/0xced/XCDYouTubeKit.git
7,1189,1455,GE,tobie,Perl,46.204391,6.143158,tobie/ua-parser,1775,https://github.com/tobie/ua-parser.git
8,776,939,ZH,sustrik,C,47.376887,8.541694,sustrik/libmill,1714,https://github.com/sustrik/libmill.git
9,2975,4400,ZH,The-Compiler,Python,47.49882,8.723689,The-Compiler/qutebrowser,1668,https://github.com/The-Compiler/qutebrowser.git


In [916]:
def gh_popup(repo, rank):
    stars = repo['star_count']
    name = repo['project_name']
    url = repo['url']
    return "#" + str(rank) + " : " + name + " (" + str(stars) + " stars)"

In [931]:
import random

# Create the map
ch_center_loc = [46.92287, 8.3829913] # Empirical "center" of Switzerland
map_ch5 = folium.Map(location=ch_center_loc, zoom_start=8)


for r in range(100):
    lat_delta = random.uniform(-0.05, 0.05)
    lng_delta = random.uniform(-0.05, 0.05)
    
    folium.Marker(
        location=[top_100_repos.iloc[r]['lat']+lat_delta, top_100_repos.iloc[r]['lng']+lng_delta],
        popup=gh_popup(top_100_repos.iloc[r], r),
        icon=folium.Icon(icon='star'),
    ).add_to(map_ch5)

map_ch5.save('6_map_ch_top100.html')
map_ch5

### You can view the map [here](6_map_ch_top100.html)