# Contributor analysis in enterprise-driven OSS projects

This notebook is used to analyze the influence of non-enterprise-affiliated contributors to enterprise-driven OSS projects, in the context of the [2022 World of Code Hackathon](https://github.com/woc-hack/hackathon-pittsburgh-2022). Particularly we aim to investigate the following project characteristics from contributors' perspective.

- commit frequency
- contribution size (LOC, files)
- activity distribution

This work is based on the following datasets.

- [A Dataset of Enterprise-Driven Open Source Software](https://arxiv.org/abs/2002.03927), MSR 2020
- [World Of Code: Complete, Curated, Cross-referenced, and Current Collection of Open Source Version Control Data](https://worldofcode.org/)

The project team consists of the following members
from the [Athens University of Economics and Business](https://www.aueb.gr/en/international), Greece.

- [Angeliki Papadopoulou](https://www.balab.aueb.gr/angeliki-papadopolou.html), undergraduate student
- [George Liargkovas](https://www.balab.aueb.gr/george-liargkovas.html), undergraduate student
- [Zoe Kotti](https://zkotti.github.io/), PhD student


In [73]:
import pandas as pd

## Data loading & curation

In [74]:
with open('../data/column_names.txt', 'r') as f:
    indoss_columns = f.read().splitlines()
indoss_columns

['url',
 'project_id',
 'sdtc',
 'mcpc',
 'mcve',
 'star_number',
 'commit_count',
 'files',
 'lines',
 'pull_requests',
 'github_repo_creation',
 'earliest_commit',
 'most_recent_commit',
 'committer_count',
 'author_count',
 'dominant_domain',
 'dominant_domain_committer_commits',
 'dominant_domain_author_commits',
 'dominant_domain_committers',
 'dominant_domain_authors',
 'cik',
 'fg500',
 'sec10k',
 'sec20f',
 'project_name',
 'owner_login',
 'company_name',
 'owner_company',
 'license']

In [75]:
indoss = pd.read_csv('../data/enterprise_projects.txt', delimiter='\t', header=None, names=indoss_columns)
indoss

Unnamed: 0,url,project_id,sdtc,mcpc,mcve,star_number,commit_count,files,lines,pull_requests,...,dominant_domain_authors,cik,fg500,sec10k,sec20f,project_name,owner_login,company_name,owner_company,license
0,https://github.com/aligent/CacheObserver,149215,t,t,t,67,117,28.0,3776.0,20,...,6,,,,,CacheObserver,aligent,,,
1,https://github.com/moguno/mikutter-windows,8303484,t,f,f,30,118,24.0,921.0,4,...,3,,,,,mikutter-windows,moguno,,,
2,https://github.com/moguno/mikutter-subparts-image,7633353,t,f,f,30,206,7.0,551.0,9,...,3,,,,,mikutter-subparts-image,moguno,,,
3,https://github.com/10up/theme-scaffold,94836776,f,t,t,79,341,76.0,19577.0,76,...,5,,,,,theme-scaffold,10up,,,MIT
4,https://github.com/10up/plugin-scaffold,95361839,f,t,t,48,152,48.0,16075.0,57,...,6,,,,,plugin-scaffold,10up,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17259,https://github.com/AppCanOpenSource/appcan-ios,14381676,t,f,f,73,864,838.0,90591.0,94,...,3,,,,,appcan-ios,AppCanOpenSource,,zymobi,LGPL-3.0
17260,https://github.com/zynxhealth/z-mon,184224,f,t,f,35,99,29.0,3076.0,3,...,1,,,,,z-mon,zynxhealth,,,
17261,https://github.com/zalando/riptide,20895628,f,f,t,111,1710,533.0,33740.0,247,...,7,,,,,riptide,zalando,,,MIT
17262,https://github.com/zalando-incubator/remora,62870276,f,f,t,147,141,,,41,...,0,,,,,remora,Zalando-Incubator,,,MIT


In [84]:
# Adapt URL to WoC project name convention
splitted_urls = indoss['url'].str.split("/", expand=True)
indoss_projects = splitted_urls.iloc[:,-2:]
indoss_projects.columns = ['dom', 'proj']
indoss_projects

Unnamed: 0,dom,proj
0,aligent,CacheObserver
1,moguno,mikutter-windows
2,moguno,mikutter-subparts-image
3,10up,theme-scaffold
4,10up,plugin-scaffold
...,...,...
17259,AppCanOpenSource,appcan-ios
17260,zynxhealth,z-mon
17261,zalando,riptide
17262,zalando-incubator,remora


In [92]:
concat_indoss_projects = indoss_projects['dom'] + '_' + indoss_projects['proj']
final_projects = pd.DataFrame(concat_indoss_projects, columns = ['project'])
final_projects = final_projects.join(indoss['dominant_domain'])
final_projects

Unnamed: 0,project,dominant_domain
0,aligent_CacheObserver,aligent.com.au
1,moguno_mikutter-windows,0kn.sakura.ne.jp
2,moguno_mikutter-subparts-image,0kn.sakura.ne.jp
3,10up_theme-scaffold,10up.com
4,10up_plugin-scaffold,10up.com
...,...,...
17259,AppCanOpenSource_appcan-ios,zymobi.com
17260,zynxhealth_z-mon,zynx.com
17261,zalando_riptide,zalando.de
17262,zalando-incubator_remora,zalando.de


In [98]:
final_projects.to_csv('../data/indoss_projects.csv', index=False, header=False, sep=";")

## Project deduplication

We want to remove forked projects from our analysis, because even if a forked project is now enteprise-driven, some initial commits are referring to the parent project. We want to include projects that are enterprise-driven from the beginning.

`$> cut -d";" -f1 ../data/indoss_projects.csv | ~/lookup/getValues p2P | awk 'BEGIN { FS=OFS=";" } $1==$2' | cut -d";" -f1 >../data/dedup_indoss_projects.txt`

## p2a

`$> cat ../data/dedup_indoss_projects.txt | ~/lookup/getValues -f p2a >../data/p2a.csv`

## Join with dominant domains

`$> join -j1 -t';' <(sort -t';' -k1,1 ../data/p2a.csv) <(sort -t';' -k1,1 ../data/indoss_projects.csv) >../data/p2a_domains.csv`

## Further cleaning

In [103]:
# Remove bots
!grep -iv [^a-z]bot[^a-z] ../data/p2a_domains.csv >../data/p2a_domains_clean.csv
!grep -i [^a-z]bot[^a-z] ../data/p2a_domains.csv | wc -l

8025


In [109]:
# Count number of author per dominant domain
!cut -d\; -f3 ../data/p2a_domains_clean.csv | sort | uniq -c | sort -rn

  64492 microsoft.com
  49067 google.com
  34577 udacity.com
  27974 redhat.com
  26471 enki.com
  20265 hashicorp.com
  15696 fb.com
  14674 amazon.com
  12616 pivotal.io
  11484 mozilla.com
   7376 cern.ch
   7005 datadoghq.com
   6621 makersacademy.com
   6537 intel.com
   6346 npmjs.com
   5942 odoo.com
   5865 elementary.io
   5441 twitter.com
   5076 influxdb.com
   4845 baidu.com
   4657 sap.com
   4578 travis-ci.com
   4469 jetbrains.com
   4323 gsa.gov
   4223 alibaba-inc.com
   3739 wix.com
   3453 wso2.com
   3432 netflix.com
   3425 xamarin.com
   3307 rapid7.com
   3270 segment.com
   3266 confluent.io
   3243 oracle.com
   3212 mesosphere.com
   3133 lightbend.com
   3119 mariadb.com
   3076 stripe.com
   3051 uber.com
   3040 kitware.com
   2919 zalando.de
   2897 ele.me
   2881 mapbox.com
   2693 apple.com
   2605 rabbitmq.com
   2564 arm.com
   2553 algolia.com
   2509 tencent.com
   2481 smartthings.com
   2476 llnl.gov

In [112]:
# Keep top 10 dominant domains in number of authors
!cut -d\; -f3 ../data/p2a_domains_clean.csv | sort | uniq -c | sort -rn | head | sed 's/[0-9]//g;s/ //g' >../data/top-10-domains.txt

sort: write failed: 'standard output': Broken pipe
sort: write error


## Find enterprise-associated and non-associated authors
`$> cat ../data/top-10-domains.txt | grep -i - <(cut -d\; -f2 ../data/p2a_domains_clean.csv) >../data/indoss-authors.txt`

`$> cat ../data/top-10-domains.txt | grep -iv - <(cut -d\; -f2 ../data/p2a_domains_clean.csv) >../data/non-indoss-authors.txt`

In [120]:
# Count indoss authors
!cat ../data/top-10-domains.txt | while read i ; do echo -n $i, ; grep -c $i ../data/indoss-authors.txt ; done | sort -t, -k2,2 -rn

microsoft.com,4032
google.com,942
redhat.com,380
amazon.com,206
pivotal.io,172
mozilla.com,65
fb.com,58
hashicorp.com,13
udacity.com,10
enki.com,0


In [119]:
# Count non-indoss authors
!cat ../data/top-10-domains.txt | while read i ; do echo -n $i, ; grep -c $i ../data/non-indoss-authors.txt ; done | sort -t, -k2,2 -rn

microsoft.com,19299
google.com,12967
redhat.com,9686
fb.com,4951
pivotal.io,3703
amazon.com,3165
mozilla.com,1779
hashicorp.com,663
udacity.com,139
enki.com,25


## Ratio of enterprise-associated authors

microsoft.com, 20.9% <br />
google.com, 7.3% <br />
udacity.com, 7.2% <br />
amazon.com, 6.5% <br />
pivotal.io, 4.6% <br />
redhat.com, 3.9% <br />
mozilla.com, 3.7% <br />
hashicorp.com, 2.0% <br />
fb.com, 1.2% <br />
enki.com, 0.0%

## Future work

### p2c

`$> cat ../data/dedup_indoss_projects.txt | ~/lookup/getValues -f p2c >../data/p2c.csv`

### c2dat

`$> cut -d";" -f2 ../data/p2c.csv | ~/lookup/getValues c2dat >../data/c2dat.csv`