![Ethereum Wallet loading screen](https://i.imgur.com/BDs8lGK.png)

# ETHPrize dev interviews analysis

Using manually tagged data

In [1]:
import pandas as pd

## Loading data

In [2]:
df = pd.read_csv('./ETHPrize Tagged Data - 20180923.csv')

In [3]:
# View a snippet of the data
df.head(10)

Unnamed: 0,Name,Who are you and what are you working on?,Topics,Projects,What are your biggest frustrations?,Topics.1,Projects.1,What tools don’t exist at the moment?,Topics.2,Projects.2,...,Projects.10,What are you most excited about in the short term?,Topics.11,Projects.11,Who are the other people you think we should talk to?,Topics.12,Projects.12,Are there any other questions we should be asking?,Topics.13,Projects.13
0,﻿Fabio Berger + Remco Bloemen,0x - Decentralized exchange protocol. It is a ...,"exchange, protocol, standards, transaction, ne...",0x,Getting a simple experimental environment up i...,"experimentation, tracing, profiling, code cove...","Solidity, Remix",,,,...,,,,,,,,,,
1,﻿Leo Logvinov,"Started in blockchain 2 years ago in Berlin, w...","usability, errors","IntelliJ, 0x","Event watching - unreliable, no support for ba...","events, ABI, contracts, debugger, standards, c...","Truffle, Solidity, IntelliJ, Ganache, LLL",Prettier type plugin for solidity. I don’t hav...,,Solidity,...,"Ganache, Solidity, EthereumJS-blockstream",,,,,,,,,
2,﻿Axel Ericsson,I have built 1Protocol\nIt lets smart contract...,"contracts, stake signing, signatures, tokens, ...","MEW, Raiden, 1Protocol",,,,There is no tooling or anything related to sta...,state channels,MEW,...,,Raiden is the golden egg in the space. The exp...,,"Raiden, Ethereum, Counterfactual",,,,,,
3,﻿Mike Goldin,Software developer at Consensys\nKnown for Tok...,"Token Curated Registries, TCRs, design, money,...","Consensys, AdChain",Truffle’s debugger is a bit disappointing. Wor...,"debugger, contracts, proxy, data, production, ...","Truffle, EthPM",Fuzz Testing and formal verification desired.\n,"fuzz testing, formal verification",,...,,Excited for Casper\nApplications implemented i...,state channels,"Casper, Plasma","Infura team, client development\nSpankchain - ...",,"Infura, Spankchain",,,
4,﻿Oleksii,Started working with smart contracts in early ...,"contracts, scalability, tokens, logic, deploym...","Ambisafe, Truffle",Our original vision was to do everything – tes...,"snapshot, deployment, contracts, deployment, t...","built_our_own, Truffle, Testrpc, Remix, Securify",,,,...,"Ethercamp, 4byte.io",,,,,,,,,
5,﻿Brett Sun,Working on Aragon entirely.\nThe end goal is t...,"netowkr, organisation, protocol, money, permis...","Aragon, Ethereum",My biggest general frustration comes with the ...,"security, productivity, ecosystem, best practi...",Ethereum,A nice debugger! Please…\nMore infrastructure ...,"debugger, infrastructure, caching, dapps, events",,...,,,,,,,,,,
6,﻿Jorge Izquierdo,Aragon - Decentralized Governance platform\nWe...,"governance, language, natspec, dapps, Ruby, up...",Aragon,,,,,,,...,"Augur, Solidity, EthereumJS-blockstream",,,,,,,,,
7,﻿Jack Peterson and Sparkle,,,,Lack of a debugger - by far the biggest issue....,"debugger, transactions",,Setting breakpoints in tests!\nSalesforce Deve...,"bounties, documentation, gas, yellow paper","Salesforce, Geth, React, IPFS, EVM, Kauri, Eth...",...,"Augur, Raiden, Airbit",Raiden is coming shortly and will be very cool...,,"Raiden, Proof of Steak, uPort, Augur",,,,,,
8,﻿Joey Krug,Co-Chief Investment Officer at Pantera Capital...,,"Pantera Capital, Augur","At the end of the day, it all comes down to th...","scalability, payments, state channels","Ethereum, L4, EOS","My answers changed over a time, 1 year ago it ...",static analysis,,...,"Ethereum, Bitcoin, Geth, parity","In the short term, he’s most excited about Mak...","stability, scalability, sharding",MakerDAO,,,,,,
9,﻿Mark Beylin,Creator of the Bounties Network. Bounties on a...,"bounties, websockets, transactions","Bounties.network, Ethereum, IPFS, Solidity, Me...",Not being able to upgrade my contracts easily....,"upgradability, data, gas, gas limit, deploymen...",IPFS,Better querying possibilities on the state of ...,"state, contracts, community",,...,,Sharding. That’s all I want.\nuPort is getting...,sharding,uPort,Joseph Vander (learning the be a solidity deve...,,"Gitcoin, Solidity",Curious to know about developer incentives. As...,"incentives, money",Augur


## Data cleaning - Reformatting the columns for multi-labelling

The topics and projects can be applied to all questions so we can train two common models instead of a
a question specific models, one model for topics and one for projects.

We will also ignore the questions with no answers.

We need to keep the original (line, name, question) to pass from the "natural" representation to the one suitable for
automated multi-labelling.

In summary we will go convert representation as the following

```
   Name  | Q0   T0  P0  Q1    T1  P1        Obs  OldRow Name  Question Answer Topics Projects
0  Alice | foo  a,b  x   oof   b   y  ==>     0    0    Alice  Q0      foo     a,b      x
1  Bob   | bar   b  x,y  rab   c   z          1    0    Alice  Q1      oof       b      y
2  Eve   | baz   a   y   zab   a   x          2    1    Bob    Q0      bar       b    x,y
                                              3    1    Bob    Q1      rab       c      z
                                              4    2    Eve    Q0      baz       a      y
                                              5    2    Eve    Q1      zab       a      x
```

With have 43 columns with the first one being the Name.

### First besides the name we need to group them
Check if we can get what we want for groups

Let's get an "empty frame" first

In [4]:
def getMultiCol(columns):
    length = len(columns)
    qtp = [('Name', 'Name')] # questions, topics, projects
    for col_idx in range(1, length, 3):
        qtp.append((columns[col_idx], 'Answer'))
        qtp.append((columns[col_idx], 'Topics'))
        qtp.append((columns[col_idx], 'Projects'))
    return pd.MultiIndex.from_tuples(qtp, names=('Questions', 'id'))
        

In [5]:

pd.DataFrame(df.drop('Name', axis = 1), columns = getMultiCol(df.columns))

Questions,Name,Who are you and what are you working on?,Who are you and what are you working on?,Who are you and what are you working on?,What are your biggest frustrations?,What are your biggest frustrations?,What are your biggest frustrations?,What tools don’t exist at the moment?,What tools don’t exist at the moment?,What tools don’t exist at the moment?,...,Other domain specific questions?,What are you most excited about in the short term?,What are you most excited about in the short term?,What are you most excited about in the short term?,Who are the other people you think we should talk to?,Who are the other people you think we should talk to?,Who are the other people you think we should talk to?,Are there any other questions we should be asking?,Are there any other questions we should be asking?,Are there any other questions we should be asking?
id,Name,Answer,Topics,Projects,Answer,Topics,Projects,Answer,Topics,Projects,...,Projects,Answer,Topics,Projects,Answer,Topics,Projects,Answer,Topics,Projects
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,,,,,,,,,,,...,,,,,,,,,,
9,,,,,,,,,,,...,,,,,,,,,,


Looks good to me. Let's do the reshaping for real now.

In [6]:
# There is no "set_column" so we transpose back and forth
df = df.T.set_index(getMultiCol(df.columns)).T

In [7]:
# Check
df.head(10)

Questions,Name,Who are you and what are you working on?,Who are you and what are you working on?,Who are you and what are you working on?,What are your biggest frustrations?,What are your biggest frustrations?,What are your biggest frustrations?,What tools don’t exist at the moment?,What tools don’t exist at the moment?,What tools don’t exist at the moment?,...,Other domain specific questions?,What are you most excited about in the short term?,What are you most excited about in the short term?,What are you most excited about in the short term?,Who are the other people you think we should talk to?,Who are the other people you think we should talk to?,Who are the other people you think we should talk to?,Are there any other questions we should be asking?,Are there any other questions we should be asking?,Are there any other questions we should be asking?
id,Name,Answer,Topics,Projects,Answer,Topics,Projects,Answer,Topics,Projects,...,Projects,Answer,Topics,Projects,Answer,Topics,Projects,Answer,Topics,Projects
0,﻿Fabio Berger + Remco Bloemen,0x - Decentralized exchange protocol. It is a ...,"exchange, protocol, standards, transaction, ne...",0x,Getting a simple experimental environment up i...,"experimentation, tracing, profiling, code cove...","Solidity, Remix",,,,...,,,,,,,,,,
1,﻿Leo Logvinov,"Started in blockchain 2 years ago in Berlin, w...","usability, errors","IntelliJ, 0x","Event watching - unreliable, no support for ba...","events, ABI, contracts, debugger, standards, c...","Truffle, Solidity, IntelliJ, Ganache, LLL",Prettier type plugin for solidity. I don’t hav...,,Solidity,...,"Ganache, Solidity, EthereumJS-blockstream",,,,,,,,,
2,﻿Axel Ericsson,I have built 1Protocol\nIt lets smart contract...,"contracts, stake signing, signatures, tokens, ...","MEW, Raiden, 1Protocol",,,,There is no tooling or anything related to sta...,state channels,MEW,...,,Raiden is the golden egg in the space. The exp...,,"Raiden, Ethereum, Counterfactual",,,,,,
3,﻿Mike Goldin,Software developer at Consensys\nKnown for Tok...,"Token Curated Registries, TCRs, design, money,...","Consensys, AdChain",Truffle’s debugger is a bit disappointing. Wor...,"debugger, contracts, proxy, data, production, ...","Truffle, EthPM",Fuzz Testing and formal verification desired.\n,"fuzz testing, formal verification",,...,,Excited for Casper\nApplications implemented i...,state channels,"Casper, Plasma","Infura team, client development\nSpankchain - ...",,"Infura, Spankchain",,,
4,﻿Oleksii,Started working with smart contracts in early ...,"contracts, scalability, tokens, logic, deploym...","Ambisafe, Truffle",Our original vision was to do everything – tes...,"snapshot, deployment, contracts, deployment, t...","built_our_own, Truffle, Testrpc, Remix, Securify",,,,...,"Ethercamp, 4byte.io",,,,,,,,,
5,﻿Brett Sun,Working on Aragon entirely.\nThe end goal is t...,"netowkr, organisation, protocol, money, permis...","Aragon, Ethereum",My biggest general frustration comes with the ...,"security, productivity, ecosystem, best practi...",Ethereum,A nice debugger! Please…\nMore infrastructure ...,"debugger, infrastructure, caching, dapps, events",,...,,,,,,,,,,
6,﻿Jorge Izquierdo,Aragon - Decentralized Governance platform\nWe...,"governance, language, natspec, dapps, Ruby, up...",Aragon,,,,,,,...,"Augur, Solidity, EthereumJS-blockstream",,,,,,,,,
7,﻿Jack Peterson and Sparkle,,,,Lack of a debugger - by far the biggest issue....,"debugger, transactions",,Setting breakpoints in tests!\nSalesforce Deve...,"bounties, documentation, gas, yellow paper","Salesforce, Geth, React, IPFS, EVM, Kauri, Eth...",...,"Augur, Raiden, Airbit",Raiden is coming shortly and will be very cool...,,"Raiden, Proof of Steak, uPort, Augur",,,,,,
8,﻿Joey Krug,Co-Chief Investment Officer at Pantera Capital...,,"Pantera Capital, Augur","At the end of the day, it all comes down to th...","scalability, payments, state channels","Ethereum, L4, EOS","My answers changed over a time, 1 year ago it ...",static analysis,,...,"Ethereum, Bitcoin, Geth, parity","In the short term, he’s most excited about Mak...","stability, scalability, sharding",MakerDAO,,,,,,
9,﻿Mark Beylin,Creator of the Bounties Network. Bounties on a...,"bounties, websockets, transactions","Bounties.network, Ethereum, IPFS, Solidity, Me...",Not being able to upgrade my contracts easily....,"upgradability, data, gas, gas limit, deploymen...",IPFS,Better querying possibilities on the state of ...,"state, contracts, community",,...,,Sharding. That’s all I want.\nuPort is getting...,sharding,uPort,Joseph Vander (learning the be a solidity deve...,,"Gitcoin, Solidity",Curious to know about developer incentives. As...,"incentives, money",Augur


### Second, unpivot/meltdown the dataframe into our desired structure

In [8]:
cleaned = df.set_index('Name').unstack().unstack(level=1).reset_index()
cleaned

id,Questions,Name,Answer,Projects,Topics
0,Are there any other questions we should be ask...,"(﻿Fabio Berger + Remco Bloemen ,)",,,
1,Are there any other questions we should be ask...,"(﻿Leo Logvinov,)",,,
2,Are there any other questions we should be ask...,"(﻿Axel Ericsson,)",,,
3,Are there any other questions we should be ask...,"(﻿Mike Goldin,)",,,
4,Are there any other questions we should be ask...,"(﻿Oleksii,)",,,
5,Are there any other questions we should be ask...,"(﻿Brett Sun,)",,,
6,Are there any other questions we should be ask...,"(﻿Jorge Izquierdo,)",,,
7,Are there any other questions we should be ask...,"(﻿Jack Peterson and Sparkle,)",,,
8,Are there any other questions we should be ask...,"(﻿Joey Krug,)",,,
9,Are there any other questions we should be ask...,"(﻿Mark Beylin,)",Curious to know about developer incentives. As...,Augur,"incentives, money"


We need to fix the "Name" having leftover multilevel and let's sort the dataframe as well

In [9]:
cleaned['Name'] = cleaned['Name'].str.get(0)
cleaned = cleaned.reindex(columns = ['Name', 'Questions', 'Answer', 'Topics', 'Projects'])
cleaned = cleaned.sort_values(['Name', 'Questions']).reset_index(drop=True)
cleaned

id,Name,Questions,Answer,Topics,Projects
0,Christopher Brown,Are there any other questions we should be ask...,,,
1,Christopher Brown,How do you handle smart contract verification ...,,,
2,Christopher Brown,How do you handle testing?,Just Truffle for tests\nMocha for unit and fun...,"unit tests, testing, functional tests, contact...","Truffle, Mocha, Mthril"
3,Christopher Brown,Other bounties?,,,
4,Christopher Brown,Other domain specific questions?,,,
5,Christopher Brown,Was anything easier than expected?,,,
6,Christopher Brown,What are the best educational resources?,,,
7,Christopher Brown,What are the tools/libraries/frameworks you use?,"Truffle for building, testing and compiling\nC...","testing, compile, deployment, integration, con...","Open Zeppelin, Truffle, Ganache, Ethereum, not..."
8,Christopher Brown,What are you most excited about in the short t...,Proof of Stake overlays will be really interes...,,"Proof of Stake, Casper, eWASM"
9,Christopher Brown,What are your biggest frustrations?,,,


Perfect, we're finished on basic data loading.

## Preparing our train, validation and test datasets

Now we have a single dataframe, we need to split it, we use the following standard terminology:

- train set: Dataset used for training our machine learning models
- validation set: Dataset used to validate that the models can generalize on unseen data.
  we have the answers, the machine don't and we ask them to predict those, then we check if those are good enough
- test set: What we actually want to do once the model is trained. We don't have the answers neither does the machine. But we assume that if it generalize well on the unseen validation set, it generalizes well for this one too.

First we can only work on questions with an actual answer

In [10]:
answered = cleaned[~pd.isnull(cleaned['Answer'])]

In [11]:
answered

id,Name,Questions,Answer,Topics,Projects
2,Christopher Brown,How do you handle testing?,Just Truffle for tests\nMocha for unit and fun...,"unit tests, testing, functional tests, contact...","Truffle, Mocha, Mthril"
7,Christopher Brown,What are the tools/libraries/frameworks you use?,"Truffle for building, testing and compiling\nC...","testing, compile, deployment, integration, con...","Open Zeppelin, Truffle, Ganache, Ethereum, not..."
8,Christopher Brown,What are you most excited about in the short t...,Proof of Stake overlays will be really interes...,,"Proof of Stake, Casper, eWASM"
10,Christopher Brown,What tools don’t exist at the moment?,The community is doing a good job and a lot of...,"community, ETHGlobal, visualisation, logging,","ETHGlobal, Solcoverage, parity"
11,Christopher Brown,What was the hardest part about learning to de...,Having a sequential getting started stuff on e...,"readthedocs, websockets, transaction","Solidity, Ethereum, Geth, GitHub, reddit"
13,Christopher Brown,Who are you and what are you working on?,"Full stack web dev, working in finance and som...","tokens, tokens, open source documentation, ERC...","Ethereum, Modular.network, AWS, Blossom, Statu..."
15,﻿ANDREY PETROV,How do you handle smart contract verification ...,Write a lot of tests myself. Get other people ...,human,
21,﻿ANDREY PETROV,What are the tools/libraries/frameworks you use?,"Truffle - not his favourite, but best thing ou...","IDE, Go","Truffle, VIM, Etherscan, built_our_own"
22,﻿ANDREY PETROV,What are you most excited about in the short t...,What dev tools on near horizon that would chan...,"wallets, dapps, market","Status, web3.js, SNARKs, MetaMask, Cipher"
23,﻿ANDREY PETROV,What are your biggest frustrations?,"Dapps: web3js stuff sucks. In the doc, it’s in...","documentation, UX, transaction, light client, ...","not_web3.js, ethers.js, MetaMask, Geth, Vipnod..."


Now we will create the test sets for topics and projects

In [12]:
topics_test = answered[pd.isnull(answered['Topics'])]
projects_test = answered[pd.isnull(answered['Projects'])]

topics_df = answered[~pd.isnull(answered['Topics'])]
projects_df = answered[~pd.isnull(answered['Projects'])]

# For ML we can drop 'Name' and the unused colums, we will train our models only with questions + Answer
topics_df.drop(['Name', 'Projects'], axis = 1, inplace = True)
projects_df.drop(['Name', 'Topics'], axis = 1, inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [13]:
topics_df

id,Questions,Answer,Topics
2,How do you handle testing?,Just Truffle for tests\nMocha for unit and fun...,"unit tests, testing, functional tests, contact..."
7,What are the tools/libraries/frameworks you use?,"Truffle for building, testing and compiling\nC...","testing, compile, deployment, integration, con..."
10,What tools don’t exist at the moment?,The community is doing a good job and a lot of...,"community, ETHGlobal, visualisation, logging,"
11,What was the hardest part about learning to de...,Having a sequential getting started stuff on e...,"readthedocs, websockets, transaction"
13,Who are you and what are you working on?,"Full stack web dev, working in finance and som...","tokens, tokens, open source documentation, ERC..."
15,How do you handle smart contract verification ...,Write a lot of tests myself. Get other people ...,human
21,What are the tools/libraries/frameworks you use?,"Truffle - not his favourite, but best thing ou...","IDE, Go"
22,What are you most excited about in the short t...,What dev tools on near horizon that would chan...,"wallets, dapps, market"
23,What are your biggest frustrations?,"Dapps: web3js stuff sucks. In the doc, it’s in...","documentation, UX, transaction, light client, ..."
24,What tools don’t exist at the moment?,Things I want improved with truffle: it has a ...,"ecosystem, local dummy client, continuous inte..."


In [14]:
projects_df

id,Questions,Answer,Projects
2,How do you handle testing?,Just Truffle for tests\nMocha for unit and fun...,"Truffle, Mocha, Mthril"
7,What are the tools/libraries/frameworks you use?,"Truffle for building, testing and compiling\nC...","Open Zeppelin, Truffle, Ganache, Ethereum, not..."
8,What are you most excited about in the short t...,Proof of Stake overlays will be really interes...,"Proof of Stake, Casper, eWASM"
10,What tools don’t exist at the moment?,The community is doing a good job and a lot of...,"ETHGlobal, Solcoverage, parity"
11,What was the hardest part about learning to de...,Having a sequential getting started stuff on e...,"Solidity, Ethereum, Geth, GitHub, reddit"
13,Who are you and what are you working on?,"Full stack web dev, working in finance and som...","Ethereum, Modular.network, AWS, Blossom, Statu..."
21,What are the tools/libraries/frameworks you use?,"Truffle - not his favourite, but best thing ou...","Truffle, VIM, Etherscan, built_our_own"
22,What are you most excited about in the short t...,What dev tools on near horizon that would chan...,"Status, web3.js, SNARKs, MetaMask, Cipher"
23,What are your biggest frustrations?,"Dapps: web3js stuff sucks. In the doc, it’s in...","not_web3.js, ethers.js, MetaMask, Geth, Vipnod..."
24,What tools don’t exist at the moment?,Things I want improved with truffle: it has a ...,Truffle


591 rows for topics
593 rows for projects

A bit on the smallish side but good enough for starters.

Note, that's really small for neural networks for example so we won't get fancy there.

-----

We won't split train and validation right away. We will use cross-validation, that will split that for us
in several different ways so that we don't optimize for a specific training and validation.

So now cleanup time

In [15]:
del df
del cleaned

# Choosing an evaluation metric for the model

For us human it's easy to evaluate if an answer matches with a topic or project, but machine learning models are not on/off. Simple models will give us a probability of 20%, 50% or 80% and it's up to us to determine the threshold (or we could use a fancy neural net model that will learn the thresholds).

Also we could get fancy and choose to penalize more false positives (wrong topic added) vs false negative (topic not added) but let's keep it simple for now.

We will set our threshold to 50%: if the model output less than 50% probability we consider it a negative and positive otherwise

# Data preparation, Feature extraction & engineering on Topics

We will use a simple Latent Semantic Analysis technique ([Wikipedia](https://en.wikipedia.org/wiki/Latent_semantic_analysis), [Stanford NLP](https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html)) for feature engineering.

In short we transform the text into a term-document frequencies (and inverse frequency) matrix, i.e. how often each word appear w.r.t. to the whole doc. We project that into a multidimensional vector space (say 2, 3, or 100 dimensions), this will be the lexical/semantic field of our documents. Then the model is trained to associate those semantic fields to our desired labels.

## Note on Pipelines

For simplicity we will use Scikit-Learn pipelines to represent our series of transformaion including the final classifier. They have the following caveats:

  - Inefficient: during "cross-validation" (validation on unseen data) they ensure that at no moment unseen data leaks into the train dataset (when doing "mean" for example), however this implies doing the intermediate computations for each "fold" (a train + validation dataset pair) even though many intermediate computations do not leak.
  - Does not support early stopping: early-stopping allows us to fine-tuning the tree models complexity without manual guesswork on the ideal number of trees, i.e. you augment the number of trees until it doesn't help.
  - Other advanced use limitations: no caching support, no out-of-fold predictions.

## Imports

In [16]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import cross_val_score

## Preparing target column for Multilabeling
First let's see what it looks like

In [17]:
topics_df['Topics']

2       unit tests, testing, functional tests, contact...
7       testing, compile, deployment, integration, con...
10         community, ETHGlobal, visualisation, logging, 
11                   readthedocs, websockets, transaction
13      tokens, tokens, open source documentation, ERC...
15                                                  human
21                                                IDE, Go
22                                 wallets, dapps, market
23      documentation, UX, transaction, light client, ...
24      ecosystem, local dummy client, continuous inte...
25                           ecosystem, security, browser
31                              bounties, ERC, volatility
37      usability, signing, ICO, bugs, education, boun...
41      UI, wallet, keys, signing, gas, contracts, tra...
44                visualisation, fuzz testing, unit tests
48                 readthedocs, stack, opcodes, community
49      modularity, updatability, opcodes, open source...
52         cac

We need to feed that to "MultiLabelBinarizer".

Since it expects a list we will use split on commas (with or without whitespace before/after) to transform those into a list.

In [18]:
topics_mlb = MultiLabelBinarizer()


In [19]:
# We split on comma and only keep non-empty labels.
# Line 3 for example has a ending comma followed by trailing space
topics_df['Topics'].apply(lambda labels: [x.strip() for x in labels.split(',') if x.strip()])

2       [unit tests, testing, functional tests, contac...
7       [testing, compile, deployment, integration, co...
10         [community, ETHGlobal, visualisation, logging]
11                 [readthedocs, websockets, transaction]
13      [tokens, tokens, open source documentation, ER...
15                                                [human]
21                                              [IDE, Go]
22                               [wallets, dapps, market]
23      [documentation, UX, transaction, light client,...
24      [ecosystem, local dummy client, continuous int...
25                         [ecosystem, security, browser]
31                            [bounties, ERC, volatility]
37      [usability, signing, ICO, bugs, education, bou...
41      [UI, wallet, keys, signing, gas, contracts, tr...
44              [visualisation, fuzz testing, unit tests]
48               [readthedocs, stack, opcodes, community]
49      [modularity, updatability, opcodes, open sourc...
52       [cach

In [20]:
# Now assign the trnasformed labels
y_topics = topics_mlb.fit_transform(
    topics_df['Topics'].apply(lambda labels: [x.strip() for x in labels.split(',') if x.strip()])
)
topics_mlb.classes_

array(['0x', 'ABI', 'ABI Encoding', 'ABIEncoderV2', 'ABIEncoding', 'AI',
       'AST', 'BigNumber', 'C', 'C++', 'DAO', 'DHT', 'DSL', 'EIP', 'ERC',
       'ETHGlobal', 'EuroToken', 'Go', 'Haskell', 'Human', 'ICO', 'IDE',
       'IDEA', 'IOT', 'Java', 'Jelly', 'LLL', 'MiniMe', 'NFT', 'Natspec',
       'Optimise', 'Proof of Stake', 'RLP', 'RPC', 'Radspec', 'Ruby',
       'Rust', 'SNARKs', 'STARKs', 'Schnorr signatures', 'TCRs',
       'Token Curated Registries', 'UI', 'UX', 'Vitalik', 'abigen',
       'adoption', 'analysis', 'analytics', 'arbitration', 'architecture',
       'art', 'artifacts', 'assembly', 'assets', 'attack', 'auction',
       'audit', 'audits', 'automatic', 'automatically', 'beige paper',
       'best practices', 'block explorer', 'blockchain explorer',
       'bloom filters', 'boilerplate', 'bootstrap', 'bounties', 'browser',
       'bug bounties', 'bugs', 'bunties', 'business logic', 'bytecode',
       'caching', 'chairty', 'code coverage', 'code review',
       'code 

Seems like we have some tagging problems "debugger", "debuggers", "securiy", "unti tests", ecosystem", "ecosystems", "open souce"...

We will continue like this for now.



In [21]:
print(y_topics.shape)

(591, 361)


Wow, we have 361 possible topics!

## Feature pipeline

In [22]:
mapper = DataFrameMapper([
    ('Questions', [TfidfVectorizer(max_features=2**16,
                                   min_df=1, stop_words='english',
                                   use_idf=True), # Create term-document frequencies and inverse frequencies
                   TruncatedSVD(20)]),            # Project on 20-dimension space (the text is very short)
    ('Answer', [TfidfVectorizer(max_features=2**16,
                                 min_df=1, stop_words='english',
                                 use_idf=True), # Create term-document frequencies and inverse frequencies
                 TruncatedSVD(100)])            # Project on 100-dimension space
])
pipeline = Pipeline([
    ('mapper_step', mapper),
    # Most model don't support MultiLabel classification
    # We will train one simple LogisticRegression classifier per target label with the OneVsRestClassifier wrapper
    # LogisticRegression is simple but fast, which is needed since we have 361 topics at the moment
    ('OvR_logreg', OneVsRestClassifier(
        LogisticRegression(random_state = 1337, n_jobs = 1),
        n_jobs = -1 # Launch as many classifiers as we have cores
    ))
])

## Measuring model performance

In [23]:
def crossval(pipe, X_train, y_train, n_folds):
    # We don't test in parallel as the classifier is already parallel
    cv = cross_val_score(pipe, X_train, y_train, cv=n_folds, n_jobs=1)
    print("Cross Validation Scores are: ", cv.round(4))
    print("Mean CrossVal score is: ", round(cv.mean(),4))
    print("Std Dev CrossVal score is: ", round(cv.std(),4))

In [24]:
X_topics = topics_df.drop('Topics', axis = 1)
X_topics.head(10)

id,Questions,Answer
2,How do you handle testing?,Just Truffle for tests\nMocha for unit and fun...
7,What are the tools/libraries/frameworks you use?,"Truffle for building, testing and compiling\nC..."
10,What tools don’t exist at the moment?,The community is doing a good job and a lot of...
11,What was the hardest part about learning to de...,Having a sequential getting started stuff on e...
13,Who are you and what are you working on?,"Full stack web dev, working in finance and som..."
15,How do you handle smart contract verification ...,Write a lot of tests myself. Get other people ...
21,What are the tools/libraries/frameworks you use?,"Truffle - not his favourite, but best thing ou..."
22,What are you most excited about in the short t...,What dev tools on near horizon that would chan...
23,What are your biggest frustrations?,"Dapps: web3js stuff sucks. In the doc, it’s in..."
24,What tools don’t exist at the moment?,Things I want improved with truffle: it has a ...


In [25]:
crossval(pipeline, X_topics, y_topics, 5) # We test our model with 5 different splits of train/validation set

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


Cross Validation Scores are:  [0.     0.     0.0085 0.     0.    ]
Mean CrossVal score is:  0.0017
Std Dev CrossVal score is:  0.0034


Ugh something seems wrong

Let's try to run the model on the input data

In [26]:
pipeline.fit(X_topics, y_topics)

Pipeline(memory=None,
     steps=[('mapper_step', DataFrameMapper(default=False, df_out=False,
        features=[('Questions', [TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=65536, m...1337, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=-1))])

In [29]:
topics_mlb.inverse_transform(pipeline.predict(X_topics))

[(),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('audit',),
 ('contracts',),
 (),
 (),
 (),
 (),
 (),
 ('audit',),
 (),
 (),
 (),
 ('audit',),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('audit',),
 (),
 (),
 (),
 (),
 (),
 ('contracts',),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('javascript', 'unit tests'),
 (),
 ('contracts',),
 (),
 (),
 ('contracts',),
 ('audit',),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('contracts',),
 ('contracts',),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('audit',),
 (),
 (),
 ('contracts',),
 (),
 ('audit',),
 (),
 (),
 (),
 (),
 (),
 ('contracts',),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('contracts',),
 (),
 (),
 ('audit',),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (

Yep that's super wrong, we probably need more data, and also a more complex model.

# Projects


In [30]:
projects_df['Projects']

2                                  Truffle, Mocha, Mthril
7       Open Zeppelin, Truffle, Ganache, Ethereum, not...
8                           Proof of Stake, Casper, eWASM
10                         ETHGlobal, Solcoverage, parity
11               Solidity, Ethereum, Geth, GitHub, reddit
13      Ethereum, Modular.network, AWS, Blossom, Statu...
21                 Truffle, VIM, Etherscan, built_our_own
22              Status, web3.js, SNARKs, MetaMask, Cipher
23      not_web3.js, ethers.js, MetaMask, Geth, Vipnod...
24                                                Truffle
25                                Truffle, MetaMask, Mist
27                                          ThousandEther
31                                      Gitcoin, MakerDAO
37      MakerDAO, RChain, MetaMask, Ethereum, Whymarrh...
41                 MetaMask, Qhymarrh, EthereumJS, eth.js
44                                         Mythril, Mocha
48        Gitter, GitHub, Slack, Stack overflow, Ethereum
49      Truffl

In [31]:
projects_mlb = MultiLabelBinarizer()

In [33]:
y_projects = projects_mlb.fit_transform(
    projects_df['Projects'].apply(lambda labels: [x.strip() for x in labels.split(',') if x.strip()])
)
projects_mlb.classes_

array(['0x', '1Protocol', '4byte.io', 'ARES', 'AWS', 'AdChain', 'Adtoken',
       'Aion', 'Airbit', 'Alethio', 'Ambisafe', 'Ansible', 'AppDynamics',
       'Apple', 'Aragon', 'Argus', 'ArtDAO', 'Atom', 'Augur', 'Ava',
       'AwesomeList', 'Bamboo', 'Bancor', 'Berkeley', 'Biddler',
       'BitFury', 'Bitcoin', 'Blockgeeks', 'Blockseer', 'Blossom',
       'Bounties.network', 'Braid', 'Brave', 'Bunz', 'Capture The Ether',
       'Capture the Ether', 'Cardano', 'Casper', 'Chai', 'Chainshot',
       'Chrome', 'Chronologic', 'Cipher', 'Circle', 'Circles', 'Coinbase',
       'Colony', 'Consensys', 'Cosmos', 'Counterfactual', 'Coursera',
       'CryptoKitties', 'CryptoNYC', 'CryptoZombies', 'CryptoZombiew',
       'Cryptomechanics.info', 'Cure52', 'DARPA', 'DAT Protocol',
       'Dagger', 'Dapphub', 'Dapple', 'Dappnode', 'Decentraland',
       'Deja Vu', 'Dfinity', 'Dharma', 'Digital Ocean', 'District0x',
       'Django', 'Docker', 'Drizzle', 'EMACS', 'ENS', 'EOS', 'ETHFiddle',
       'ETHGas

In [35]:
print(y_projects.shape)

(593, 383)


383 projects as well :/

In [38]:
X_projects = projects_df.drop('Projects', axis = 1)
X_projects.head(10)

id,Questions,Answer
2,How do you handle testing?,Just Truffle for tests\nMocha for unit and fun...
7,What are the tools/libraries/frameworks you use?,"Truffle for building, testing and compiling\nC..."
8,What are you most excited about in the short t...,Proof of Stake overlays will be really interes...
10,What tools don’t exist at the moment?,The community is doing a good job and a lot of...
11,What was the hardest part about learning to de...,Having a sequential getting started stuff on e...
13,Who are you and what are you working on?,"Full stack web dev, working in finance and som..."
21,What are the tools/libraries/frameworks you use?,"Truffle - not his favourite, but best thing ou..."
22,What are you most excited about in the short t...,What dev tools on near horizon that would chan...
23,What are your biggest frustrations?,"Dapps: web3js stuff sucks. In the doc, it’s in..."
24,What tools don’t exist at the moment?,Things I want improved with truffle: it has a ...


In [39]:
crossval(pipeline, X_projects, y_projects, 5) # We can reuse the previous pipeline

  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))


Cross Validation Scores are:  [0.     0.0336 0.0084 0.     0.0254]
Mean CrossVal score is:  0.0135
Std Dev CrossVal score is:  0.0137


In [40]:
pipeline.fit(X_projects, y_projects)

Pipeline(memory=None,
     steps=[('mapper_step', DataFrameMapper(default=False, df_out=False,
        features=[('Questions', [TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=65536, m...1337, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
          n_jobs=-1))])

In [41]:
projects_mlb.inverse_transform(pipeline.predict(X_projects))

[(),
 ('Truffle',),
 (),
 (),
 (),
 (),
 ('Remix', 'Truffle'),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('Solidity',),
 ('Ethereum',),
 (),
 (),
 ('Ethereum',),
 (),
 (),
 ('Ethereum',),
 (),
 ('Ethereum',),
 (),
 (),
 (),
 (),
 ('Truffle',),
 (),
 ('Parity', 'Truffle'),
 (),
 ('Solidity',),
 (),
 (),
 (),
 (),
 (),
 (),
 ('Ethereum',),
 (),
 (),
 (),
 (),
 (),
 ('Ethereum',),
 (),
 (),
 (),
 (),
 ('Truffle',),
 (),
 (),
 (),
 (),
 (),
 ('Ethereum',),
 ('Ethereum',),
 ('Ethereum',),
 ('Ethereum',),
 ('Truffle',),
 (),
 (),
 (),
 (),
 ('Solidity',),
 (),
 (),
 ('Solidity', 'Truffle'),
 ('Ethereum',),
 ('Consensys',),
 (),
 (),
 (),
 (),
 ('Ethereum',),
 ('Truffle',),
 (),
 ('Remix', 'Truffle'),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('Ethereum',),
 (),
 (),
 (),
 ('Truffle',),
 ('Solidity',),
 (),
 ('Ethereum',),
 (),
 (),
 ('Remix', 'Solidity', 'Truffle'),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 (),
 ('Ethereum',),
 (),
 (),
 (),
 ('Ethereum',

Seems like our logistic regression model as an easier time with projects. It's still not perfect though.