## Create classes

This notebook is the first one that should be run in order to start the TeckRank. Here, we upload the data and we create the classes for companies and technologies. 

Then we save the results as two dictionaries (company_name:class_company and tech_name:class_tech), which contain all the needed information for the nest steps

### Table of contents:

* [Download data from CSV](#down)
* [Data cleaning](#cleaning)
* [Select companies in cybersecurity](#cyber)
* [Create graph and dictionaries](#create_graph)
* [Save graph and dictionaries](#save)
* [Quick loop](#loop0)

In [1]:
flag_cybersecurity = True

In [2]:
import math
import arrow
import ipynb 
import os.path

import json
import pickle
import sys
import random
import operator

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as ss
import numpy as np

from dotenv import load_dotenv
from networkx.algorithms.bipartite.matrix import biadjacency_matrix
from networkx.algorithms import bipartite
from importlib import reload
from typing import List


In [3]:
# import functions from py file 

import functions.fun
reload(functions.fun)
from functions.fun import CB_data_cleaning, df_from_api_CB, extract_nodes, extract_data_from_column, field_extraction
from functions.fun import nx_dip_graph_from_pandas, plot_bipartite_graph, filter_dict, check_desc
from functions.fun import extract_classes_company_tech, degree_bip, insert_data_classes

In [4]:
# import functions from py file 

import functions.fun_meth_reflections
reload(functions.fun_meth_reflections)
from functions.fun_meth_reflections import zero_order_score, Gct_beta, Gtc_alpha, make_G_hat, next_order_score, generator_order_w
from functions.fun_meth_reflections import M_test_triangular, w_stream, find_convergence, rank_df_class, w_star_analytic

In [5]:
# import classes 

import classes
reload(classes)

<module 'classes' from '/home/anita.mezzetti/bipartite_network/classes.py'>

## Download data from CSV <a class="anchor" id="down"></a>

In [6]:
df_start = pd.read_csv("data/data_cb/organizations.csv")

df_start.head()

Unnamed: 0,uuid,name,type,permalink,cb_url,rank,created_at,updated_at,legal_name,roles,...,phone,facebook_url,linkedin_url,twitter_url,logo_url,alias1,alias2,alias3,primary_role,num_exits
0,e1393508-30ea-8a36-3f96-dd3226033abd,Wetpaint,organization,wetpaint,https://www.crunchbase.com/organization/wetpaint,158955.0,2007-05-25 13:51:27,2019-06-24 22:19:25,,company,...,206-859-6300,https://www.facebook.com/Wetpaint,https://www.linkedin.com/company/wetpaint,https://twitter.com/wetpainttv,https://res.cloudinary.com/crunchbase-producti...,,,,company,
1,bf4d7b0e-b34d-2fd8-d292-6049c4f7efc7,Zoho,organization,zoho,https://www.crunchbase.com/organization/zoho,6686.0,2007-05-26 02:30:28,2018-10-27 00:29:49,,"investor,company",...,,http://www.facebook.com/zoho,http://www.linkedin.com/company/zoho-corporati...,http://twitter.com/zoho,https://res.cloudinary.com/crunchbase-producti...,,,,company,1.0
2,5f2b40b8-d1b3-d323-d81a-b7a8e89553d0,Digg,organization,digg,https://www.crunchbase.com/organization/digg,7793.0,2007-05-26 03:03:23,2018-12-10 10:09:14,"Digg Holdings, LLC",company,...,877-342-7222,http://www.facebook.com/digg,http://www.linkedin.com/company/digg,http://twitter.com/digg,https://res.cloudinary.com/crunchbase-producti...,,,,company,
3,f4d5ab44-058b-298b-ea81-380e6e9a8eec,Omidyar Network,organization,omidyar-network,https://www.crunchbase.com/organization/omidya...,136861.0,2007-05-26 03:21:34,2019-06-19 12:17:48,,investor,...,650.482.2500,http://www.facebook.com/OmidyarNetwork,http://www.linkedin.com/company/22806,http://twitter.com/OmidyarNetwork,https://res.cloudinary.com/crunchbase-producti...,,,,investor,38.0
4,df662812-7f97-0b43-9d3e-12f64f504fbb,Facebook,organization,facebook,https://www.crunchbase.com/organization/facebook,47.0,2007-05-26 04:22:15,2021-04-14 23:52:25,"Facebook, Inc.","investor,company",...,,https://www.facebook.com/facebook/,http://www.linkedin.com/company/facebook,https://twitter.com/facebook,https://res.cloudinary.com/crunchbase-producti...,,,,company,


In [7]:
df_start.columns

Index(['uuid', 'name', 'type', 'permalink', 'cb_url', 'rank', 'created_at',
       'updated_at', 'legal_name', 'roles', 'domain', 'homepage_url',
       'country_code', 'state_code', 'region', 'city', 'address',
       'postal_code', 'status', 'short_description', 'category_list',
       'category_groups_list', 'num_funding_rounds', 'total_funding_usd',
       'total_funding', 'total_funding_currency_code', 'founded_on',
       'last_funding_on', 'closed_on', 'employee_count', 'email', 'phone',
       'facebook_url', 'linkedin_url', 'twitter_url', 'logo_url', 'alias1',
       'alias2', 'alias3', 'primary_role', 'num_exits'],
      dtype='object')

## Data Cleaning <a class="anchor" id="cleaning"></a>

We decide to use as key the name. For the future, it would be better to use the uuid

- `df_start`: dataset before cleaning
- `df` : datsety after cleaning

In [8]:
# we create the lists needed as input in the function to clean the data

to_drop = [
    'type',
    'permalink',
    'cb_url',   
    'created_at',
    'domain',
    'address',
    'state_code',
    'updated_at',
    'legal_name',
    'roles',
    'postal_code',
    'homepage_url',
    'num_funding_rounds',
    'total_funding_currency_code',
    'phone',
    'email',
    'num_exits',
    'alias2',
    'alias3',
    'num_exits',
    'logo_url',
    'alias1',
    'last_funding_on',
    'twitter_url',
    'facebook_url'
]

# to_rename = { 'category_groups_list': 'category_groups' }
to_rename = { 'category_list': 'category_groups' }

drop_if_nan = [
    'category_groups',
    'rank',
    'short_description'
]

to_check_double = {}

sort_by = "rank"

In [9]:
# clean data: from df_start to df
df = CB_data_cleaning(df_start, to_drop, to_rename, to_check_double, drop_if_nan, sort_by)

In [10]:
# show cleaned dataset
df.head()

Unnamed: 0,uuid,name,rank,country_code,region,city,status,short_description,category_groups,category_groups_list,total_funding_usd,total_funding,founded_on,closed_on,employee_count,linkedin_url,primary_role
1178,74a20af3-f4dd-6188-de60-c4ee6cd0ca4a,Ant Group,1.0,CHN,Zhejiang,Hangzhou,operating,Ant Group strives to enable all consumers and ...,"Banking,Financial Services,FinTech,Payments","Financial Services,Lending and Investments,Pay...",22000000000.0,22000000000.0,2014-10-01,,5001-10000,https://www.linkedin.com/company/antgroup/,company
4042,022417b5-4980-6c54-0f3c-6736bbbb1a5e,Spotify,2.0,SWE,Stockholms Lan,Stockholm,ipo,Spotify is a commercial music streaming servic...,"Audio,Cloud Computing,Music,Music Streaming,Vi...","Content and Publishing,Internet Services,Media...",2085425000.0,2085425000.0,2006-04-23,,5001-10000,http://www.linkedin.com/company/spotify,company
349,468bef9f-2f50-590e-6e78-62e3adb05aa1,Citi,3.0,USA,New York,New York,ipo,Citigroup is a diversified financial services ...,"Banking,Credit Cards,Financial Services,Wealth...","Financial Services,Lending and Investments,Pay...",8700000000.0,8700000000.0,1998-10-08,,10000+,https://www.linkedin.com/company/citi,investor
211260,a40d0a1f-f32c-a1e9-1bbd-a10bb0eca2e7,Deliveroo,4.0,GBR,England,London,ipo,Deliveroo is an online food delivery service t...,"Delivery,Food and Beverage,Food Delivery,Same ...","Administrative Services,Food and Beverage,Tran...",1712683000.0,1712683000.0,2012-01-01,,5001-10000,https://www.linkedin.com/company/deliveroo,company
621119,00daca16-8311-454b-84e0-24a40d16be9c,Antler,5.0,SGP,Central Region,Singapore,operating,Antler is a global early-stage venture capital...,Venture Capital,"Financial Services,Lending and Investments",78000000.0,78000000.0,2017-01-01,,101-250,https://www.linkedin.com/company/antlerglobal/,investor


In [11]:
df.columns

Index(['uuid', 'name', 'rank', 'country_code', 'region', 'city', 'status',
       'short_description', 'category_groups', 'category_groups_list',
       'total_funding_usd', 'total_funding', 'founded_on', 'closed_on',
       'employee_count', 'linkedin_url', 'primary_role'],
      dtype='object')

In [12]:
# convert category_groups to list

def convert_to_list(string):
    li = list(string.split(","))
    return li
  
if type(df["category_groups"][df.index[0]]) != list:
    df["category_groups"] = [convert_to_list(x) for x in df["category_groups"]]

In [13]:
len(df)

1254665

### Select companies in cybersecurity <a class="anchor" id="cyber"></a>


We decide to select only companies that work in the cybersecurity field. The algorithm is easily extendible to any field: we only have to change the _field_words_ list word.

Please note that if we want to select also some sub-sample, we have to cut the dataset at this stage (as it is done in the quick loop at the end of this notebook).


In [14]:
flag_cybersecurity

True

In [15]:
if flag_cybersecurity == True:
    df, flag_cybersecurity = field_extraction('cybersecurity', df)
else:
    df, flag_cybersecurity = field_extraction('medicine', df)

In [16]:
df.head()

Unnamed: 0,uuid,name,rank,country_code,region,city,status,short_description,category_groups,category_groups_list,total_funding_usd,total_funding,founded_on,closed_on,employee_count,linkedin_url,primary_role
397698,b99936db-3b9f-4397-e03b-7ddb70c2dc00,OneTrust,107.0,USA,Georgia,Atlanta,operating,"OneTrust helps companies manage privacy, secur...","[Compliance, Privacy, Risk Management, Software]","Privacy and Security,Professional Services,Sof...",920000000.0,920000000.0,2016-01-01,,1001-5000,https://www.linkedin.com/company/onetrust,company
362482,655ff5a2-33d2-dfe5-af13-20866a58a5c0,BigID,152.0,USA,New York,New York,operating,BigID is a data intelligence company developin...,"[Artificial Intelligence, Big Data, Cyber Secu...","Artificial Intelligence,Data and Analytics,Inf...",246099999.0,246099999.0,2016-02-01,,101-250,https://www.linkedin.com/company/bigid,company
75057,3e9d5f7e-7301-b645-66af-eb756892af3a,Zscaler,170.0,USA,California,San Jose,ipo,Zscaler is a global cloud-based information se...,"[Cloud Security, Cyber Security, Enterprise So...","Information Technology,Privacy and Security,So...",148000000.0,148000000.0,2008-01-01,,1001-5000,https://www.linkedin.com/company/zscaler/,company
740558,7ca00768-e6a1-4a58-9aff-871a46bf8971,AppOmni,280.0,USA,California,San Francisco,operating,AppOmni is a SaaS security management solution...,"[Cloud Management, Cloud Security, Cyber Secur...","Information Technology,Internet Services,Priva...",53000000.0,53000000.0,2018-01-01,,51-100,https://www.linkedin.com/company/appomni,company
21622,8e3a72ba-b0af-f535-615e-477ce5ba119e,SimpliSafe,323.0,USA,Massachusetts,Boston,acquired,SimpliSafe is a home security solutions provid...,"[Consumer Electronics, Security, Sensor, Smart...","Consumer Electronics,Hardware,Privacy and Secu...",187000000.0,187000000.0,2006-01-01,,501-1000,http://www.linkedin.com/company/simplisafe,company


In [17]:
df['short_description'].values

array(['OneTrust helps companies manage privacy, security and governance requirements in a regulatory environment that constantly changes.',
       'BigID is a data intelligence company developing software that helps companies secure customer data and satisfy privacy regulations.',
       'Zscaler is a global cloud-based information security company that enabling secure digital transformation for a mobile and cloud first world.',
       ...,
       'OPrestador develops OPrestLink, an application where users concentrate their service, contact, availability period and other information.',
       'The Taxperts is to provide their clients quality service with professionalism, communication, and integrity.',
       'Wayne Alarm strives to provide the best security and life safety systems available.'],
      dtype=object)

### Create Companies and Technologies classes

#### Ranking

I personally appreciate the ranking that you provide for each company. However, I did not quite understand what's the magic behind it. Is there any chance to get some more insight/details, also considering that we do have an NDA in place?

- Crunchbase rank uses Crunchbase’s intelligent algorithms to score and rank entities (e.g. Company, People, Investors, etc.).
- The algorithms take into account many different variables; ranging from funding events, the entity’s strength of relationships with other entities in the Crunchbase ecosystem, the level of engagement from our website, news articles, and acquisitions.

    - A company’s Rank is fluid and subject to rising and decaying over time with time-sensitive events. Events such as product launches, funding events, leadership changes, and news affect a company’s Crunchbase Rank.


- The Crunchbase rank shows where an entity falls in the Crunchbase database relative to all other entities in that entity type (i.e. if searching for companies, you will see where a specific company ranks relative to all other companies). An entity with a Crunchbase Rank of 1 has the highest rank relative to all other entities of that type.

I would also suggest leveraging our Trend Score - 7 Day, 30 Day, 90 Day (e.g. Company, People, Investors, etc.)

- While Rank shows context, Crunchbase Trend Score demonstrates activity. A company’s rank will change based on activity (fundraising, news, etc.) and Trend Score is an indicator of how much their rank is changing at any given time.
- Crunchbase Trend Score tracks the fluctuations in Rank. As a company’s rank changes, so do its Trend Score.
- Trend Score measures the rate of a company’s activity on a 20-point (+10 <-> -10) scale. Scores closer to +10 mean it’s moving up in rank much faster compared to their peers. Scores closer to -10 mean it’s moving down.
- For example, a company that announces its first funding round will likely experience a jump in its Rank, pushing its Trend Score up as its page views, article counts, funding amount, team members, etc., begin to increase.


## Create graph and dictionaries <a class="anchor" id="create_graph"></a>

In [None]:
# Extracts the dictionaries of Companies and Technologies from the dataset and create the network
[dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)

In [None]:
print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")

## Save dictionaries and network <a class="anchor" id="save"></a>

In [None]:
# Save dictionaries in a pickle files

if flag_cybersecurity==False: # all fields
    name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
    name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
else: # only companies in cybersecurity
    name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
    name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

# companies
with open(name_file_com, "wb") as f:
    pickle.dump(dict_companies, f)

#technologies
with open(name_file_tech, "wb") as f:
    pickle.dump(dict_tech, f)

In [None]:
# Save the bipartite graph as gpickle:

if flag_cybersecurity==False: # all fields
    name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'                                     
else: # only companies in cybersecurity
    name_file_graph = 'savings/networks/cybersecurity_comp_'+ str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'
                                                       
nx.write_gpickle(B, name_file_graph)

## Quick loop  <a class="anchor" id="loop0"></a>

With quick loop, we mean that we do all the step of the previous sections, in order to update the dictionaries, for all size, in only one loop.

In this part, you won't see many comments because everything has been already explained before :)

In [18]:
limits = [2443]
flag_cybersecurity = True

In [19]:
for i in limits:
    df_limited = df[:i] # set limits
    [dict_companies, dict_tech, B] = extract_classes_company_tech(df_limited)
    print(f"We have {len(dict_companies)} companies and {len(dict_tech)} technologies")
    
    # Save dictionaries in a pickle files

    if flag_cybersecurity==False: # all fields
        name_file_com = "savings/classes/dict_companies_" + str(len(dict_companies)) + ".pickle"
        name_file_tech = "savings/classes/dict_tech_" + str(len(dict_tech)) + ".pickle"
    else: # only companies in cybersecurity
        name_file_com = "savings/classes/dict_companies_cybersecurity_" + str(len(dict_companies)) + ".pickle"
        name_file_tech = "savings/classes/dict_tech_cybersecurity_" + str(len(dict_tech)) + ".pickle"

    # companies
    with open(name_file_com, "wb") as f:
        pickle.dump(dict_companies, f)

    #technologies
    with open(name_file_tech, "wb") as f:
        pickle.dump(dict_tech, f)
        
    if flag_cybersecurity==False: # all fields
        name_file_graph = 'savings/networks/comp_' + str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'                                     
    else: # only companies in cybersecurity
        name_file_graph = 'savings/networks/cybersecurity_comp_'+ str(len(dict_companies)) + '_tech_' + str(len(dict_tech)) + '.gpickle'

    nx.write_gpickle(B, name_file_graph)

Petah Tiqva, HaMerkaz is not a good address
Petah Tiqva, HaMerkaz is not a good address
Ra'anana, HaMerkaz is not a good address
Reykjavík, Gullbringusysla is not a good address
Lod, HaMerkaz is not a good address
Yehud, HaMerkaz is not a good address
Ra'anana, HaMerkaz is not a good address
Bedok, East Region is not a good address
Tirat Karmel, Hefa is not a good address
Qiryat Ono, Tel Aviv is not a good address
Rosh Ha'ayin, HaMerkaz is not a good address
Seoul, Seoul-t'ukpyolsi is not a good address
Oguchi, Kagoshima is not a good address
Netanya, HaMerkaz is not a good address
Bene Beraq, Tel Aviv is not a good address
Zur Moshe, HaMerkaz is not a good address
Petah Tiqva, HaMerkaz is not a good address
Seoul, Seoul-t'ukpyolsi is not a good address
Elstree, Hertford is not a good address
Lysaker, Akershus is not a good address
Seoul, Seoul-t'ukpyolsi is not a good address
Reykjavík, Gullbringusysla is not a good address
Haifa, Hefa is not a good address
Hemel Hempstead, Hertford i