# US Companies: Recomendaciones por Contenido (áreas de trabajo)
### Martín Gonella y Maximiliano Armesto

Se trabajará con un dataset que contiene información de un gran número de empresas de EEUU utilizado en el laboratorio anterior. El dataset contiene las siguientes columnas: **company_name_id**, **company_name** y **company_category**. La idea es recomendar empresas que se dediquen a áreas de trabajo parecidas.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('Dataset/us_companies_filtered.csv')
df = pd.DataFrame(data)

In [3]:
df[:20]

Unnamed: 0,company_name_id,company_name,company_category
0,3-round-stones-inc,"3 Round Stones, Inc.",Data/Technology
1,48-factoring-inc,48 Factoring Inc.,Finance & Investment
2,5psolutions,5PSolutions,Data/Technology
3,abt-associates,Abt Associates,Research & Consulting
4,accela,Accela,Governance
5,accuweather,AccuWeather,Environment & Weather
6,acxiom,Acxiom,Data/Technology
7,adaptive,Adaptive,Business & Legal Services
8,adobe-digital-government,Adobe Digital Government,Data/Technology
9,aidin,Aidin,Healthcare


In [4]:
df.isnull().sum()

company_name_id     0
company_name        0
company_category    0
dtype: int64

Vamos a construir un motor de recomendaciones basado en contenido que computa la similitud entre compañias basadas en sus áreas ó cateogorías de trabajo. Sugiere compañias que son más similares a una compañia en particular en función de categoría.

In [5]:
df['company_category'] = df['company_category'].str.replace("&","/")
df['company_category'] = df['company_category'].str.split('/')
df['company_category'] = df['company_category'].fillna("").astype('str')
df[:5]

Unnamed: 0,company_name_id,company_name,company_category
0,3-round-stones-inc,"3 Round Stones, Inc.","['Data', 'Technology']"
1,48-factoring-inc,48 Factoring Inc.,"['Finance ', ' Investment']"
2,5psolutions,5PSolutions,"['Data', 'Technology']"
3,abt-associates,Abt Associates,"['Research ', ' Consulting']"
4,accela,Accela,['Governance']


No tenemos una métrica cuantitativa para juzgar el rendimiento de nuestra recomendación, así que esto tendrá que hacerse de forma cualitativa. Para hacerlo, usaremos la función **TfidfVectorizer** de scikit-learn, que transforma el texto en vectores de características que se pueden usar como entrada para el estimador.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df['company_category'])
tfidf_matrix.shape

(530, 47)

In [7]:
from sklearn.metrics.pairwise import linear_kernel
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim[:4, :4]

array([[1., 0., 1., 0.],
       [0., 1., 0., 0.],
       [1., 0., 1., 0.],
       [0., 0., 0., 1.]])

Ahora tenemos una matriz de similitud de **coseno pairwise** para todas las compañias en el conjunto de datos. El siguiente paso es escribir una función que devuelva las 10 compañias más similares en función de sus categorías.

In [8]:
indices = pd.Series(df.index, index=df['company_name_id'])
cosine_sim


array([[1.        , 0.        , 1.        , ..., 0.1641325 , 0.27264264,
        0.27264264],
       [0.        , 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [1.        , 0.        , 1.        , ..., 0.1641325 , 0.27264264,
        0.27264264],
       ...,
       [0.1641325 , 0.        , 0.1641325 , ..., 1.        , 0.40408444,
        0.40408444],
       [0.27264264, 0.        , 0.27264264, ..., 0.40408444, 1.        ,
        1.        ],
       [0.27264264, 0.        , 0.27264264, ..., 0.40408444, 1.        ,
        1.        ]])

In [9]:
titles = df['company_name_id']
indices = pd.Series(df.index, index=df['company_name_id'])

def company_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    df_indices = [i[0] for i in sim_scores]
    return titles.iloc[df_indices]

Ahora vamos a obtener las mejores recomendaciones para algunas compañias y veremos qué tan buenas son las recomendaciones.

In [10]:
pd.DataFrame(company_recommendations('alltuition').head(10))

Unnamed: 0,company_name_id
44,betterlesson
67,cappex
103,college-board
109,connectedu
152,escholar-llc
204,greatschools
219,hows-my-offer
250,junyo
255,kidadmit-inc
359,petersons


Ahora vamos a ver las categorías de las empresas recomendadas y de la empresa que original.

In [11]:
df.loc[[14,44,67,103,109,152,204,219,250,255,359]]

Unnamed: 0,company_name_id,company_name,company_category
14,alltuition,Alltuition,['Education']
44,betterlesson,BetterLesson,['Education']
67,cappex,Cappex,['Education']
103,college-board,College Board,['Education']
109,connectedu,ConnectEDU,['Education']
152,escholar-llc,eScholar LLC.,['Education']
204,greatschools,GreatSchools,['Education']
219,hows-my-offer,How's My Offer?,['Education']
250,junyo,Junyo,['Education']
255,kidadmit-inc,"KidAdmit, Inc.",['Education']


In [12]:
pd.DataFrame(company_recommendations('48-factoring-inc').head(10))

Unnamed: 0,company_name_id
11,allianz
28,asset4
34,avalara
42,berkery-noyes-mandasoft
43,berkshire-hathaway
45,billguard
49,blackrock
50,bloomberg
54,bridgewater
55,brightscope


In [13]:
df.loc[[1,11,28,34,42,43,45,49,50,54,55]]

Unnamed: 0,company_name_id,company_name,company_category
1,48-factoring-inc,48 Factoring Inc.,"['Finance ', ' Investment']"
11,allianz,Allianz,"['Finance ', ' Investment']"
28,asset4,Asset4,"['Finance ', ' Investment']"
34,avalara,Avalara,"['Finance ', ' Investment']"
42,berkery-noyes-mandasoft,Berkery Noyes MandASoft,"['Finance ', ' Investment']"
43,berkshire-hathaway,Berkshire Hathaway,"['Finance ', ' Investment']"
45,billguard,BillGuard,"['Finance ', ' Investment']"
49,blackrock,BlackRock,"['Finance ', ' Investment']"
50,bloomberg,Bloomberg,"['Finance ', ' Investment']"
54,bridgewater,Bridgewater,"['Finance ', ' Investment']"


In [14]:
pd.DataFrame(company_recommendations('3-round-stones-inc').head(10))

Unnamed: 0,company_name_id
2,5psolutions
6,acxiom
8,adobe-digital-government
15,altova
16,amazon-web-services
19,analytica
20,apextech-llc
21,appallicious
24,areavibes-inc
36,ayasdi


In [15]:
df.loc[[0,2,6,8,15,16,19,20,21,24,36]]

Unnamed: 0,company_name_id,company_name,company_category
0,3-round-stones-inc,"3 Round Stones, Inc.","['Data', 'Technology']"
2,5psolutions,5PSolutions,"['Data', 'Technology']"
6,acxiom,Acxiom,"['Data', 'Technology']"
8,adobe-digital-government,Adobe Digital Government,"['Data', 'Technology']"
15,altova,Altova,"['Data', 'Technology']"
16,amazon-web-services,Amazon Web Services,"['Data', 'Technology']"
19,analytica,Analytica,"['Data', 'Technology']"
20,apextech-llc,Apextech LLC,"['Data', 'Technology']"
21,appallicious,Appallicious,"['Data', 'Technology']"
24,areavibes-inc,AreaVibes Inc.,"['Data', 'Technology']"


In [16]:
pd.DataFrame(company_recommendations('google').head(10))

Unnamed: 0,company_name_id
529,apple
527,inphi
0,3-round-stones-inc
2,5psolutions
6,acxiom
8,adobe-digital-government
15,altova
16,amazon-web-services
19,analytica
20,apextech-llc


In [17]:
df.loc[[528,529,527,0,2,6,8,15,16,19,20]]

Unnamed: 0,company_name_id,company_name,company_category
528,google,Google,"['Technology', 'Data', 'Electronic']"
529,apple,Apple,"['Technology', 'Data', 'Electronic']"
527,inphi,Inphi,"['Technology', 'Electronic']"
0,3-round-stones-inc,"3 Round Stones, Inc.","['Data', 'Technology']"
2,5psolutions,5PSolutions,"['Data', 'Technology']"
6,acxiom,Acxiom,"['Data', 'Technology']"
8,adobe-digital-government,Adobe Digital Government,"['Data', 'Technology']"
15,altova,Altova,"['Data', 'Technology']"
16,amazon-web-services,Amazon Web Services,"['Data', 'Technology']"
19,analytica,Analytica,"['Data', 'Technology']"


Cómo se puede observar, el sistema funciona correctamente, y para cada empresa devuelve o recomienda empresas que poseen la misma industria que la compañía que se usa como entrada.