### Notebook description

This is the most interesting part of the project. I am still working on it and may update this notebook at a later date.

We used a content recommender to approximate a search engine. Basically saying, recommend me the most similar text blob to the one I enter as a seach term.

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import regex as re

### Crunchbase

I used crunchbase data to test the approach, since crunchbase data is already cleaned and organized.

In [None]:
crunch = pd.read_csv('./organizations.csv')

In [None]:
crunch.shape

In [None]:
tfidf = TfidfVectorizer(stop_words="english")
tfidf.fit(union['get_text'].values.astype(str))
tf_matrix = tfidf.transform(union['get_text'].values.astype(str))

In [None]:
y = ['SAMPLE RESEARCH TERMS']
search_vector = tfidf.transform(y)

In [None]:
distances = pairwise_distances(tf_matrix, search_vector, metric = 'cosine')

In [None]:
np.sort(distances)

In [None]:
crunch.loc[distances.argmin(),:]

In [None]:
matrix_df = pd.DataFrame(tf_matrix.toarray(), columns=tfidf.get_feature_names())

### Startup

#### Startup data: merge with scraped text and tweets

In [None]:
client = pd.read_csv('./git_projects/capstone/XXX.csv')
scrape = pd.read_pickle('./git_projects/capstone/FINAL_pickle_soup')

#### Clean the soup

Clean up the tags and other non-text in the scraped homepages.

In [None]:
from bs4 import BeautifulSoup 
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'lxml')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

In [None]:
clean_soup = []
for i in scrape.index:
    soup = BeautifulSoup(scrape['soup'][i], "lxml")
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    clean_soup.append(u" ".join(t.strip() for t in visible_texts))
    
scrape['homepage'] = clean_soup

In [None]:
scrape = scrape.drop(['soup'], axis=1)

In [None]:
client = pd.merge(client, scrape, on='co_id', how='outer')
client.shape

#### Feature Engineering

In [None]:
# drop all nans, so they don't clutter the text
client.replace(np.nan, '', regex=True, inplace = True)

In [None]:
# drop columns that I will not use for TFIDF
client = client.drop(['co_slug', 'created','updated',
                   'website_x', 'website_y','url', 'email', ...'], axis=1)

##### Correcting some fields (for ex people entered both city and state in the city field etc.)

In [None]:
client['state'] = [i.split(',')[1:] for i in client['city']]
client['state'] = client['state'].apply(lambda x: ' '.join(x))

In [None]:
client['city'] = [i.split(',')[:1] for i in client['city']]
client['city']=client['city'].map(lambda x: x[0])

In [None]:
def drop_hiphens(column):
    client[column] = [' '.join(i.split('-')) for i in client[column]]    

columns = ['XXX', 'YYY' etc....]
for i in columns:
    drop_hiphens(i)

In [None]:
#saving at this stage
client.to_pickle("client_ready")

In [None]:
client = pd.read_pickle("./client_ready")

#### Get all the text in one column. Creating several combinations to see what works best with the recommendation engine.

In [None]:
client['sum_description'] = (client['first feature'].astype(str) + " " + client['second feature'].astype(str)
                     + " " + client['third feature'].astype(str))

In [None]:
union['text_no_soup'] = (client['first feature'].astype(str) + " " + client['second feature'].astype(str)
                     + " " + client['fourth feature'].astype(str) + " " client['fifth feature'].astype(str)
                     + " " + client['second feature'].astype(str)
                     + " " + client['seventh feature'].astype(str) + " "  + client['first feature'].astype(str)
                     + " " + client['twirteenth feature'].astype(str)
                     + " " + client['tenth feature'].astype(str))

#### Client data TFIDF

Here we are using a recommender to approximate searcg engine. TF-IDF to give weight to the words that are less frequent and specific to a given description and penalize the words that appear everywhere.

We fit the vectorizer on the text from all description and then use the search term to find description that are most similar to the search term.

In [None]:
### Also used lemmatizing

In [None]:
#building a content-based recommender

In [None]:
#fitting tfidf on our camlany descriptions

tfidf = TfidfVectorizer(stop_words="english", max_features=5000)
tfidf.fit(client['sum_description'].values.astype(str))
tf_matrix = tfidf.transform(client['sum_description'].values.astype(str))

In [None]:
tf_matrix.shape

In [None]:
#transforming our search term
y = ["Sample search exaple"]
search_vector = tfidf.transform(y)

In [None]:
#finding which descriptions are most similar to the searcg term
distances = pairwise_distances(tf_matrix, search_vector, metric = 'cosine')

In [None]:
np.sort(distances, axis=0)[:5] #five best scores/pairwise distances

In [None]:
results = {} 

for idx, row in client.iterrows(): #
    
    similar_indices = np.sort(distances[idx].argsort()) #stores 5 most similar blocks of text
    similar_items = [(distances[idx][i], client['co_id'][i]) for i in similar_indices]
    results[row['co_id']] = similar_items[0][0]   

In [None]:
s = pd.Series(results, name='cosine_score')
s.index.name = 'co_id'
s = s.reset_index()
s.sort_values('cosine_score', ascending = True).head()

In [None]:
#Print a df that shows the clients that best match the search terms, organized from the best match to the worst.
#We can see both the % similarity to the searh terms and the actual description.

merged_df = pd.merge(union, s, on='co_id', how='outer')
pd.set_option('display.max_colwidth', 150)
merged_df[["co_name","sum_description", 'cosine_score']].sort_values('cosine_score', ascending = True)