# Extract technical terms from SBIR data
This notebook extracts entities from SBIR dataset. The SBIR dataset is a csv file. We will run spacy methods to lemmatize and extract entities from the abstract field. We will then filter the entities to technical terms by passing the entities through binary classification model previously created. 

In [6]:
import import_ipynb
import spacy as sp
import json
import pandas as pd
import joblib
import requests
import io

In [2]:
import spacy_helper_methods as sph

importing Jupyter notebook from spacy_helper_methods.ipynb


## Load input data

In [7]:
%%time
# read SBIR award data directly from web URL
url="https://data.www.sbir.gov/awarddatapublic/award_data.csv"
s=requests.get(url).content
sbir_df=pd.read_csv(io.StringIO(s.decode('utf-8')))



CPU times: user 10.6 s, sys: 7.77 s, total: 18.4 s
Wall time: 1min 12s


In [14]:
!unzip -o ../model/trained_tech_classifier_model.joblib.zip -d ../model/

Archive:  ../model/trained_tech_classifier_model.joblib.zip
  inflating: ../model/trained_tech_classifier_model.joblib  


## Extract entities and classify

In [15]:
model = joblib.load('../model/trained_tech_classifier_model.joblib')
nlp = sp.load('en_core_sci_lg')

In [16]:
%%time
sbir_df['abstract_entities'] = sph.extract_tech_entities(nlp, model, sbir_df['Abstract'])

CPU times: user 9h 13min 46s, sys: 41min 3s, total: 9h 54min 49s
Wall time: 11h 12min 49s


In [17]:
sbir_df.to_csv('../preprocessed_files/sbir_entities.csv')

In [20]:
sbir_df = sbir_df.drop(['Abstract'],axis=1)

In [21]:
sbir_df.to_csv('../preprocessed_files/sbir_entities1.csv')

In [23]:
sbir_df.columns

Index(['Company', 'Award Title', 'Agency', 'Branch', 'Phase', 'Program',
       'Agency Tracking Number', 'Contract', 'Proposal Award Date',
       'Contract End Date', 'Solicitation Number', 'Solicitation Year',
       'Solicitation Close Date', 'Proposal Receipt Date',
       'Date of Notification', 'Topic Code', 'Award Year', 'Award Amount',
       'Duns', 'HUBZone Owned', 'Socially and Economically Disadvantaged',
       'Women Owned', 'Number Employees', 'Company Website', 'Address1',
       'Address2', 'City', 'State', 'Zip', 'Contact Name', 'Contact Title',
       'Contact Phone', 'Contact Email', 'PI Name', 'PI Title', 'PI Phone',
       'PI Email', 'RI Name', 'RI POC Name', 'RI POC Phone',
       'abstract_entities'],
      dtype='object')

## Create small output files
Since the dataframe is large, need to breakdown into smaller chunks for upload to github

In [27]:
chunksize = 22000 # number of rows per chunk
num_chunks = len(sbir_df)//chunksize + 1

In [30]:
output_directory = '../preprocessed_files/'
base_filename = "sbir_entities_"

# Write the DataFrame into multiple small files
for i in range(num_chunks):
    start_idx = i * chunksize
    end_idx = (i + 1) * chunksize
    chunk_dataframe = sbir_df.iloc[start_idx:end_idx]
    
    # Formulate the output filename for each chunk
    output_filename = f"{output_directory}{base_filename}{i + 1}.csv"
    
    # Write the chunk to a CSV file
    chunk_dataframe.to_csv(output_filename, index=False)