# Extract technical terms from SBIR data
This notebook extracts entities from SBIR dataset. The SBIR dataset is a csv file. We will run spacy methods to lemmatize and extract entities from the abstract field. We will then filter the entities to technical terms by passing the entities through binary classification model previously created. 

In [1]:
import import_ipynb
import spacy as sp
import json
import pandas as pd
import joblib
import requests
import io

In [2]:
import spacy_helper_methods as sph

importing Jupyter notebook from spacy_helper_methods.ipynb


## Load input data

In [3]:
%%time
# read SBIR award data directly from web URL
url="https://data.www.sbir.gov/awarddatapublic/award_data.csv"
s=requests.get(url).content
sbir_df=pd.read_csv(io.StringIO(s.decode('utf-8')), low_memory=False)

CPU times: user 10.5 s, sys: 8.01 s, total: 18.5 s
Wall time: 43.1 s


In [4]:
!unzip -o ../model/trained_tech_classifier_model.joblib.zip -d ../model/

Archive:  ../model/trained_tech_classifier_model.joblib.zip
  inflating: ../model/trained_tech_classifier_model.joblib  


## Extract entities and classify

In [5]:
model = joblib.load('../model/trained_tech_classifier_model.joblib')
nlp = sp.load('en_core_sci_lg')

  deserializers["tokenizer"] = lambda p: self.tokenizer.from_disk(  # type: ignore[union-attr]


In [6]:
#sbir_df.info()

In [7]:
#Resolve Nulls for Abstracts by assigning an empty value '' to allow the entity extraction process to work without removing the rows
sbir_df[sbir_df['Abstract'].isna()] = ''
sbir_df['Abstract'].astype('string')
sbir_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201856 entries, 0 to 201855
Data columns (total 41 columns):
 #   Column                                   Non-Null Count   Dtype 
---  ------                                   --------------   ----- 
 0   Company                                  201853 non-null  object
 1   Award Title                              201232 non-null  object
 2   Agency                                   201855 non-null  object
 3   Branch                                   146776 non-null  object
 4   Phase                                    201856 non-null  object
 5   Program                                  201856 non-null  object
 6   Agency Tracking Number                   201544 non-null  object
 7   Contract                                 168149 non-null  object
 8   Proposal Award Date                      123889 non-null  object
 9   Contract End Date                        118873 non-null  object
 10  Solicitation Number                      125

In [8]:
#sbir_df[sbir_df['Abstract'].isna()]

In [9]:
%%time
# This is for the whole database. This can take long time. Uncomment when necessary to process whole data
#sbir_df['abstract_entities'] = sph.extract_tech_entities(nlp, model, sbir_df['Abstract'])
#sbir_df.to_csv('../preprocessed_files/sbir_entities.csv')

CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 11 µs


In [21]:
%%time
sbir_sample = sbir_df.sample(1000)
sbir_sample = sbir_sample[~sbir_sample['Abstract'].isna()]
sbir_sample = sbir_sample [sbir_sample['Abstract'] != '']
sbir_sample['abstract_entities'] = sph.extract_tech_entities(nlp, model, sbir_sample['Abstract'])


CPU times: user 3min 55s, sys: 33.6 s, total: 4min 28s
Wall time: 10min 39s


In [24]:
sbir_sample = sbir_sample.rename_axis('id')
sbir_sample
sbir_sample.to_csv('../preprocessed_files/sbir_1k_sample.csv')

## Create small output files
Run this when cell 9 above is uncommented. Since the dataframe is large, need to breakdown into smaller chunks for upload to github

In [None]:
chunksize = 22000 # number of rows per chunk
num_chunks = len(sbir_df)//chunksize + 1

In [None]:
output_directory = '../preprocessed_files/'
base_filename = "sbir_entities_"

# Write the DataFrame into multiple small files
for i in range(num_chunks):
    start_idx = i * chunksize
    end_idx = (i + 1) * chunksize
    chunk_dataframe = sbir_df.iloc[start_idx:end_idx]
    
    # Formulate the output filename for each chunk
    output_filename = f"{output_directory}{base_filename}{i + 1}.csv"
    
    # Write the chunk to a CSV file
    chunk_dataframe.to_csv(output_filename, index=False)