# Step 1 Data Preprocessing

Now that we have our data we can preprocess it. We will be using the spacy library to preprocess our data. We will be using the following steps to preprocess our data:

1. Remove non-english tokens
2. Remove noise from our data which includes things like:
    - Stop words
    - Punctuation
    - Entities like locations and currencies
    - Numbers
    - URLs
    - Emails
3. Lemmatize our data
4. Remove tokens that are too short or too long. The range i chose was 15-60 tokens per description
5. Remove duplicate keywords from our industry dataset



### Cell 1 - Imports

In [3]:
# # # # IMPORTS # # # #

from tqdm import tqdm
import spacy
from spacy import displacy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt


### Cell 2 - Load Data

In [None]:

with open(r'C:\Users\imran\DataspellProjects\WalidCase\data\raw\30k_startups_raw.csv', 'r', encoding='utf-8', errors='ignore') as f:
    raw_startups = pd.read_csv(f)

mask = raw_startups['cb_description'].apply(lambda x: len(x.split()) >= 30)
raw_startups = raw_startups.loc[mask]
raw_startups.rename(columns={'cb_short_description': 'cb_description'}, inplace=True)

raw_industries = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industry_dataset_clean.csv', sep='\t')


In [14]:
raw_startups = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\raw\30k_startups_raw.csv', encoding='utf-8', error_bad_lines=False)
raw_industries = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industry_dataset_clean.csv', sep='\t')




  raw_startups = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\raw\30k_startups_raw.csv', encoding='utf-8', error_bad_lines=False)


In [15]:
# drop rows with na descriptions
raw_startups.dropna(inplace=True)
# remove descriptions with less than 20 words
mask = raw_startups['cb_description'].apply(lambda x: len(x.split()) >= 30)
raw_startups = raw_startups.loc[mask]




In [6]:
raw_startups.shape

(17202, 3)

In [16]:
raw_startups.drop_duplicates(inplace=True)
raw_startups.shape

(17202, 3)

In [21]:
desc = raw_startups['cb_description'][1]
# display pos stuff
nlp = spacy.load("en_core_web_sm")
doc = nlp(desc)
displacy.render(doc, style='dep', jupyter=True)

### Cell 3

This is the `TextPreprocessing` class that will handle all of our preprocessing. It will take in a dataframe and a boolean value to specify if the data is for startups or industries. It will also have a method to remove tokens that are too short or too long. The range i chose was 15-60 tokens per description. To make this a little cleaner, the dataframe columns `cb_description` and `keywords` on the startups and industry dataframes can be changed to `description` to allow for more uniform handling of the df. I chose not to do this because it would require changing the names of the columns in the other notebooks.

This class is stored under `src/data/preprocessing.py`, and can be imported as `from src.data.preprocessing import TextProcessing`

In [18]:
class TextProcessing:
    def __init__(self, df: pd.DataFrame, data_type='startup'):
        self.nlp = spacy.load("en_core_web_sm")
        self.data = df.copy()
        self.data.reset_index(drop=True, inplace=True)
        self.data_type = data_type
        if data_type != 'startup' and data_type != 'industry':
            raise ValueError("data_type must be either 'startup' or 'industry'")

    def _iterate_rows(self, column_name):
        for index, row in tqdm(self.data.iterrows()):
            text = row[column_name]
            yield index, text

    def separate_first_sentence(self):
        self.data['next_sentences'] = ""

        for i, row in tqdm(self.data.iterrows()):
            text = row['cb_description']
            doc = self.nlp(text)
            sentences = [sent.text for sent in doc.sents]
            if len(sentences) > 1:
                self.data.loc[i, 'cb_description'] = sentences[0]
                self.data.loc[i, 'next_sentences'] = ' '.join(sentences[1:])
        return self.data

    def length_range(self, length_range=(30, 150)):
        self.data.dropna(inplace=True)
        for i, row in self.data.iterrows():
            length = len(row['cb_description'].split())
            if length < length_range[0] or length > length_range[1]:
                self.data.drop(i, inplace=True)

        return self.data

    def preprocess_text(self, remove_non_english=True, remove_noise=True):
        target_column = 'keywords' if self.data_type == 'industry' else 'cb_description'

        for index, text in self._iterate_rows(target_column):

            #keep only nouns
            doc = self.nlp(text)

            tokens = [token.text for token in doc if token.pos_ == 'NOUN' or
                                                     token.pos_ == 'ADJ']
            text = " ".join(tokens)

            doc = self.nlp(text)
            for ent in doc.ents:
                if ent.label_:
                    text = text.replace(ent.text, "")
            doc = self.nlp(text)
            unimportant_pos = {'DET', 'CONJ', 'ADP', 'AUX', 'PUNCT', 'PART', 'PRON', 'SCONJ'}
            tokens = [token.text for token in doc if token.pos_ not in unimportant_pos]
            text = " ".join(tokens)
            doc = self.nlp(text)



            if remove_non_english:
                tokens = [token.text for token in doc if token.lang_ == 'en' and token.is_alpha]
                text = " ".join(tokens)

            if remove_noise:
                tokens = [token.text.lower() for token in doc if
                          not token.is_stop
                          and not token.is_punct
                          and not token.is_space
                          and not token.like_num
                          and not token.is_digit
                          and not token.is_currency
                          and not token.is_bracket
                          and not token.is_quote
                          and not token.is_left_punct
                          and not token.is_right_punct
                          and not token.like_url
                          and not token.like_email]
                text = " ".join(tokens)

            self.data.at[index, target_column] = text if text else np.nan

        self.data.dropna(inplace=True)
        return self.data

    def lemma(self):
        target_column = 'keywords' if self.data_type == 'industry' else 'cb_description'

        for index, text in self._iterate_rows(target_column):
            doc = self.nlp(text)
            tokens = [token.lemma_ for token in doc]
            lemmatized_text = " ".join(tokens)
            self.data.at[index, target_column] = lemmatized_text if lemmatized_text else np.nan

        self.data.dropna(inplace=True)
        return self.data

    @staticmethod
    def make_keywords_unique(df):
        unique_keywords = set()
        for index, row in df.iterrows():
            keywords = row['keywords'].split()
            unique_keywords.update(set(keywords))
        appended_keywords = []
        for index, row in df.iterrows():
            keywords = [keyword for keyword in row['keywords'].split() if keyword in unique_keywords and keyword not in appended_keywords]
            appended_keywords.extend(keywords)
            new_keys = ' '.join(keywords)
            df.at[index, 'keywords'] = new_keys
        return df


In [9]:
raw_startups.head()

Unnamed: 0,id,name,cb_description
0,4081,InterResolve,InterResolve is a radically new approach to de...
1,2785,GladCloud,GladCloud is a trade marketing infrastructure ...
2,23680,13th-Lab,13th Lab is developing the next generation com...
3,2932,Hilson-Moran,Hilson Moran provides consultancy in building ...
4,22003,1928-Diagnostics,"1928 Diagnostics is a digital health company, ..."


### Cell 4

This is the start of the preprocessing journey. First you initialize the class with the dataframe you want to preprocess, and explicitly set whether or not it is a startup df or not. This is necessary because of the difference in the column names. Then you can call the methods you want to use. The preprocessed dataframe will be saved as a class attribute, so there is no need to return anything. It is possible to pass a dataframe to the methods, but if you don't, the class attribute will be used.

In [10]:
print(raw_startups.head(10))

       id              name                                     cb_description
0    4081      InterResolve  InterResolve is a radically new approach to de...
1    2785         GladCloud  GladCloud is a trade marketing infrastructure ...
2   23680          13th-Lab  13th Lab is developing the next generation com...
3    2932      Hilson-Moran  Hilson Moran provides consultancy in building ...
4   22003  1928-Diagnostics  1928 Diagnostics is a digital health company, ...
5   21617        1939-Games  1939 Games is a indie game development studio ...
12  23558           2021.AI  2021.AI serves the growing business need for f...
13  22310            20nine  20nine has helped transform regional, national...
14  21774           21GRAMS  21 Grams offers postal management to corporate...
15   9229            Geltor  Geltor is the conscious biodesign company crea...


In [19]:
preprocess = TextProcessing(raw_startups.head(1000), data_type='startup')


In [None]:
#x = preprocess.separate_first_sentence()


In [None]:
x

### Cell 5

This cell fully preprocesses the data and returns a dataframe with the cleaned text. There will be a progress bar indicating the progress of the preprocessing. This is achieved through the ```tqdm``` library.

In [20]:
preprocess.preprocess_text()
df = preprocess.lemma()

1000it [01:06, 15.06it/s]
1000it [00:14, 67.15it/s]


In [None]:
df.head(10)

In [22]:
df.to_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed/spacy_engineered/1k_nouns_adjectives.csv', index=False)

