# Step 1 Data Preprocessing

Now that we have our data we can preprocess it. We will be using the spacy library to preprocess our data. We will be using the following steps to preprocess our data:

1. Remove non-english tokens
2. Remove noise from our data which includes things like:
    - Stop words
    - Punctuation
    - Entities like locations and currencies
    - Numbers
    - URLs
    - Emails
3. Lemmatize our data
4. Remove tokens that are too short or too long. The range i chose was 15-60 tokens per description
5. Remove duplicate keywords from our industry dataset



### Cell 1 - Imports

In [7]:
# # # # IMPORTS # # # #

from tqdm import tqdm_notebook as tqdm
import spacy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt


### Cell 2 - Load Data

In [3]:

with open(r'C:\Users\imran\DataspellProjects\WalidCase\data\raw\startup_dataset.csv', 'r', encoding='utf-8',
          errors='ignore') as f:
    raw_startups = pd.read_csv(f)

raw_industries = pd.read_csv(r'C:\Users\imran\DataspellProjects\WalidCase\data\processed\industry_dataset_clean.csv',
                             sep='\t')


### Cell 3

This is the `TextPreprocessing` class that will handle all of our preprocessing. It will take in a dataframe and a boolean value to specify if the data is for startups or industries. It will also have a method to remove tokens that are too short or too long. The range i chose was 15-60 tokens per description. To make this a little cleaner, the dataframe columns `cb_description` and `keywords` on the startups and industry dataframes can be changed to `description` to allow for more uniform handling of the df. I chose not to do this because it would require changing the names of the columns in the other notebooks.

This class is stored under `src/data/preprocessing.py`, and can be imported as `from src.data.preprocessing import TextProcessing`

In [12]:


class TextProcessing:
    def __init__(self, df: pd.DataFrame = None, industry=False, startup=False):
        self.nlp = spacy.load("en_core_web_sm")
        self.startups = pd.DataFrame([])
        self.industries = pd.DataFrame([])
        if startup:
            self.startups = df.copy()
        elif industry:
            self.industries = df.copy()
        else:
            raise ValueError("Please specify if the data is for startups or industries")

    def __iterate_rows(self):
        df = self.startups if not self.startups.empty else self.industries
        for index, row in tqdm(df.iterrows()):
            self.index = index
            if not self.industries.empty:
                self.about_us = row["keywords"]
            else:
                self.about_us = row["cb_description"]
            yield self

    def length_range(self, data, length_range=(30, 150)):

        self.startups = data.copy()
        self.startups.dropna(inplace=True)
        for i, row in self.startups.iterrows():
            length = len(row['cb_description'].split())
            if length < 15 or length > 60:
                self.startups.drop(i, inplace=True)

        return self.startups

    def remove_non_english_tokens(self, data=None):
        if data is not None:
            if not self.industries.empty:
                self.industries = data.copy()
            else:
                self.startups = data.copy()
        english_tokens = []
        for description in self.__iterate_rows():
            doc = self.nlp(self.about_us)
            tokens = [token.text for token in doc if token.lang_ == 'en' and token.is_alpha]
            self.about_us = " ".join(tokens)
            english_tokens.append(self.about_us)

        if not self.startups.empty:
            self.startups['cb_description'].replace(to_replace=self.startups['cb_description'].unique(),
                                                    value=english_tokens, inplace=True)
            return self.startups

        else:
            self.industries['keywords'].replace(to_replace=self.industries['keywords'].unique(), value=english_tokens,
                                                inplace=True)
            return self.industries

    def remove_noisy_tokens(self, data=None):
        if data is not None:
            if not self.industries.empty:
                self.industries = data.copy()
            else:
                self.startups = data.copy()

        cleaned_about_us = []
        for item in self.__iterate_rows():
            doc = self.nlp(self.about_us)
            for ent in doc.ents:
                if ent.label_:
                    self.about_us = self.about_us.replace(ent.text, "")
            cleaned_doc = self.nlp(self.about_us)

            tokens = [token.text.lower() for token in cleaned_doc if
                      not token.is_stop
                      and not token.is_punct
                      and not token.is_space
                      and not token.like_num
                      and not token.is_digit
                      and not token.is_currency
                      and not token.is_bracket
                      and not token.is_quote
                      and not token.is_left_punct
                      and not token.is_right_punct
                      and not token.like_url
                      and not token.like_email]

            self.about_us = " ".join(tokens)
            cleaned_about_us.append(self.about_us)
        if not self.industries.empty:
            self.industries['keywords'].replace(to_replace=self.industries['keywords'].unique(), value=cleaned_about_us,
                                                inplace=True)
            return self.industries
        else:
            self.startups['cb_description'].replace(to_replace=self.startups['cb_description'].unique(),
                                                    value=cleaned_about_us, inplace=True)
            return self.startups

    def lemma(self, data=None):
        if data is not None:
            if not self.industries.empty:
                self.industries = data.copy()
            else:
                self.startups = data.copy()
        lemmatized_about_us = []
        for description in self.__iterate_rows():
            doc = self.nlp(self.about_us)
            tokens = [token.lemma_ for token in doc]
            self.about_us = " ".join(tokens)
            lemmatized_about_us.append(" ".join(tokens))
        if not self.industries.empty:
            self.industries['keywords'].replace(to_replace=self.industries['keywords'].unique(),
                                                value=lemmatized_about_us, inplace=True)
            return self.industries
        else:
            self.startups['cb_description'].replace(to_replace=self.startups['cb_description'].unique(),
                                                    value=lemmatized_about_us, inplace=True)
            return self.startups

    @staticmethod
    def make_keywords_unique(df): # very ugly function but its okay
        unique_keywords = set()
        for index, row in df.iterrows():
            keywords = row['keywords'].split()
            unique_keywords.update(set(keywords))
        appended_keywords = []
        for index, row in df.iterrows():
            keywords = [keyword for keyword in row['keywords'].split() if keyword in unique_keywords and keyword not in appended_keywords]
            appended_keywords.extend(keywords)
            new_keys = ' '.join(keywords)
            df.at[index, 'keywords'] = new_keys
        return df




### Cell 4

This is the start of the preprocessing journey. First you initialize the class with the dataframe you want to preprocess, and explicitly set whether or not it is a startup df or not. This is necessary because of the difference in the column names. Then you can call the methods you want to use. The preprocessed dataframe will be saved as a class attribute, so there is no need to return anything. It is possible to pass a dataframe to the methods, but if you don't, the class attribute will be used.

In [None]:
preprocess = TextProcessing(raw_startups, startup=True)


### Cell 5

This cell fully preprocesses the data and returns a dataframe with the cleaned text. There will be a progress bar indicating the progress of the preprocessing. This is achieved through the ```tqdm``` library.

In [16]:
preprocess.remove_non_english_tokens()
preprocess.remove_noisy_tokens()
preprocess.lemma()
df = preprocess.length_range(length_range=range(15, 60))


Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for index, row in tqdm(df.iterrows()):


0it [00:00, ?it/s]

0it [00:00, ?it/s]

0it [00:00, ?it/s]

Unnamed: 0,id,name,cb_description
0,1820,0xKYC,modular knowledge system identity credential m...
1,1536,100ms,live video infrastructure platform provide sub...
2,3640,10X-Genomics,create revolutionary dna sequence technology h...
3,9594,111Skin,commit positive luxury skincare push boundary ...
4,4697,1715Labs,company establish commercialise technology
...,...,...,...
3995,6882,Rosaly,give ability manage advance payment request au...
3996,4394,Roslin-Technologies,mission improve protein production disruptive ...
3997,1036,Rossum,solve key step document base process receive d...
3998,8697,Rotaready,develop hospitality leisure retail stop shop s...


In [36]:
min_words = df['nr_words'].min()
max_words = df['nr_words'].max()
mean_words = df['nr_words'].mean()
std_dev_words = df['nr_words'].std()

print("Min words per row:", min_words)
print("Max words per row:", max_words)
print("Mean words per row:", mean_words)
print("Standard deviation of words per row:", std_dev_words)

# or just call df.describe()



Min words per row: 10
Max words per row: 17
Mean words per row: 12.426229508196721
Standard deviation of words per row: 1.3782131969735862
