In [29]:
import nltk
import pandas as pd

import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if not module_path in sys.path:
    sys.path.insert(0, module_path)

from innoprod.sheet_tools import get_sheet_dfs
from innoprod.wrangling.wrangling_tools import is_non_empty
from innoprod.wrangling.msyh_data_sharing import wrangle_roadmaps

In [2]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
data = get_sheet_dfs()
roadmaps_df = wrangle_roadmaps(data['Roadmaps'])

# Pre-processing

In [4]:
col = 'What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets?'
new_col = col + ' [Pre-processed]'
roadmaps_df[col]

0      As with many business one of the main barriers...
1      As with many businesses, one of the main barri...
2      [REDACTED] company has sought to implement dig...
3                                                    nan
4      [REDACTED] barrier to increased adoption of di...
                             ...                        
215    As with many business one of the main barriers...
216    CRM/ERP(part), and automation â€“ automatic we...
217    [REDACTED] company has invested in implementin...
218    [REDACTED] has been very little investment in ...
219    As with many businesses, one of the main barri...
Name: What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets?, Length: 220, dtype: str

In [5]:
from nltk.tokenize import RegexpTokenizer
phrases = [
    "[I|i]ndustry 4\\.0",
]
regexpr = "|".join(phrases) + "|[\\w']+"
tokenizer = RegexpTokenizer(regexpr)
# regexpr
# from nltk.tokenize import word_tokenize


In [6]:
from nltk.corpus import stopwords
stops = set(stopwords.words('english'))

In [7]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [8]:
from nltk import pos_tag

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return 'a'
    elif tag.startswith('V'):
        return 'v'
    elif tag.startswith('N'):
        return 'n'
    elif tag.startswith('R'):
        return 'r'
    else:
        return 'n'

In [9]:
def preprocess(row):
    tokens = tokenizer.tokenize(row)
    # tokens = word_tokenize(row)
    filtered_tokens = [word for word in tokens if (word != 'REDACTED') and (not word.lower() in stops)]
    # tagged_tokens = pos_tag(filtered_tokens)
    # result = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in tagged_tokens]
    result = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return result

In [10]:
roadmaps_df[new_col] = roadmaps_df.apply(lambda row: preprocess(row[col]), axis=1)
roadmaps_df[new_col]

0      [many, business, one, main, barrier, investmen...
1      [many, business, one, main, barrier, investmen...
2      [company, sought, implement, digital, technolo...
3                                                  [nan]
4      [barrier, increased, adoption, digital, techno...
                             ...                        
215    [many, business, one, main, barrier, investmen...
216    [CRM, ERP, part, automation, â, automatic, wei...
217    [company, invested, implementing, digital, tec...
218    [little, investment, industry 4.0, due, combin...
219    [many, business, one, main, barrier, investmen...
Name: What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets? [Pre-processed], Length: 220, dtype: object

# Simple analysis

In [11]:
all_lemmas = roadmaps_df[new_col].explode()

In [12]:
all_lemmas.value_counts()[0:30]

What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets? [Pre-processed]
investment       411
business         359
barrier          316
technology       275
support          168
main             167
digital          142
many             140
cost             122
equipment        121
need             116
return           102
system            97
manufacturing     96
balance           95
see               92
internal          87
company           86
one               85
finance           84
management        84
productivity      82
accessing         81
also              81
capital           79
come              74
automation        72
investing         70
development       69
seen              69
Name: count, dtype: int64

In [13]:
lemma_df = all_lemmas.value_counts().to_frame()
lemma_df

Unnamed: 0_level_0,count
What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets? [Pre-processed],Unnamed: 1_level_1
investment,411
business,359
barrier,316
technology,275
support,168
...,...
ingredient,1
shopfloor,1
non,1
connected,1


# Repeated phrases
In this section we begin by extracting the responses to the 'what are the barriers' question for only those 8 firms where a value was given for 'Increased TO/Employee', in order to give us a flavour of the responses.

In [5]:
col = 'What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets?'
mask = roadmaps_df['Increased TO/Employee'].notna()
for row in roadmaps_df[mask][col]:
    print(row)
    print('')

As with many business one of the main barriers to investment in [REDACTED] has been accessing the finance to support investment, specifically in respect to capital equipment which comes at substantive cost. [REDACTED] have incorporated limited automation in their manufacturing process to support operations to date, most processes and stock movement is manual, they are open to investment but have to balance these against day to day business liquidity requirements. .[REDACTED] management are progressive and forward thinking and see technology as a means of enhancing productivity but need to balance investment with demonstrable business returns, the critical barrier at present is that the business doesnt know what it doesnt know - the development of a clear digital strategy should be seen as crucial

As with many business one of the main barriers to investment in [REDACTED] has been accessing the finance to support investment, specifically in respect to capital equipment which comes at su

Manual inspection of these revealed that some phrases were used repeatedly, and this pattern is not restricted to these 8 firms:

In [6]:
main_barriers_phrase = 'As with many business one of the main barriers to investment in [REDACTED] has been accessing the finance to support investment, specifically in respect to capital equipment which comes at substantive cost.'
col = 'What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets?'
main_barriers_mask = roadmaps_df[col].str.contains(main_barriers_phrase, regex=False)
sum(main_barriers_mask)

55

In [7]:
progressive_phrase = '[REDACTED] management are progressive and forward thinking and see technology as a means of enhancing productivity but need to balance investment with demonstrable business returns'
col = 'What are the internal barriers to growth? How do you intend to finance future growth? Are there sufficient leadership and management skills in the business to achieve your growth? What opportunities do you have to expand into new markets?'
progressive_mask = roadmaps_df[col].str.contains(progressive_phrase, regex=False)
sum(progressive_mask)

50

In [8]:
sum(main_barriers_mask & progressive_mask)

50

We now try to see if any other sentences are used repeatedly. Note that this initial approach won't capture phrases that form only parts of sentences, such as the '[REDACTED] management are progressive and forward thinking...' phrase above.

In [34]:
from nltk.tokenize import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()

mask = is_non_empty(roadmaps_df[col])

sentences_df = roadmaps_df[mask][['Client ID', col]]
sentences_df['sentences'] = sentences_df.apply(lambda row: tokenizer.tokenize(row[col]), axis=1)
sentences_df = sentences_df[['Client ID', 'sentences']].explode('sentences').value_counts('sentences').to_frame()

In [35]:
sentences_df[0:20]

Unnamed: 0_level_0,count
sentences,Unnamed: 1_level_1
"As with many business one of the main barriers to investment in [REDACTED] has been accessing the finance to support investment, specifically in respect to capital equipment which comes at substantive cost.",55
[REDACTED] management are progressive and forward thinking and see technology as a means of enhancing productivity but need to balance investment with demonstrable business returns,34
"As with many businesses, one of the main barriers to investment in [REDACTED] has been accessing the finance to support investment.",26
[REDACTED] main barrier is knowledge but also investing in the right technology to see a return on investment has been seen as a barrier in the past.,19
[REDACTED] main barrier affecting current investment is the lack of a structured digitisation strategy.,18
[REDACTED] is needed to inform the areas of highest opportunity to produce a tangible return from investing in digitisation and as such create the investment justifications to underpin and inform an internal procurement process.,18
"[REDACTED] management are progressive and forward thinking and see technology as a means of enhancing productivity but need to balance investment with demonstrable business returns, the critical barrier at present is that the business doesnt know what it doesnt know - the development of a clear digital strategy should be seen as crucial",13
As with many businesses however internally they have to balance the requirements of maintaining a stable financial position and as such any investment in digital technology needs to be able to demonstrate a tangible positive impact on both productivity and profitability.,13
[REDACTED] company has invested in implementing digital technologies where they are seen as being value add to the company's operating model.,12
[REDACTED] time can be a significant barrier.,11
