# Business questions
In iAuditor, our customers go through __creating/applying__ public *templates* (checklist), __using__ the *template* for their inspections, __creating actions__ from issues found through the inspections and __reporting incidents/accidents__ whenever necessary. Our customers come from a __variety of industries__ with a large proportion contributed by food & hospitality, construction, and manufacturing. With the provided data,
1. We’re interested in seeking solutions that help increase the number of customers using the standard public checklist templates for their inspections (e.g. how can we recommend the suitable templates to customers)?
2. Based on the findings above, do you have any suggestions on potential features/products that can be built to improve our customers’ experiences ?
3. Through using the checklists for inspection, can you suggest some potential solutions for correlating the data from the checklists with potential risks relevant to the respective inspections.


My view: 
- Different types of Templates
- Various industries
- Public and custom templates

# Data Analysis

In [222]:
import json
import pandas as pd
from collections import defaultdict

In [225]:
INPUT_FILE = "/Users/silvia/Downloads/sample_pl_data.json"
with open(INPUT_FILE, "r") as fin:
    data = json.load(fin)
    
print(f"Data loaded: {len(data)} items")

Data loaded: 170 items


In [25]:
counts = defaultdict(int)
for template in data:
    for key in template:
        counts[key] += 1


revision_key 170
_rev 93
created_at 170
type 170
export_profiles 63
template_data 170
deleted 170
permissions 170
action_item_profiles 27
autoshares 4
name 156
header 170
template_id 170
modified_at 170
items 170
trashed 170
temp_rev 61
meta 73
assets 9
migrated_at 19
_id 2
revision_id 1
libraryId 21
server_revision_key 16


In [29]:
df_counts = pd.DataFrame(counts.items())
df_counts.sort_values(by=1, ascending=False)

Unnamed: 0,0,1
0,revision_key,170
7,permissions,170
15,trashed,170
14,items,170
13,modified_at,170
11,header,170
12,template_id,170
6,deleted,170
5,template_data,170
3,type,170


In [32]:
revision_keys = defaultdict(int)
for template in data:
    revision_keys[template["revision_key"]]+=1
revision_keys

defaultdict(int,
            {'36b7d0d5-d599-437f-97ca-27d31db910b6': 1,
             'efcef146-0022-4966-b15c-9df6e79f88ea': 1,
             '1c6267d8-8f3d-46f5-a021-81a4e77630dd': 1,
             'f23143dd-d0f5-40d6-90ee-c4dcb91fcd51': 1,
             '63948F9D-9710-4DF0-888A-4C046BEED772': 1,
             '0e36b73a-1540-484a-995e-ac62cfeb0c50': 1,
             '17b8e032-c957-4d63-b689-1f4c191d9bd1': 1,
             '1eb1aaf7-8609-485d-b576-2d40ccd989e4': 1,
             '4a42e55c-7acc-4132-9cc6-5f6c790edf2a': 1,
             '4432dbfc-519d-490f-bba7-8e1182b277cb': 1,
             '53c3427f-991c-4fe6-ab8f-e20225f1dc63': 1,
             '64383f2e-3f4f-406a-8dbb-7c28a093b288': 1,
             '5ba8228a-29bd-4e32-821b-e81522447690': 1,
             '35cc9da4-7b93-49fe-bcee-a32560218aa6': 1,
             '7f80b96b-7b39-4cb4-9f2a-88965ab76dae': 1,
             'd8e0aea9-d91c-4295-9603-bf77a573fddf': 1,
             '284c0f60-4355-4c5f-b3b6-38d5a0d43ad3': 1,
             'b3752269-0e31-4e2

In [33]:
with open(INPUT_FILE, "r") as fin:
    data1 = pd.read_json(fin)
data1

Unnamed: 0,revision_key,_rev,created_at,type,export_profiles,template_data,deleted,permissions,action_item_profiles,autoshares,...,items,trashed,temp_rev,meta,assets,migrated_at,_id,revision_id,libraryId,server_revision_key
0,36b7d0d5-d599-437f-97ca-27d31db910b6,6-96455a285bbe4176b2d3e5433645de76,2018-04-27 01:18:26.079000+00:00,template,{},{'metadata': {'audit_title_rule': ['f3245d40-e...,False,{'owner': 'user_80a0569c75c211e49ed3001b1118ce...,{},{'user_de146dd7a04011e4b27f001b1118ce11': {'vi...,...,[{'item_id': '95594086-34f0-4680-b43b-78fa330c...,False,,,,NaT,,,,
1,efcef146-0022-4966-b15c-9df6e79f88ea,,2018-06-29 05:55:32.658000+00:00,template,{},{'metadata': {'audit_title_rule': ['f3245d40-e...,False,{'owner': 'user_de146dd7a04011e4b27f001b1118ce...,{},,...,[{'item_id': 'e6384211-e707-484a-96fb-98664b17...,False,1-c8c995170f7b422a90c3c9ac48605f08,{'rev': '1821-153c8bdc61f000000000000000000000...,,NaT,,,,
2,1c6267d8-8f3d-46f5-a021-81a4e77630dd,,2018-08-02 04:14:53.311000+00:00,template,,{'metadata': {'audit_title_rule': ['f3245d40-e...,False,{'owner': 'user_7d7e66c7db3e4c8387e31cbcc81323...,{},,...,[{'item_id': '6e1cb72b-389e-40e3-9779-f47e8574...,False,1-148ecc06624b4ee8ad11bb9f3ab29aef,{'rev': '16925-1546f61cdb540000000000000000000...,[],NaT,,,,
3,f23143dd-d0f5-40d6-90ee-c4dcb91fcd51,,2018-04-18 01:14:14.893000+00:00,template,,{'metadata': {'audit_title_rule': ['f3245d42-e...,False,{'owner': 'user_1d8fa6fdca154b42a0d8b1bcf9b720...,,,...,[{'item_id': '9eb6457e-8f93-4158-808a-11e36612...,True,,{'rev': '1721-15266f12dd4100000000000000000000...,,NaT,,,,
4,63948F9D-9710-4DF0-888A-4C046BEED772,,2018-07-23 04:28:54.832000+00:00,template,,"{'metadata': {'image': '', 'doc_no': '[number]...",False,{'owner': 'user_935cec3a55a211e39f35001b1118ce...,{},,...,[{'item_id': '190E2CBC-B49B-4961-B13A-DF4162E4...,False,1-56631843d31848e49ba04f8f2ec61a7f,"{'rev': '52-1543e50b8c4300000000000000000000',...",,NaT,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165,AE67EDD1-60A0-4CCD-8C3F-B458FC28BC3A,1-ebf362848f8641d19bd011b2edc72025,2019-04-15 00:47:17.061000+00:00,template,,{'metadata': {'audit_title_rule': ['f3245d40-e...,False,{'owner': 'user_13dc7171247047129a4e00936c9c5d...,,,...,[{'item_id': '5e93bab0-d8e0-11e2-9b70-a5eba55b...,False,,,,NaT,,,,
166,c0425d82-a021-4b24-8881-d17ced63a172,10-a4c261064df24449a1e3d4f5c1295030,2019-05-09 04:54:17.808000+00:00,template,{},{'metadata': {'audit_title_rule': ['f3245d42-e...,False,{'owner': 'user_2f6a79ed4dcd444d93fefe5cf375c9...,,,...,[{'item_id': 'e6352487-f07e-4af9-8390-cf59d29b...,False,,,,NaT,,,,
167,e510b37f-a02e-48dc-9c31-2539233d10f8,,2018-07-31 23:41:49.239000+00:00,template,,{'metadata': {'audit_title_rule': ['f3245d40-e...,False,{'owner': 'user_6fd834119a0811e3bfe7001b1118ce...,,,...,[{'item_id': 'b07a6cd0-29b9-11e5-892f-a3a8eaef...,False,1-54ea5178a9c244f8a1ef2a9acb504fee,{'rev': '16485-1546989c77080000000000000000000...,,NaT,,,,
168,C844C1FD-C9B1-48EF-8BE1-8E06EFD8BFFD,,2018-08-04 12:35:37.235000+00:00,template,{'716D6753-28D7-4873-A305-4DD0397DEB37': {'tit...,{'metadata': {'audit_title_rule': ['f3245d40-e...,False,{'owner': 'user_af65f89e8a4211e2844f75356627f5...,,,...,[{'item_id': '31CCC382-7242-4C27-8ACD-4E5F2323...,False,1-72e400763f914c46a2ce88cb4cdfe563,{'rev': '16484-1547ae951e240000000000000000000...,,NaT,,,,


In [50]:
from pandas.io.json import json_normalize
df_items = pd.concat([json_normalize(template["items"]) for template in data], ignore_index = bool)
df_items.set_index("item_id")

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0_level_0,action_item_profile_id,inactive,label,options,options.condition,options.drawing_base_image,options.element,options.enable_date,options.enable_signature_timestamp,options.enable_time,...,options.visible_in_report,options.weighting,parent_id,reference_item_profile_ids,responses.datetime,responses.name,responses.response,responses.text,responses.value,type
item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
95594086-34f0-4680-b43b-78fa330c70a0,,,Audit,,,,,,,,...,,1.0,,,,,,,,section
06e129d0-05dc-496b-9c19-ce583fd8ea8e,,,,,,,,,,,...,,1.0,95594086-34f0-4680-b43b-78fa330c70a0,,,,,,,question
9616bec9-80ed-4589-8182-2ae443dc7bc9,,,,,,,,,,,...,,1.0,95594086-34f0-4680-b43b-78fa330c70a0,,,,,,,category
2a39efa2-2d5e-4f2b-9342-a8860ea84260,,,,,,,,,,,...,,1.0,9616bec9-80ed-4589-8182-2ae443dc7bc9,,,,,,,address
7395020e-aa56-48db-a401-ac361ce54c9e,,,,,,,,,,,...,,1.0,9616bec9-80ed-4589-8182-2ae443dc7bc9,,,,,,,scanner
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77F8D8E3-8F5B-45F8-BB25-88A9731BD667,,,Ground Anchor Position in housing (Gunnebo pods),,,,,,,,...,,,8952FD67-A221-43A9-96B8-4B5B8FD2236E,,,,,,,media
AB416A72-6F5D-48B2-85AF-290BDB1BB5F2,,,Distance shot showing ATM location within store,,,,,,,,...,,0.0,BD21E1C0-CE9E-4102-A172-2DF34B62204F,,,,,,,question
9CD481AF-60B9-4BA5-9230-E3FE8D431554,,,Distance shot showing ATM location within store,,,,,,,,...,,,BD21E1C0-CE9E-4102-A172-2DF34B62204F,,,,,,,media
21A3D435-FB28-46B0-B8D1-5EC830BB2A3F,,,Comments - please use comments box to note any...,,,,,,,,...,,,BD21E1C0-CE9E-4102-A172-2DF34B62204F,,,,,,,text


In [60]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df_items[["label", "type"]])

                                                   label          type
0                                                  Audit       section
1                                                             question
2                                                             category
3                                                              address
4                                                              scanner
5                                                             checkbox
6                                                             datetime
7                                                              drawing
8                                                          information
9                                                                 list
10                                                               media
11                                                           signature
12                                                              slider
13    

In [65]:
from pandas.io.json import json_normalize
df_template_data = pd.concat([json_normalize(template["template_data"]["metadata"]) for template in data], ignore_index = bool)
#df_template_datas.set_index("item_id")   

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [68]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):  # more options can be specified also
    print(df_template_data[["industry","name","subindustry"]])

     industry                                               name  subindustry
0          -2                              All items - duplicate           -2
1          -2                                Restest - duplicate           -2
2          -2               bda04557-fed0-4f60-ad3e-ad1d590b79b1           -2
3          -2                      Question and List - duplicate           -2
4          -2                            Advanced Items Template           -2
5          -2                                            Dyn exp           -2
6          -2                                                 Xj           -2
7          -2                                              Zhzsv           -2
8          -2       asdasdsa - duplicate - duplicate - duplicate           -2
9          -2                                Restest - duplicate           -2
10         -2                                                 Sj           -2
11         -2                   asdasdsa - duplicate - duplicate

In [73]:
from pandas.io.json import json_normalize
df_template_items = pd.concat([json_normalize(template["template_data"]["metrics"]) for template in data], ignore_index = bool)
df_template_items

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,avg_duration,date_last_used,duration_count,est_duration,rating,use_count
0,0,1.555284e+09,,0.0,-1,12
1,0,1.505284e+09,,0.0,-1,4
2,-1,1.475626e+09,,-1.0,-1,1
3,0,-1.000000e+00,,0.0,-1,0
4,-1,-1.000000e+00,,-1.0,-1,0
...,...,...,...,...,...,...
165,-1,-1.000000e+00,,-1.0,-1,0
166,0,1.557378e+09,0.0,,-1,3
167,0,0.000000e+00,,-1.0,0,0
168,-1,-1.000000e+00,,-1.0,-1,0


In [74]:
df_template_items.describe()

Unnamed: 0,avg_duration,date_last_used,duration_count,est_duration,rating,use_count
count,170.0,170.0,84.0,120.0,170.0,170.0
mean,-0.523529,291199600.0,0.0,-0.866667,-0.670588,0.394118
std,0.500922,585156700.0,0.0,0.34136,0.471388,1.755118
min,-1.0,-1.0,0.0,-1.0,-1.0,0.0
25%,-1.0,-1.0,0.0,-1.0,-1.0,0.0
50%,-1.0,0.0,0.0,-1.0,-1.0,0.0
75%,0.0,0.0,0.0,-1.0,0.0,0.0
max,0.0,1557721000.0,0.0,0.0,0.0,13.0


In [76]:
from collections import Counter
print(Counter(df_template_data["subindustry"]))   

Counter({0: 43, -1: 38, 1: 31, -2: 20, 2: 15, 3: 8, 5: 6, 4: 4, 13: 4, 7: 1})


# Data Preparation

In [None]:
from pandas.io.json import json_normalize
df_items = pd.concat([json_normalize(template["items"]) for template in data], ignore_index = bool)
df_items.set_index("item_id")

In [79]:
df_items["action_item_profile_id"][df_items["action_item_profile_id"].notna()]

Series([], Name: action_item_profile_id, dtype: object)

In [None]:
templates = {}
for template in data:
    template_text = []
    for item in template["items"]:
        if ("label" in item) and (len(item["label"])>0):
            template_text.append((item["label"], item["type"]))
    templates[template["template_id"]] = template_text
           
templates

Filter out those items that were created for test

In [121]:
templates = [(k,v) for k,v in templates.items() if len(v)>8]

## Tokenize

In [138]:
from nltk.tokenize import word_tokenize
import re 
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

In [None]:
templates_tokenized = {}
for key, texts in templates:
    templates_tokenized[key] = [word.lower() for text, _ in texts for word in word_tokenize(text) if re.match("[a-z]\w{1,}", word.lower()) and (word.lower() not in stop_words)]

In [163]:
df_tokens = pd.DataFrame(templates_tokenized.items(), columns = ["Id", "Tokens"])
df_tokens.set_index("Id", inplace=True)

In [164]:
df_tokens

Unnamed: 0_level_0,Tokens
Id,Unnamed: 1_level_1
template_0D81EB72BFBD4D39ABC14BBB7735691F,"[sections, categories, sections, sections, cat..."
template_23C46B7E2CD34B4EA329A3EAF48590F8,"[identification, identification, insurer, make..."
template_6996751795CD497A8D681FBEB78C7957,"[answer, questions, basic, yes, question, tap,..."
template_51a6f94a9c9c4d1c92e896e7cde0ff74,"[modems, engines, aftershave, throaty, runny, ..."
template_D41194CF40B14DA5956C0D819A35902D,"[health, surveillance, record, health, surveil..."
...,...
template_17ded2e6b05245188bb5c47644e151e0,"[audit, modified, another, something]"
template_eb57fbf86e6a42c29a9fbd2bb62d270d,"[store, details, store, details, division, nor..."
template_CB3D831B3C25481896BEC6B2C0ED8D57,"[expertise, verzekerde, hoedanigheid, toedrach..."
template_6E2570F305DB4109A32863669C0018D0,"[recomendation, lettter, loss, control, recomm..."


## Labels

In [150]:
from pandas.io.json import json_normalize
df_template_data = pd.concat([json_normalize(template["template_data"]["metadata"]) for template in data], ignore_index = bool)
df_template_data[["industry","name","subindustry"]]

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,industry,name,subindustry
0,-2,All items - duplicate,-2
1,-2,Restest - duplicate,-2
2,-2,bda04557-fed0-4f60-ad3e-ad1d590b79b1,-2
3,-2,Question and List - duplicate,-2
4,-2,Advanced Items Template,-2
...,...,...,...
165,7,Rapport van expertise - WB,3
166,7,Grs test 2,3
167,7,The new DR!,3
168,7,FFVA Loss Control Recommendation Letter [EM] -...,3


In [156]:
labels = []
df_labels = pd.DataFrame([(template["template_id"], template["template_data"]["metadata"]["industry"], template["template_data"]["metadata"]["subindustry"]) for template in data], columns=["Id", "Industry", "Subindustry"])
df_labels.set_index("Id", inplace=True)   

In [157]:
df_labels

Unnamed: 0_level_0,Industry,Subindustry
Id,Unnamed: 1_level_1,Unnamed: 2_level_1
template_65db0caaac874d08ae7e7d15d05b5c7c,-2,-2
template_dcb88e21ea304c9baa842daa2e5abafc,-2,-2
template_58aa02d963444fd9b94c76ea03537d9f,-2,-2
template_c0ed72130e4f401eb281c5d7e725ec64,-2,-2
template_0D81EB72BFBD4D39ABC14BBB7735691F,-2,-2
...,...,...
template_CB3D831B3C25481896BEC6B2C0ED8D57,7,3
template_4fcce8994e9f453c8f87ed7eccc41591,7,3
template_c814e26f64944ee1bde8917a1f3587e2,7,3
template_6E2570F305DB4109A32863669C0018D0,7,3


In [167]:
df_data = df_tokens.join(df_labels, how="inner")

In [171]:
df_data["Text"] = df_data["Tokens"].apply(lambda tokens: ' '.join(tokens))
df_data

Unnamed: 0_level_0,Tokens,Industry,Subindustry,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
template_0D81EB72BFBD4D39ABC14BBB7735691F,"[sections, categories, sections, sections, cat...",-2,-2,sections categories sections sections categori...
template_23C46B7E2CD34B4EA329A3EAF48590F8,"[identification, identification, insurer, make...",-2,-2,identification identification insurer make pla...
template_6996751795CD497A8D681FBEB78C7957,"[answer, questions, basic, yes, question, tap,...",-2,-2,answer questions basic yes question tap yes ma...
template_51a6f94a9c9c4d1c92e896e7cde0ff74,"[modems, engines, aftershave, throaty, runny, ...",-1,-1,modems engines aftershave throaty runny vaults...
template_D41194CF40B14DA5956C0D819A35902D,"[health, surveillance, record, health, surveil...",-1,-1,health surveillance record health surveillance...
...,...,...,...,...
template_17ded2e6b05245188bb5c47644e151e0,"[audit, modified, another, something]",6,2,audit modified another something
template_eb57fbf86e6a42c29a9fbd2bb62d270d,"[store, details, store, details, division, nor...",7,-1,store details store details division north reg...
template_CB3D831B3C25481896BEC6B2C0ED8D57,"[expertise, verzekerde, hoedanigheid, toedrach...",7,3,expertise verzekerde hoedanigheid toedracht fo...
template_6E2570F305DB4109A32863669C0018D0,"[recomendation, lettter, loss, control, recomm...",7,3,recomendation lettter loss control recommendat...


In [168]:
df_data.groupby("Industry").count()

Unnamed: 0_level_0,Tokens,Subindustry
Industry,Unnamed: 1_level_1,Unnamed: 2_level_1
-2,3,3
-1,5,5
0,13,13
1,13,13
2,13,13
3,18,18
4,10,10
5,9,9
6,12,12
7,4,4


In [175]:
df_data['Industry']

Id
template_0D81EB72BFBD4D39ABC14BBB7735691F   -2
template_23C46B7E2CD34B4EA329A3EAF48590F8   -2
template_6996751795CD497A8D681FBEB78C7957   -2
template_51a6f94a9c9c4d1c92e896e7cde0ff74   -1
template_D41194CF40B14DA5956C0D819A35902D   -1
                                            ..
template_17ded2e6b05245188bb5c47644e151e0    6
template_eb57fbf86e6a42c29a9fbd2bb62d270d    7
template_CB3D831B3C25481896BEC6B2C0ED8D57    7
template_6E2570F305DB4109A32863669C0018D0    7
template_C35D4AC035F841E188FA195D4242D6A2    7
Name: Industry, Length: 100, dtype: int64

In [187]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
train_x, valid_x, train_y, valid_y = train_test_split(df_data['Text'], df_data['Industry'], stratify=df_data['Industry'])

# label encode the target variable 
# encoder = LabelEncoder()
# train_y = encoder.fit_transform(train_y)
# valid_y = encoder.fit_transform(valid_y)

### Tf-Idf 

In [191]:
from sklearn.feature_extraction.text import TfidfVectorizer

# word level tf-idf
tfidf_vect = TfidfVectorizer(analyzer='word')
tfidf_vect.fit(df_data['Text'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

In [197]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

classifier=MultinomialNB()
# fit the training dataset on the classifier
classifier.fit(xtrain_tfidf, train_y)

# predict the labels on validation dataset
predictions = classifier.predict(xvalid_tfidf)

accuracy_score(predictions, valid_y)

0.56

In [202]:
ads_tfidf = xtrain_tfidf.toarray()
keywords = [(word, value) for value, word in zip(ads_tfidf[0], tfidf_vect.get_feature_names()) if
            value > 0]
pd.DataFrame(keywords).sort_values(1, ascending=False)

Unnamed: 0,0,1
84,floor,0.286107
197,surfaces,0.238422
1,appearance,0.224406
205,toilet,0.213733
123,light,0.212087
...,...,...
221,working,0.011897
223,yes,0.011201
176,safety,0.010734
54,date,0.008192


### LDA

In [212]:
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
# create a count vectorizer object 
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(df_data['Text'])

# transform the training and validation data using count vectorizer object
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

# train a LDA Model
lda_model = LatentDirichletAllocation(n_components=10, max_iter=20)
X_topics = lda_model.fit_transform(xtrain_count)
topic_word = lda_model.components_ 
print(topic_word.shape)
vocab = count_vect.get_feature_names()
print(len(vocab))

# view the topic models
n_top_words = 10
topic_summaries = []
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    topic_summaries.append(' '.join(topic_words))

(10, 6732)
6732


In [210]:
topic_word

array([[ 0.10000096,  0.10000096,  0.1       , ...,  0.1       ,
         0.1       ,  0.1       ],
       [ 1.09999904,  1.09999904,  0.1       , ...,  0.1       ,
         0.1       ,  0.1       ],
       [ 0.1       ,  0.1       ,  0.1       , ...,  1.1       ,
         0.1       ,  0.1       ],
       ...,
       [ 0.1       ,  0.1       ,  0.1       , ...,  0.1       ,
         0.1       ,  0.1       ],
       [ 0.1       ,  0.1       ,  0.1       , ...,  0.1       ,
         0.1       ,  0.1       ],
       [ 0.1       ,  0.1       ,  0.1       , ...,  0.1       ,
        14.1       ,  0.1       ]])

In [209]:
topic_summaries

['argos care must comments card answer customer offer colleague information',
 'include children information centre action met cause photo corrective reference',
 'photo notes floor surfaces clean light location overall appearance inspected',
 'date field employee mvr audit question smart category information signature',
 'defect photos atm add needed water unit locker thermostat within',
 'text media number name selection notifications section choice multiple switch',
 'ensure one text failed sf selected question multi site blank',
 'lack equipment inadequate data del unsafe documento damaged di electrical',
 'nc rectification minor description location photos date enter severity id',
 'usage kullanımı kaldırma work storage working fall düşme machine durumu']

## Random Forest 

In [221]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier()
# fit the training dataset on the classifier
classifier.fit(xtrain_count, train_y)

# predict the labels on validation dataset
predictions = classifier.predict(xvalid_count)

accuracy_score(predictions, valid_y)

# # RF on Count Vectors
# accuracy = train_model(classifier, xtrain_count, train_y, xvalid_count)
# print("RF, Count Vectors: ", accuracy)

# # RF on Word Level TF IDF Vectors
# accuracy = train_model(classifier, xtrain_tfidf, train_y, xvalid_tfidf)
# print("RF, WordLevel TF-IDF: ", accuracy)

0.56