# Yu-Ting Shen

# RiskGenius Challenge Project


https://www.irmi.com/glossary

https://scrapy.org/

The IRMI link points to a site with definitions of insurance terms.
The Scrapy link is to a library which can extract data from websites.

The idea of this project is in 3 parts:

1. Scrape and store the IRMI glossary into some data format (maybe SQLite, or .json or something).  Be sure to have at least the definition label and definition text.  Other data might be unnecessary.

2. Build a classifier (you can choose the model) and optimize hyperparameters to predict the definition label from the definition text.

3. Predict the word that will be in the definition label, instead of the label itself.  Possibly predict the count vector of the definition label in this case.

This could have a real application in RiskGenius, as a step toward automatically generating definition labels by predicting the words that would be used in definition labels.  You are likely to find in many cases, words in the definition label cannot be found in the definition text, so keep that in mind.

***
***
***

# Load Data

In [36]:
import pandas as pd

df_insurance_terms = pd.read_csv('terms.csv')
# df_insurance_terms.head()
df = df_insurance_terms[['term', 'text']]
df.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


# Normalize

In [37]:
df_cnt = df.copy()

from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()

import re

def normalize_string(string):
    '''
    normalize original string:
    1. remove punctuation
    2. use lower case
    3. lemmatize
    '''
    # Remove punctuation
    string = re.sub('[<(!?).,*->]', ' ', string)
    # Convert to lower case
    string = string.lower()
    # Lemmatization
    string = [stemmer.lemmatize(word) for word in string.split()]
    # joint list
    return ' '.join(string)

df_cnt['norm_term'] = df_cnt['term'].apply(lambda x: normalize_string(x))
df_cnt['norm_text'] = df_cnt['text'].apply(lambda x: normalize_string(x))

df_cnt.head()

Unnamed: 0,term,text,norm_term,norm_text
0,automatic premium loan,An optional provision in life insurance that a...,automatic premium loan,an optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,household good transportation act of,provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...,hydrocarbon,a class of organic compound composed only of c...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,hydraulic fracturing fracking,a process in which fracture in hard to reach s...
4,hybrid plans,Risk financing techniques that are a combinati...,hybrid plan,risk financing technique that are a combinatio...


# CountVectorize

In [38]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

cnt_vect_text = CountVectorizer(stop_words=stopwords.words('english'))
cnt_vect_term = CountVectorizer(stop_words=stopwords.words('english'))

X = cnt_vect_text.fit_transform(df_cnt['norm_text'])
y = cnt_vect_term.fit_transform(df_cnt['norm_term'])

In [39]:
X_dense = X.todense()
y_dense = y.todense()

X_arr = X.toarray()
y_arr = y.toarray()

In [40]:
print(X_dense.shape, y_dense.shape, X_arr.shape, y_arr.shape)

(3261, 7400) (3261, 2716) (3261, 7400) (3261, 2716)


In [41]:
text_features = cnt_vect_text.get_feature_names()
term_features = cnt_vect_term.get_feature_names()

print('text features: \n', text_features)
print('term features: \n', term_features)

text features: 
term features: 
 ['aai', 'aais', 'aam', 'ab', 'abandonment', 'abatement', 'ability', 'absolute', 'absorbed', 'abuse', 'acc', 'acceleration', 'accept', 'acceptance', 'accepted', 'access', 'accident', 'accidental', 'accommodation', 'accordance', 'account', 'accountability', 'accountant', 'accounting', 'accreditation', 'accredited', 'accumulation', 'achievable', 'acii', 'acla', 'acls', 'acm', 'acquired', 'acquisition', 'act', 'action', 'activity', 'actual', 'actuarial', 'actuary', 'acute', 'acv', 'ad', 'ada', 'adaaa', 'add', 'added', 'addition', 'additional', 'adea', 'adequacy', 'adhesion', 'adjustable', 'adjusted', 'adjuster', 'adjustment', 'administration', 'administrative', 'administrator', 'admiralty', 'admitted', 'adr', 'advance', 'advanced', 'advancement', 'adverse', 'advertisement', 'advertising', 'advice', 'advised', 'adviser', 'advisory', 'aerobic', 'affiant', 'affidavit', 'affiliated', 'affinity', 'affirmative', 'affordable', 'afis', 'afsb', 'afterburner', 'after

# Try Neural Network model

In [42]:
from sklearn.neural_network import MLPRegressor

nn = MLPRegressor()
nn.fit(X_arr, y_arr)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [43]:
def words_in_label(input_string, threshold=0.05,
                   regressor=nn,
                   def_vectorizer=cnt_vect_text,
                   label_vectorizer=cnt_vect_term):
    
    vector = def_vectorizer.transform([input_string])
    preds = regressor.predict(vector)
    
    probabilities = [preds[0, n] for n in range(preds.shape[1])]
    labels = label_vectorizer.get_feature_names()
    
    labels_with_probabilities = zip(labels, probabilities)
    
    possible_labels = {}
    for label, prob in labels_with_probabilities:
        if prob > threshold:
#             print('{0}: \t {1:.4f}'.format(label, prob))
            possible_labels[label]=prob
    return possible_labels

Use the following to test:
* term = automatic premium loan
* text = An optional provision in life insurance that authorizes the insurer to pay from the cash value any premium due at the end of the grace period  This provision is useful in preventing inadvertent lapse of the policy


In [46]:
data = 'An optional provision in life insurance that authorizes the insurer to pay from the cash value any premium due at the end of the grace period  This provision is useful in preventing inadvertent lapse of the policy'

possible_labels = words_in_label(data, 0.05)
print(possible_labels)

# print(sorted(possible_labels.items(), key=lambda x: x[1], reverse=True))
highest_probability_term = sorted(possible_labels.items(), key=lambda x: x[1], reverse=True)[0]
# print(highest_probability_term[0])

{'loss': 0.08360077327074153, 'premium': 0.10694684298810071, 'risk': 0.07756092935629316, 'value': 0.05562000722777047}
[('premium', 0.10694684298810071), ('loss', 0.08360077327074153), ('risk', 0.07756092935629316), ('value', 0.05562000722777047)]
premium


# Get the correct term

In [47]:
all_terms = [term for term in df['term']]
print(all_terms)

['automatic premium loan', 'Household Goods Transportation Act of 1980', 'hydrocarbons', 'hydraulic fracturing (fracking)', 'hybrid plans', 'hybrid captive', 'hurdle rate', 'hybrid', 'job safety analysis (JSA)', 'improvements and betterments', 'implied warranty', 'implied authority', 'implead', 'impairment capital', 'incorporation doctrine', 'incontestable clause', 'incident reporting provision', 'management advisory services', 'managed care organization (MCO)', 'manufacturers penalty insurance', 'managed care liability insurance', 'managed care coverage endorsement', 'managed care', 'malware', 'manufacturers output policy (MOP)', 'market value clause', 'matching deductible', 'master service agreement (MSA)', 'master policy', 'masonry noncombustible construction (ISO)', 'medical malpractice insurance', 'market value', 'maximum possible loss (MPL)', 'medical malpractice  caps ', 'Migrant and Seasonal Agricultural Worker Protection Act (MSAWPA) of 1983', 'Medical Injury Compensation Refo

In [48]:
for term in all_terms:
    tokens = term.split()
    for token in tokens:
        if token in highest_probability_term:
            print(term)

automatic premium loan
extra premium
premium tax
premium reserve
premium prepayment
premium payment plan
premium notice
premium discount
premium loan
premium capacity
premium audit
premium  advance
premium
exportability of premium risk
excess loss premium (ELP) factor
modified premium
minimum premium
fractional premium
nonsubject premium
estimated premium
direct written premium
original gross premium (OGP)
maximum premium
net level premium
net written premium
initial premium
guaranteed cost premium
deposit premium
gross written premium (GWP)
gross premium
manual premium
earned reinsurance premium
basic premium
earned premium
basic premium factor
base premium
standard premium
written premium
return premium
return of premium (or cash value)
single premium insurance
waiver of premium
unearned premium reserve (UPR)
unearned reinsurance premium
reinsurance premium
vanishing premium
reinstatement premium


There are too many possible terms. Need to find a way to narrow down the result and get higher precision