# Yu-Ting Shen

# RiskGenius Challenge Project


https://www.irmi.com/glossary

https://scrapy.org/

The IRMI link points to a site with definitions of insurance terms.
The Scrapy link is to a library which can extract data from websites.

The idea of this project is in 3 parts:

1. Scrape and store the IRMI glossary into some data format (maybe SQLite, or .json or something).  Be sure to have at least the definition label and definition text.  Other data might be unnecessary.

2. Build a classifier (you can choose the model) and optimize hyperparameters to predict the definition label from the definition text.

3. Predict the word that will be in the definition label, instead of the label itself.  Possibly predict the count vector of the definition label in this case.

This could have a real application in RiskGenius, as a step toward automatically generating definition labels by predicting the words that would be used in definition labels.  You are likely to find in many cases, words in the definition label cannot be found in the definition text, so keep that in mind.

***
***
***

# Load Data

In [72]:
import pandas as pd

df_insurance_terms = pd.read_csv('terms.csv')
df = df_insurance_terms[['term', 'text']]
df.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


# Normalize

In [73]:
df_tfidf = df.copy()

from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()

import re

def normalize_string(string):
    '''
    normalize original string:
    1. remove punctuation
    2. use lower case
    3. lemmatize
    '''
    # Remove punctuation
    string = re.sub('[<(!?).,*->]', ' ', string)
    # Convert to lower case
    string = string.lower()
    # Lemmatization
    string = [stemmer.lemmatize(word) for word in string.split()]
    # joint list
    return ' '.join(string)

df_tfidf['norm_term'] = df_tfidf['term'].apply(lambda x: normalize_string(x))
df_tfidf['norm_text'] = df_tfidf['text'].apply(lambda x: normalize_string(x))

df_tfidf = df_tfidf[['norm_term', 'norm_text']]
df_tfidf.columns=['label', 'definition']
df_tfidf.head()

Unnamed: 0,label,definition
0,automatic premium loan,an optional provision in life insurance that a...
1,household good transportation act of,provided a nonjudicial dispute settlement prog...
2,hydrocarbon,a class of organic compound composed only of c...
3,hydraulic fracturing fracking,a process in which fracture in hard to reach s...
4,hybrid plan,risk financing technique that are a combinatio...


# TfidfVectorizer

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords

tfidf_vectorizer_definition = TfidfVectorizer(stop_words=stopwords.words('english'))
tfidf_vectorizer_label = TfidfVectorizer(stop_words=stopwords.words('english'))

X = tfidf_vectorizer_definition.fit_transform(df_tfidf['definition'])
y = tfidf_vectorizer_label.fit_transform(df_tfidf['label'])

X_dense = X.todense()
y_dense = y.todense()

X_arr = X.toarray()
y_arr = y.toarray()

print(X_dense.shape, y_dense.shape, X_arr.shape, y_arr.shape)

(3261, 7400) (3261, 2716) (3261, 7400) (3261, 2716)


In [75]:
definition_features = tfidf_vectorizer_definition.get_feature_names()
label_features = tfidf_vectorizer_label.get_feature_names()

In [76]:
print('text features: \n', definition_features)

text features: 


In [77]:
print('term features: \n', label_features)

term features: 
 ['aai', 'aais', 'aam', 'ab', 'abandonment', 'abatement', 'ability', 'absolute', 'absorbed', 'abuse', 'acc', 'acceleration', 'accept', 'acceptance', 'accepted', 'access', 'accident', 'accidental', 'accommodation', 'accordance', 'account', 'accountability', 'accountant', 'accounting', 'accreditation', 'accredited', 'accumulation', 'achievable', 'acii', 'acla', 'acls', 'acm', 'acquired', 'acquisition', 'act', 'action', 'activity', 'actual', 'actuarial', 'actuary', 'acute', 'acv', 'ad', 'ada', 'adaaa', 'add', 'added', 'addition', 'additional', 'adea', 'adequacy', 'adhesion', 'adjustable', 'adjusted', 'adjuster', 'adjustment', 'administration', 'administrative', 'administrator', 'admiralty', 'admitted', 'adr', 'advance', 'advanced', 'advancement', 'adverse', 'advertisement', 'advertising', 'advice', 'advised', 'adviser', 'advisory', 'aerobic', 'affiant', 'affidavit', 'affiliated', 'affinity', 'affirmative', 'affordable', 'afis', 'afsb', 'afterburner', 'aftermarket', 'agc', 

# Try Neural Network model

In [78]:
from sklearn.neural_network import MLPRegressor

nn = MLPRegressor()
nn.fit(X_arr, y_arr)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [79]:
def words_in_label(input_string, threshold=0.05,
                   regressor=nn,
                   def_vectorizer=tfidf_vectorizer_definition,
                   label_vectorizer=tfidf_vectorizer_label):
    
    vector = def_vectorizer.transform([input_string])
    preds = regressor.predict(vector)
    
    probabilities = [preds[0, n] for n in range(preds.shape[1])]
    labels = label_vectorizer.get_feature_names()
    
    labels_with_probabilities = zip(labels, probabilities)
    
    possible_labels = {}
    for label, prob in labels_with_probabilities:
        if prob > threshold:
#             print('{0}: \t {1:.4f}'.format(label, prob))
            possible_labels[label]=prob
    return possible_labels

Use the following to test:
* term = automatic premium loan
* text = An optional provision in life insurance that authorizes the insurer to pay from the cash value any premium due at the end of the grace period  This provision is useful in preventing inadvertent lapse of the policy



In [80]:
data = 'An optional provision in life insurance that authorizes the insurer to pay from the cash value any premium due at the end of the grace period  This provision is useful in preventing inadvertent lapse of the policy'

possible_labels = words_in_label(data, 0.01)
print(possible_labels)

# print(sorted(possible_labels.items(), key=lambda x: x[1], reverse=True))
highest_probability_term = sorted(possible_labels.items(), key=lambda x: x[1], reverse=True)
print(highest_probability_term[0])

{'agency': 0.011998725837760429, 'agreement': 0.014518807289797268, 'automatic': 0.011246896779330454, 'carrier': 0.010388421155649055, 'clause': 0.012605013388289784, 'coverage': 0.019038283675747725, 'credit': 0.01088226325647249, 'experience': 0.010002623924193495, 'fee': 0.013024901515306622, 'insurance': 0.020807278356672846, 'liability': 0.013050809491247867, 'loan': 0.0227204020797463, 'motion': 0.010795589020052149, 'paid': 0.01206612134322216, 'portfolio': 0.011315529134145696, 'premium': 0.05072785370268747, 'rating': 0.013055549073059586, 'reinsurance': 0.02383954701707023, 'standard': 0.012870102502777215, 'surplus': 0.010592257087736215, 'written': 0.01146883250006993}
('premium', 0.05072785370268747)


# Get the correct term

In [81]:
all_terms = [term for term in df['term']]

In [82]:
keys = possible_labels.keys()
print(keys)

possible_term = []
for term in all_terms:
    tokens = term.split()
    # check is tokens a subset of keys
    if set(tokens) <= set(keys):
#         print(term)
        possible_term.append(term)
    
print(possible_term)

dict_keys(['agency', 'agreement', 'automatic', 'carrier', 'clause', 'coverage', 'credit', 'experience', 'fee', 'insurance', 'liability', 'loan', 'motion', 'paid', 'portfolio', 'premium', 'rating', 'reinsurance', 'standard', 'surplus', 'written'])
['automatic premium loan', 'credit insurance', 'coverage', 'premium loan', 'premium', 'experience rating', 'experience', 'portfolio reinsurance', 'motion', 'portfolio', 'insurance', 'clause', 'liability insurance', 'liability', 'carrier', 'agency', 'agency agreement', 'surplus reinsurance', 'surplus', 'standard premium', 'written premium', 'reinsurance credit', 'reinsurance premium', 'reinsurance agreement', 'rating experience', 'rating']


### Use data and possible_term to train again

In [83]:
X_data = tfidf_vectorizer_definition.fit_transform([data])
y_label = tfidf_vectorizer_label.fit_transform(possible_term)

X_data_arr = X_data.toarray()
y_label_arr = y_label.toarray()

print(X_data_arr.shape, y_label_arr.shape)

(1, 19) (26, 19)


In [84]:
nn2 = MLPRegressor()
nn2.fit(X_data_arr, y_label_arr)

ValueError: Found input variables with inconsistent numbers of samples: [1, 26]

Use the following to test:
* term = hydrocarbons
* text = A class of organic compounds composed only of carbon and hydrogen  Common hydrocarbons include natural gas  crude oil  and coal  Hydrocarbons are the primary source of the world s electric energy and heat sources due to the power created when they are burned

In [85]:
data = 'A class of organic compounds composed only of carbon and hydrogen Common hydrocarbons include natural gas crude oil and coal Hydrocarbons are the primary source of the world s electric energy and heat sources due to the power created when they are burned'

possible_labels = words_in_label(data, 0.01)
print(possible_labels)

highest_probability_term = sorted(possible_labels.items(), key=lambda x: x[1], reverse=True)
print(highest_probability_term)

ValueError: dimension mismatch

In [68]:
keys = possible_labels.keys()
print(keys)

possible_term = []
for term in all_terms:
    tokens = term.split()
    # check is tokens a subset of keys
    if set(tokens) <= set(keys):
#         print(term)
        possible_term.append(term)
    
print(possible_term)

dict_keys(['agreement', 'captive', 'claim', 'class', 'disability', 'dividend', 'exception', 'experience', 'exposure', 'hydrocarbon', 'insurance', 'insured', 'line', 'loss', 'market', 'period', 'pure', 'retention', 'risk'])
['pure captive', 'pure risk', 'exposure', 'experience', 'loss', 'insured', 'insurance line', 'line', 'disability', 'insurance', 'class', 'market risk', 'claim', 'captive', 'risk retention', 'risk', 'retention']
