# Yu-Ting Shen

# RiskGenius Challenge Project


https://www.irmi.com/glossary

https://scrapy.org/

The IRMI link points to a site with definitions of insurance terms.
The Scrapy link is to a library which can extract data from websites.

The idea of this project is in 3 parts:

1. Scrape and store the IRMI glossary into some data format (maybe SQLite, or .json or something).  Be sure to have at least the definition label and definition text.  Other data might be unnecessary.

2. Build a classifier (you can choose the model) and optimize hyperparameters to predict the definition label from the definition text.

3. Predict the word that will be in the definition label, instead of the label itself.  Possibly predict the count vector of the definition label in this case.

This could have a real application in RiskGenius, as a step toward automatically generating definition labels by predicting the words that would be used in definition labels.  You are likely to find in many cases, words in the definition label cannot be found in the definition text, so keep that in mind.

***
***
***

# Load Data

In [4]:
import pandas as pd

df_insurance_terms = pd.read_csv('terms.csv')
# df_insurance_terms.head()
df = df_insurance_terms[['term', 'text']]
df.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


# Normalize

In [24]:
df_cnt = df.copy()

from nltk.stem import WordNetLemmatizer
stemmer = WordNetLemmatizer()

def normalize_string(string):
    '''
    normalize original string:
    1. remove punctuation
    2. use lower case
    3. lemmatize
    '''
    # Remove punctuation
    string = re.sub('[<(!?).,*->]', ' ', string)
    # Convert to lower case
    string = string.lower()
    # Lemmatization
    string = [stemmer.lemmatize(word) for word in string.split()]
    # joint list
    return ' '.join(string)

df_cnt['norm_term'] = df_cnt['term'].apply(lambda x: normalize_string(x))
df_cnt['norm_text'] = df_cnt['text'].apply(lambda x: normalize_string(x))

df_cnt.head()

Unnamed: 0,term,text,norm_term,norm_text
0,automatic premium loan,An optional provision in life insurance that a...,automatic premium loan,an optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,household good transportation act of,provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...,hydrocarbon,a class of organic compound composed only of c...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,hydraulic fracturing fracking,a process in which fracture in hard to reach s...
4,hybrid plans,Risk financing techniques that are a combinati...,hybrid plan,risk financing technique that are a combinatio...


# CountVectorize

In [26]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

cnt_vect = CountVectorizer(stop_words=stopwords.words('english'))

X = cnt_vect.fit_transform(df_cnt['norm_text']).toarray()
y = cnt_vect.fit_transform(df_cnt['norm_term']).toarray()

# Training and Testing Sets

In [32]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2282, 7400) (979, 7400) (2282, 2716) (979, 2716)


# Try regression model

In [31]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1.0, penalty='l2')

clf.fit(X_train, y_train)



ValueError: bad input shape (2282, 2716)