# Yu-Ting Shen

# RiskGenius Challenge Project


https://www.irmi.com/glossary

https://scrapy.org/

The IRMI link points to a site with definitions of insurance terms.
The Scrapy link is to a library which can extract data from websites.

The idea of this project is in 3 parts:

1. Scrape and store the IRMI glossary into some data format (maybe SQLite, or .json or something).  Be sure to have at least the definition label and definition text.  Other data might be unnecessary.

2. Build a classifier (you can choose the model) and optimize hyperparameters to predict the definition label from the definition text.

3. Predict the word that will be in the definition label, instead of the label itself.  Possibly predict the count vector of the definition label in this case.

This could have a real application in RiskGenius, as a step toward automatically generating definition labels by predicting the words that would be used in definition labels.  You are likely to find in many cases, words in the definition label cannot be found in the definition text, so keep that in mind.

***
***
***

## Load data

In [5]:
import pandas as pd

df_insurance_terms = pd.read_csv('terms.csv')
df_insurance_terms.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [6]:
df_insurance_terms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3261 entries, 0 to 3260
Data columns (total 3 columns):
term       3261 non-null object
text       3261 non-null object
synonym    18 non-null object
dtypes: object(3)
memory usage: 76.5+ KB


Most of the **synonym** are **NaN**, but there are 18 recorders not NaN.

List these 18 recorders.

In [7]:
df_insurance_terms[df_insurance_terms['synonym'].notnull()]

Unnamed: 0,term,text,synonym
215,cost of hire endorsement,A contractors equipment coverage endorsement t...,Rental cost reimbursement endorsement
273,product disparagement,A standard peril covered under a media profess...,Trade libel
314,primary insurer,A transaction in which one party the reinsur...,Reinsurance
323,preservation of property,An ocean and inland marine insurance provision...,Sue and labor clause
578,excess of loss ratio reinsurance,A form of reinsurance also known as aggregate...,Stop loss
584,fronted captive,A special-purpose insurer that operates only o...,Reinsurance captive
593,policy reserve,That portion of the policy premium that has no...,Unearned premium
769,interrelated claims provisions,Provisions within professional liability insur...,Related claims provisions
796,buyers market,One side of the market cycle that is character...,Soft market
866,inter-insurance exchange,An unincorporated group of individuals or orga...,Reciprocal company


## Convert into SQL

* Using SQLite

In [8]:
from sqlalchemy import create_engine

engine = create_engine('sqlite:///insurance_terms.sqlite', echo=False)
df_insurance_terms.to_sql('insurance_terms', con=engine)

* Load the SQL file to check

In [9]:
engine2 = create_engine('sqlite:///insurance_terms.sqlite')
table_names = engine2.table_names()
print(table_names)

sql_command = 'SELECT * FROM ' + table_names[0]
print(sql_command)

con = engine2.connect()
rs = con.execute(sql_command)
df_test_sql = pd.DataFrame(rs.fetchall())
df_test_sql.head()

['insurance_terms']
SELECT * FROM insurance_terms


Unnamed: 0,0,1,2,3
0,0,automatic premium loan,An optional provision in life insurance that a...,
1,1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,2,hydrocarbons,A class of organic compounds composed only of ...,
3,3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,4,hybrid plans,Risk financing techniques that are a combinati...,


In [10]:
df_test_sql.columns=['index', 'term', 'text', 'synonym']
df_test_sql_2 = df_test_sql.drop(['index'], axis=1)
df_test_sql_2.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [11]:
# df_test_sql.drop(['index'], axis=1).equals(df_insurance_terms)
df_test_sql_2.equals(df_insurance_terms)

True

## Convert into JSON format

* From CSV to JSON

In [32]:
import json
import numpy as np

fout = open('insurance_terms.json', 'w')

with open('terms.csv', 'r') as fin:
    for line in fin:
        split_line = line.rstrip().split(',')
        if len(split_line) == 3:
#             term = split_line[0]
#             text = split_line[1]
#             synonym = split_line[2]
       
#             print(term)
#             print(text)
#             print(synonym)
#             print('\n')
            
            d = {}
            d['term'] = split_line[0]
            d['text'] = split_line[1]
            d['synonym'] = split_line[2]
            
            if d['term'] == 'term':
                    continue
                    
#             if d['synonym'] == '':
#                 print(d['synonym'])
#                 d['synonym'] = np.nan
            
#             print(json_dict)
            json_dict = json.dumps(d)
            fout.write(json_dict)
            fout.write('\n')
    
fout.close()

* Load JSON to check

In [33]:
df_test_json = pd.read_json('insurance_terms.json', orient='columns', lines=True)
df_test_json.head()

Unnamed: 0,synonym,term,text
0,,automatic premium loan,An optional provision in life insurance that a...
1,,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,,hydrocarbons,A class of organic compounds composed only of ...
3,,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,,hybrid plans,Risk financing techniques that are a combinati...


In [34]:
len(df_test_json)

3261

In [35]:
cols = ['term', 'text', 'synonym']
df_test_json2 = df_test_json[cols].reset_index(drop=True)
df_test_json2.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [36]:
len(df_test_json2)

3261

In [37]:
df_test_json2.equals(df_insurance_terms)

False

Above is False because synonym is None in `df_insurance_terms` and is empty in `df_test_json2`. Let's compare term and text columns only.

In [38]:
df_test_json2.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [39]:
df_insurance_terms.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [40]:
df_test_json3 = df_test_json2[['term', 'text']]
df_test_json3.equals(df_insurance_terms[['term', 'text']])

True

In [43]:
df_test_json3.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


In [42]:
len(df_test_json3), len(df_insurance_terms)

(3261, 3261)

* From pandas dataframe to JSON

In [44]:
df_insurance_terms.to_json(path_or_buf='insurance_terms_2.json', orient='index')

In [45]:
df_test_json4 = pd.read_json('insurance_terms_2.json', orient='index')
df_test_json4.head()

Unnamed: 0,synonym,term,text
0,,automatic premium loan,An optional provision in life insurance that a...
1,,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,,hydrocarbons,A class of organic compounds composed only of ...
3,,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,,hybrid plans,Risk financing techniques that are a combinati...


In [46]:
df_test_json4 = df_test_json4[cols]
df_test_json4.head()

Unnamed: 0,term,text,synonym
0,automatic premium loan,An optional provision in life insurance that a...,
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...,
2,hydrocarbons,A class of organic compounds composed only of ...,
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...,
4,hybrid plans,Risk financing techniques that are a combinati...,


In [47]:
df_test_json4.equals(df_insurance_terms)

True

## Only keep term and text for the analysis

In [48]:
df_insurance_terms_2 = df_insurance_terms[['term', 'text']]
df_insurance_terms_2.head()

Unnamed: 0,term,text
0,automatic premium loan,An optional provision in life insurance that a...
1,Household Goods Transportation Act of 1980,Provided a nonjudicial dispute settlement prog...
2,hydrocarbons,A class of organic compounds composed only of ...
3,hydraulic fracturing (fracking),A process in which fractures in hard-to-reach ...
4,hybrid plans,Risk financing techniques that are a combinati...


In [49]:
def dataframe_to_sql(df, table_name):
    from sqlalchemy import create_engine

    sqlite = 'sqlite:///' + table_name + '.sqlite'
    engine = create_engine(sqlite, echo=False)
    df.to_sql(table_name, con=engine)

In [50]:
dataframe_to_sql(df_insurance_terms_2, 'insurance_terms_2')

In [51]:
def csv_to_json(csv_file, json_file):
    import json
    import numpy as np

    fout = open(json_file, 'w')

    with open(csv_file, 'r') as fin:
        for line in fin:
            split_line = line.rstrip().split(',')
            if len(split_line) == 3:
                d = {}
                d['term'] = split_line[0]
                d['text'] = split_line[1]
#                 d['synonym'] = split_line[2]
                json_dict = json.dumps(d)
                if d['term'] == 'term':
                    continue
                fout.write(json_dict)
                fout.write('\n')
    
    fout.close()

In [52]:
csv_to_json('terms.csv', 'insurance_terms_3.json')

In [53]:
def dataframe_to_json(df, json_file):
    df.to_json(path_or_buf=json_file, orient='index')

In [54]:
dataframe_to_json(df_insurance_terms_2, 'insurance_terms_4.json')

***
***
***