#  Professional and Occupational Licensing

This notebook is a continuation of exploratory analysis conducted on the [Disciplinary Actions for Professional and Occupational Licensees dataset for the state of delaware](https://github.com/seyeadekanye/Disciplinary-Action-ODD/blob/master/Disciplinary%20Action.ipynb). This dataset contains "This dataset contains information about individuals who have applied for, currently hold or previously held a professional or occupational license issued by the State of Delaware" (for certain industries).

### We hope to answer the follwong questions from this dataset:
* For a given year and given profession, what is the ratio of licenses issued to fines given?
* 
* 

We will be using the pandas module for the data manipulation, while using the matplotlib and seaborn packages for our visualizations.

### Necessary data  set(s)
[Disciplinary Actions for Professional and Occupational Licensees](https://data.delaware.gov/Licenses-and-Certifications/Disciplinary-Actions-for-Professional-and-Occupati/dz6p-akeq)

[Professional and Occupational Licensing](https://data.delaware.gov/Licenses-and-Certifications/Professional-and-Occupational-Licensing/pjnv-eaih)





*Written on January 29, 2018 by Seye Adekanye*

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib notebook

In [2]:
url_licenses = "https://data.delaware.gov/resource/dhqa-h9is.csv?$limit=300000"
data = pd.read_csv(url_licenses, low_memory=False)

In [3]:
url_fines = "https://data.delaware.gov/resource/wqvn-hw3m.csv?$limit=6000"
data2 = pd.read_csv(url_fines, low_memory=False)

In [4]:
data.head()

Unnamed: 0,city,combined_name,count,country,disciplinary_action,expiration_date,first_name,issue_date,last_name,license_no,license_status,license_type,licensee_url,licensee_url_description,profession_id,state,zip_code
0,Philadelphia,"Paulson, Melyssa M.",1,United States,N,2003-03-31T00:00:00.000,Melyssa,2001-08-09T14:49:00.000,Paulson,C7-0002313,Expired-Must Reapply,ACGME Training,https://dpronline.delaware.gov/mylicense%20web...,,Medical Practice,PA,19107
1,Wilmington,"Dholakia, Madhun",1,United States,N,2005-03-31T00:00:00.000,Madhun,2004-07-26T00:00:00.000,Dholakia,C7-0003059,Expired-Must Reapply,ACGME Training,https://dpronline.delaware.gov/mylicense%20web...,,Medical Practice,DE,19899
2,Collegedale,"Sims, Naomi Reneau",1,United States,N,2001-09-30T00:00:00.000,Naomi,2001-01-17T00:00:00.000,Sims,L1-0029470,Lapsed-Must Reinstate,Registered Nurse,https://dpronline.delaware.gov/mylicense%20web...,,Nursing,TN,37315
3,Newark,"Macklin, Kevin, Sr.",1,United States,N,2010-03-31T00:00:00.000,Kevin,2005-10-14T00:00:00.000,Macklin,D1-0001215,Null and Void,Barber,https://dpronline.delaware.gov/mylicense%20web...,,Cosmetology and Barbering,DE,19711
4,Pasadena,"Miller, Ronald J",1,United States,N,2006-06-30T00:00:00.000,Ronald,2005-03-10T00:00:00.000,Miller,T1-0005253,Expired-Must Reapply,Master Electrician,https://dpronline.delaware.gov/mylicense%20web...,,Electrical Examiners,MD,21122


In [5]:
data.shape

(251665, 17)

In [6]:
data['profession_id'].unique()

array(['Medical Practice', 'Nursing', 'Cosmetology and Barbering',
       'Electrical Examiners', 'Real Estate', 'Charitable Gaming',
       'Real Estate Appraisers', 'Controlled Substances', 'Geologists',
       'Accountancy', 'Architecture', 'Physical Therapy/Athletic Trg',
       'Social Work Examiners', 'Pharmacy', 'Landscape Architecture',
       'Massage Bodywork', 'Speech and Hearing',
       'Nursing Home Administrators', 'Veterinary Medicine', 'Psychology',
       'Chiropractic', 'Plumbing/HVACR', 'Funeral Services', 'Dentistry',
       'Combative Sports', 'Occupational Therapy', 'Optometry',
       'Mental Health', 'Dietitians/Nutritionists', 'Home Inspectors',
       'Deadly Weapons Dealers', 'Land Surveyors', 'Pilots',
       'Manufactured Home Installation', 'Podiatry', '<any>',
       'Adult Entertainment'], dtype=object)

In [7]:
data2.head()

Unnamed: 0,combined_name,count,disp_end,disp_start,first_name,item_text,last_name,license_id_l,license_type,profession_id
0,"Jones, Matthew T.",1,,2017-06-13T00:00:00.000,Matthew,Remedial Education,Jones,N1-0002718,Veterinarian,Veterinary Medicine
1,"Jones, Matthew T.",1,,2017-06-13T00:00:00.000,Matthew,Probation,Jones,N1-0002718,Veterinarian,Veterinary Medicine
2,"Fortner, Elizabeth Buckley",1,,2011-06-03T00:00:00.000,Elizabeth,Letter of Reprimand,Fortner,G2-0002068,Dental Hygienist,Dentistry
3,"Hynson, Lauren Marie",1,2017-07-14T00:00:00.000,2017-04-12T00:00:00.000,Lauren,Remedial Education,Hynson,L1-0041565,Registered Nurse,Nursing
4,"Hensley, Leslie A Doughty",1,2017-07-16T00:00:00.000,2013-07-16T00:00:00.000,Leslie,Probation,Hensley,L6-0A00209,Certified Registered Nurse Anesthetist,Nursing


In [8]:
data2.shape

(5078, 10)

Which professions are common to both datasets?

In [9]:
A = data['profession_id'].unique()
B = data2['profession_id'].unique()
list(set(A).intersection(set(B)))

['Psychology',
 'Cosmetology and Barbering',
 'Controlled Substances',
 'Land Surveyors',
 'Massage Bodywork',
 'Pharmacy',
 'Architecture',
 'Speech and Hearing',
 'Medical Practice',
 'Home Inspectors',
 'Plumbing/HVACR',
 'Optometry',
 'Landscape Architecture',
 'Veterinary Medicine',
 'Occupational Therapy',
 'Real Estate',
 'Social Work Examiners',
 'Nursing Home Administrators',
 'Accountancy',
 'Combative Sports',
 'Physical Therapy/Athletic Trg',
 'Chiropractic',
 'Electrical Examiners',
 'Pilots',
 'Charitable Gaming',
 'Adult Entertainment',
 'Real Estate Appraisers',
 'Mental Health',
 'Podiatry',
 'Geologists',
 'Manufactured Home Installation',
 'Nursing',
 'Deadly Weapons Dealers',
 'Funeral Services',
 'Dentistry']

Drop all rows from `data` dataset where `profession_id` is not in `data2` dataset

In [10]:
list(set(A).difference(set(B)))

['<any>', 'Dietitians/Nutritionists']

In [11]:
data = data[data.profession_id != '<any>']
data = data[data.profession_id != 'Dietitians/Nutritionists']

In [12]:
data.reset_index(drop=True);

In [13]:
data.shape

(251072, 17)

In [14]:
len(data[data['profession_id']== 'Medical Practice'])

22703

In [15]:
count = {}
for p in data['profession_id']:
    try:
        count[p] += 1
    except KeyError:
        count[p] = 1

In [16]:
count

{'Accountancy': 14328,
 'Adult Entertainment': 5,
 'Architecture': 4079,
 'Charitable Gaming': 24958,
 'Chiropractic': 746,
 'Combative Sports': 385,
 'Controlled Substances': 10966,
 'Cosmetology and Barbering': 19035,
 'Deadly Weapons Dealers': 366,
 'Dentistry': 2455,
 'Electrical Examiners': 18067,
 'Funeral Services': 560,
 'Geologists': 949,
 'Home Inspectors': 168,
 'Land Surveyors': 557,
 'Landscape Architecture': 596,
 'Manufactured Home Installation': 143,
 'Massage Bodywork': 4715,
 'Medical Practice': 22703,
 'Mental Health': 1087,
 'Nursing': 65272,
 'Nursing Home Administrators': 658,
 'Occupational Therapy': 1870,
 'Optometry': 424,
 'Pharmacy': 9259,
 'Physical Therapy/Athletic Trg': 5941,
 'Pilots': 180,
 'Plumbing/HVACR': 3262,
 'Podiatry': 202,
 'Psychology': 1189,
 'Real Estate': 26507,
 'Real Estate Appraisers': 3310,
 'Social Work Examiners': 1501,
 'Speech and Hearing': 2304,
 'Veterinary Medicine': 2325}

Next we deal with the strange date formats. We will be converting these columns to datetime objects.

In [17]:
from datetime import datetime
disciplinary_start_date = []
for date in data2['disp_start']:
    try:
        date = date[0:10]
        disciplinary_start_date.append(datetime.strptime(str(date),'%Y-%m-%d'))
    except:
        disciplinary_start_date.append(date) 
data2['disp_start'] = disciplinary_start_date
        
        
license_issue_date = []
for date in data['issue_date']:
    try:
        date = date[0:10]
        license_issue_date.append(datetime.strptime(str(date),'%Y-%m-%d'))
    except:
        license_issue_date.append(date)
data['issue_date'] = license_issue_date

In [18]:
license_year = []
err = []
for date in data['issue_date']:
    try:
        license_year.append(date.year)
    except error as e:
        err.append(e)
        license_year.append(date)
data['licence_year'] = license_year

In [19]:
disciplinary_year = []
err = []
for date in data2['disp_start']:
    try:
        disciplinary_year.append(date.year)
    except error as e:
        err.append(e)
        disciplinary_year.append(date)
data2['disciplinary_year'] = disciplinary_year

A function that tells us in a given year, the ratio of fines recieved to total number of licenses issued

In [44]:
def fine_to_license_ratio(license_data, fine_data, column_name1=None, column_name2=None,year=None):
    """Get ratio of fines to licenses issued in a given year
    
    Parameters:
    -----------
    license_data: DataFrame
        Any subset of the Professional and Occupational Licensing dataframe   
    fine_data: DataFrame
        Any subset of the Disciplinary Actions dataframe 
    year: int
        Year to use to subset your data
    column_name1: Series
        Column containing years in license_data dataset
    column_name2: Series
        Column containing years in fine_data dataset
    
    Returns:
    --------
    tuple
        A tuple with year as the first entry and ratio as the second
        (year, ratio)
    """
    
    int(year)
    str(column_name1)
    str(column_name2)
    
    if year not in license_data[column_name1].unique() or year not in fine_data[column_name2].unique():
        raise Exception(str(year) + " not a valid year for this dataset" 
                        + "\n----------------------------------------")
    else:
        license_data = license_data[license_data[column_name1]==year]
        fine_data = fine_data[fine_data[column_name2]==year]
    try:
        ratio = len(fine_data)/len(license_data) * 100
        print("In the year " + str(year) + ", " + str(ratio)[:4] 
              +"% of license recipients were issued disciplinary action(s)" 
              + "\n" +str(len(license_data)) + " licenses issued, " + str(len(fine_data)) + " fines issued"
            + "\n----------------------------------------")
        return year, ratio
    except ZeroDivisionError:
        print("Hmmm...It looks like there is are no licenses yet for the year " + str(year))
    

In [25]:
fine_to_license_ratio(license_data=data, fine_data=data2, column_name1='licence_year', 
                      column_name2='disciplinary_year', year=2017)

In the year 2017, 5.80% of license recipients were issued disciplinary action(s)
11925 licenses issued, 692 fines issued
----------------------------------------


(2017, 5.80293501048218)

In [26]:
data[data['licence_year']==2177]

Unnamed: 0,city,combined_name,count,country,disciplinary_action,expiration_date,first_name,issue_date,last_name,license_no,license_status,license_type,licensee_url,licensee_url_description,profession_id,state,zip_code,licence_year
241959,Mount Royal,"Whiting, Toni Lyn",1,United States,N,2017-10-10T00:00:00.000,Toni,2177-08-01,Whiting,GN-0005269,Null and Void,Temporary Permit - GN,https://dpronline.delaware.gov/mylicense%20web...,,Nursing,NJ,8061,2177.0


Create function to group data by `profession_id` and then input that data into `fine_to_license_ratio` to compute the ratio of fines to licenses for that particular profession.

In [41]:
def fine_to_license_ratio_profession(license_data, fine_data, profession, 
                                     profession_column_license, profession_column_fine, 
                                     column_name1=None, column_name2=None,year=None):
    """Get ratio of fines to licenses issued in a given year
    
    Parameters:
    -----------
    license_data: DataFrame
        A subset of the Professional and Occupational Licensing dataframe by profession  
    fine_data: DataFrame
        A subset of the Disciplinary Actions dataframe by profession
    profession: str
        Profession to get out of your grouped data
    profession_column_license: str
        Column name of professions in Occupational Licensing dataframe
    profession_column_fine: str
        Column name of professions in Disciplinary Actions dataframe dataframe
    year: int
        Year to use to subset your data
    column_name1: Series
        Column containing years in license_data dataset
    column_name2: Series
        Column containing years in fine_data dataset
    
    Returns:
    --------
    tuple
        A tuple with year as the first entry and ratio as the second
        (year, ratio)
    """
    
    int(year)
    str(column_name1)
    str(column_name2)
    str(profession)
    
    grouped_license_data = license_data.groupby(profession_column_license)
    grouped_fine_data = fine_data.groupby(profession_column_fine)
    license_profession_df = grouped_license_data.get_group(profession)
    fine_profession_df = grouped_fine_data.get_group(profession)
    try: 
        print("For the " + profession + " Profession...")
        return fine_to_license_ratio(license_data=license_profession_df, fine_data=fine_profession_df, 
                                 column_name1=column_name1, column_name2=column_name2,year=year)
    except Exception as e:
        print(e)

In [45]:
fine_to_license_ratio_profession(license_data=data, fine_data=data2, 
                                 profession='Pharmacy', profession_column_license='profession_id', 
                                 profession_column_fine='profession_id', column_name1='licence_year', 
                                 column_name2='disciplinary_year', year=2001)

For the Pharmacy Profession...
2001 not a valid year for this dataset
----------------------------------------


For nursing compare growth in number of licenses issued by year to growth of fines issued.

In [57]:
import numpy as np
for date in np.arange(2000,2018):
    fine_to_license_ratio_profession(license_data=data, fine_data=data2, 
                                     profession='Occupational Therapy', profession_column_license='profession_id', 
                                     profession_column_fine='profession_id', column_name1='licence_year', 
                                     column_name2='disciplinary_year', year=date)

For the Occupational Therapy Profession...
2000 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2001 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2002 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2003 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2004 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2005 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2006 not a valid year for this dataset
----------------------------------------
For the Occupational Therapy Profession...
2007 not a valid year for this dataset
----------------------------------------
For the Occupati

In [53]:
data['profession_id'].unique()

array(['Medical Practice', 'Nursing', 'Cosmetology and Barbering',
       'Electrical Examiners', 'Real Estate', 'Charitable Gaming',
       'Real Estate Appraisers', 'Controlled Substances', 'Geologists',
       'Accountancy', 'Architecture', 'Physical Therapy/Athletic Trg',
       'Social Work Examiners', 'Pharmacy', 'Landscape Architecture',
       'Massage Bodywork', 'Speech and Hearing',
       'Nursing Home Administrators', 'Veterinary Medicine', 'Psychology',
       'Chiropractic', 'Plumbing/HVACR', 'Funeral Services', 'Dentistry',
       'Combative Sports', 'Occupational Therapy', 'Optometry',
       'Mental Health', 'Home Inspectors', 'Deadly Weapons Dealers',
       'Land Surveyors', 'Pilots', 'Manufactured Home Installation',
       'Podiatry', 'Adult Entertainment'], dtype=object)