# Classifying doctor violations by hand, then by machine

You're looking out for certain types of doctor violations! Whether keeping poor records, being addicted to drugs, or anything else. **You decide.**

**You're going to see how often doctors lose their license for that violation.** There are about 7000 records, though, and you ain't going to read all of them!

Steps:

1. **Classify some violations by hand**
1. Vectorize the **hand-classified violations**
1. Train a classifer on the **hand-classified violations**.
1. **Test the classifier**. If it's good, next step! If not, go back to training.
1. Vectorize the **unclassified violations**
1. Use the classifier to **predict the labels of the unclassified violations**
1. What actions were taken against those doctors?

It'll be magic!

In [18]:
import pandas as pd
import numpy as np

In [19]:
df = pd.read_csv("physicians-ny-violations.csv")
df.head(2)

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,


## Step 1: Classify some by hand

If you had a CSV with some sort of key in common, you'd be able to just do a join. But we don't! So **I'm going to help you out**.

I wrote this little script to help you **classify content by hand**. It will print the violation, then it's what you're looking for. If you type "y" or "Y" before hitting enter, that means YES. Once it's done it'll add the results to the dataframe in a column called `category`.

In [22]:
number_to_classify_by_hand = 30

In [None]:
#checking for drugs

In [23]:
def is_what_you_want(row):
    response = input("\n------------\n\n{desc}\n\n\nIS THIS WHAT YOU'RE LOOKING FOR? y for YES ".format(index=row.index, desc=row.misconduct))
    if response == "y" or response == "Y":
        print("\n** Classified as YES **")
        return "YES"
    else:
        print("\n** Classified as NO **")
        return "NO"

# Reset category column
df['category'] = np.nan
df['category'] = df[:number_to_classify_by_hand].apply(is_what_you_want, axis=1)

df.category.value_counts()


------------

The corporation admitted guilt to the charge of ordering excessive tests, treatment, or use of treatment facilities not warranted by the condition of a patient.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The corporation admitted to the charge of having been convicted in New York Supreme Court, Kings County of a scheme to defraud in the first degree; falsifying business records; insurance fraud and failing to comply with the requirements of the New York State Business Corporation Law Section 1503(a).


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **

------------

This action modifies the penalty previously imposed  by Order# 93-40 on March 31, 1993, where the Hearing Committee sustained the charge that the physician was disciplined by the Utah State Medical Board, and ordered that if he intends to engage in practice in NY State, a two-year period of probation shall be imposed.


IS THIS WHAT YOU'RE LOOKING 


** Classified as NO **

------------

The Corporation was rendered in violation of New York State Business Corporation Law Section 1503(a) and (b) and 1504(a) due to the surrender of the sole shareholder's medical license.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES n

** Classified as NO **

------------

The Hearing Committee sustained the charge finding the physician guilty of having been disciplined by the Illinois State Department of Professional Regulation for filing insurance claims for services which were not rendered to patients.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **

------------

The physician assistant did not contest the charge of fraudulent practice due to prescribing controlled substances for her own use.


IS THIS WHAT YOU'RE LOOKING FOR? y for YES y

** Classified as YES **


NO     20
YES    10
Name: category, dtype: int64

In [24]:
df.head(5)

Unnamed: 0,action,date_updated,eff_date,first,last,lic_num,lic_type,middle,misconduct,order_pdf,restrictions,url,year_of_birth,category
0,Revocation of certificate of incorporation.,09/29/2010,09/29/2010,P.C.,563 Grand Medical,196275,,,The corporation admitted guilt to the charge o...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,NO
1,Revocation of certificate of incorporation. P...,12/01/2010,12/08/2010,P.C.,AR Medical Art,207165,,,The corporation admitted to the charge of havi...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,,YES
2,License Surrender,,01/13/1999,Joseph,Aaron,72800,MD,,This action modifies the penalty previously im...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1927.0,NO
3,License limited until the physician's North Ca...,12/06/2005,12/13/2005,Mark,Aarons,161530,MD,Gold,The physician did not contest the charge of ha...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1958.0,NO
4,License surrender.,08/07/2013,08/14/2013,Jamsheed,Abadi,136045,MD,S,The physician did not contest the charge of fa...,https://apps.health.ny.gov/pubdoh/professional...,,https://apps.health.ny.gov/pubdoh/professional...,1939.0,NO


In [117]:
categorized = df[df.category.notnull()]

## Step 2: Vectorize the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.**

In [57]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words = 'english', max_features=500)
matrix = vec.fit_transform(df[df.category.notnull()]['misconduct'].str.replace("\d",""))
features_df = pd.DataFrame(matrix.toarray(), columns=vec.get_feature_names())
features_df.head()

Unnamed: 0,abetting,abusing,accepted,accurate,action,addiction,adequate,administration,admitted,aiding,...,user,utah,verbally,violation,warranted,willfully,written,year,years,york
0,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,2
2,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Step 3: Create a classifier and train a model using the violation descriptions

You want to **ONLY DO THIS WITH THE ONES YOU CLASSIFIED.** You'll also need to make the `category` column a number, probably.

And remember your test/train split!

In [59]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.model_selection import train_test_split

df['is_yes'] = (df[df.category.notnull()]['category'] == 'YES').astype(int)

clf = BernoulliNB()

X_train, X_test, y_train, y_test = train_test_split(
    features_df.values, 
    df[df.category.notnull()]['is_yes'],
    test_size=0.2) 

clf.fit(X_train,y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

## Step 4: Test the classifier

How does it look? Remember, we're only using the classified ones so far!

**If you don't like its predicting ability**, go back up and play around with your vectorizer, and even with your classifier. There are a lot of options!

In [60]:
clf.score(X_test, y_test)

0.66666666666666663

## Step 5: Vectorize the unclassified violations

Now we need to vectorize the violations we didn't classify by hand.

You **DO NOT MAKE A NEW VECTORIZOR**. You juse use the one we already have! Also, you **DON'T FIT IT AGAIN!** You just transform. I hope you read this line, but I'll give you some code anyway.

In [112]:
not_categorized = df[df.category.isnull() & df.misconduct.notnull()]

features_df = vec.transform(not_categorized.misconduct)

## Step 6: Use the classifier to predict the labels of the unclassified violations

You **DON'T NEED A NEW CLASSIFIER**, use the one you have! You'll use `clf.predict`, and feed it... what? What does it need to predict the labels?

In [115]:
not_categorized['is_yes'] = clf.predict(features_df)
df_new = categorized.append(not_categorized)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


### Step 6.2: Those labels are ugly

If you used a `LabelEncoder` to create your categories, you can feed the numbers to `le.inverse_transform` to get actual text back.

### 6.3: Put the category labels back into the original dataframe

In [None]:

not_categorized.loc[not_categorized['is_yes']==0,'category']="NO"
not_categorized.loc[not_categorized['is_yes']==1,'category']="YES"

## Step 7: What actions were taken against those doctors?

In [None]:
df_new[(df_new.category == 'YES') & (df_new.restrictions.notnull())].restrictions.value_counts()