## Simple example kernel to get started with the data
* I already cleaned up the possible targets, though you may wish to remove the "classes" that have <10 cases.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from scipy import sparse
from sklearn.preprocessing import LabelEncoder as LE # warning - using this can result in silly range features, vs using OHE from sklearn or Pandas's get_dummies(). 
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

In [None]:
path = '../input/'
filename = 'FederalAirMarshalMisconduct.csv'

df = pd.read_csv(path + filename,parse_dates=["Date Case Opened"],infer_datetime_format=True)

In [None]:
df.head()

In [None]:
df.columns


Whoops, looks like some of the columns names got spaces added at their ends :( !  
Let's fix that!

* We could also remove spaces in variable/column names for easier work, but we can skip that - it's not as critical as 'deceptive' column names with hidden spaces ;)


In [None]:
df.rename(columns={"Allegation ":"Allegation", "Field Office ":'FieldOffice'},inplace=True)
print(df.columns)

Let's look at the charges (_Allegation_) against the marshals, and the trial outcomes (_Final Disposition_ /	_target_)

In [None]:
df["Allegation"].value_counts()

In [None]:
df['Final Disposition'].value_counts()

After lower casing and replacing all the " - X day[s]", we have far less possible outcomes:

In [None]:
df['target'].value_counts()

In [None]:
print("A naive majority classifier would get: %.4f Accuracy" % (1833/df.shape[0]))

Classes with just 1 case aren't relevant. We'll remove cases/_target_ from the data where the target/outcome has less than 30 occurances.
 * This leaves us with about 8 classes. LEtter of counsel and verbal counsel might be similar, but i'm unsure, so we'll leave them as seperaet classes.

In [None]:
least_frequent_classes = df['target'].value_counts().tail(8).index

In [None]:
print(df.shape)
df = df.loc[~df.target.isin(least_frequent_classes)]
df.shape[0]

In [None]:
print("Check for nulls in the target column:")
print(df.isnull().sum())
# df.dropna(subset="target",inplace=True,axis=1) # This gives errors on kaggle for some reason ? 
df = df.loc[df.target.notnull()]
print("After cleaning:",df.isnull().sum())

I already analyzed the data externally: there are strong features based on the data, notably time ranges and years, as they relate to some of the final dispositions (notably retirement). 
![](http://)* Here let's  get just  simple datetime  features. 

In [None]:
df["Year"] = df['Date Case Opened'].dt.year
df["Month"] = df['Date Case Opened'].dt.month

In [None]:
df.head()

## Let's drop the columns we don't want, and keep just the subset for predictive model building: 

In [None]:
df = df[[ 'FieldOffice', 'Allegation', 'target', 'Year', 'Month']]

### Encode the categorical feature of Office
* can encode as OHE with p[andas's get_dummys , or via label encoding (which saves on spac/columnse but can give silly features involving range)

In [None]:
# # Encode OHE the FieldOffice:
df = pd.get_dummies( df, columns = ["FieldOffice"] )

# ### ALT:
# df["FieldOffice"] = LE.fit_transform(df["FieldOffice"])

In [None]:
df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop("target",axis=1), df.target, random_state=42)

In [None]:
# Bag of words features on the text
tfidf = CountVectorizer(stop_words='english', max_features=200,min_df=3,ngram_range=(1, 2))
tr_sparse = tfidf.fit_transform(X_train["Allegation"])
te_sparse = tfidf.transform(X_test["Allegation"])

In [None]:
X_train = sparse.hstack([X_train.drop("Allegation",axis=1), tr_sparse]).tocsr()
X_test = sparse.hstack([X_test.drop("Allegation",axis=1), te_sparse]).tocsr()

In [None]:
fmodel = RandomForestClassifier(n_estimators=400, random_state=42, max_depth=9, max_features=30,class_weight="balanced").fit(X_train, y_train)
prediction = fmodel.predict(X_test)

In [None]:
# Data is hihgly imbalanced, so accuracy is meaningless. let'sWe could have a look at the AUC, but it's tricker to define for multiclass, so we'll leave it for now) : 
# score = roc_auc_score(y_test, prediction)
# print("AUC on test set: %.2f" % score)

acc_score = accuracy_score(y_test, prediction)
print("Accuracy score on test set: %.2f" % (100.*acc_score))

In [None]:
print(classification_report(y_test, prediction))

### Further work:
* word cloud and top features per class