# TSA Claims Classification (Approve / Settle / Deny)  
## (Part 1)

## Overview
In this kernel, we will work toward building a model to predict whether a TSA claim is approved, settled, or denied. This is mostly a practice exercise, but it could have some very neat real-world uses!   
  
For example, a travel insurance company could use it to help predict losses or adjust reimbursement. The TSA could also use a predictive model to help triage incoming claims, depending on the predicted class and the confidence of the prediction. 

Note: In all of these use cases, it would be ideal if we can predict the result using only information that is known before the claim is filed. Certain fields in the data set (like Disposition, Close Value) are generated after-the-fact so we will exclude them from the modeling analysis. 

## Workflow
We can do this task in 3 main phases: data cleaning, feature engineering, and modeling.
* Data Cleaning: Our data is quite messy! We need to take out the outliers, account for missing values, and standardize formats. 
* Feature engineering: Our data (after cleaning) is an array of text columns! We need to do some engineering to turn the text into useful values for modeling. 
* Modeling: Finally, we can run some different models and see how accurate our predictions we can get.

The workflow is rather lengthy, so I'll break it into two parts. Data cleaning will be covered in this notebook, and the feature engineering / modeling will be [>here<](https://www.kaggle.com/perrychu/tsa-claims-classification-part-2?)

# Phase One: Data Cleaning
Lets load up some necessary packages to start.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

## Import Data
First, let's load the data file. 

I'm renaming a few columns so we can preserve the original data while we do our feature engineering..

We have ~200k rows to start.

In [None]:
#Read Kaggle file
df = pd.read_csv("../input/tsa_claims.csv",low_memory=False)

#Format columns nicely for dataframe index
df.columns = [s.strip().replace(" ","_") for s in df.columns]

#Rename date columns
df["Date_Received_String"] = df.Date_Received
df["Incident_Date_String"] = df.Incident_Date
df.drop(["Date_Received","Incident_Date"], axis=1, inplace=True)

print("Rows:", len(df))

## Check Nulls
Let's look at nulls by row and column.

### Rows:
About half our rows have at least one null value. A few nulls can be ok (depending on which columns they are in). However, about 2k rows have 6+ nulls. That is more indicative of a problem and might be something we want to fix.

In [None]:
# Check distribution of nulls per row
temp = df.isnull().sum(axis=1).value_counts().sort_index()
print ("Nulls    Rows     Cum. Rows")
for i in range(len(temp)):
    print ("{:2d}: {:10d} {:10d}".format(temp.index[i], temp[i], temp[i:].sum()))

### Columns:
Disposition and Close Amount have a ton of nulls. That's ok since they are after-the-fact data so we are going to exclude them from the analysis anyway. There is a minor issue here... after digging into the source data, it appears that these values are being used as proxies for "Status" and "Claim Amount" in the later years (~2013-2015) when TSA stops providing "Status" and "Claim Amount". There isn't an easy workaround, and there's relatively less data in those years... so we're going to ignore this for now.

Next, Airline, Airport, Claim Type, Claim Site, and Item have quite a few nulls. However, these are categorical values, so we can treat the nulls as "Other" or "Missing". Basically we will treat the nulls as being in their own category - knowing that data is missing can also be useful information.

Next, Claim Amount, Incident Date, and Date Received are numerical values that have some nulls. We most likely need to throw these values away since there isn't a straightforward way to assign them a value.

Finally, Status has 5 nulls. We have to discard those rows, since that's the variable we are trying to predict. 

In [None]:
# Check distribution of nulls per column
df.isnull().sum().sort_values(ascending=False)

In [None]:
#Drop rows with too many nulls
df.dropna(thresh=6, inplace=True)

#Fill NA for categorical columns
fill_columns = ["Airline_Name","Airport_Name","Airport_Code","Claim_Type","Claim_Site","Item"]
df[fill_columns] = df[fill_columns].fillna("-")

#Set NA Claim Amount to 0. Zeros are dropped later in the code.
df["Claim_Amount"] = df.Claim_Amount.fillna("$0.00")

#Dropping these nulls later on:
#  Incident Date / Date Received
#  Status

print(len(df))

## Dependent (Target) Variable

Now, let's look at the target variable we will  try to predict: Claim Status.

There are quite a range of values - some are inconsistent spelling of the cases we want to predict (Denied vs. Deny) while other are unsettled claims (-, Insufficient, In litigation).

In [None]:
df.Status.str.split(";").map(lambda x: "Null" if type(x)==float else x[0]).value_counts()

Let's collapse the inconsistent spellings and remove the non-final statuses.

In [None]:
valid_targets = ['Denied','Approved','Deny','Settled','Approve in Full', 'Settle']

df = df[df.Status.isin(valid_targets)]
df.Status.replace("Approve in Full","Approved",inplace=True)
df.Status.replace("Deny","Denied",inplace=True)
df.Status.replace("Settle","Settled",inplace=True)

print(df.Status.value_counts())
print(len(df))

## Independent (Feature) Variables

### Date Received
Date Recieved is formatted consistently, but some of the dates are entered incorrectly. We know from the data source that we should only have records from 2002 to 2014. Let's drop the ones that don't fall in this range.

In [None]:
#Drop nulls
df.dropna(subset=["Date_Received_String"], inplace=True)

#Format datetime
df["Date_Received"] = pd.to_datetime(df.Date_Received_String,format="%d-%b-%y")

#Check year range
df = df[df.Date_Received.dt.year.isin(range(2002,2014+1))]

print(df.Date_Received.dt.year.value_counts().sort_index())

### Incident Date
There are also inconsistencies with Incident Date formats. Some are formatted as 20/jan/09 while others are 01/20/09. Let's make a function to standardize the format so we can conert the strings to DateTime objects.

In [None]:
month_dict = {"jan":1,"feb":2,"mar":3,"apr":4,"may":5,"jun":6,"jul":7,"aug":8,"sep":9,"oct":10,"nov":11,"dec":12}

def format_dates(regex, date_string):
    '''
    Formats the date string from 2014 entries to be consistent with the rest of the doc 
    Inputs: 
        regex - compiled re with three groups corresponding to {day}/{month (abbrev.)}/{Year}
        date_string - string to be formatted matching the regex
    Outputs: 
        If regex match, return formatted string of form {Month}/{Day}/{Year}; else return original string
    '''
    m = regex.match(date_string)
    if(m):
        day, month, year = m.group(1,2,3)
        return "{}/{}/{}".format(month_dict[month],day,"20"+year)
    else:
        return date_string
        

Incident Date includes both date and time. We'll separate out the time before formatting the dates. Unfortunatley many rows are missing the time component, so we end up not being able to use that piece. There are a few other inconsistencies we should fix. Then, we can finally apply the formatting function we wrote above.

In [None]:
#Drop nulls
df.dropna(subset=["Incident_Date_String"], inplace=True)

#Error correction for one value in Kaggle data set (looked up in original TSA data)
df.Incident_Date_String.replace("6/30/10","06/30/2010 16:30",inplace=True)

#String formatting for consistency
df["Incident_Date_String"] = df.Incident_Date_String.str.replace("-","/")
df["Incident_Date_String"] = df.Incident_Date_String.str.lower()

#Splitting up time (if exists otherwise will be date) and date components
df["Incident_Time"] = df.Incident_Date_String.str.split(" ").map(lambda x: x[-1])
df["Incident_Date"] = df.Incident_Date_String.str.split(" ").map(lambda x: x[0])

#Could not find a reasonable translation for these entries... most look like "02##"
regex = re.compile(r"/[a-z]{3}/[0-9]{4}")
df = df[df.Incident_Date.map(lambda x: not bool(regex.search(x)))].sort_values(["Date_Received"])

#These are entries received in 2014. Formatting is different from other years but internally consistent.
regex = re.compile(r"(\d*)/([a-z]{3})/(1[1-4])$")
df["Incident_Date"] = df.Incident_Date.map(lambda x: format_dates(regex,x) )
#df[df.Incident_Date.map(lambda x: bool(regex.search(x)))].sort_values(["Date_Received"])

#Format datetime, check year range, create year and month
df["Incident_Date"] = pd.to_datetime(df.Incident_Date,format="%m/%d/%Y")
df = df[df.Incident_Date.dt.year.isin(range(2002,2014+1))]

print(df.Incident_Date.dt.year.value_counts().sort_index())
print(len(df))

### Airport Code / Name
First, we notice some airport codes have multiple distinct airport names. It turns out this is just from excess whitespace, which is an easy fix.

In [None]:
#Check multiple Airport Names assigned to one Airport Code
temp = df.groupby("Airport_Code").Airport_Name.nunique().sort_values(ascending=False)
print(df[df.Airport_Code.isin(temp[temp>1].index)].groupby("Airport_Code").Airport_Name.unique().head())
print("\n---\n")

#Duplicates are from excess spaces
df["Airport_Code"] = df.Airport_Code.str.strip()
df["Airport_Name"] = df.Airport_Name.str.strip()

#Check multiple Airport Names assigned to one Airport Code
temp = df.groupby("Airport_Code").Airport_Name.nunique().sort_values(ascending=False)
print(df[df.Airport_Code.isin(temp[temp>1].index)].groupby("Airport_Code").Airport_Name.unique().head())


Next, let's consolidate the airports that don't show very rarely in the claims. This reduces the dimensionality of our data.

From looking at the distribution, I'm grouping airports which have less than 200 claims (<~.01% of data each).

In [None]:
#Look at tail distribution of claims by airport
temp = df.Airport_Code.value_counts()
print("Total: {} airports, {} complaints".format(temp.count(),temp.sum()))
for num in range(1000,1,-100):
    print("Under {}: {} airports, {} complaints".format(num, temp[temp<num].count(),temp[temp<num].sum()))

level = 200
#plot distribution below level
#temp[temp<level].count(), temp[temp<level].sum()
#temp[temp<level].plot.bar()

#Set airport and code to "Other" under level
def set_other(row, keep_items):
    if row.Airport_Code in keep_items:
        row["Airport_Code_Group"] = row.Airport_Code
        row["Airport_Name_Group"] = row.Airport_Name
    else:
        row["Airport_Code_Group"] = 'Other'
        row["Airport_Name_Group"] = 'Other'
    return row

keep_set = set(temp[temp>=level].index)
df = df.apply(lambda x: set_other(x,keep_set),axis=1)

### Airline Name
For airlines, let's consolidate names that seem to be subsidiary/parent or have inconsistent naming.

In [None]:
df["Airline_Name"] = df.Airline_Name.str.strip().str.replace(" ","")
df.Airline_Name.replace("AmericanEagle","AmericanAirlines",inplace=True)
df.Airline_Name.replace("AmericanWest","AmericaWest",inplace=True)
df.Airline_Name.replace("AirTranAirlines(donotuse)","AirTranAirlines",inplace=True)
df.Airline_Name.replace("AeroflotRussianInternational","AeroFlot",inplace=True)
df.Airline_Name.replace("ContinentalExpressInc","ContinentalAirlines",inplace=True)
df.Airline_Name.replace("Delta(Song)","DeltaAirLines",inplace=True)
df.Airline_Name.replace("FrontierAviationInc","FrontierAirlines",inplace=True)
df.Airline_Name.replace("NorthwestInternationalAirwaysLtd","NorthwestAirlines",inplace=True)
df.Airline_Name.replace("SkywestAirlinesAustralia","SkywestAirlinesIncUSA",inplace=True)

df.Airline_Name.value_counts().head(10)
print(len(df))

### Claim Type, Claim site
Only a few values each here. We can leave them as-is.

In [None]:
print(df.Claim_Type.value_counts())
print(df.Claim_Site.value_counts())

### Item
Item is a tricky field - it is actually a comma and semi-colon separated list of categories but the naming schemes vary over time. For this analysis, we're going to ignore this data. It would be an interesting future exercise to extract a feature from this column.

In [None]:
#Isolating broadest item categories
#Items column is a text list of all item categories. Sub categories are inconsistent across years.
df_item = df.Item.str.split("-").map(lambda x: "" if type(x) == float else x[0])
df_item = df_item.str.split(r" \(").map(lambda x: x[0])
df_item = df_item.str.split(r" &").map(lambda x: x[0])
df_item = df_item.str.split(r"; ").map(lambda x: x[0])
df_item = df_item.str.strip()

categories = df_item.value_counts()

#categories[[not bool(re.compile(";").search(x)) for x in categories.index]][0:]

categories[categories > 100]

### Claim Amount
Claim amount is right skewed - there is a lot of data at the lower end then fewer and fewer at high values... all the way up to a single claim at \$12 billion! However, there is also an anomoly of a bunch of values at $0...

You might remember that we set some of those zero values when looking at nulls. The others appear to come from the dataset using "Close Amount" as a proxy for "Claim Amount" for years where "Claim Amount" is missing. When a claim is denied, the "Close Amount" is zero. Unfortunately, there isn't any way to infer what the actual Claim $ was, so we will have to drop those rows.

In [None]:
df["Claim_Amount"] = df.Claim_Amount.str.strip()
df["Claim_Amount"] = df.Claim_Amount.str.replace(";","").str.replace("$","").str.replace("-","0")
df["Claim_Value"] = df.Claim_Amount.astype(float)

df_copy = df.copy()

print(df.Claim_Value.describe())
print(df.Status.value_counts())
print(len(df))

sns.distplot(df.Claim_Value[(df.Claim_Value>0)&(df.Claim_Value<500)])

df.Status[(df.Claim_Value>0)&(df.Claim_Value<1000)].value_counts()

One interesting way to vizualize the data is to look at the distribution of claim value for each target class (approve / settle / deny). We can plot this as a histogram. Since the data is skewed, we will use a log-scale for the x-axis bins. 

Under these conditions, it turns out each target class has a (roughly) the shape of a normal distributed at different means. However, since our x-axis bins are log-scale these aren't actual normal distributions.

We can also see the cluster of claims with claim value = 0.

In [None]:
bins = [round(10**x) for x in (list(np.arange(0,4.1,.4))+[10])]

bottom = -1

data = []

for x,top in enumerate(bins):
    counts = df.Status[(df.Claim_Value>bottom)&(df.Claim_Value<=top)].value_counts()
    for i in range(len(counts)):
        data.append({"bin":(str(x)+":"+str(top)),"label":counts.index[i],"count":counts[i]})
    bottom = top

counts_df = pd.DataFrame(data)

sns.factorplot(x="bin",y="count",hue="label",data=counts_df,kind="bar",size=10)

Ok, now lets actually remove the zero values. 

In [None]:
df = df[df.Claim_Value != 0]

print(df.Claim_Value.describe())
print(df.Status.value_counts())
print(len(df))

### Close Amount
Note: I also looked at the relationship between Close Amount and Claim Amount. I wasn't able to link the two in a way that made sense, but I'll leave these plots here in case it is interesting. Remember that we aren't using Close Amount because it is only known after the claim is settled - we would prefer to make our prediction before the claim is submitted.

In [None]:
df["Close_Amount"] = df.Close_Amount.str.strip()
df["Close_Amount"] = df.Close_Amount.str.replace(";","").str.replace("$","")
df["Close_Value"] = df.Close_Amount.astype(float)
df.Close_Value.describe()

In [None]:
plot_df = df[(df.Claim_Value < 200000) & (df.Close_Value <= 500000)]

plt.scatter(plot_df.Claim_Value,plot_df.Close_Value,alpha=.2)
plt.title("Combined")
plt.xlabel("Claim value")
plt.ylabel("Close value")
plt.show()

fig,ax = plt.subplots(1,3)
fig.set_size_inches(16,4)

for i,s in enumerate(plot_df.Status.unique()):
    ax[i].scatter(plot_df[plot_df.Status==s].Claim_Value,plot_df[plot_df.Status==s].Close_Value,alpha=.2)
    ax[i].set_title(s)


## Cleaning Complete!
Now we have our clean data, so we can dig into modeling. Let's write the data out to a file for future use.**

In [None]:
output_df = df.drop(["Close_Amount", "Claim_Amount", "Disposition",
                     "Date_Received_String","Incident_Date_String","Incident_Time",
                     "Airport_Code","Airport_Name"],axis=1)

output_df.to_csv("tsa_claims_clean.csv",index=False)

output_df.head(5)