# Finding surveillance planes using random forests

**The stories:**

- https://www.buzzfeednews.com/article/peteraldhous/spies-in-the-skies
- https://www.buzzfeednews.com/article/peteraldhous/hidden-spy-planes
    
This story, done by Peter Aldhous at Buzzfeed News, involved training a machine learning algorithm to recognize government surveillance planes based on what their flight patterns look like.

**Datasets**

* **feds.csv:** Transponder codes of planes operated by the federal government
* **planes_features.csv:** various features describing each plane's flight patterns
* **train.csv:** a labeled dataset of transponder codes and whether each plane is a surveillance plane or not
    - The `label` column was originally `class`, but I renamed it because pandas freaks out a bit with a column named `class`
    - This was created by Buzzfeed `feds.csv`
* **data dictionary:** You can find the data dictionary published with their analysis [here](https://buzzfeednews.github.io/2016-04-federal-surveillance-planes/analysis.html)
* **a few other files**

## What's the goal?

The FBI and Department of Homeland Security operate many planes that are not directly labeled as belonging to the government. If we can uncover these planes, we have a better idea of the surveillance activities they are undertaking.

## Imports

Let's also set a large number of maximum columns.

In [None]:
import pandas as pd

pd.set_option("display.max_columns", 100)

# Read in our data

Almost all classification problems start with a set of labeled features. In this case, the features are in one CSV file and the labels are in another.

**Read in both `planes_features.csv` and `train.csv` and merge them on `adshex`, the transpoder code.**

### No wait, merge them again!

We have **features** for about 20,000 planes and **labels** for about 600 planes. This is because we don't know whether many of the planes are surveillance planes or not. When you merge, it only keeps rows you have **both features and labels for**.

We want to keep those in the dataframe so we can play detective with them later, and try to find surveillance planes using the features. When you merge, you should use `how='left'` or `how='right'` to keep unmatched columns from the left (or right) dataframe.

Save this merged dataframe as `df`.

Confirm you have 19,799 rows and 34 columns.

In [None]:
df.shape

# Cleaning up our data

## Number-izing our labels

Each row is a plane, and it's marked as either a surveillance plane or not. How many do we have in each category?

How do you feel about that split? Is it balanced enough?

**Let's this column for machine learning.** `"surveil"` and `"other"` won't work with sklearn because they're strings, not numbers. Adjust the `label` column to be something that we can use with sklearn.

## Categorical variables

Do we have any variables that count as categories? Yes, we do, `type` of plane! **How many different categories does it have?**

These are **text**, not numbers, so we can't automatically use them in our classifier. But it's a little different than when we were working with actual documents - running this across a TF-IDF vetorizer shoudln't seem to make much sense.

Instead, we'll just **make each plane type a number.** For example, `unknown` might be `0` and `C172` might be `1` and `SR22` might be `2`. 

If you want to convert a list of categories into numbers, an easy way is to use the `Categorical` data type.

In [None]:
df.type = df.type.astype('category')
df.type.head()

It looks like a normal bunch of strings, but pandas is secretly using a number for each one! You can find the number with `.cat.codes`.

**We can use `df.type.cat.codes` to make a new columns called `type_code`.** 

In [None]:
df['type_code'] = df.type.cat.codes
df[['type', 'type_code']].head(10)

We'll use `type_code` for machine learning since sklearn needs a number, and `type` for reading since we like text.

# Building our classifier

To build a classifier, we need an `X` and a `y`. If we're working with text, we usually just use the `words_df` for the features and whatever our label column is for the `y`.

```python
X = words_df
y = df.label_column
```

This time it's going to be a little bit different, and take a few more steps!

First, since we have **labeled** and **unlabeled data** in the same dataframe, we need to get rid of all of the rows that don't have a `label`. Save this labeled dataset as `train_df`.

Confirm your `train_df` has 597 rows. **And while we're at it, let's look at the first five rows.**

Before we make our `X` - our set of features that predicts the label - note that we also have a few extra columns that we aren't using to train our classifier:

1. The `adshex` transponder code
2. The text version of the plane type

We'll need to get rid of these in a second!

### Create your `X` and `y`.

Creating your `X` and `y` for non-text-analysis projects looks different than for text-based analysis.

In these non-text situations, we usually have a single dataframe that we want certain columns out of.

* `X` are the columns that we use to make the predictions, our features. Usually we want every single column of data, which means we do `X = train_df.drop(columns=['label_column'])`. This makes sure we aren't including the label in our features.
* `y` is the column that is the label that we are predicting

In this case, though, we want to get rid of more than just the label column. Adjust the code below to also remove the transponder code and plane type.

In [None]:
X = train_df.drop(columns=['label'])
y = train_df.label

Triple-check that `X` is a list of numeric features and and `y` is a numeric label.

### Split into train and test datasets

We could be nice and lazy and use all our data for training, but it just isn't right! Taking a test using the exact same questions you studied is just cheating. Split your data into test and train.

# Classify using a logistic classifier

## Train your classifier

Build a `LogisticRegression` and fit it to your data, making sure you're training using only `X_train` and `y_train`.

You can build a Logistic Regression classifier like this:

```python
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(C=1e9, solver='lbfgs', max_iter=4000)
```

## Explaining our classifier

Let's use eli5 to explain our classifier. What are the important features for detecting a surveillance plane?

```python
import eli5

feature_names = list(X.columns)

eli5.show_weights(clf, feature_names=feature_names)
```

Use the code above.

## How well does our classifier perform?

Let's take a look at the confusion matrix to see how well this classifier finds surveillance planes. Make sure you're using `y_test` and `X_test`, not the full dataset.

# Classify using a decision tree

Now we'll use a decision tree. This is how you make one:

```python
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
```

If we use `max_depth=` to limit the depth of the tree, it will help us visualize it. For example, `max_depth=5` will only allow the tree to make five decisions.

Make a decision tree and fit it to your data. Use a `max_depth=` of something between 2 to 5.

## What are the important features?

We'll use slighyl different code for a decision tree, as it likes to draw big pictures if we don't stop it. The code looks like this:

```python
import eli5

feature_names=list(X.columns)
eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])
```

### Understanding the output

We'll do this in class.

## How well does the tree perform?

Display another confusion matrix with your new classifier.

## Visualize the tree

You can use `eli5` to visualize the decision tree itself! It usually takes up too much space, but since it's a special occasion we'll let it go. You... might need to install something for this to work? I'm not 100% sure, though.

In [None]:
feature_names=list(X.columns)

label_names = ['not surveillance', 'surveillance']
eli5.show_weights(clf, feature_names=feature_names, target_names=label_names, show=['decision_tree'])

If you'd like your graph to have colors colors, or to not use eli5, you can do it the old-fashioned way. You might need to `brew install graphviz` and `pip install graphviz`. Windows users will need to download and install from [the graphviz website](https://graphviz.org/download/), and potentially add graphviz to their path.

```python
from sklearn import tree
import graphviz

label_names = ['not surveillance', 'surveillance']
feature_names = X.columns

dot_data = tree.export_graphviz(clf,
                    feature_names=feature_names,  
                    filled=True,
                    class_names=label_names)  
graph = graphviz.Source(dot_data)  
graph
```

* **Tip:** You'll probably need to scroll sideways a bit

# One more classifier: Random forest

## Build and train your classifier

We can build a random forest classifier like this:

```python
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
```

But you're in charge of fitting it to your training data!

* **Tip:** You can also set `max_depth` here if you want.
* **Tip:** Increase `n_estimators` to 100 to make a better classifier.

## What are the important features?

Use eli5 to obtain the feature importances. The best method for a random forest is below.

```python
feature_names = list(X.columns)
eli5.show_weights(clf, feature_names=feature_names, show=['description', 'feature_importances'])
```

### Understanding the output

Random forests are a collection of decision trees. We'll talk about how to understand the feature importance in class!

## How well does it perform?

Use a confusion matrix.

## Which model are you most confident in?

# Actually finding spy planes

Now let's try ot actually find our spy planes

## Retrain our model

When we did test/train split, we trained our model with only a subset of our data, so we could test with the rest. Now that we're working in the "real world" we want to re-train it using not just `_train` and `_test` data, but instead **everything we have labels for.**

That's the whole `X` and `y`.

## Filter for planes we want to predict

We have a dataframe of features that includes three types of planes:

* Those that are labeled as surveillance planes
* Those that are labeled as not surveillance
* Those that aren't labeled

Which do we want to predictions for? **Filter to create new dataframe that's just unlabeled planes.** We'll call it `unknown_df`.

How many planes do you have in that list? **Confirm it's about 19,200.**

## Predicting 

Build your `X_unknown` - remember you need to drop a few columns! We only want **numeric features** here.

Now use that to make a prediction for each plane, and **assign the prediction into the `predicted` column of `unknown_df`**.

* **Tip:** Scroll up to see where you created your features for training, it's similar
* **Tip:** pandas will yell at us about setting values on copies of a slice but it's fine

## How many planes did it predict to be surveillance planes?

It should be roughly around 70-80 planes.

## But.. what about those other ones? The ones that are just below the threshold?

The cutoff for a prediction of `1` is 50%, but since we have a lot of time we're interested in investigating the top 150.

To get the probability for each row, you will use `clf.predict_proba` instead of `clf.predict`. Also, to get the predicted probability for the `1` category, you'll need to add `[:,1]`, which gives us _something like_ this:

```python
clf.predict_proba(unknown_df.drop(columns=['label', 'adshex', 'type']))[:,1]
```

**Create a new column called `predicted_prob` that is the chance that the plane is a surveillance plane.**

* **Tip:** You dropped three columns when using `clf.predict`, but if you drop the same three (e.g. you cut and pasted the code above) you'll get an error now. There's now an extra column that you'll need to drop! What is it?

### Get the top 200 predictions

Take a look at what the probabilities look like, showing the top 200 planes that are **most likely to be surveillance planes.**

Then save them to a file for later research.

Save these top 200 to a CSV named `planes-to-research.csv`.

# Questions

Using words and not column names, describe what the machine learning algorithm found to be important when identifying surveillance planes.

Why did we use test/train split when it would have been more effective to give our model all of the data from the start?

Why did we use a random forest instead of a decision tree or logistic regression?

Why did we use probability instead of just looking for planes with a predicted value of 1? It seems like we should have just trusted the algorithm, right?

The government could claim that we're threatening national security by publishing this paper as well as publishing this code - now anyone could look for planes that are surveilling them. What do you think?

We're using data from the past, but you can get real-time flight data from many services. Can you think of any uses for this algorithm using real-time instead of historical data?