In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

### Read in the data and take a peek

In [None]:
data = pd.read_csv('../input/austin_waste_and_diversion.csv', parse_dates=[2, 5])
data.head()

Dropping the `report_date` column because in some cases it lags behind the `load_time` by one or more days. Perhaps this is because loads after 5PM don't actually get reported until the next business day? What about the ones that lag by many days...weekends maybe? Special events? Are some type of loads staged until there's enough to deliver?

In any case it's not a useful field so I'm dropping it for now.

In [None]:
data.drop('report_date', axis=1, inplace=True)

## Let's look at the number of each type of collection for the dataset
*keep in mind we are only looking at the count for now, not the actual weight*

In [None]:
plt.figure(figsize=(12,6))
load_type_counts = data['load_type'].value_counts()
ax1 = load_type_counts.plot(kind='bar');#, logy=True);
labels = []
for i, label in enumerate(load_type_counts.index):
    labels.append('{} - ({})'.format(label, load_type_counts[label]))
ax1.set_xticklabels(labels);

### Because I live in Austin, I know a little bit about this:
* Garbage is picked up every week
* Recycling is picked up every two weeks
* Residential streets are swept six times per year
* Major streets are swept twelve times per year
* Yard trimmings are collected weekly
* Brush and Bulk are collected twice per year 
    * There are 10 residential areas (I think) that rotate year-round
* Not sure about the paper and comingle recycling
    * Maybe this is like bulk pickup for industry?
    * Could there have been a change in the collection process? 
        * *foreshadowing intensifies*
 
The rest are probably not on a schedule (as needed)

### Comment on `load_type` vs `route_type`
It seems that all load types receive waste from several different route types. Also, waste from different route types do not exclusively go to any particular load type. Just thought this was interesting, not sure if anything will come of it.

To look at the full list of load type contributions, run this code block:  

```python
load_by_route_type = data.groupby('load_type')['route_type'].unique()
for i in load_by_route_type.index:
    print('\n{}: '.format(i))
    for j in load_by_route_type[i]:
        print('\t{}'.format(j))
```

## Now let's look at the waste destinations

In [None]:
plt.figure(figsize=(12, 6))
dropoff_site_counts = data['dropoff_site'].value_counts()
ax2 = dropoff_site_counts.plot(kind='bar');
labels = []
for i, label in enumerate(dropoff_site_counts.index):
    labels.append('{} - ({})'.format(label, dropoff_site_counts[label]))
ax2.set_xticklabels(labels);

### About some of the waste destinations
TDS - Texas Disposal Systems  
MRF - Materials Recovery Facility  
* TDS Landfill is the primary city dump
* I think MRF and TDS MRF are the same recycling facility
* Hornsby Bend is a bird observatory/water treatment plant
    * This is where Austin's "Dillo Dirt" is made
* Onion Creek is the Austin Community Landfill
* Steiner Landfill is the Creedmore Sanitary Landfill (no organic waste)

### Each one of these takes different streams of waste
To see the full list of waste streams to each site run this:  
```python
dropoff_site_load_types = data.groupby('dropoff_site')['load_type'].unique()
for i in dropoff_site_load_types.index:
    print('\n{}: '.format(i))
    for j in dropoff_site_load_types[i]:
        print('\t{}'.format(j))
```

## Check on missing load weights

In [None]:
missing = data[data['load_weight'].isnull()]
missing_perc = len(missing) / len(data) * 100
print('Missing Total: {}\nMissing Percentage: {:.2f}%'.format(len(missing), missing_perc))

So >10% of the full dataset is missing the weight, let's dig a little deeper

In [None]:
missing['load_type'].value_counts()

That makes me feel a bit better, most of the missing data is from street sweepers...maybe they don't have to fill out paperwork in some cases? Earlier we saw that there were 72,377 data points for street sweeping, ~82% of which are missing.  

Luckily, a very small portion of the rest of the data is missing. For now we'll keep these in but we will need to either fill them in with mean/median or remove them if we want to make predictions in the future

## Let's see the summary of load weights by load type
We'll do this by looking at the description and boxplot

In [None]:
ax3 = data.boxplot(by='load_type', column='load_weight', figsize=(12, 6))
ax3.set_ylim(-1000,40000);
ax3.set_xticklabels(ax3.get_xticklabels(), rotation=90);
data.groupby('load_type')['load_weight'].describe()

Two irrelevant but interesting things to note about this dataset:
1. There is a single negative value in single-stream recycling
    * Was there a pickup of nonrecyclable items from the recycling plant?
    * Was this just a typo?
2. Imagine a single truck filled with 14,540 lbs of dead animals...YUCK!
    * I'm assuming this is in lbs as opposed to kg

## How do the route numbers relate to the route type and load type?

In [None]:
routes = data.groupby('route_number')
routes_by_route_type = routes['route_type'].nunique()
routes_by_load_type = routes['load_type'].nunique()
print('Redundant Route Types: {}'.format((routes_by_route_type > 1).sum()))
print('Redundant Load Types: {}'.format((routes_by_load_type > 1).sum()))

So this tells us the each route number is associated with only one route type but can be associated with more than one load type. It also means that the first few characters of a route number could potentially tell us what kind of route type it is. For example:

In [None]:
print('BULK: \n{}\n'.format(data[data['route_type'] == 'BULK']['route_number'].unique()))
print('DEAD ANIMAL: \n{}\n'.format(data[data['route_type'] == 'DEAD ANIMAL']['route_number'].unique()))

All bulk route numbers start with "BU" while all dead animal route numbers start with "DA". Nothing groundbreaking here, just a little more good-to-know information in case the dataset grows.

## Now let's start looking at the dataset as a time-series

In [None]:
data_ts = data.sort_values('load_time')
data_ts.index = data_ts['load_time']
data_ts.drop('load_time', axis=1, inplace=True)
data_ts.head()

In [None]:
load_types = data_ts['load_type'].unique()
skip_plots = []
fig = plt.figure(figsize=(12,12))
for i, lt in enumerate(load_types):
    #resample the data to get monthly totals
    tmp = data_ts[data_ts['load_type'] == lt]['load_weight'].resample('M').sum()
    ax = fig.add_subplot(6, 3, i+1)
    plt.plot(tmp.index, tmp.values)
    ax.set_title(lt)
    ax.set_xlim(data_ts.index.min(), data_ts.index.max())
fig.tight_layout()

### Inital observations that need investigation:
* Plots for bagged litter, xmas trees, mulch, recycling plastic bags, and matress are useless
* Single stream recycling started around 2009
    * This is when comingle and paper recycling stopped
    * There was a downturn in garbage collection once single stream recycling started and has held steady since
* Upward trend in single stream recycling?
* Downward trend in street sweeping, litter, and dead animals?
* Seems to be some seasonality in yard trimmings
    * maybe brush, street sweeping, and ~~tires~~ too?
    * fft for peak frequencies


In [None]:
yt = data_ts[data_ts['load_type'] == 'YARD TRIMMING']
yt = yt.resample('M').sum()
yt['load_weight'].plot()

### There is definitely some seasonality here, let's get the monthly averages to see what the yearly cycle looks like

In [None]:
plt.plot(yt.groupby(yt.index.month).mean()['load_weight'])

### So this makes sense, lots of yard trimmings in the spring, slowly dwindling as the brutal Texas summer sets in, smaller peak in fall, dips again in winter.
I think it's pretty telling of our weather that there are more yards being mowed in the dead of winter than the dead of summer

In [None]:
da = data_ts[data_ts['load_type'] == 'DEAD ANIMAL']
da.resample('M').sum()['load_weight'].plot()

### There is alot of information to be gleaned from this plot:
* peak in late 2016 when Austin no-kill animal shelters became overcrowded and stopped taking rescues
    * Intense flooding in 2016