# Model Selection and Pipelines

In this lab, we'll explore ways to evaluate the quality of models created through training. We'll also begin setting up a basic pipeline, which Python's `sklearn` library provides functions for. Pipelines can be useful when we want to tune parameters and test different models and model parameters on a dataset.

In [1]:
# load libraries

import numpy as np
import pandas as pd

from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

## Load Data

This portion of the notebook is an abbreviated version of the lab where we processed the Chicago Divvy and Weather datasets to produce an appropriate dataframe with the features.

### Weather Data

In [2]:
wdf = pd.read_csv("../../data/chicago-weather.csv.gz", compression='gzip')
wdf['DATE'] = pd.to_datetime(wdf['DATE'], format='%Y/%m/%d')

midway_temps = wdf[wdf['STATION']=='USC00111577'].loc[:,['DATE','TMIN','TMAX']]
midway_temps_2018 = midway_temps[midway_temps['DATE'] < '2019-01-01']

### Divvy Data

In [3]:
ddf = pd.read_csv("../../data/Divvy_Trips_2018.csv.gz", compression='gzip')
ddf['START TIME'] = pd.to_datetime(ddf['START TIME'], format='%m/%d/%Y %H:%M:%S %p')

divvy_2018 = ddf[ddf['START TIME'] >= '2018-01-01']
dates = pd.Series(divvy_2018['START TIME'].apply(lambda x: x.date()), index=divvy_2018.index)
divvy_2018.loc[:,'DATE'] = dates

divvy_2018_rides_by_date = pd.DataFrame(divvy_2018.groupby(['DATE'])['DATE'] \
                                  .count() \
                                  .reset_index(name='count') \
                                  .sort_values(['DATE'], ascending=True))

divvy_2018_duration_by_date = pd.DataFrame(divvy_2018.groupby(['DATE'])['TRIP DURATION'] \
                                  .sum() \
                                  .reset_index(name='duration') \
                                  .sort_values(['DATE'], ascending=True))

divvy_2018_by_date = divvy_2018_duration_by_date.merge(divvy_2018_rides_by_date, 
                                                       on='DATE',
                                                       how='left')
divvy_2018_by_date['DATE'] = pd.to_datetime(divvy_2018_by_date['DATE'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


### Merge Data into a Single Dataframe

In [4]:
rides_temps = midway_temps_2018.merge(divvy_2018_by_date, on='DATE')
rides_temps.to_csv('../../data/rides_temps.csv')