As an aside, there is no ISO standard for quarters, but there is an EDTF recommended by the ISO format (see: https://en.wikipedia.org/wiki/ISO_8601#Standardised_extensions and https://www.loc.gov/standards/datetime/). Unfortunately, the datetime formats of Python don't recognize EDTF formats. Instead I'll be replacing the quarter designations with the month number for the first month in the quarter (e.g. 1 - January, 4 - April, etc)

In [9]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

bee_colony_data = pd.read_csv('bees_post_eda.csv')
bee_colony_data.head(5)

Unnamed: 0,State,year,quarter,Starting colonies,Maximum colonies,Lost colonies,Percent lost,Added colonies,Varroa mites,Other pests,Diseases,Pesticides,Other,Unknown,year_and_quarter,region,subregion
0,Alabama,2015,Q1,7000.0,7000.0,1800.0,26.0,2800.0,10.0,5.4,,2.2,9.1,9.4,2015 Q1,south,east south central
1,Alabama,2015,Q2,7500.0,7500.0,860.0,12.0,1900.0,16.7,42.5,,2.3,3.2,4.1,2015 Q2,south,east south central
2,Alabama,2015,Q3,8500.0,9000.0,1400.0,16.0,160.0,63.1,70.6,,2.6,2.2,17.7,2015 Q3,south,east south central
3,Alabama,2015,Q4,8000.0,8000.0,610.0,8.0,80.0,3.1,6.4,0.2,0.2,2.8,1.9,2015 Q4,south,east south central
4,Alabama,2016,Q1,7500.0,7500.0,1700.0,23.0,1200.0,24.2,22.0,4.3,8.1,2.4,11.3,2016 Q1,south,east south central


Convert year and quarter to datetime format

In [10]:
bee_colony_data['year_and_quarter'] = bee_colony_data['year_and_quarter'].str.replace(' Q1', '-01')
bee_colony_data['year_and_quarter'] = bee_colony_data['year_and_quarter'].str.replace(' Q2', '-04')
bee_colony_data['year_and_quarter'] = bee_colony_data['year_and_quarter'].str.replace(' Q3', '-07')
bee_colony_data['year_and_quarter'] = bee_colony_data['year_and_quarter'].str.replace(' Q4', '-10')

bee_colony_data['year_and_quarter'] = pd.to_datetime(bee_colony_data['year_and_quarter'], format='%Y-%m')
bee_colony_data.dtypes

State                        object
year                          int64
quarter                      object
Starting colonies           float64
Maximum colonies            float64
Lost colonies               float64
Percent lost                float64
Added colonies              float64
Varroa mites                float64
Other pests                 float64
Diseases                    float64
Pesticides                  float64
Other                       float64
Unknown                     float64
year_and_quarter     datetime64[ns]
region                       object
subregion                    object
dtype: object

Create dummy variables for states, regions, and subregions

In [11]:
bee_data_encoded = pd.get_dummies(bee_colony_data, columns=['State', 'region', 'subregion'])

Apply StandardScaler to the numeric columns after separating the dependent and target variables

In [12]:
scaled_col_names = ['Added colonies', 'Varroa mites', 'Other pests', 'Diseases', 'Pesticides', 
                    'Other', 'Unknown']

scaled_features = bee_data_encoded[scaled_col_names]
scaler = StandardScaler()
scaled_bee_data = scaler.fit_transform(scaled_features.values)

bee_data_encoded[scaled_col_names] = scaled_bee_data
bee_data_encoded.head(5)

Unnamed: 0,year,quarter,Starting colonies,Maximum colonies,Lost colonies,Percent lost,Added colonies,Varroa mites,Other pests,Diseases,...,region_west,subregion_east north central,subregion_east south central,subregion_mid atlantic,subregion_mountain,subregion_new england,subregion_pacific,subregion_south atlantic,subregion_west north central,subregion_west south central
0,2015,Q1,7000.0,7000.0,1800.0,26.0,-0.269072,-0.99801,-0.438503,,...,0,0,1,0,0,0,0,0,0,0
1,2015,Q2,7500.0,7500.0,860.0,12.0,-0.30182,-0.649452,2.348027,,...,0,0,1,0,0,0,0,0,0,0
2,2015,Q3,8500.0,9000.0,1400.0,16.0,-0.365133,1.764442,4.45858,,...,0,0,1,0,0,0,0,0,0,0
3,2015,Q4,8000.0,8000.0,610.0,8.0,-0.368044,-1.356972,-0.363395,-0.528829,...,0,0,1,0,0,0,0,0,0,0
4,2016,Q1,7500.0,7500.0,1700.0,23.0,-0.327291,-0.259275,0.8083,-0.020045,...,0,0,1,0,0,0,0,0,0,0


Turns out a few of the 'Lost colonies' are NaN. I've decided to set those to 0, under the assumption that it's entirely possible that no colonies were lost in a season. This could be especially true in Q4/Q1 when the hives are not checked in the colder months, so it would be unknown if any were lost and any losses would be counted in subsequent quarters. Some of the 'Percent lost' columns are also NaN, so I'm dropping that column entirely since 'Log - final' serves the same purpose.

Create calculated column of final colony count (Starting colonies - Lost colonies)

In [13]:
bee_data_encoded['Lost colonies'] = bee_data_encoded['Lost colonies'].fillna(0)
bee_data_encoded.drop('Percent lost', axis=1, inplace=True)

bee_data_encoded['Log - final'] = np.log((bee_data_encoded['Starting colonies'] - bee_data_encoded['Lost colonies']) / bee_data_encoded['Starting colonies'])

# A small number of Log records are NaN, so removing them (just 5 rows)
bee_data_encoded.dropna(inplace=True)

  result = getattr(ufunc, method)(*inputs, **kwargs)


Impute missing data using sklearn SimpleImputer

In [14]:
imp_median = SimpleImputer(missing_values=np.nan, strategy='median')
imputed_cols = imp_median.fit_transform(bee_data_encoded[scaled_col_names])

bee_data_imputed = bee_data_encoded.copy()
bee_data_imputed[scaled_col_names] = imputed_cols
bee_data_imputed.head(5)

Unnamed: 0,year,quarter,Starting colonies,Maximum colonies,Lost colonies,Added colonies,Varroa mites,Other pests,Diseases,Pesticides,...,subregion_east north central,subregion_east south central,subregion_mid atlantic,subregion_mountain,subregion_new england,subregion_pacific,subregion_south atlantic,subregion_west north central,subregion_west south central,Log - final
3,2015,Q4,8000.0,8000.0,610.0,-0.368044,-1.356972,-0.363395,-0.528829,-0.739222,...,0,1,0,0,0,0,0,0,0,-0.079314
4,2016,Q1,7500.0,7500.0,1700.0,-0.327291,-0.259275,0.8083,-0.020045,0.107578,...,0,1,0,0,0,0,0,0,0,-0.257045
6,2015,Q2,33000.0,33000.0,5500.0,0.338588,-1.081247,1.566898,-0.491601,1.393856,...,0,0,0,0,0,0,0,0,0,-0.182322
7,2015,Q3,40000.0,40000.0,6000.0,-0.214492,1.265016,1.018604,0.116458,1.093724,...,0,0,0,0,0,0,0,0,0,-0.162519
8,2015,Q4,36000.0,39000.0,12000.0,-0.261795,1.088136,-0.716405,-0.417145,-0.385496,...,0,0,0,0,0,0,0,0,0,-0.405465


Remove year_and_quarter column since regression cannot handle date format. And replace the Q1-Q4 designations with numbers

In [15]:
bee_data_imputed.drop(['year_and_quarter'], axis=1, inplace=True)
bee_data_imputed['quarter'] = bee_data_imputed['quarter'].str.replace('Q1', '1')
bee_data_imputed['quarter'] = bee_data_imputed['quarter'].str.replace('Q2', '2')
bee_data_imputed['quarter'] = bee_data_imputed['quarter'].str.replace('Q3', '3')
bee_data_imputed['quarter'] = bee_data_imputed['quarter'].str.replace('Q4', '4')

Create a train and test split

In [16]:
X = bee_data_imputed.drop(['Log - final'], axis=1)
y = bee_data_imputed['Log - final']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42)

In [17]:
X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)
X.to_csv('X.csv', index=False)
y.to_csv('y.csv', index=False)
bee_data_encoded.to_csv('bee_data_after_preprocessing.csv', index=False)