### Data Definition
1. **Year**: Year of observation; only 1917 in this subset of the data.
2. **Month**: Month of observation.
3. **Stn_Name**: Name of station where observation was recorded.
4. **Prov**: Province in which the station is located.
5. **Lat**: Latitude coordinate of the station.
6. **Long**: Longitude coordinate of the station.
7. **Tm**: Recorded monthly average temperature (°C).
8. **S**: Recorded monthly average snowfall (cm).
9. **P**: Recorded monthly average precipitation (mm).

In [1]:
#import necessary libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

In [2]:
#read csv into df
df = pd.read_csv('./data/1917_full.csv', encoding = 'utf-16')
df.head()

Unnamed: 0,Year,Month,Stn_Name,Prov,Lat,Long,Tm,S,P
0,1917,1,ALIX,AB,52.383,-113.167,-15.4,40.200001,42.0
1,1917,1,ALLIANCE,AB,52.433,-111.783,-17.299999,0.0,0.0
2,1917,1,ATHABASCA LANDING,AB,54.717,-113.283,-20.700001,26.4,26.4
3,1917,1,BANFF,AB,51.183,-115.567,-11.4,18.299999,18.299999
4,1917,1,BASHAW,AB,52.683,-112.867,0.0,22.9,35.599998


In [3]:
#removing NaNs
df.isnull().sum()

Year         0
Month        0
Stn_Name     0
Prov         0
Lat         12
Long        12
Tm           0
S            0
P            0
dtype: int64

In [4]:
df[df.Lat.isnull()]

Unnamed: 0,Year,Month,Stn_Name,Prov,Lat,Long,Tm,S,P
226,1917,1,POINT LEPREAU,NB,,,0.0,48.200001,107.800003
717,1917,2,POINT LEPREAU,NB,,,0.0,40.599998,66.5
1203,1917,3,POINT LEPREAU,NB,,,0.0,25.299999,107.900002
1693,1917,4,POINT LEPREAU,NB,,,0.0,22.799999,59.900002
2190,1917,5,POINT LEPREAU,NB,,,0.0,0.0,82.800003
2695,1917,6,POINT LEPREAU,NB,,,0.0,0.0,146.600006
3196,1917,7,POINT LEPREAU,NB,,,0.0,0.0,44.700001
3696,1917,8,POINT LEPREAU,NB,,,0.0,0.0,94.599998
4205,1917,9,POINT LEPREAU,NB,,,0.0,0.0,37.299999
4712,1917,10,POINT LEPREAU,NB,,,0.0,0.0,146.399994


In [5]:
df = df[df['Lat'].notna()]
df.isnull().values.any()

np.False_

In [6]:
#choose target - going with province
X = df.drop(columns=['Prov'])
y = df['Prov']

In [7]:
X

Unnamed: 0,Year,Month,Stn_Name,Lat,Long,Tm,S,P
0,1917,1,ALIX,52.383,-113.167,-15.400000,40.200001,42.000000
1,1917,1,ALLIANCE,52.433,-111.783,-17.299999,0.000000,0.000000
2,1917,1,ATHABASCA LANDING,54.717,-113.283,-20.700001,26.400000,26.400000
3,1917,1,BANFF,51.183,-115.567,-11.400000,18.299999,18.299999
4,1917,1,BASHAW,52.683,-112.867,0.000000,22.900000,35.599998
...,...,...,...,...,...,...,...,...
5974,1917,12,YELLOW GRASS,49.817,-104.183,-19.500000,0.000000,0.000000
5975,1917,12,YORKTON,51.183,-102.517,-23.700001,0.000000,0.000000
5976,1917,12,CARCROSS,60.183,-134.700,-34.900002,9.100000,9.100000
5977,1917,12,DAWSON,64.050,-139.433,-46.299999,5.400000,5.400000


In [8]:
# one-hot encoding - apply get_dummies to the entire dataset
X = pd.get_dummies(X, drop_first=True)
X.head()

Unnamed: 0,Year,Month,Lat,Long,Tm,S,P,Stn_Name_AGASSIZ CDA,Stn_Name_ALBERNI BEAVER CREEK,Stn_Name_ALERT BAY,...,Stn_Name_WILMER,Stn_Name_WINDSOR KINGS COLLEGE,Stn_Name_WINDSOR RIVERSIDE,Stn_Name_WINNIPEG ST JOHNS COLL,Stn_Name_WOLFVILLE,Stn_Name_WOODS LAKE,Stn_Name_WOODSTOCK,Stn_Name_YARMOUTH,Stn_Name_YELLOW GRASS,Stn_Name_YORKTON
0,1917,1,52.383,-113.167,-15.4,40.200001,42.0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,1917,1,52.433,-111.783,-17.299999,0.0,0.0,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,1917,1,54.717,-113.283,-20.700001,26.4,26.4,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,1917,1,51.183,-115.567,-11.4,18.299999,18.299999,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,1917,1,52.683,-112.867,0.0,22.9,35.599998,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [9]:
#split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [10]:
# Create a pipeline standard scaler, and logistic regression
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('logreg', LogisticRegression())
])

In [11]:
# Train the model
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

          AB       1.00      1.00      1.00       245
          BC       1.00      1.00      1.00       439
          MB       1.00      1.00      1.00       119
          NB       1.00      1.00      1.00        33
          NL       1.00      1.00      1.00        28
          NS       1.00      1.00      1.00        75
          NT       1.00      1.00      1.00        22
          ON       1.00      1.00      1.00       448
          PE       1.00      1.00      1.00         4
          QC       1.00      1.00      1.00       210
          SK       1.00      0.99      1.00       160
          YT       1.00      1.00      1.00         8

    accuracy                           1.00      1791
   macro avg       1.00      1.00      1.00      1791
weighted avg       1.00      1.00      1.00      1791



## Summary
As expected, these results are incredibly accurate, but it's a bit of a trivial problem. Each station is linked to one and only one province, so it's almost too easy for the model to classify the entries.