## Solar

Project Solar is an attempt to equire information about sourounding envrionment in a living space, analize it, and build predictions in order to answer:

> Determine value of artificial light to counteract its natural deficit

+ Given time of the day, provide an answer about the level of light
+ Given time of the day (and day of the week) - provide an answer if light should be on

In [693]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

from sklearn.model_selection import train_test_split

In [694]:
from sklearn.externals import joblib
df = joblib.load('data/sensing_numeric.sav')
df.columns

Index(['dot_week', 'light_level', 'light_log_mms', 'light_log_sss',
       'location_black', 'location_blue', 'location_green', 'location_orange',
       'location_purple', 'motion', 'present', 'sound_log_mms',
       'sound_log_sss', 'sun_cat'],
      dtype='object')

Calculate 25 percentile of light at each location.

Add percentile to data set.

Hourly motion deviation per day of the week at each location 

Adding additional feature.

In [664]:
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 565223 entries, (1, 2018-05-01 01:09:00) to (576227, 2018-05-19 23:38:00)
Data columns (total 13 columns):
dot_week           565223 non-null int64
light_level        565223 non-null int64
light_log_mms      565223 non-null float64
light_log_sss      565223 non-null float64
location_black     565223 non-null uint8
location_blue      565223 non-null uint8
location_green     565223 non-null uint8
location_orange    565223 non-null uint8
location_purple    565223 non-null uint8
motion             565223 non-null int64
sound_log_mms      565223 non-null float64
sound_log_sss      565223 non-null float64
sun_cat            565223 non-null int8
dtypes: float64(4), int64(3), int8(1), uint8(5)
memory usage: 41.2 MB


#### Feature selection (including sensing)

In [695]:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
df['light_level_sss'] = ss.fit_transform(df['light_level'].values.reshape(-1,1))



In [696]:
y = df['light_level']
X = df[df.columns.difference(['light_level_sss','light_level','light_log_mms','light_log_sss','sound_log_mms'])]
X.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 589227 entries, (1, 2018-05-01 01:09:00) to (600232, 2018-05-20 22:58:00)
Data columns (total 10 columns):
dot_week           589227 non-null int64
location_black     589227 non-null uint8
location_blue      589227 non-null uint8
location_green     589227 non-null uint8
location_orange    589227 non-null uint8
location_purple    589227 non-null uint8
motion             589227 non-null int64
present            589227 non-null int64
sound_log_sss      589227 non-null float64
sun_cat            589227 non-null int8
dtypes: float64(1), int64(3), int8(1), uint8(5)
memory usage: 29.4 MB


#### Train split

In [697]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

#### Logistic Regression

In [674]:
from sklearn.linear_model import LogisticRegression
from helpers import print_predict_scores
from helpers import fit_predict
logReg = LogisticRegression()

In [675]:
(model, prediction) = fit_predict(X_train, y_train, logReg)

p-values:
[  1.31179820e-145   3.31730796e-029   4.09001793e-001   1.33012136e-003
   1.12112347e-023   2.01227720e-006   0.00000000e+000   3.54114052e-302
   0.00000000e+000]

coefficients:
[[  3.50683455e-02   1.13768239e-02  -2.89545318e-02  -6.26249105e-02
   -3.03357683e-02   5.68234876e-02  -9.22049459e-01  -7.57268863e-02
    4.39801409e-01]
 [  5.46408033e-04  -6.76396316e-01  -7.04910386e-01  -7.00831525e-01
   -6.82782339e-01  -7.06768909e-01   3.11995025e-01  -4.57857959e-02
   -1.85766951e-01]
 [ -4.58595727e-04  -6.99970410e-01  -6.51525617e-01  -7.23846976e-01
   -6.46457347e-01  -6.89113923e-01   1.16469330e-01  -4.86220158e-02
   -2.33103193e-01]
 [  1.53710051e-02  -6.04928623e-01  -6.01977411e-01  -6.67019594e-01
   -5.98342261e-01  -6.52181102e-01   2.80325874e-01  -7.46766781e-02
   -2.18745102e-01]
 [ -1.66067626e-03  -5.94304384e-01  -5.42883880e-01  -5.71907685e-01
   -5.24951410e-01  -6.32383804e-01   4.97102529e-01  -6.61407410e-02
   -2.28667839e-01]
 [ -4.103

In [676]:
from sklearn.metrics import mean_squared_error, classification_report, precision_score, recall_score, confusion_matrix

In [677]:
print(mean_squared_error(y_train, prediction))
print(classification_report(y_train, prediction))

7.05597558013
             precision    recall  f1-score   support

          0       0.68      0.97      0.80    284117
          1       0.00      0.00      0.00      4907
          2       0.00      0.00      0.00      4824
          3       0.00      0.00      0.00      7351
          4       0.00      0.00      0.00      9536
          5       0.54      0.10      0.16    113182

avg / total       0.60      0.68      0.58    423917



  'precision', 'predicted', average, warn_for)


In [678]:
print('Test score r2:', logReg.score(X_test,y_test))

Test score r2: 0.677756075467


#### Stochastic Gradient Descent classifier

In [679]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier()

In [680]:
(model, prediction) = fit_predict(X_train, y_train, logReg)

p-values:
[  1.31179820e-145   3.31730796e-029   4.09001793e-001   1.33012136e-003
   1.12112347e-023   2.01227720e-006   0.00000000e+000   3.54114052e-302
   0.00000000e+000]

coefficients:
[[  3.50683455e-02   1.13768239e-02  -2.89545318e-02  -6.26249105e-02
   -3.03357683e-02   5.68234876e-02  -9.22049459e-01  -7.57268863e-02
    4.39801409e-01]
 [  5.46408033e-04  -6.76396316e-01  -7.04910386e-01  -7.00831525e-01
   -6.82782339e-01  -7.06768909e-01   3.11995025e-01  -4.57857959e-02
   -1.85766951e-01]
 [ -4.58595727e-04  -6.99970410e-01  -6.51525617e-01  -7.23846976e-01
   -6.46457347e-01  -6.89113923e-01   1.16469330e-01  -4.86220158e-02
   -2.33103193e-01]
 [  1.53710051e-02  -6.04928623e-01  -6.01977411e-01  -6.67019594e-01
   -5.98342261e-01  -6.52181102e-01   2.80325874e-01  -7.46766781e-02
   -2.18745102e-01]
 [ -1.66067626e-03  -5.94304384e-01  -5.42883880e-01  -5.71907685e-01
   -5.24951410e-01  -6.32383804e-01   4.97102529e-01  -6.61407410e-02
   -2.28667839e-01]
 [ -4.103

In [681]:
print('Test score:', model.score(X_test,y_test))

Test score: 0.677756075467


#### Parameters optimization - cross validation

In [586]:
from sklearn.model_selection import GridSearchCV, KFold

In [587]:
param_grid = [
    {'C': [10**-i for i in range(-5, 5)], 'class_weight': [None, 'balanced']}
]
grid = GridSearchCV(
    estimator=logReg,
    param_grid=param_grid,
    cv=7,
    scoring = 'neg_mean_squared_error'
)

In [588]:
grid.fit(X_train, y_train)
grid.best_estimator_

LogisticRegression(C=100000, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [589]:
(model, prediction) = fit_predict(X_train, y_train, grid.best_estimator_)

p-values:
[  0.00000000e+000   7.81591889e-001   8.22329104e-006   1.09508329e-120
   3.25909785e-014   5.69222709e-006   5.89836426e-007   0.00000000e+000
   0.00000000e+000]

coefficients:
[[-1.14637545 -1.11456695  0.00260649 -0.19394196  0.09117005 -0.18489674
   0.17982027  0.40133189  0.35172401]
 [ 0.8405488   1.41654391 -0.92625755 -0.80815717 -1.02096276 -0.77981337
  -1.09993522 -0.31155181 -0.24025492]
 [ 0.65481086  0.77906726 -0.93237243 -0.87283014 -0.98662389 -0.71375082
  -0.96555748 -0.27671844 -0.24430037]
 [ 0.50300216  0.40165074 -0.88936004 -0.77870008 -1.0002758  -0.83414316
  -0.98458923 -0.30175319 -0.19627103]
 [ 0.22944994  1.32072953 -1.0159605  -0.81311794 -1.09116195 -0.73075688
  -0.96110659 -0.28753594 -0.23598012]
 [ 0.2398079   1.27264136 -0.88878623 -0.69139848 -0.94399001 -0.61837409
  -0.8532268  -0.30064654 -0.23054582]
 [ 0.31047884  0.61925708 -0.72036772 -0.60853525 -0.80208916 -0.63795111
  -0.76516407 -0.29335613 -0.22036897]
 [ 0.68295054  0.5

In [590]:
print(classification_report(y_train, prediction))

             precision    recall  f1-score   support

          0       0.74      0.94      0.83    150704
          1       0.00      0.00      0.00       472
          2       0.00      0.00      0.00       494
          4       0.00      0.00      0.00       457
          5       0.00      0.00      0.00       463
          6       0.00      0.00      0.00       949
          7       0.00      0.00      0.00      1433
          8       0.00      0.00      0.00      2336
          9       0.48      0.17      0.25     51830

avg / total       0.66      0.72      0.66    209138



  'precision', 'predicted', average, warn_for)


In [591]:
print('Test score r2:', model.score(X_test,y_test))

Test score r2: 0.723968269907


#### Testing Lasso, Ridge, ElasticNet

In [592]:
from sklearn.linear_model import Lasso, Ridge, ElasticNet
from sklearn.metrics import mean_squared_error

In [593]:
params = [
    {'alpha': [0.1, 0.2, 0.3, 0.5]}
]
lasso = Lasso()
grid_search_lasso = GridSearchCV(lasso, params, cv=5, scoring='neg_mean_squared_error')
grid_search_lasso.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'alpha': [0.1, 0.2, 0.3, 0.5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [594]:
grid_search_lasso.best_estimator_
lasso_model = grid_search_lasso.best_estimator_
lasso_model.fit(X_train, y_train)

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [595]:
prediction2 = lasso_model.predict(X_train)
mse2 = mean_squared_error(prediction2, y_train)
print('mean square root error :\n{}\n'.format(np.sqrt(mse2)))

mean square root error :
3.718970833212442



In [596]:
print('Train score r2:', lasso_model.score(X_train,y_train))
print('Test score r2:', lasso_model.score(X_test,y_test))

Train score r2: 0.109126820172
Test score r2: 0.112861989046


In [597]:
ridge = Ridge()
grid_seaerch_ridge = GridSearchCV(ridge, params, cv = 3, scoring='neg_mean_squared_error')
grid_seaerch_ridge.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'alpha': [0.1, 0.2, 0.3, 0.5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [598]:
grid_seaerch_ridge.best_score_
ridge_model = grid_seaerch_ridge.best_estimator_
ridge_model.fit(X_train, y_train)

Ridge(alpha=0.5, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [599]:
prediction3 = ridge_model.predict(X_train)
mse3 = mean_squared_error(prediction3, y_train)
print('square root:\n{}\n'.format(np.sqrt(mse3)))

square root:
3.692066535689094



In [600]:
print('Train score r2:', ridge_model.score(X_train,y_train))
print('Test score r2:', ridge_model.score(X_test,y_test))

Train score r2: 0.121969953762
Test score r2: 0.127602546279


In [601]:
en = ElasticNet()
grid_seaerch_en = GridSearchCV(en, params, cv = 3, scoring='neg_mean_squared_error')
grid_seaerch_en.fit(X_train, y_train)

GridSearchCV(cv=3, error_score='raise',
       estimator=ElasticNet(alpha=1.0, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'alpha': [0.1, 0.2, 0.3, 0.5]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [602]:
grid_seaerch_en.best_score_
en_model = grid_seaerch_en.best_estimator_
en_model.fit(X_train, y_train)

ElasticNet(alpha=0.1, copy_X=True, fit_intercept=True, l1_ratio=0.5,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [603]:
prediction4 = en_model.predict(X_train)
mse4 = mean_squared_error(prediction4, y_train)
print('square root:\n{}\n'.format(np.sqrt(mse4)))

square root:
3.70986240452818



In [604]:
print('Train score r2:', en_model.score(X_train,y_train))
print('Test score r2:', en_model.score(X_test,y_test))

Train score r2: 0.113485293689
Test score r2: 0.117569538757


#### KNeighbors

In [698]:
df.columns

Index(['dot_week', 'light_level', 'light_log_mms', 'light_log_sss',
       'location_black', 'location_blue', 'location_green', 'location_orange',
       'location_purple', 'motion', 'present', 'sound_log_mms',
       'sound_log_sss', 'sun_cat', 'light_level_sss'],
      dtype='object')

In [699]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
knn = KNeighborsClassifier(n_neighbors=5)

In [700]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [701]:
prediction5 = knn.predict(X_train)

In [702]:
print(confusion_matrix(y_train, prediction5))

[[277295    142    109    226    392  17077]
 [  3166    497     82     64     64   1226]
 [  2992     82    514    119     63   1252]
 [  4337     92    135    974    156   1911]
 [  5297     57     92    179   1432   2660]
 [ 35982    217    301    442    968  81326]]


In [703]:
print(classification_report(y_train, prediction5))

             precision    recall  f1-score   support

          0       0.84      0.94      0.89    295241
          1       0.46      0.10      0.16      5099
          2       0.42      0.10      0.16      5022
          3       0.49      0.13      0.20      7605
          4       0.47      0.15      0.22      9717
          5       0.77      0.68      0.72    119236

avg / total       0.80      0.82      0.80    441920



In [704]:
print('Train score r2:', knn.score(X_train,y_train))
print('Test score r2:', knn.score(X_test,y_test))

Train score r2: 0.819238776249
Test score r2: 0.77003808373


In [706]:
k = [i for i in range(2, 10, 2)]
grid_search_knn_params = [{'n_neighbors': k}]


grid_seaerch_en = GridSearchCV(knn, grid_search_knn_params, cv = 5, scoring='neg_mean_squared_error')
grid_seaerch_en.fit(X_train, y_train)

GridSearchCV(cv=5, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'n_neighbors': [2, 4, 6, 8]}], pre_dispatch='2*n_jobs',
       refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=0)

In [707]:
grid_seaerch_en.best_score_

-4.4258485698769006

In [709]:
knn_model = grid_seaerch_en.best_estimator_
knn_model.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=8, p=2,
           weights='uniform')

In [710]:
prediction5 = knn_model.predict(X_train)

In [711]:
print(classification_report(y_train, prediction5))

             precision    recall  f1-score   support

          0       0.83      0.95      0.88    295241
          1       0.43      0.05      0.09      5099
          2       0.39      0.06      0.10      5022
          3       0.47      0.07      0.12      7605
          4       0.48      0.09      0.15      9717
          5       0.76      0.64      0.69    119236

avg / total       0.79      0.81      0.79    441920



In [712]:
print('Train score r2:', knn_model.score(X_train,y_train))
print('Test score r2:', knn_model.score(X_test,y_test))

Train score r2: 0.809071777697
Test score r2: 0.779182252031


In [690]:
from sklearn.model_selection import cross_val_predict

In [691]:
y_pred = cross_val_predict(knn, X_train, y_train)

In [692]:
conf_mx = confusion_matrix(y_train, y_pred)
conf_mx

array([[254121,    230,    184,    350,    766,  28466],
       [  3326,     72,     34,     46,     58,   1371],
       [  3186,     29,     72,     82,     75,   1380],
       [  4761,     48,     94,    205,    165,   2078],
       [  5944,     44,     65,    128,    312,   3043],
       [ 51375,    202,    235,    467,    848,  60055]])

Regression models:
    - linear support regressor
    - linear vs non-linear kernel
    - random forest
    - XG Boost model

In [640]:
from sklearn.externals import joblib
filename = 'data/sensing_numeric_full.sav'
joblib.dump(df, filename)

['data/sensing_numeric_full.sav']