### Разделите набор данных на обучающие/проверочные в пропорции 80/20.

Загрузите данные и очистите значения (нулями и средними). Построить модель линейной регрессии для каждого часа в отдельности, используя температуру воздуха (air_temperature), влажность (dew_temperature), атмосферное давление (sea_level_pressure), скорость ветра (wind_speed) и облачность (cloud_coverage).

Рассчитайте качество построенной модели по проверочным данным. Используйте данные:

http://video.ittensive.com/machine-learning/ashrae/building_metadata.csv.gz

http://video.ittensive.com/machine-learning/ashrae/weather_train.csv.gz

http://video.ittensive.com/machine-learning/ashrae/train.0.0.csv.gz

### Questions for this assignment

Какое получилось качество модели линейной регрессии по часам с точностью до десятых? Например, 0.0

### Подключение библиотек

In [14]:
%matplotlib inline
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

### Загрузка данных

In [15]:
buildings = pd.read_csv("http://video.ittensive.com/machine-learning/ashrae/building_metadata.csv.gz")
weather = pd.read_csv("http://video.ittensive.com/machine-learning/ashrae/weather_train.csv.gz")
energy_0 = pd.read_csv("http://video.ittensive.com/machine-learning/ashrae/train.0.0.csv.gz")

### Объединение данных и фильтрация

In [16]:
energy_0 = pd.merge(left=energy_0, right=buildings, how="left",
                   left_on="building_id", right_on="building_id")
energy_0.set_index(["timestamp", "site_id"], inplace=True)
weather.set_index(["timestamp", "site_id"], inplace=True)
energy_0 = pd.merge(left=energy_0, right=weather, how="left",
                   left_index=True, right_index=True)
energy_0.reset_index(inplace=True)
energy_0 = energy_0[energy_0["meter_reading"] > 0]
energy_0["timestamp"] = pd.to_datetime(energy_0["timestamp"])
energy_0["hour"] = energy_0["timestamp"].dt.hour
print (energy_0.head())

               timestamp  site_id  building_id  meter  meter_reading  \
704  2016-01-30 08:00:00        0            0      0        43.6839   
725  2016-01-31 05:00:00        0            0      0        37.5408   
737  2016-01-31 17:00:00        0            0      0        52.5571   
2366 2016-04-08 14:00:00        0            0      0        59.3827   
2923 2016-05-01 19:00:00        0            0      0       448.0000   

     primary_use  square_feet  year_built  floor_count  air_temperature  \
704    Education         7432      2008.0          NaN              8.3   
725    Education         7432      2008.0          NaN             12.8   
737    Education         7432      2008.0          NaN             20.6   
2366   Education         7432      2008.0          NaN             21.7   
2923   Education         7432      2008.0          NaN             31.1   

      cloud_coverage  dew_temperature  precip_depth_1_hr  sea_level_pressure  \
704              NaN              6.

### Очистка данных

In [17]:
energy_0["air_temperature"].fillna(0, inplace = True)
energy_0["dew_temperature"].fillna(0, inplace = True)
energy_0["cloud_coverage"].fillna(0, inplace = True)
energy_0["wind_speed"].fillna(0, inplace = True)
energy_0["precip_depth_1_hr"] = energy_0["precip_depth_1_hr"].apply(lambda x:x if x>0 else 0)
energy_0_sea_level_pressure_mean = energy_0["sea_level_pressure"].mean()
energy_0["sea_level_pressure"] = energy_0["sea_level_pressure"].apply(lambda x:energy_0_sea_level_pressure_mean if x!=x else x)
print (energy_0.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5411 entries, 704 to 8783
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   timestamp           5411 non-null   datetime64[ns]
 1   site_id             5411 non-null   int64         
 2   building_id         5411 non-null   int64         
 3   meter               5411 non-null   int64         
 4   meter_reading       5411 non-null   float64       
 5   primary_use         5411 non-null   object        
 6   square_feet         5411 non-null   int64         
 7   year_built          5411 non-null   float64       
 8   floor_count         0 non-null      float64       
 9   air_temperature     5411 non-null   float64       
 10  cloud_coverage      5411 non-null   float64       
 11  dew_temperature     5411 non-null   float64       
 12  precip_depth_1_hr   5411 non-null   float64       
 13  sea_level_pressure  5411 non-null   float64   

### Разделение Данных
На обучающие и проверочные

In [19]:
energy_0_train, energy_0_test = train_test_split(energy_0, test_size=0.2)
print (energy_0_train.head())

               timestamp  site_id  building_id  meter  meter_reading  \
7421 2016-11-05 05:00:00        0            0      0       221.1490   
5648 2016-08-23 08:00:00        0            0      0       307.8350   
4253 2016-06-26 05:00:00        0            0      0       240.9440   
8214 2016-12-08 06:00:00        0            0      0        91.4631   
8339 2016-12-13 11:00:00        0            0      0        72.3514   

     primary_use  square_feet  year_built  floor_count  air_temperature  \
7421   Education         7432      2008.0          NaN             20.6   
5648   Education         7432      2008.0          NaN             26.1   
4253   Education         7432      2008.0          NaN             26.7   
8214   Education         7432      2008.0          NaN             16.7   
8339   Education         7432      2008.0          NaN             17.2   

      cloud_coverage  dew_temperature  precip_depth_1_hr  sea_level_pressure  \
7421             4.0             18.

### Линейная регрессия по часам
Модель включает ait_temperature, dew_temperature, sea_level_pressure, wind_speed, cloud_coverage

In [20]:
hours = range(0, 24)
energy_0_train_lr = pd.DataFrame(energy_0_train,
        columns=["meter_reading", "air_temperature", "dew_temperature",
                "sea_level_pressure", "wind_speed", "cloud_coverage", "hour"])

energy_0_lr = [[]]*len(hours)
for hour in hours:
    energy_0_lr_hourly = energy_0_train_lr[energy_0_train_lr["hour"]==hour]
    y = energy_0_lr_hourly["meter_reading"]
    x = energy_0_lr_hourly.drop(labels=["meter_reading", "hour"], axis=1)
    
    model = LinearRegression().fit(x, y)
    energy_0_lr[hour] = model.coef_
    energy_0_lr[hour] = np.append(energy_0_lr[hour], model.intercept_)
print (energy_0_lr)

[array([ 4.30969439e+00,  2.90952941e+00, -1.22557441e+00, -3.13118265e+00,
       -2.62251864e+00,  1.34234413e+03]), array([ 7.10389915e+00,  1.21286690e+00, -3.94196921e-01, -2.27104585e+00,
       -1.20103255e+00,  4.51490341e+02]), array([ 6.08512208e+00,  1.98470874e+00, -2.19631662e-01, -2.72996449e+00,
       -1.66213496e+00,  2.89846698e+02]), array([ 8.08113691e+00, -2.66048347e-02, -4.16490869e-01, -2.68394992e+00,
        5.74026119e+00,  4.77634440e+02]), array([ 9.82748767e+00, -2.66582966e+00, -7.91258463e-01, -3.41054638e+00,
        5.12416559e+00,  8.76142198e+02]), array([ 1.08629992e+01, -3.33445070e+00, -8.46668739e-01, -3.48392003e+00,
        1.58866405e+00,  9.25806718e+02]), array([  8.09162866,   0.40515371,   0.07531936,  -0.3136902 ,
        -0.98610532, -24.28873855]), array([   4.63162811,    1.83831538,    0.54358031,   -0.60939751,
          0.5397949 , -448.65552705]), array([  10.79328061,   -2.78842465,    0.22978871,   -4.01493125,
          2.443222

### Предсказание и оценка модели
Проводим проверку модели используя нужную метрику

In [22]:
def calculate_model(x):
    model = energy_0_lr[x.hour]
    meter_reading_log = np.log(x.meter_reading + 1)
    meter_reading_lr = np.log(1 +x.air_temperature * model [0] +
        x.dew_temperature * model[1] + x.sea_level_pressure * model[2] +
        x.wind_speed * model[3] + x.cloud_coverage * model[4] + model[5])
    x["meter_reading_lr_q"] = (meter_reading_log - meter_reading_lr)**2
    return x

energy_0_test = energy_0_test.apply(calculate_model, 
                                   axis=1, result_type="expand")
energy_0_test_lr_rmsle = np.sqrt(energy_0_test["meter_reading_lr_q"].sum() / 
                                len(energy_0_test))
print ("Качество почасовой линейной регрессии, 5 параметров:",
      energy_0_test_lr_rmsle, round(energy_0_test_lr_rmsle, 2))

Качество почасовой линейной регрессии, 5 параметров: 0.20469548094316103 0.2
