# **Activity 9.01 – Data splitting, scaling, and modeling**

1. For this activity, all you will need is the Pandas library, the modules from
sklearn, and numpy. Load them in the first cell of the notebook.
2. Use the power_plant.csv dataset – 'Datasets\\power_plant.csv'. Read
the data into a Pandas DataFrame, print out the shape, and list the first five rows.
The independent variables are as follows:
 AT – ambient temperature
 V – exhaust vacuum level
 AP – ambient pressure
 RH – relative humidity
The dependent variable is EP – electrical power produced.
3. Split the data into a train, val, and test set with fractions of 0.8, 0.1, and 0.1
respectively, using Python and Pandas but not sklearn methods. You will use 0.8 for the train split because there is a large number of rows, so the validation and
test splits will still have enough rows.
4. Repeat the split in step 3 but use train_test_split. Call it once to split the
train data, and then call it again to split what remains into val and test.
5. Ensure that the row counts are correct in all cases.
6. Fit .StandardScaler() to the train data from step 3, and then transform train,
validation, and test X. Do not transform the EP column, as it is the target.
7. Fit a .LinearRegression() model to the scaled train data, using the X
variables to predict y (the EP column).
8. Print the R2 score and the RMSE of the model on the train, validation, and
test datasets.


In [1]:
!wget https://raw.githubusercontent.com/PacktWorkshops/The-Pandas-Workshop/master/Chapter09/Datasets/power_plant.csv

--2023-07-10 00:01:39--  https://raw.githubusercontent.com/PacktWorkshops/The-Pandas-Workshop/master/Chapter09/Datasets/power_plant.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 308694 (301K) [text/plain]
Saving to: ‘power_plant.csv’


2023-07-10 00:01:39 (11.4 MB/s) - ‘power_plant.csv’ saved [308694/308694]



1. For this activity, all you will need is the Pandas library, the modules from sklearn, and numpy. Load them in the first cell of the notebook.

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression as OLS
from sklearn.metrics import mean_squared_error

2. Use the power_plant.csv dataset – 'Datasets\power_plant.csv'. Read the data into a Pandas DataFrame, print out the shape, and list the first five rows. The independent variables are as follows: AT – ambient temperature V – exhaust vacuum level AP – ambient pressure RH – relative humidity The dependent variable is EP – electrical power produced.

In [3]:
df = pd.read_csv('power_plant.csv')
df.shape

(9568, 5)

In [4]:
df.head()

Unnamed: 0,AT,V,AP,RH,EP
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43


3. Split the data into a train, val, and test set with fractions of 0.8, 0.1, and 0.1 respectively, using Python and Pandas but not sklearn methods. You will use 0.8 for the train split because there is a large number of rows, so the validation and test splits will still have enough rows.

In [5]:
np.random.seed(42)

train_rows = pd.Series(np.random.choice(list
    (df.index),
    int(0.8 * df.shape[0]),
    replace = False
))

val_rows = pd.Series(np.random.choice(
    list(df.drop(train_rows, axis = 0).index),
    int(0.1 * df.shape[0]),
        replace = False)
)

test_rows = pd.Series(df.drop(pd.concat([train_rows, val_rows]), axis = 0).index)

train_data = df.iloc[train_rows, :]
val_data = df.iloc[val_rows, :]
test_data = df.iloc[test_rows, :]

print('Train shape = ', train_data.shape, '\n'
      'Val shape = ', val_data.shape, '\n'
      'Test shape = ', test_data.shape)


Train shape =  (7654, 5) 
Val shape =  (956, 5) 
Test shape =  (958, 5)


4. Repeat the split in step 3 but use train_test_split. Call it once to split the train data, and then call it again to split what remains into val and test.

In [6]:
train_data_2, val_data_2 = \
  train_test_split(df, train_size=0.8, random_state = 42)
val_data_2, test_data_2 = \
  train_test_split(val_data_2, test_size = 0.5, random_state = 42)

print('Train shape = ', train_data_2.shape, '\n'
      'Val shape = ', val_data_2.shape, '\n'
      'Test shape = ', test_data_2.shape)


Train shape =  (7654, 5) 
Val shape =  (957, 5) 
Test shape =  (957, 5)


5. Ensure that the row counts are correct in all cases.

In [7]:
print(sum([train_data, val_data, test_data]))
print(sum([train_data_2, val_data_2, test_data_2]))
print(df.shape)

      AT   V  AP  RH  EP
0    NaN NaN NaN NaN NaN
1    NaN NaN NaN NaN NaN
2    NaN NaN NaN NaN NaN
3    NaN NaN NaN NaN NaN
4    NaN NaN NaN NaN NaN
...   ..  ..  ..  ..  ..
9563 NaN NaN NaN NaN NaN
9564 NaN NaN NaN NaN NaN
9565 NaN NaN NaN NaN NaN
9566 NaN NaN NaN NaN NaN
9567 NaN NaN NaN NaN NaN

[9568 rows x 5 columns]
      AT   V  AP  RH  EP
0    NaN NaN NaN NaN NaN
1    NaN NaN NaN NaN NaN
2    NaN NaN NaN NaN NaN
3    NaN NaN NaN NaN NaN
4    NaN NaN NaN NaN NaN
...   ..  ..  ..  ..  ..
9563 NaN NaN NaN NaN NaN
9564 NaN NaN NaN NaN NaN
9565 NaN NaN NaN NaN NaN
9566 NaN NaN NaN NaN NaN
9567 NaN NaN NaN NaN NaN

[9568 rows x 5 columns]
(9568, 5)


6. Fit .StandardScaler() to the train data from step 3, and then transform train, validation, and test X. Do not transform the EP column, as it is the target.

In [8]:
scaler = StandardScaler()
scaler.fit(train_data.iloc[:, :-1])

train_X = scaler.transform(train_data.iloc[:, :-1])
train_y = train_data['EP']

val_X = scaler.transform(val_data.iloc[:, :-1])
val_y = val_data['EP']

test_X = scaler.transform(test_data.iloc[:, :-1])
test_y = test_data['EP']

7. Fit a .LinearRegression() model to the scaled train data, using the X variables to predict y (the EP column).

In [9]:
linear_model = OLS()
linear_model.fit(train_X, train_y)
linear_model

8. Print the R2 score and the RMSE of the model on the train, validation, and test datasets.

In [10]:
print('train score: ', linear_model.score(train_X, train_y),
      '\nvalidation score: ', linear_model.score(val_X, val_y),
      '\ntest score: ', linear_model.score(test_X, test_y))
print('train RMSE: ',
      mean_squared_error(linear_model.predict(train_X), train_y),
      '\nvalidation RMSE: ',
      mean_squared_error(linear_model.predict(val_X), val_y),
      '\ntest RMSE: ',
      mean_squared_error(linear_model.predict(test_X), test_y))

train score:  0.9287072840354756 
validation score:  0.9238845251967255 
test score:  0.9333918854821254
train RMSE:  20.732519659228682 
validation RMSE:  22.820591843766213 
test RMSE:  19.023390952574694
