## **WIND-TURBINE POWER PREDICTION (MACHINE LEARNING - ADVANCED)**

**Submitted By:**

**SNEHA KULKARNI**

**INSAID- GCD, July 18th 2021 Cohort**


---
# **Table of Contents**
---

1. [**Introduction**](#Section1)<br>
2. [**Problem Statement**](#Section2)<br>
3. [**Installing & Importing Libraries**](#Section3)<br>
  3.1 [**Installing Libraries**](#Section31)<br>
  3.3 [**Importing Libraries**](#Section33)<br>
4. [**Data Acquisition & Description**](#Section4)<br>
5. [**Data Pre-Profiling**](#Section5)<br>
6. [**Data Pre-Processing**](#Section6)<br>
7. [**Data Post-Profiling**](#Section7)<br>
8. [**Exploratory Data Analysis**](#Section8)<br>
9. [**Splitting data into training and test**](#Section9)</br>
10. [**Scaling**](#Section10)</br>
11. [**Model Building**](#Section11)</br>
12. [**Predicting Unseen Data**](#Section12)</br>
13. [**Preparing Submission File**](#Section13)</br>
14. [**Summary**](#Section14)</br>


---
<a name = Section1></a>
# **1. Introduction**

---

 Company Introduction - Energy Limited

 Your client for this project is a Renewable energy institution.

  1) They are going to provide an amount of power generated from a wind turbine in KW/hr by using the real time data.

  2) Factors such as temperature, wind direction, turbine status, weather, blade length, etc. influence the amount of power generated.

  3) We have to select the most important features which help us to generate more power in an efficient way.

### Current Scenario
 -  Company rolled out this service to several areas and they will monitor which features can increase the power generated by the turbines. Using this they can map those areas for future investments.


---
<a name = Section2></a>
# **2. Problem Statement**

---

Moving from traditional energy plans powered by fossil fuels to unlimited renewable energy subscriptions allows for instant access to clean energy without heavy investment in infrastructure like Wind Turbines.

### The current process suffers from the following problems:
- 1) One issue is that **spinning turbine** blades can pose a threat to flying wildlife like birds and bats.
- 2) Wind energy can have adverse environmental impacts, including the potential to reduce, fragment, or degrade habitat for wildlife, fish, and plants.
- 3) The company wants to figure out how they can manage these challenges to produce wind energy in an efficient manner.

The energy department has hired you as data science consultants.

### Your Role
- You are given datasets of wind turbines and the power generated by them.
- Your task is to build a regression model using the datasets.
- Because there was no machine learning model for this problem in the company, you don’t have a quantifiable win condition. You need to build the best possible model.

### Project Deliverable
   - Deliverable: **Deliverable: Predict the power that is generated (in KW/h) based on the various features provided in the dataset.**
   - Machine Learning Task: **Regression**
   - Target Variable: **windmill_generated_power(kW/h)**
   - Win Condition: **N/A (best possible model)**

### Evaluation Metric
   - The model evaluation will be based on the **r2** Score.



---
<a name = Section3></a>
# **3. Installing and Importing Libraries**

---



---
<a name = Section31></a>
## **3.1 Installing Libraries**

---

In [None]:
!pip install -q datascience                   # Package that is required by pandas profiling
!pip install -q pandas-profiling              # Library to generate basic statistics about data
!pip install -q --upgrade pandas-profiling

In [None]:
pip install MarkupSafe==2.0.1

---
<a name = Section32></a>
## **3.1 Importing Libraries**

---

In [4]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
import plotly.graph_objs as go                                      # Importing plotly for interactive visualizations
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------


In [5]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.5f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
import plotly.graph_objs as go                                      # Importing plotly for interactive visualizations
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler                   # Importing to scale the features in the dataset
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split                # To properly split the dataset into train and test sets
from sklearn.linear_model import LogisticRegression                   # To create a linear regression model
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier                  # To create a random forest regressor model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB                          # To create a naive bayes model using algorithm
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics                                         # Importing to evaluate the model used for regression
from sklearn.decomposition import PCA                               # Importing to create an instance of PCA model
#-------------------------------------------------------------------------------------------------------------------------------
from random import randint                                          # Importing to generate random integers
#-------------------------------------------------------------------------------------------------------------------------------
import time                                                         # For time functionality
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once
#-------------------------------------------------------------------------------------------------------------------------------
!pip install imblearn
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score



---
<a name = Section4></a>
# **4 Data acquisition and Description**


|***ID***|****Feature****|****Description****           |
|:--|:--|:--|
|01| tracking_id       | Represents a unique identification number of a windmill.|
|02| datetime  | Represents the date and time of a record.|  
|03| wind_speed(m/s)       | Represents the speed of wind (in meter per second).| 
|04| atmospheric_temperature(°C)    | Represents the temperature (in degree Celsius) of a town or village that the windmill is present in.|   
|05| shaft_temperature(°C)    | Represents the temperature of the shaft (in degree Celsius). |
|06| blades_angle(°)       | Represents the angle of the blades of a wind turbine (in degrees).|
|07| gearbox_temperature(°C)     | Represents the temperature of a gearbox (in degree Celsius).|
|08| engine_temperature(°C) | Represents the temperature of an engine (in degree Celsius).|
|09| motor_torque(N-m) | Represents the torque of a motor (in Newton meter).|
|10| generator_temperature(°C) | Represents the temperature of a generator (in degree Celsius).|
|11| atmospheric_pressure(Pascal) | Represents the atmospheric pressure (in Pascals) in that area.|
|12| area_temperature(°C)      | Represents the temperature (in degree Celsius) of the area within a 100 m radius of the windmill.|
|13| windmill_body_temperature(°C) | Represents the temperature of the body of a windmill (in degree Celsius).|
|14| wind_direction(°) | Represents the direction of the wind (in degrees).|
|15| resistance(ohm) | Represents the resistance against the wind.|
|16| rotor_torque(N-m) | Represents the torque of a rotor (in Newton meter).|
|17| turbine_status | Represents the torque of a rotor (in Newton meter).|
|18| cloud_level | Represents the following levels of the cloud in the sky on a particular day: Extremely low, Low, Medium.|
|19| blade_length(m) | Represents the length of the blades of a windmill (in meter)|
|20| blade_breadth(m) | Represents the breadth of the blades of a windmill (in meter).|
|21| windmill_height(m) | Represents the height of the blades of a windmill (in meter).|
|22| windmill_generated_power(kW/h)| Represents the power generated (in Kilowatt per hour)|


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [6]:
df_windTurbine = pd.read_csv("/content/drive/MyDrive/Windturbine_train.csv")

In [None]:
df_windTurbine.head()

In [None]:
df_windTurbine.shape

In [None]:
df_windTurbine.info()

In [None]:
df_windTurbine.describe()

---
<a name = Section5></a>
# **5 Data Pre-Profiling**

---

In [None]:
# pandas profiling

from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
windTurbine_profile=ProfileReport(df_windTurbine)
windTurbine_profile

---
<a name = Section6></a>
# **6 Data Preprocessing**

---

## Checking for Duplicates

In [11]:
# Function to check duplicates in the  dataset

def duplicate_values(df):
    print("Duplicate check...")
    print("There are", df.duplicated().sum(), "duplicate observations in the dataset.")

    # dup = df.groupby('ID').size()
    # dup = dup[dup>1]
    # print("Duplicate ID  : {} ".format(dup))

    duplicate_values = df.duplicated().sum()
    if duplicate_values > 0:
        df.drop_duplicates(keep = 'first', inplace = True)
        print(duplicate_values,(" Duplicates were dropped!"))
    else:
        print("There are no duplicates")  


In [None]:
duplicate_values(df_windTurbine)

## Checking for Missing Values

In [None]:
df_windTurbine.isnull().sum().sum()

In [13]:
# function to return missing values in the data

def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending = False)
    missing_percent = (df.isnull().sum() / df.isnull().count()).sort_values(ascending = False)
    missing_values = pd.concat([missing_number, missing_percent], axis = 1, keys = ['Missing_Number', 'Missing_Percent'])
    return missing_values[missing_values['Missing_Number'] > 0]

In [None]:
missing_values(df_windTurbine)

In [15]:
# Plot the missing data

def plot_missing_data(df):

  missing_percent = 100 * df.isnull().sum() / len(df)
  missing_percent = missing_percent[missing_percent > 0]
  missing_percent.sort_values(inplace = True)

  df_missing = pd.DataFrame(missing_percent)

  missing_percent.plot.bar()
  plt.show()



In [None]:
plot_missing_data(df_windTurbine)

In [17]:
# print value counts for all 'objects' with more than 1 null value
# This provide an output that shows the values of each categorical feature that has null values
def object_vcs_and_nulls(df):
  for i in df:
    if df[i].dtype == 'O':
      if df[i].isnull().sum() > 0:
        print(df[i].value_counts())  
        print("Number of Null Values: " + str(df[i].isnull().sum()))
        print("Percentage of Nulls = " + str(np.round((df[i].isnull().sum() / 14.60), 2)) + "%")
        print("\n")

In [None]:
object_vcs_and_nulls(df_windTurbine)

In [None]:
#Check ALL the NUMERICAL COLUMNS for -99 values and Replace/Substitute them with NaN values

(df_windTurbine == -99 ).sum(axis = 0)

In [20]:
df_windTurbine.replace(-99, np.NaN, inplace=True)

In [None]:
#check for missing values in total after replacing the -99's with NAN's

missing_values(df_windTurbine)

#impute the missing values

In [None]:
(df_windTurbine == 0 ).sum(axis = 0)

## Data Cleaning

### Imputing Missing Values

Imputing missing values in numerical columns

In [26]:
df_windTurbine['wind_speed(m/s)'].fillna(value = df_windTurbine['wind_speed(m/s)'].mean(), inplace = True) # Skewness is -0.052
df_windTurbine['atmospheric_temperature(°C)'].fillna(value = df_windTurbine['atmospheric_temperature(°C)'].median(), inplace = True) # Skewness is -1.68
df_windTurbine['shaft_temperature(°C)'].fillna(value = df_windTurbine['shaft_temperature(°C)'].median(), inplace = True) # Skewness is -2.54
df_windTurbine['blades_angle(°)'].fillna(value = df_windTurbine['blades_angle(°)'].mean(), inplace = True) # Skewness is -0.653
df_windTurbine['gearbox_temperature(°C)'].fillna(value = df_windTurbine['gearbox_temperature(°C)'].median(), inplace = True) # Skewness is 1.30
df_windTurbine['engine_temperature(°C)'].fillna(value = df_windTurbine['engine_temperature(°C)'].median(), inplace = True) # Skewness is -3.99
df_windTurbine['motor_torque(N-m)'].fillna(value = df_windTurbine['motor_torque(N-m)'].mean(), inplace = True) # Skewness is 0.030
df_windTurbine['generator_temperature(°C)'].fillna(value = df_windTurbine['generator_temperature(°C)'].mean(), inplace = True) # Skewness is -0.200
df_windTurbine['atmospheric_pressure(Pascal)'].fillna(value = df_windTurbine['atmospheric_pressure(Pascal)'].mean(), inplace = True) # Skewness is 0.074
df_windTurbine['windmill_body_temperature(°C)'].fillna(value = df_windTurbine['windmill_body_temperature(°C)'].median(), inplace = True) # Skewness is -2.21
df_windTurbine['wind_direction(°)'].fillna(value = df_windTurbine['wind_direction(°)'].mean(), inplace = True) # Skewness is 0.170
df_windTurbine['resistance(ohm)'].fillna(value = df_windTurbine['resistance(ohm)'].mean(), inplace = True) # Skewness is -0.682
df_windTurbine['rotor_torque(N-m)'].fillna(value = df_windTurbine['rotor_torque(N-m)'].median(), inplace = True) # Skewness is -1.04
df_windTurbine['blade_length(m)'].fillna(value = df_windTurbine['blade_length(m)'].median(), inplace = True) # Skewness is -8.76
df_windTurbine['blade_breadth(m)'].fillna(value = df_windTurbine['blade_breadth(m)'].mean(), inplace = True) # Skewness is -0.183
df_windTurbine['windmill_height(m)'].fillna(value = df_windTurbine['windmill_height(m)'].mean(), inplace = True) # Skewness is -0.067
df_windTurbine['windmill_generated_power(kW/h)'].fillna(value = df_windTurbine['windmill_generated_power(kW/h)'].mean(), inplace = True) # Skewness is 0.683

In [None]:
missing_values(df_windTurbine)

In [None]:
df_windTurbine.isna().sum().sum()

## Understanding Correlations

In [None]:
import matplotlib.pyplot as plt
plt.figure (figsize =(12,12))
sns.heatmap(df_windTurbine.corr(),annot=True,cbar=0, linewidths=2,vmax=1, vmin=0, square=True,cmap ='Blues').set_title('HeatMap')

plt.show()

**Observations**:
The feature "generator_temperature(°C)" has high correlation of 0.93 with feature "motor_torque(N-m)

It also has correlation greator than 0.5 with other two features.

Therefore, drop the feature "generator_temperature(°C)" to avoid Multicolinearity.

## Feature Engineering

In [30]:
#convert the object column to datetime column 
df_windTurbine['datetime'] = df_windTurbine['datetime'].apply(pd.to_datetime)

In [None]:
df_windTurbine.info()

In [32]:
#Extracting month and hour from datetime column
df_windTurbine['month'] = pd.DatetimeIndex(df_windTurbine['datetime']).month
df_windTurbine ['hour'] = pd.DatetimeIndex(df_windTurbine['datetime']).hour

In [None]:
df_windTurbine.info()

In [None]:
df_windTurbine.shape

In [35]:
#drop the id and datetime columns

df_windTurbine = df_windTurbine.drop(['tracking_id','datetime'],axis =1)

In [None]:
# drop "generator_temperature(°C)" to avoid multicolinearity
#df_windTurbine = df_windTurbine.drop('generator_temperature(°C)',axis =1)

In [None]:
df_windTurbine.info()

## imputing Missing categorical values with KNN Imputer 

In [37]:
#imputing the missing values in the categorical columns "turbine_status and cloud_level" 
# with mode will introduce the columns class. And therefore resulting bias in the stasistical analysis after imputation 

In [None]:
#Categorical Columns
df_objs = df_windTurbine.select_dtypes(include='object')
df_objs.head(5)

In [None]:
df_objs.shape

In [None]:
df_objs.isna().sum()

In [41]:
# first transform categorical features into numeric ones while preserving the
# NaN values (LabelEncoder: that keeps missing values as 'NaN'), 
# then you can use the KNNImputer using only the nearest neighbour as replacement.
# https://stackoverflow.com/questions/64900801/implementing-knn-imputation-on-categorical-variables-in-an-sklearn-pipeline

# https://stackoverflow.com/questions/54444260/labelencoder-that-keeps-missing-values-as-nan
# pd.Series(
#     LabelEncoder().fit_transform(series[series.notnull()]),
#     index=series[series.notnull()].index
# )
# series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.

# as the label encoder returns a numpy.array and throws out an index, 
# index=series[series.notnull()].index restores it to concatenate it correctly. 
# If don't do indexing values shift from correct positions - and even an IndexError may occur.
# Single encoder for all columns

#In That case, stack dataframe, fit encode, then unstack it:


# series_stack = df_objs.stack().astype(str)
# label_encoder = LabelEncoder()
# df_objs = pd.Series(
#     label_encoder.fit_transform(series_stack),
#     index=series_stack.index
# ).unstack()
# # print(df_objs)


# as the series_stack is pd.Series containing NaN's, 
# all values from the DataFrame is floats, so you may prefer to convert it.

In [42]:
# use the KNNImputer using only the nearest neighbour as replacement

# from sklearn.impute import KNNImputer

# imputer = KNNImputer()
# # df_encoded = imputer.fit_transform(df_objs)

# df_encoded = pd.DataFrame(np.round(imputer.fit_transform(df_objs)),columns = df_objs.columns)

In [43]:
# df_encoded.isna().sum()      #check if imputation is done correctly or not

In [44]:
# df_encoded.shape

In [None]:
! pip install fancyimpute

In [46]:
from fancyimpute import KNN

#instantiate both packages to use
encoder = LabelEncoder()
imputer = KNN()
# create a list of categorical columns to iterate over
cat_cols = ['turbine_status','cloud_level']

def encode(data):
    '''function to encode non-null data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_label = encoder.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    data.loc[data.notnull()] = np.squeeze(impute_label)
    return data


In [None]:
df_objs.shape

In [48]:
# #create a for loop to iterate through each column in the data
for columns in cat_cols:
    encode(df_objs[columns])

In [None]:
df_objs.shape

In [None]:
#  impute data and convert 
df_encoded = pd.DataFrame(np.round(imputer.fit_transform(df_objs)),columns = df_objs.columns)

In [None]:
df_encoded.shape

In [52]:
# drop th original "turbine_status" and	"cloud_level" columns from the dataset. 
# concatenate the encoded categorical columns. and apply KNNImputer to impute the missing values.

In [53]:
#drop the original columns
df_windTurbine = df_windTurbine.drop(['turbine_status', 'cloud_level'],axis =1)

In [None]:
df_windTurbine.columns

In [55]:
# concatenate the encoded categorical columns
df = pd.concat([df_windTurbine,df_encoded],axis=1)

In [None]:
df.info()

In [None]:
missing_values(df)

---
<a name = Section7></a>
# **7 Data Post-Profiling**

---

In [None]:
# pandas profiling

from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
windTurbine_post_profile = ProfileReport(df)
windTurbine_post_profile

---
<a name = Section9></a>
# **9 Splitting Data Into Train and Test**

---

In [58]:
# Separate out the data into X features and y labels
# "Target" is the dependent variable

X = df.drop('windmill_generated_power(kW/h)',axis=1)
y = df['windmill_generated_power(kW/h)']

In [59]:
# Split up X and y into a training set and test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
print('x_train shape: ', X_train.shape)
print('y_train shape: ', y_train.shape)
print('x_test shape: ', X_test.shape)
print('y_test shape: ', y_test.shape)

---
<a name = Section10></a>
# **10 Scaling / Standardizing**

---

In [61]:
# scale the X features

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

In [62]:
scaled_X_train = scaler.fit_transform(X_train)
scaled_X_test = scaler.transform(X_test)

---
<a name = Section11></a>
# **11 Model Building**

---

In [63]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor 
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.ensemble import GradientBoostingRegressor
from xgboost.sklearn import XGBRegressor

## Building Multiple Base Model

In [64]:
clfs = [LinearRegression(fit_intercept=True,),DecisionTreeRegressor(random_state = 42), RandomForestRegressor (n_estimators = 100, random_state = 42, n_jobs = -1)
       ,KNeighborsRegressor(n_neighbors = 6),GaussianProcessRegressor(), GradientBoostingRegressor(random_state = 42),XGBRegressor(n_estimators=64, random_state=42)]

In [None]:
for clf in clfs:

  # Extracting model name
  model_name = type(clf).__name__

  # Calculate start time
  start_time = time.time()

  # Train the model
  clf.fit(scaled_X_train, y_train)

  # Make predictions on the trained model
  predictions = clf.predict(scaled_X_test)


  RMSE = np.sqrt(metrics.mean_squared_error(y_test, predictions))
  R_squared = metrics.r2_score(y_test, predictions)


  # Calculate evaluated time
  elapsed_time = (time.time() - start_time)

  # Display the metrics and time took to develop the model
  print('Performance Metrics of', model_name, ':')  
  print('[Processing Time]:', elapsed_time, 'seconds') 

  print('[RMSE Value]:', RMSE)
  print('[R2 Score]:',  R_squared)
  print('----------------------------------------\n')

**Observation**:

**Multiple Baseline Model Performance** with default parameters:

| Model Name |R2-Score|
|--|--|
|**LogisticRegression :**|46.01|
|**Decision Tree Regressor :**|89.03|
|**Random Forest Regressor :**|94.53|
|**K Neighbour Regressor :**|57.85|
|**Gaussian N B :**|-1.82|
|**Gradient Boosting Regressor :**|93.49|
|**XGBRegressor :**|92.98|


**Clearly Random Forest Regressor performs better than other Classifiers.**

## Hyperparameter Tuning 

In [66]:
# Random Forest Regressor
 
rf_model = RandomForestRegressor(n_estimators=500, n_jobs= -1, random_state=42)
rf_model.fit(scaled_X_train, y_train)

y_pred_train_rf = rf_model.predict(scaled_X_train)

y_pred_test_rf = rf_model.predict(scaled_X_test)

RMSE_train_rf = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_rf))
r2_train_rf = metrics.r2_score(y_train, y_pred_train_rf)

RMSE_test_rf = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_rf))
r2_test_rf = metrics.r2_score(y_test, y_pred_test_rf)

print('Random Forest Regressor:')
print('RMSE for training set is {}'.format(RMSE_train_rf))
print('R2 for training set is {}'.format(r2_train_rf))

print('RMSE for testing set is {}'.format(RMSE_test_rf))
print('R2 for testing set is {}'.format(r2_test_rf))


Random Forest Regressor:
RMSE for training set is 0.22690856023656067
R2 for training set is 0.9927789844443913
RMSE for testing set is 0.5955505960893106
R2 for testing set is 0.9498974938824949


In [None]:
# Gradient Boosting Regressor

gbr=GradientBoostingRegressor(n_estimators=500, random_state=42)
gbr.fit(scaled_X_train,y_train)

y_pred_train_gbr =gbr.predict(scaled_X_train) 
 
y_pred_test_gbr = gbr.predict(scaled_X_test)

RMSE_train_gbr = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_gbr))
r2_train_gbr = metrics.r2_score(y_train, y_pred_train_gbr)

RMSE_test_gbr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_gbr))
r2_test_gbr = metrics.r2_score(y_test, y_pred_test_gbr)

print('Gradient Boosting Regressor:')
print('RMSE for training set is {}'.format(RMSE_train_gbr))
print('R2 for training set is {}'.format(r2_train_gbr))

print('RMSE for testing set in LR is {}'.format(RMSE_test_gbr))
print('R2 for testing set in LR is {}'.format(r2_test_gbr))

Gradient Boosting Regressor:
RMSE for training set is 0.4604474896545634
R2 for training set is 0.9702657670645427
RMSE for testing set in LR is 0.6123777664794735
R2 for testing set in LR is 0.947026221500934


In [None]:
# XGBoost Regressor

xgb_model = XGBRegressor(n_estimators=500, n_jobs= -1, random_state=42)
xgb_model.fit(scaled_X_train, y_train)

y_pred_train_eb = xgb_model.predict(scaled_X_train)

y_pred_test_eb = xgb_model.predict(scaled_X_test)

RMSE_train_eb = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_eb))
r2_train_eb = metrics.r2_score(y_train, y_pred_train_eb)

print('RMSE for training set in XGB is {}'.format(RMSE_train_eb))
print('R2 for training set in LR XGB {}'.format(r2_train_eb))

RMSE_test_eb = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_eb))
r2_test_eb = metrics.r2_score(y_test, y_pred_test_eb)

print('RMSE for testing set in XGB is {}'.format(RMSE_test_eb))
print('R2 for testing set in XGB is {}'.format(r2_test_eb))

| Model Name |R2-Score - train set|R2-Score - test set|
|--|--|--|
|**Random Forest Regressor :**|99.27|94.98|
|**Gradient Boosting Regressor :**|97.02|94.70|
|**XGBRegressor :**|96.77|94.67|

### Hyperparameter tunung - Gradient Boosting Regressor Model -1

### Hypertuned Model -2

In [None]:
from sklearn.model_selection import GridSearchCV

model=GradientBoostingRegressor()
params={'n_estimators':range(1,200)}
grid=GridSearchCV(estimator=model,cv=2,param_grid=params,scoring='r2')
grid.fit(scaled_X_train,y_train)
print("The best estimator returned by GridSearch CV is:",grid.best_estimator_)



In [None]:
y_pred_train_gbr =grid.predict(scaled_X_train) 

RMSE_train_gbr = np.sqrt(metrics.mean_squared_error(y_train, y_pred_train_gbr))
r2_train_gbr = metrics.r2_score(y_train, y_pred_train_gbr)

print('Gradient Boosting Regressor:')
print('RMSE for training set is {}'.format(RMSE_train_gbr))
print('R2 for training set is {}'.format(r2_train_gbr))

In [None]:
#The best estimator returned by GridSearch CV is:  GradientBoostingRegressor(n_estimators=199)
GB=grid.best_estimator_
y_pred_test_gbr = GB.predict(scaled_X_test)

RMSE_test_gbr = np.sqrt(metrics.mean_squared_error(y_test, y_pred_test_gbr))
r2_test_gbr = metrics.r2_score(y_test, y_pred_test_gbr)

print('RMSE for testing set in LR is {}'.format(RMSE_test_gbr))
print('R2 for testing set in LR is {}'.format(r2_test_gbr))


RMSE for testing set in LR is 0.6282342841698788
R2 for testing set in LR is 0.9442571431317365


## Model Selection

The best estimator is: 
Random forest Regressor(n_estimators=500,n_jobs = -1,random_state = 42)

The train R2 Score is: 99.24

The test R2 Score is: 94.66

## Model Evaluation
Lets evaluate the unseen data with best estimator:
Random forest Regressor(n_estimators=500,n_jobs = -1,random_state = 42)

---
<a name = Section12></a>
# **12 Predicting Unseen Data**

---

## Unseen Data Fetching

In [67]:
df_unseen = pd.read_csv("/content/drive/MyDrive/Windturbine_test.csv")

In [None]:
df_unseen.head(5)

In [None]:
df_unseen.info()

In [None]:
df_unseen.describe()

## Unseen Data Preprocessing

In [71]:

#Feature Engineering - remove id column, extract time and month  columns from datetime. and drop datetime column

#convert the object column to datetime column 
df_unseen['datetime'] = df_unseen['datetime'].apply(pd.to_datetime)

#Extracting month and hour from datetime column
df_unseen['month'] = pd.DatetimeIndex(df_unseen['datetime']).month
df_unseen ['hour'] = pd.DatetimeIndex(df_unseen['datetime']).hour

In [72]:
# drop "generator_temperature(°C)" to avoid multicolinearity
# df_unseen = df_unseen.drop('generator_temperature(°C)',axis =1)

In [None]:
data_submission = df_unseen['tracking_id']
data_submission.head()

In [74]:
#drop the id column
df_unseen = df_unseen.drop(['tracking_id','datetime'],axis =1)

In [None]:
df_unseen.info()

In [None]:
# check missing  values
missing_values(df_unseen)

In [None]:
#Check ALL the NUMERICAL COLUMNS for -99 values and Replace/Substitute them with NaN values

(df_unseen == -99 ).sum(axis = 0)

In [78]:
df_unseen.replace(-99, np.NaN, inplace=True)

In [None]:
#impute the missing values

(df_unseen == 0 ).sum(axis = 0)

In [None]:
df_unseen.isna().sum()

In [None]:
# check duplicates 
duplicate_values(df_unseen)

In [82]:
# impute numerical missing values 

In [83]:
df_unseen['wind_speed(m/s)'].fillna(value = df_unseen['wind_speed(m/s)'].mean(), inplace = True) # Skewness is -0.052
df_unseen['atmospheric_temperature(°C)'].fillna(value = df_unseen['atmospheric_temperature(°C)'].median(), inplace = True) # Skewness is -1.68
df_unseen['shaft_temperature(°C)'].fillna(value = df_unseen['shaft_temperature(°C)'].median(), inplace = True) # Skewness is -2.54
df_unseen['blades_angle(°)'].fillna(value = df_unseen['blades_angle(°)'].mean(), inplace = True) # Skewness is -0.653
df_unseen['gearbox_temperature(°C)'].fillna(value = df_unseen['gearbox_temperature(°C)'].median(), inplace = True) # Skewness is 1.30
df_unseen['engine_temperature(°C)'].fillna(value = df_unseen['engine_temperature(°C)'].median(), inplace = True) # Skewness is -3.99
df_unseen['motor_torque(N-m)'].fillna(value = df_unseen['motor_torque(N-m)'].mean(), inplace = True) # Skewness is 0.030
df_unseen['generator_temperature(°C)'].fillna(value = df_unseen['generator_temperature(°C)'].mean(), inplace = True) # Skewness is -0.200
df_unseen['atmospheric_pressure(Pascal)'].fillna(value = df_unseen['atmospheric_pressure(Pascal)'].mean(), inplace = True) # Skewness is 0.074
df_unseen['windmill_body_temperature(°C)'].fillna(value = df_unseen['windmill_body_temperature(°C)'].median(), inplace = True) # Skewness is -2.21
df_unseen['wind_direction(°)'].fillna(value = df_unseen['wind_direction(°)'].mean(), inplace = True) # Skewness is 0.170
df_unseen['resistance(ohm)'].fillna(value = df_unseen['resistance(ohm)'].mean(), inplace = True) # Skewness is -0.682
df_unseen['rotor_torque(N-m)'].fillna(value = df_unseen['rotor_torque(N-m)'].median(), inplace = True) # Skewness is -1.04
df_unseen['blade_length(m)'].fillna(value = df_unseen['blade_length(m)'].median(), inplace = True) # Skewness is -8.76
df_unseen['blade_breadth(m)'].fillna(value = df_unseen['blade_breadth(m)'].mean(), inplace = True) # Skewness is -0.183
df_unseen['windmill_height(m)'].fillna(value = df_unseen['windmill_height(m)'].mean(), inplace = True) # Skewness is -0.067


In [None]:
missing_values(df_unseen)

In [None]:
# impute categorical missing values Categorical Columns of unseen/test data

df_objs_test = df_unseen.select_dtypes(include='object')
df_objs_test.head(5)

In [None]:
df_objs_test.isna().sum()

In [87]:
#instantiate both packages to use
encoder = LabelEncoder()
imputer = KNN()
# create a list of categorical columns to iterate over
cat_cols = ['turbine_status','cloud_level']

def encode(data):
    '''function to encode non-null data and replace it in the original data'''
    #retains only non-null values
    nonulls = np.array(data.dropna())
    #reshapes the data for encoding
    impute_reshape = nonulls.reshape(-1,1)
    #encode date
    impute_label = encoder.fit_transform(impute_reshape)
    #Assign back encoded values to non-null values
    data.loc[data.notnull()] = np.squeeze(impute_label)
    return data

# #create a for loop to iterate through each column in the data
for columns in cat_cols:
    encode(df_objs_test[columns])

In [None]:
df_objs_test.isna().sum()

In [None]:
# # impute data and convert 
df_encoded_test = pd.DataFrame(np.round(imputer.fit_transform(df_objs_test)),columns = df_objs_test.columns)

In [None]:
df_encoded_test.isna().sum()

In [91]:
#drop the original columns
df_unseen = df_unseen.drop(['turbine_status', 'cloud_level'],axis =1)

In [92]:
# concatenate the encoded categorical columns
df_final = pd.concat([df_unseen,df_encoded_test],axis=1)

In [None]:
missing_values(df_final)

# Unseen Data Prediction

In [None]:
# Scale the data
scaled_arr_final = scaler.fit_transform(df_final)

# Inputting our transformed data in a dataframe
data_final_model = pd.DataFrame(data=scaled_arr_final, columns=df_final.columns)

# Getting a glimpse of transformed data
data_final_model.head()

In [None]:
data_final_model.info()

In [96]:
#predict the unseen data
y_pred_test_rfc = rf_model.predict(data_final_model)


In [97]:
## Convert the array into a DataFrame

y_pred_test_rfc = pd.DataFrame(y_pred_test_rfc)

In [None]:
y_pred_test_rfc

In [None]:
y_pred_test_rfc.nunique()

---
<a name = Section13></a>
# **13 Preparing Submission File**

---

In [100]:
submission_file = pd.concat([data_submission,y_pred_test_rfc], axis = 1)

In [None]:
submission_file.head(20)

### To write the final data to the submission file which is .csv without HEADER and INDEX

In [None]:
# Installed the s3fs package using pip
!pip3 install s3fs

In [104]:
submission_file.to_csv('D://Wind_Turbine_Prediction_Submission.csv', header=False, index=False)

---
<a name = Section14></a>
# **14 Summary**

---