#### This notebook has my solution/approach to the problem statement of finding widnmill power at Hackerarth Compitetion 2021.
> **A Fine Windy Day: HackerEarth Machine Learning challenge**<br>
Link : https://www.hackerearth.com/challenges/competitive/hackerearth-machine-learning-challenge-predict-windmill-power/problems/

#### Steps Involved :
##### 1. Importing dataset and libraries
##### 2. Analysi train and test data
##### 3. Imputation 
##### 4. Modeling

## Importing libraries and dataset

In [None]:
#data analysis
import pandas as pd
import numpy as np
#visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

#modeling
from xgboost import XGBRegressor
from xgboost import plot_importance

#warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
#importing datasets
train=pd.read_csv('../input/a-fine-windy-day-hackerearth-ml-challenge/train_data.csv')
test=pd.read_csv('../input/a-fine-windy-day-hackerearth-ml-challenge/test_data.csv')

#storing training target varible
train_target=train.iloc[:,-1]

#combining train and test dataset
dataset = pd.concat([train,test],axis=0)
dataset.info()

> ##### Tracking Id can be deleted, as it will not show any effect on windmill power.
> ##### There are manyc columns filled with null values, need to impute them.

#### **Datetime** column is in object format. It should be converted into datetime formate.

In [None]:
dataset['datetime']=pd.to_datetime(dataset['datetime'], format='%Y/%m/%d %H:%M:%S')
dataset['year']=dataset['datetime'].dt.year
dataset['month']=dataset['datetime'].dt.month
dataset['day']=dataset['datetime'].dt.day
dataset['hour']=dataset['datetime'].dt.hour
dataset['minute']=dataset['datetime'].dt.minute

## Visualization

In [None]:
# missing bargraph in training dataset
msno.bar(train, figsize=(12, 6), fontsize=12, color='steelblue')

In [None]:
# missing bargraph in testing dataset
msno.bar(test, figsize=(12, 6), fontsize=12, color='steelblue')

> #### In training and testing dataset, all columns are filled with more than 70%.

#### Correlation graph

In [None]:
corr = train.corr()
plt.figure(figsize=(12,10))
mask = np.zeros_like(corr,dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr,mask=mask,annot=True,cbar=False)
plt.show()

#### Wind Speed

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(121)
sns.boxplot(y=train['wind_speed(m/s)'])
plt.title('Train Dataset',fontsize=18)
plt.ylabel('wind speed',fontsize=15)
plt.subplot(122)
sns.boxplot(y=test['wind_speed(m/s)'])
plt.title('Train Dataset',fontsize=18)
plt.ylabel('')
plt.show()

In [None]:
plt.figure(figsize=(10,6))
sns.scatterplot(x='wind_speed(m/s)',y='windmill_generated_power(kW/h)',data=dataset.iloc[:28200,:].astype({'year':str}),hue='year')
plt.xlabel('Wind Speed',fontsize=13)
plt.ylabel('Windmill Power',fontsize=13)
plt.title('Windmill Power Vs Wind Speed',fontsize=16)

>#### In year 2019, maximum power is generated.

#### Maximum power geenrated month wise

In [None]:
plt.figure(figsize=(12,5))
month_power=dataset.iloc[:train.shape[0],:].astype({'month':str}).groupby('month',axis=0)['windmill_generated_power(kW/h)'].max()
plt.plot(month_power.index.values,month_power,'go-')
plt.xlabel('Month',fontsize=15)

> ##### Maximum power generated in January month.

#### Turbine Status

In [None]:
fig = plt.figure(figsize=(12,8))
plt.subplot(121)
sns.countplot(data=train,x='turbine_status')
plt.title('Train Dataset',fontsize=15)
plt.subplot(122)
sns.countplot(data=test,x='turbine_status')
plt.title('Test Dataset',fontsize=15)

#### Cloud Level

In [None]:
fig = plt.figure(figsize=(12,5))
plt.subplot(121)
sns.countplot(data=train,x='cloud_level')
plt.title('Train Dataset',fontsize=15)
plt.subplot(122)
sns.countplot(data=test,x='cloud_level')
plt.title('Test Dataset',fontsize=15)

> Distribution of cloud level on training and testing data is quit same.<br>
And **Low** value is maximum in both dataset.<br>
Hence , imputation can be done by mode technique

#### Atmosperic Temperature Vs Atmosperic pressure

In [None]:
fig=plt.figure(figsize=(15,8))
plt.subplot(121)
sns.scatterplot(x='atmospheric_temperature(°C)',y='atmospheric_pressure(Pascal)',data=train,hue='cloud_level')
plt.title('Train Dataset',fontsize=18)
plt.xlabel('Atmosperic Temperature',fontsize=15)
plt.ylabel('Atmosperic Pressure',fontsize=15)
plt.subplot(122)
sns.scatterplot(x='atmospheric_temperature(°C)',y='atmospheric_pressure(Pascal)',data=test,hue='cloud_level')
plt.title('Test Dataset',fontsize=18)
plt.xlabel('Atmosperic Temperature',fontsize=15)
plt.ylabel('Atmosperic Pressure',fontsize=15)

#### Gearbox Temperature

In [None]:
fig=plt.figure(figsize=(12,6))
sns.scatterplot(x='gearbox_temperature(°C)',y='windmill_generated_power(kW/h)',data=train,hue='cloud_level')
plt.title('Train Dataset',fontsize=18)
plt.xlabel('Gearbox Temperature',fontsize=15)
plt.ylabel('Windmill Power',fontsize=15)

#### Windmill Body Temperature

In [None]:
fig=plt.figure(figsize=(12,6))
sns.scatterplot(x='windmill_body_temperature(°C)',y='windmill_generated_power(kW/h)',data=train,hue='cloud_level')
plt.title('Train Dataset',fontsize=18)
plt.xlabel('Windmill Temperature',fontsize=15)
plt.ylabel('Windmill Power',fontsize=15)

## Imputation

> #### I used Boruta Feature selection technique and rejected attributes are :
> ##### gearbox_temperature(°C),windmill_body_temperature(°C),blade_length(m),windmill_height(m),year

In [None]:
# drop column tracking_id and datatime
dataset.drop(columns=['tracking_id','datetime','gearbox_temperature(°C)','windmill_body_temperature(°C)','blade_length(m)','windmill_height(m)','year'],axis=1,inplace=True)

### Numerical Imputation using Mean Values
columns=dataset.select_dtypes(include='float64').columns[:-1]
for col in columns:
    dataset[col].fillna(value=dataset[col].mean(),inplace=True)
    
### Categorical Imputation using Mode
dataset['turbine_status'].fillna(dataset['turbine_status'].mode()[0],inplace=True)
dataset['cloud_level'].fillna(dataset['cloud_level'].mode()[0],inplace=True)

### Encoding on Categorical values
turbine_dummies = pd.get_dummies(dataset['turbine_status'],prefix='t')
cloud_dummies = pd.get_dummies(dataset['cloud_level'],prefix='c')
dataset = pd.concat([dataset,turbine_dummies,cloud_dummies],axis=1)
dataset.drop(columns=['turbine_status','cloud_level'],axis=1,inplace=True)

## Modeling

In [None]:
### spliting training and testing
train_sample = dataset.iloc[:train.shape[0],:] #filled target values from original train dataset
test_sample = dataset.iloc[train.shape[0]:,:] #unfilled target values from original train dataset

train_1 = train_sample[train_sample['windmill_generated_power(kW/h)'].notna()].reset_index(drop=True)
test_1 = train_sample[train_sample['windmill_generated_power(kW/h)'].isna()].reset_index(drop=True)

X_train = train_1.drop(columns='windmill_generated_power(kW/h)',axis=1).reset_index(drop=True)
Y_train = train_1['windmill_generated_power(kW/h)'].reset_index(drop=True)
X_test = test_1.drop(columns='windmill_generated_power(kW/h)',axis=1).reset_index(drop=True)


print("X_train shape",X_train.shape)
print("Y_train shape",Y_train.shape)
print("X_test shape",X_test.shape)

In [None]:
# Model Creation for filling target values of train dataset
xgb = XGBRegressor(n_estimators=1000,max_depth=8,booster='gbtree',learning_rate=0.1,objective='reg:squarederror')

#Model fitting and prediction
xgb.fit(X_train,Y_train)
Y_test = xgb.predict(X_test)

#Converting into Series
Y_test=pd.Series(Y_test,name='windmill_generated_power(kW/h)')

In [None]:
# Combining X_test and Y_test to form final test dataset
test_final = pd.concat([X_test,Y_test],axis=1)

# Combining train_1 and test_final to form final train dataset
train_final = pd.concat([train_1,test_final],axis=0)

#dropping target from test_sample
test_sample.drop(columns='windmill_generated_power(kW/h)',axis=1,inplace=True)

In [None]:
# Predicting values for original test dataset
xgb.fit(train_final.drop(columns=['windmill_generated_power(kW/h)'],axis=1),train_final['windmill_generated_power(kW/h)'])
final_ans = xgb.predict(test_sample)

#### Feature Importance Graph

In [None]:
#feature importance 
plt.rcParams["figure.figsize"] = (18,10)
plot_importance(xgb)
plt.show()

In [None]:
power=pd.Series(final_ans,name='windmill_generated_power(kW/h)')
file=pd.concat([test[['tracking_id','datetime']],power],axis=1)
file.to_csv('XGB_ans.csv',index=False)