## <center> Productivity Prediction of Garment Employees</center>

This dataset includes important attributes of the garment manufacturing process and the productivity of the employees which had been collected manually and also been validated by the industry experts.

<a href="https://www.kaggle.com/ishadss/productivity-prediction-of-garment-employees"> Datasource here.</a>


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import matplotlib.lines as lines
from IPython.display import HTML
import dateutil.parser as dt_parse
from scipy import stats
from scipy.special import inv_boxcox

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from xgboost import XGBRegressor

In [None]:
#Visualization settings
sns.set_style(style='white')
sns.set(rc={
    'figure.figsize': (12,7),
    'axes.facecolor': 'white',
    'axes.grid': True,
    'grid.color': '.9',
    'axes.linewidth': 1.0,
    'grid.linestyle': u'-'},
    font_scale=1.5)
custom_colors=["#3498db", "#95a5a6","#34495e", "#2ecc71", "#e74c3c"]
sns.set_palette(custom_colors)
background_color='#fbfbfb'

In [None]:
# Load input data file
df_input = pd.read_csv('../input/productivity-prediction-of-garment-employees/garments_worker_productivity.csv')

In [None]:
print (f"Shape of dataframe : {df_input.shape}\n")
print (f"Sample data frame:\n")
display(df_input.head())
print ("Dataset summary \n")
display (df_input.info())

<h4 style="background-color:#fbfbfb;font-family:serif;font-size:160%;">
    Float features   : 6 <br>
    Integer features : 5 <br>
    String features  : 4 <br>
    </h4>

In [None]:
# Missing values
missing_val= df_input.isnull().sum()
missing_val.sort_values(inplace=True, ascending=False)
print ("Missing value counts:\n")
display (missing_val)

## Feature Analysis

In [None]:
#Discrete feature analysis
def analyze_discrete_feature(fld,display_graph=True):
    print ("Sample data:\n")
    display(fld.head())
    df=pd.DataFrame({"Value": fld.value_counts().index,
                 "Count":fld.value_counts().values})
    print ("\nNull value count : ", fld.isnull().sum())
    unique_list=fld.unique().tolist()
    print ("\nUnique values: ", unique_list)
    print ("\n Unique values count: ", len(unique_list))
    print ("\nValue counts:\n",    df)
    if display_graph==True:
        plt.subplots(figsize=(25,10),facecolor=background_color)
        plt.subplot(2,2,1)
        plt.pie(fld.value_counts(),labels=fld.value_counts().index,autopct=lambda x: f'{x: .2f}%');
        plt.xticks(rotation=90)

        plt.subplot(2,2,2)   
        sns.barplot(data=df, x="Value",y="Count").set_facecolor(background_color);
        plt.xticks(rotation=90);
        plt.suptitle(fld.name + " -distribution");

        plt.show()
        plt.close()
    display(HTML("<h4 style='background-color:#fbfbfb;font-family:serif;font-size:160%'>Discrete variable</h4>"))


In [None]:
#Continuous feature analysis
def analyze_continuous_feature(fld):
    print ("Sample data:\n",fld.head())
    print ("\nNull value count : ", fld.isnull().sum())
    print ("\n", fld.describe())
    print (f"\n Skewness : {fld.skew()} \n")
    plt.subplots(figsize=(25,10))
    plt.subplot(2,2,1)
    plt.hist(fld)
    plt.subplot(2,2,2)
    sns.boxplot(fld)
    plt.suptitle(fld.name + "-distribution")
    plt.show()
    plt.close()
    display(HTML("<h4 style='background-color:#fbfbfb;font-family:serif;font-size:160%'>Continuous variable</h4>"))

### 1. Date
Date in MM-DD-YYYY

In [None]:
df_input.date

### 2. Day
Day of the week

In [None]:
analyze_discrete_feature(df_input.day)

### 3. quarter 
A portion of the month. A month was divided into four quarters

In [None]:
analyze_discrete_feature(df_input.quarter)

### 4. department
Associated department with the instance

In [None]:
# Found that there are whitespaces in the department column.
# Trimming the white spaces
df_input['department'] = df_input['department'].apply(str.strip)

In [None]:
analyze_discrete_feature(df_input.department)

### 5. teamno 
Associated team number with the instance 

In [None]:
analyze_discrete_feature(df_input.team)

### 6. noofworkers 
Number of workers in each team

In [None]:
analyze_continuous_feature(df_input.no_of_workers)

### 7. noofstylechange
Number of changes in the style of a particular product

In [None]:
analyze_continuous_feature(df_input.no_of_style_change)

### 8. targetedproductivity 
Targeted productivity set by the Authority for each team for each day. 


In [None]:
analyze_continuous_feature(df_input.targeted_productivity)

### 9. smv
Standard Minute Value, it is the allocated time for a task 

In [None]:
analyze_continuous_feature(df_input.smv)

### 10. wip 
Work in progress. Includes the number of unfinished items for products 

In [None]:
analyze_continuous_feature(df_input.wip)

### 11. overtime 
Represents the amount of overtime by each team in minutes

In [None]:
analyze_continuous_feature(df_input.over_time)

### 12. incentive 
Represents the amount of financial incentive (in BDT) that enables or motivates a particular course of action.

In [None]:
analyze_continuous_feature(df_input.incentive)

### 13. idletime 
The amount of time when the production was interrupted due to several reasons 

In [None]:
analyze_continuous_feature(df_input.idle_time)

### 14. idlemen
The number of workers who were idle due to production interruption

In [None]:
analyze_continuous_feature(df_input.idle_men)

### 15. actual_productivity 
The actual % of productivity that was delivered by the workers. It ranges from 0-1.

In [None]:
analyze_continuous_feature(df_input.actual_productivity)

## <center>Feature Relationships</center>

### 1. How many years of data do we have?

In [None]:
#Change the string formatted date column to datetime object
df_input['date_dt'] = df_input['date'].apply(lambda x : dt_parse.parse(x))

df_input['date'] = df_input.date_dt.apply(lambda x : x.day)
df_input['month'] = df_input.date_dt.apply(lambda x : x.month)
df_input['year'] = df_input.date_dt.apply(lambda x : x.year)

# Remove the existing date feature
df_input.drop('date_dt',axis=1, inplace=True)

print (f'years:{df_input.year.unique()}')
print (f'months:{df_input.month.unique()}')

<h4 style="background-color:#fbfbfb;font-family:serif;font-size:160%;">
    Dataset containing data for year 2015 and months of January, February and March.
    </h4>

### 2. Which department having more productivity in whole dataset?

In [None]:
df_input

In [None]:
summary=df_input.groupby('department').agg('sum')['actual_productivity']

In [None]:
#print (summary)
#summary.to_frame().plot.bar()

In [None]:
summary = summary.to_frame()
display (summary)

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot the barplot
sns.barplot(data=summary,x=summary.index,y=summary.actual_productivity,ax=ax0)

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Total productivity per department
        ''')
#text content
fig.text(x=0.5,
        y=0.4,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        Sweing department having more productivity
        ''')
plt.show()

### 3. Does the incentives improves productivity?

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot the barplot
sns.lineplot(data=df_input, x = 'incentive', y='actual_productivity',estimator=None,ax=ax0)

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Actual productivity and Incentives
        ''')
#text content
fig.text(x=0.5,
        y=0.4,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        There is not much relationship between Actual productivity 
        and amount of incentives workers got.
        ''')
plt.show()

### 4. Is there a relation between Incentives and Overtime?

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot the barplot
sns.lineplot(data=df_input,x = 'incentive', y='over_time',estimator=None,ax=ax0)

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Overtime and Incentives
        ''')
#text content
fig.text(x=0.5,
        y=0.4,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        Not able to find a relation between Overtime
        and amount of incentives workers got.
        ''')
plt.show()

### 5. Targetted productivity vs Actual productivity per month

In [None]:
df_input

In [None]:
df_input.month.unique()

In [None]:
df = df_input.groupby('month').agg({'targeted_productivity':'mean','actual_productivity':'mean'})

In [None]:
df.reset_index(inplace=True)

In [None]:
df

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot 
sns.lineplot(data=df,x='month',y='targeted_productivity',ax=ax0,color='grey')
sns.lineplot(data=df,x='month',y='actual_productivity',ax=ax0,color='green')
ax0.set_ylabel('Productivity')

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Actual vs Targetted productivity
        ''')
#text content
fig.text(x=0.5,
        y=0.2,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        grey - Targeterd productivity
        green - actual productivty
        
        Actual and targetted productivity
        meets same target on third month.
        
        ''')
plt.show()

### 6. Overall productivity of team per month

In [None]:
result=df_input.groupby(['month','team']).agg({'actual_productivity':'sum'})

In [None]:
type(result)

In [None]:
result.reset_index(inplace=True)

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot 
sns.barplot(data=result,x='month',y='actual_productivity',hue='team',ax=ax0)
ax0.set_ylabel('Productivity')

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Actual productivity of the teams per month
        ''')
#text content
fig.text(x=0.5,
        y=0.2,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
                
        ''')
plt.show()

### 7. Normally which day having more(average) productivity?

In [None]:
result=df_input.groupby('day').agg({'actual_productivity':'mean'})
result.reset_index(inplace=True)

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot 
sns.barplot(data=result,x='day',y='actual_productivity',ax=ax0)
ax0.set_ylabel('Productivity')

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Actual productivity per day
        ''')
#text content
fig.text(x=0.5,
        y=0.5,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        Productivity is not affected by the day  
        ''')
plt.show()

### 8. Which team having more idle men?

In [None]:
result=df_input.groupby('team').aggregate({'idle_men':'sum'})

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,5));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot 
sns.barplot(data=result,x=result.index,y='idle_men',ax=ax0)
ax0.set_ylabel('Productivity')

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Idlemen per team
        ''')
#text content
fig.text(x=0.5,
        y=0.5,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        Looks like team 7 got more idle men
        ''')
plt.show()

## Data Wrangling
Clean the data for better model training.

In [None]:
df = df_input.copy()

In [None]:
df

In [None]:
df.quarter.unique()

In [None]:
# The year value 2015 is common across the dataset. Hence removing year.
df.drop('year',axis=1,inplace=True)

In [None]:
# Feature 'quarter' doesn't make any sense. This we can derive from date and month. Hence removing feature 'quarter'
df.drop('quarter',axis=1,inplace=True)

In [None]:
# About 57.72% wip feature containing null values. Hence removing this feature.
df.drop('wip',axis=1,inplace=True)

In [None]:
# actual_productivity is the target variable, having skewness -0.807
# Let's try to reduce this target feature skewness
log_target = np.log(df.actual_productivity)
sqrt_target = np.sqrt(df.actual_productivity)
boxcox_target = stats.boxcox(df_input.actual_productivity)

# Store the BoxCox key for getting the actual value
box_cox_param = boxcox_target[1]

boxcox_target = pd.Series(boxcox_target[0])
print (f"Log skewness: {log_target.skew()}\nSquare root target skewness : {sqrt_target.skew()}\nBoxcox target skewness: {boxcox_target.skew()}")


In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,12));

ax0=fig.add_subplot(2,2,1)
# ax1=fig.add_subplot(2,2,2)
ax2=fig.add_subplot(2,2,3)
ax3=fig.add_subplot(2,2,4)

#ax1.grid(False)
#ax1.set_xticklabels([])
#ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)

ax0.set_facecolor(background_color)
#ax1.set_facecolor(background_color)
ax2.set_facecolor(background_color)
ax3.set_facecolor(background_color)

for side in ["bottom","top","right","left"]:
    ax0.spines[side].set_visible(False)
#    ax1.spines[side].set_visible(False)
    ax2.spines[side].set_visible(False)
    ax3.spines[side].set_visible(False)

#Plot 
sns.distplot(log_target,ax=ax0)
ax0.set_title("Log of target")

sns.distplot(sqrt_target,ax=ax2)
ax2.set_title("Square root of target")

sns.distplot(boxcox_target,ax=ax3)
ax3.set_title("Boxcox target")

#rotating the ticklabels in x axis
for tick in ax0.get_xticklabels():
    tick.set_rotation(90)
    
#Draw line in the middle    
l1= lines.Line2D([0.5,0.5],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.75,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Different models for actual_productivity
        distribution
        ''')
#text content
fig.text(x=0.5,
        y=0.6,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        Log skewness: -1.5736961147100366
        Square root target skewness : -1.1734572168685393
        Boxcox target skewness: -0.13866501140801019
        
        Looks like BoxCox model performs good 
        in normalizing distribution
        ''')
plt.show()

In [None]:
# Replacing taget feature with Boxcox transformed target
df['actual_productivity']=boxcox_target

In [None]:
# Since department is categorical feature, applying onehot encoding to 'department'
df=pd.concat([df,pd.get_dummies(df['department'])],axis=1)
df.drop('department',axis=1,inplace=True)

In [None]:
# Feature 'team' also categorical feature
df=pd.concat([df,pd.get_dummies(df['team'],prefix='team')],axis=1)
df.drop('team',axis=1,inplace=True)

In [None]:
# Feature 'day' is categorical in nature
df=pd.concat([df,pd.get_dummies(df['day'])],axis=1)
df.drop('day',axis=1,inplace=True)

In [None]:
# Dropping the feature 'date'
df.drop('date',axis=1,inplace=True)

In [None]:
print (f"Feature names after data wrangling :\n\n {df.columns}")

In [None]:
# Let's check the correlation of features with target feature.
df_corr = df.copy()
correlation_matrix = df_corr.corr()
# Interested only in the relation with target feature 'actual_productivity'
correlation_matrix=correlation_matrix['actual_productivity']
correlation_matrix=correlation_matrix.to_frame()
correlation_matrix.sort_values(by='actual_productivity',ascending=False,inplace=True)
# display(correlation_matrix)

In [None]:
sns.reset_defaults()

#Visualization
fig=plt.figure(figsize=(10,7));

ax0=fig.add_subplot(1,2,1)
ax1=fig.add_subplot(1,2,2)
ax1.grid(False)
ax1.set_xticklabels([])
ax1.set_yticklabels([])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)
ax0.spines["bottom"].set_visible(False)
ax0.spines["top"].set_visible(False)
ax0.spines["right"].set_visible(False)
ax0.spines["left"].set_visible(False)
ax1.spines["bottom"].set_visible(False)
ax1.spines["top"].set_visible(False)
ax1.spines["right"].set_visible(False)
ax1.spines["left"].set_visible(False)
ax1.tick_params(left=False,bottom=False)

#Plot 
sns.heatmap(correlation_matrix,cmap='inferno',annot=True,cbar=False,ax=ax0)
ax0.set_ylabel('Productivity')

   
#Draw line in the middle    
l1= lines.Line2D([0.52,0.52],[0.1, 0.9],color='black',lw=0.2,transform=fig.transFigure)
fig.lines.extend([l1])

#heading content
fig.text(x=0.5,
        y=0.6,
        fontweight='bold',
        fontfamily='serif',
        fontsize=17,
        color='grey',
        s='''
        Relation between independent features
        with target feature
        ''')
#text content
fig.text(x=0.5,
        y=0.5,
        fontweight='light',
        fontfamily='serif',
        fontsize=16,
        color='grey',
        s='''
        Almost all the features doesn't have 
        any strong relation with target feature.
        ''')
plt.show()

<i>With weak relations with target feature, model training will be difficult. However we can try for find a good model that can fit this given data.</i>

## Model training
Find the best model for the data and train the model with available data.

In [None]:
# Function for providing generalized results for regression model
def evaluate_model(model,x_train,y_train,x_test,y_test):
    '''
    Inputs
    1. regression model eg: LinearRegression(),LassoRegression() etc.
    2. training x data
    3. training y data
    4. testing x data
    5. testing y data
    '''
    model.fit(x_train,y_train)
    model_name=model.__class__.__name__
    
    train_score=model.score(x_train,y_train)
    test_score=model.score(x_test,y_test)
    print (f"Training score: {train_score}\nTesting score: {test_score}")
    
    y_pred=model.predict(x_test)
    print("Prediction completed.")
    df=pd.DataFrame({"Actual": y_test,
                     "Predicted":y_pred})
    
    #Apply inverse box cox to retrieve original target results
    df=inv_boxcox(df,box_cox_param)
    
    #Finding the difference between original and predicted
    df["difference"]=df.Predicted-df.Actual
    df.reset_index(inplace=True)
    
    #Plot actual vs predicted
    plt.figure(figsize=(10,5));
    sns.scatterplot(data=df,x="index",y="Actual",color='grey',label=["Actual"]).set_facecolor(background_color);
    sns.lineplot(data=df,x="index",y="Predicted",color='salmon',label=["Predicted"]);
    plt.legend(loc="right",bbox_to_anchor=(1.1,1));
    plt.title(model_name+" -Actual vs Predicted");
    plt.show()
    
    print ("Sample comparison file for actual and predicted target feature:")
    display(df.head())
    
    # Return the model for re-use if required.
    return model

In [None]:
#General data frame and function for storing and comparing model results.
df_model_results=pd.DataFrame(columns=["ModelName","TrainScore"])

def store_model_results(modl_name,train_score):
    global df_model_results
    row_loc=df_model_results.shape[0]+1
    df_model_results.loc[row_loc,["ModelName","TrainScore"]]=[modl_name,train_score]

In [None]:
#Run each model and show the combined results.
def show_model_scores(x,y):
    global df_model_results
    df_model_results=df_model_results.iloc[0:0] #reset display dataframe
    for model in  [LinearRegression(),
                   Lasso(),
                   Ridge(),
                   ElasticNet(),
                   XGBRegressor()]:
        store_model_results(model.__class__.__name__, cross_val_score(model,x,y,cv=3).mean())
    df_model_results.sort_values("TrainScore",ascending=False,inplace=True)
    display(df_model_results)
    display(HTML('Selected model : <b>' + df_model_results.head(1)['ModelName'].values[0] + '</b>'))
    

In [None]:
# Setting dependent and independent variables
y = df.actual_productivity
x = df.drop('actual_productivity',axis=1)

In [None]:
# Set training and testing dataset
x_train,x_test,y_train,y_test=train_test_split(x,y)

In [None]:
#Lets run the 
show_model_scores(x_train,y_train)

In [None]:
evaluate_model(XGBRegressor(),x_train,y_train,x_test,y_test)

<i>
    Selected model here is XtremeGradientBoosting. Also score also not that much great.
    Note that here training score is more than testing score, which means model is overfitted one.
    </i>

## Model tunning

In [None]:
# Possible parameter values
param_tuning={
    'learning_rate' : [0.01,0.1,.11,.2],
    'max_depth' : [1,2,3,5],
    'min_child_weight' : [3,5,7,9],
    'subsample' : [0.5,0.7,0.9],
    'colsample_bytree' : [0.3,0.5,0.7,0.9],
    'n_estimators' : [25,50,100],
    'objective' : ['reg:squarederror']
}

In [None]:
# Using the GridSearchCrossValidation, find the model with best parameter settings
gsearch = GridSearchCV (estimator = XGBRegressor(),
                       param_grid = param_tuning,
                       cv = 2,
                       n_jobs= -1,
                       verbose = 1)

In [None]:
# Not that below step may be time consuming process.
# gsearch.fit(x,y)
# print (gsearch.best_params_)

# Below is the best parameters for the model, obtained from gsearch.best_params
best_params = {'colsample_bytree': 0.7, 'learning_rate': 0.11, 'max_depth': 3, 'min_child_weight': 7, 'n_estimators': 50, 'objective': 'reg:squarederror', 'subsample': 0.7}

In [None]:
# Save the model with best parameters
# selected_model = gsearch.best_estimator_

selected_model = XGBRegressor(colsample_bytree=0.7, learning_rate=0.11, max_depth=3,
             min_child_weight=7, n_estimators=50, objective='reg:squarederror', subsample=0.7)

In [None]:
# Find how the model is performing now.
evaluate_model(selected_model,x_train,y_train,x_test,y_test)

This time, both training and testing models are almost same. That means model is no more overfitted.
This model is not much training data dependent, i.e. we can use for predicting similar data.