The techniques used for in this notebook include:

* Visualization of Important Insights using Matplotlib libraray
* Data Wrangling using Pandas
* Bagging and Boosting Techniques to improve bias and variance
* Cross-Validation and Machine Learning Curves for Model Evaluation
* Stacking: Improvement in bias and Variance by 10%.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
data=pd.read_csv('/kaggle/input/onlinenewspopularity/OnlineNewsPopularity.csv')

In [None]:
data.head()

# ****1. Expolatory Data Analysis (EDA)****

In [None]:
data['url'][2000]

In [None]:
import nltk
import re
import string
from datetime import datetime

In [None]:
date=[]
date_original=[]

for i in range(data.shape[0]):
    x=re.findall(r'[0-9]{4}/[0-9]{2}/[0-9]{2}',data['url'][i])
    date.append(x)
    
for i in date:
    for r in i:
        date_original.append(r)

In [None]:
data['date']=date_original

In [None]:
data['date']= pd.to_datetime(data['date'])

In [None]:
fig=plt.figure(figsize=(10,10))
ax=fig.gca()
plt.plot(data['date'],data[' shares'])
plt.show()

In [None]:
plt.scatter(data[' timedelta'],data[' shares'],c='r')


### Observations:

1) Time Delta (The days between the dataset compilation and article publishing) v Number of shares

2) There are more outliers such as after 400 days more articles have been shared more than 20k times

3) These outliers are not helping to check the exact distribution of data and are skewing the results

In [None]:
data.shape[0]

In [None]:
for i in range(data.shape[0]):
    if data[' shares'][i] > 28000:
        data.drop(index=i,inplace=True)
        

In [None]:
data=data.reset_index(drop=True)

In [None]:
plt.scatter(data[' timedelta'],data[' shares'],c='r')

In [None]:
plt.plot(data[' num_imgs'],data[' shares'],'ro',label='Images')
plt.plot(data[' num_videos'],data[' shares'],'b^',label='Videos')
plt.legend()

### This was for getting the relation between number of videos and images in an article with their corresponding shares.

1) If number of videos and images exceeds 80 then shares come close to 0-5k.

2) Most of the articles between 0-40 images and videos has been shared 0-20k times

In [None]:
sns.scatterplot(data[' n_tokens_title'],data[' shares'])

In [None]:
sns.scatterplot(data[' n_tokens_content'],data[' shares'])

In [None]:
plt.hist(data[' n_tokens_content'],alpha=0.5,color='b')
plt.hist(data[' shares'],alpha=0.5,color='g')
plt.legend()

## Observations:

1) Too short and too long titles are not getting good response. Words between 5-18 are good.

2) Total words between 0-20k are getting the higher response.Above 20k articles have not been shared more than 500 times.

3) Both "shares" and "n_number_tokens" are right skewed.That means data is concentrated in lower half.

In [None]:
from scipy.stats import norm
fig= plt.figure(figsize=(10,10))
ax=fig.gca()
ax.set_title("The 'Sharing' distribution of whole dataset")
sns.distplot(data[' shares'],ax=ax, fit=norm)

In [None]:
print("Skew:",data[' shares'].skew())

A Positive Skew means that data is right skewed and it can be corrected with log or square root.

#### 1. it is positive skewed so more data in lower half.
#### 2. We cannot use squared error term as it would highlight the higher terms with errors so would make results less interpretable.
#### 3. We can power transform the target variable or leave it as it is.

In [None]:
lifestyle_articles=data[data[' data_channel_is_lifestyle'] == 1][' shares'].sum()
entertainment_articles=data[data[' data_channel_is_entertainment'] == 1][' shares'].sum()
business_articles=data[data[' data_channel_is_bus'] == 1][' shares'].sum()
socialmedia_articles=data[data[' data_channel_is_socmed'] == 1][' shares'].sum()
technical_articles=data[data[' data_channel_is_tech'] == 1][' shares'].sum()
world_articles=data[data[' data_channel_is_world'] == 1][' shares'].sum()

In [None]:
articles_types=np.array([lifestyle_articles,entertainment_articles,business_articles,socialmedia_articles,technical_articles,world_articles],dtype=np.int64)
fig= plt.figure(figsize=(10,10))
ax=fig.gca()
ax.set_title('TOTAL SHARED ARTICLES OF EACH GENRE')
ax.set_ylabel('Number of Articles')
plt.bar(x=['lifestyle','entertainment','business','socialmedia','technical','world'],height=articles_types,color='rgbkymc')


In [None]:
articles_types

### Observations:

1) Technical genre articles are the highest ones in sharing order.

2) lifestyle articles are shared the least.

In [None]:
monday_articles=data[data[' weekday_is_monday'] == 1][' shares'].sum()
tuesday_articles=data[data[' weekday_is_tuesday'] == 1][' shares'].sum()
wednesday_articles=data[data[' weekday_is_wednesday'] == 1][' shares'].sum()
thursday_articles=data[data[' weekday_is_thursday'] == 1][' shares'].sum()
friday_articles=data[data[' weekday_is_friday'] == 1][' shares'].sum()
saturday_articles=data[data[' weekday_is_saturday'] == 1][' shares'].sum()
sunday_articles=data[data[' weekday_is_sunday'] == 1][' shares'].sum()
weekend_articles=data[data[' is_weekend'] == 1][' shares'].sum()

In [None]:
articles_publishing_days= np.array([monday_articles,tuesday_articles,wednesday_articles,thursday_articles,friday_articles,
                                    saturday_articles,sunday_articles,weekend_articles])
fig= plt.figure(figsize=(10,10))
ax=fig.gca()
ax.set_title('Total sharing of articles day-wise')
ax.set_ylabel('Number of Articles')
plt.bar(x=['monday','tuesday','wednesday','thursday','friday','saturday','sunday','weekend'],height=articles_publishing_days
        ,color='rgbkymc')


In [None]:
result=[]
days=[' weekday_is_monday',' weekday_is_tuesday',' weekday_is_wednesday',' weekday_is_thursday',' weekday_is_friday',
     ' weekday_is_saturday',' weekday_is_sunday',' is_weekend']
genre=[' data_channel_is_lifestyle',' data_channel_is_entertainment',' data_channel_is_bus',' data_channel_is_socmed',
       ' data_channel_is_tech',' data_channel_is_world']
for i in days:
    list1=[]
    for j in genre:
        list1.append(data.groupby([i,j])[' shares'].sum()[1][1])
    print('Best channel on {} has articles {} and channel is {}'.format(i,max(list1),genre[list1.index(max(list1))]))

In [None]:
Worst_min_shares=pd.DataFrame(data.groupby([' kw_min_min'],sort=True)[' shares'].sum())
Worst_max_shares=pd.DataFrame(data.groupby([' kw_max_min'],sort=True)[' shares'].sum())
Worst_avg_shares=pd.DataFrame(data.groupby([' kw_avg_min'],sort=True)[' shares'].sum())
Best_min_shares=pd.DataFrame(data.groupby([' kw_min_max'],sort=True)[' shares'].sum())
Best_max_shares=pd.DataFrame(data.groupby([' kw_max_max'],sort=True)[' shares'].sum())
Best_avg_shares=pd.DataFrame(data.groupby([' kw_avg_max'],sort=True)[' shares'].sum())
Normal_min_shares=pd.DataFrame(data.groupby([' kw_min_avg'],sort=True)[' shares'].sum())
Normal_max_shares=pd.DataFrame(data.groupby([' kw_max_avg'],sort=True)[' shares'].sum())
Normal_avg_shares=pd.DataFrame(data.groupby([' kw_avg_avg'],sort=True)[' shares'].sum())

In [None]:
Worst_min_shares.plot()

In [None]:
Lda_00=pd.DataFrame(data.groupby(by=[' LDA_00'])[' shares'].sum().sort_values(ascending=False)).reset_index()
Lda_01=pd.DataFrame(data.groupby(by=[' LDA_01'])[' shares'].sum().sort_values(ascending=False)).reset_index()
Lda_02=pd.DataFrame(data.groupby(by=[' LDA_02'])[' shares'].sum().sort_values(ascending=False)).reset_index()
Lda_03=pd.DataFrame(data.groupby(by=[' LDA_03'])[' shares'].sum().sort_values(ascending=False)).reset_index()
Lda_04=pd.DataFrame(data.groupby(by=[' LDA_04'])[' shares'].sum().sort_values(ascending=False)).reset_index()



In [None]:
## mean respective lda for > 50 shares
mean_lda_00=np.mean(Lda_00[Lda_00[' shares'] > 50])[0]
mean_lda_01=np.mean(Lda_01[Lda_01[' shares'] > 50])[0]
mean_lda_02=np.mean(Lda_02[Lda_02[' shares'] > 50])[0]
mean_lda_03=np.mean(Lda_03[Lda_03[' shares'] > 50])[0]
mean_lda_04=np.mean(Lda_04[Lda_04[' shares'] > 50])[0]

In [None]:
fig=plt.figure(figsize=(8,8))
ax=fig.gca()
plt.bar(x=['mean_lda_00','mean_lda_01','mean_lda_02','mean_lda_03','mean_lda_04'],
        height=[mean_lda_00,mean_lda_01,mean_lda_02,mean_lda_03,mean_lda_04])

In [None]:
sns.scatterplot(x=data[' global_subjectivity'],y=data[' shares']) # Subjectivity from 0.0-1.0

In [None]:
columns_group_3=[' global_sentiment_polarity', ' global_rate_positive_words',
       ' global_rate_negative_words', ' rate_positive_words',
       ' rate_negative_words', ' avg_positive_polarity',
       ' min_positive_polarity', ' max_positive_polarity',
       ' avg_negative_polarity', ' min_negative_polarity',
       ' max_negative_polarity', ' title_subjectivity',
       ' title_sentiment_polarity', ' abs_title_subjectivity',
       ' abs_title_sentiment_polarity', ' shares']

In [None]:
fig, ax = plt.subplots(figsize=(20,20))

sns.heatmap(data[columns_group_3].corr(),linewidth=1.0,ax=ax,square=True,annot=True)

# 2. Principal Component Analysis

In [None]:
from sklearn.decomposition import PCA

In [None]:
y= data[' shares']

In [None]:
pca_data=data.drop(labels=['url',' shares','date'],axis=1)

In [None]:
pca_data.head()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
data_transformed= scaler.fit_transform(pca_data)

In [None]:
pca=PCA()
principal_comp=pd.DataFrame(pca.fit_transform(pca_data))

In [None]:
principal_comp.head()

In [None]:
pca.explained_variance_ratio_

# 3. Model Selection:

# 3(a).Random Forest Regressor

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV

In [None]:
test_size=0.2
X_train, X_test, y_train, y_test = train_test_split(pca_data, y,  
    test_size=test_size,random_state=23)

In [None]:

param_grid= {'n_estimators':[20,40],
            'max_depth':[10,20],
             'max_features':['auto',10,20],
             'bootstrap':[True,False],             
            }

## Initial result gave both extreme values as best parameters so run again by increasing limit

In [None]:
random_search= RandomizedSearchCV(RandomForestRegressor(),param_distributions=param_grid,
                                  cv=5,scoring='neg_mean_absolute_error',
                         verbose=1,n_jobs=-1)
randomsearch_result=random_search.fit(X_train,y_train)
best_paramters= randomsearch_result.best_params_

In [None]:
pd.DataFrame(randomsearch_result.cv_results_).sort_values('mean_test_score',ascending=False)

In [None]:
best_paramters

In [None]:
from sklearn.model_selection import cross_val_score
rf=RandomForestRegressor(n_estimators=40,max_depth=10,max_features=10)
scores=cross_val_score(rf,X_train,y_train,scoring='neg_mean_absolute_error',cv=5)

In [None]:
absolute_scores= -scores.mean()

In [None]:
## def display_scores(score):
   ## print("Mean:", score.mean())
   ## print("Standard deviation:", score.std())

In [None]:
rf.fit(X_train,y_train)

In [None]:
from sklearn.metrics import mean_absolute_error
y_pred=rf.predict(X_test)
test_score=mean_absolute_error(y_test,y_pred)

In [None]:
test_score

In [None]:
pd.DataFrame({'actual_train_mae_score':absolute_scores,
             'actual_test_mae_score':test_score},index=['Mean'])

### reversing the normalizing of target variable

In [None]:
df=pd.DataFrame(rf.feature_importances_,pca_data.columns).reset_index()

In [None]:
df.columns=['variables','score']

In [None]:
sorted_df=df.sort_values('score',ascending=False)

In [None]:
important_variables=sorted_df.iloc[1:15,:]

In [None]:
fig=plt.figure(figsize=(15,15))
ax=fig.gca()
plt.bar(x=important_variables.variables,height=important_variables.score,color='r')
plt.xticks(rotation=90)
plt.show()

# 3(b).   RandomForest.PCA__

In [None]:
pca_final_data=principal_comp[[0,1]]

In [None]:
X_train_pca,X_test_pca,y_train_pca,y_test_pca=train_test_split(pca_final_data,y,test_size=0.2,random_state=23)

In [None]:
param_grid_pca= {'n_estimators':[20,40],
            'max_depth':[10,20],
             'bootstrap':[True,False],             
            }

In [None]:
random_search_pca= RandomizedSearchCV(RandomForestRegressor(),param_distributions=param_grid_pca,
                                  cv=5,scoring='neg_mean_absolute_error',
                                  verbose=1,n_jobs=-1)
randomsearch_result_pca=random_search_pca.fit(X_train_pca,y_train_pca)
best_paramters_pca= randomsearch_result_pca.best_params_

In [None]:
pd.DataFrame(randomsearch_result_pca.cv_results_).sort_values('mean_test_score',ascending=False)

In [None]:
best_paramters_pca

In [None]:
rf_pca=RandomForestRegressor(n_estimators=20,max_depth=10)

In [None]:
scores_1=cross_val_score(rf_pca,X_train_pca,y_train_pca,scoring='neg_mean_absolute_error',cv=10)

In [None]:
absolute_scores_1=-scores_1.mean()

In [None]:

rf_pca.fit(X_train_pca,y_train_pca)

In [None]:
y_predict_pca= rf_pca.predict(X_test_pca)
test_score_pca= mean_absolute_error(y_test_pca,y_predict_pca)

In [None]:
pd.DataFrame({'train_mse_score':[absolute_scores_1],
             'test_mse_score':[test_score_pca]},index=['Mean'])

__1(d). Learning Curves__

In [None]:
X_train.shape[0]

In [None]:
train_sizes=[500,800,1000,1250,2500,5000,10000,12000,16000,18000,20000]

In [None]:
from sklearn.model_selection import learning_curve
train_sizes,train_scores,validation_scores= learning_curve(rf,X=X_train,y=y_train,train_sizes=train_sizes,
                                             cv=3,scoring='neg_mean_absolute_error')

In [None]:
train_scores_mean= -train_scores.mean(axis=1)
validation_scores_mean=-validation_scores.mean(axis=1)

In [None]:
plt.plot(train_sizes, train_scores_mean, label = 'Training error')
plt.plot(train_sizes, validation_scores_mean, label = 'Validation error')
plt.ylabel('MAE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a random forest regression model', fontsize = 18, y = 1.03)
plt.legend()

In [None]:
train_sizes,train_scores_pca,validation_scores_pca= learning_curve(rf_pca,X=X_train_pca,y=y_train_pca,train_sizes=train_sizes,
                                             cv=3,scoring='neg_mean_absolute_error')

In [None]:
train_scores_mean_pca= -train_scores_pca.mean(axis=1)
validation_scores_mean_pca=-validation_scores_pca.mean(axis=1)

In [None]:
plt.plot(train_sizes, train_scores_mean_pca, label = 'Training error PCA')
plt.plot(train_sizes, validation_scores_mean_pca, label = 'Validation error PCA')
plt.ylabel('MAE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a random forest regression model', fontsize = 18, y = 1.03)
plt.legend()

One solution at this point is to change to a more complex learning algorithm. This should decrease the bias and increase the variance. A mistake would be to try to increase the number of training instances. Generally, these other two fixes also work when dealing with a high bias and low variance problem:

__1. Training the current learning algorithm on more features (to avoid collecting new data, you can generate easily polynomial features). This should lower the bias by increasing the model’s complexity.__

__2. Decreasing the regularization of the current learning algorithm, if that’s the case. In a nutshell, regularization prevents the algorithm from fitting the training data too well. If we decrease regularization, the model will fit training data better, and, as a consequence, the variance will increase and the bias will decrease.__

### Comparison:

1. PCA Model has less variance but more bias------Solution: Train on more features increasing the complexity of model and decreasing the regularization of model.Meaning allowing it to overfit.



2. Normal Model has less bias (as compared to PCA but more than a random forest should have) and more variance.

#### Since it is a bagging method it has less variance. Now we can use boosting to get less bias.

#  3(C). Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
param_gradboost={'n_estimators':[100,150],
                'max_depth':[5,10],
                'learning_rate':[0.1,0.2]}

In [None]:
pca_data_gbr= principal_comp[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]]
X_train_gbr,X_test_gbr,y_train_gbr,y_test_gbr=train_test_split(pca_data_gbr,y,test_size=0.2,random_state=23)

In [None]:
grad_randomsearch= RandomizedSearchCV(GradientBoostingRegressor(),param_distributions=param_gradboost,cv=3,
                                      scoring='neg_mean_absolute_error',n_jobs=-1,verbose=1)
grad_fit=grad_randomsearch.fit(X_train_gbr,y_train_gbr)
best_param_grad= grad_fit.best_params_

In [None]:
pd.DataFrame(grad_fit.cv_results_)

In [None]:
best_param_grad

In [None]:
gbr= GradientBoostingRegressor(n_estimators=50,max_depth=5,learning_rate=0.1)

In [None]:
grad_result= gbr.fit(X_train_gbr,y_train_gbr)

In [None]:
scores_boosting= cross_val_score(gbr,X_train_gbr,y_train_gbr,scoring='neg_mean_absolute_error',cv=5)

In [None]:
absolute_scores_boosting= - scores_boosting.mean()

In [None]:
y_pred_gbr=gbr.predict(X_test_gbr)
test_score_gbr= mean_absolute_error(y_test_gbr,y_pred_gbr)

In [None]:
pd.DataFrame({'train_mae_score':[absolute_scores_boosting],
             'test_mae_score':[test_score_gbr]},index=['Mean'])

In [None]:
train_sizes,train_scores_gbr,validation_scores_gbr= learning_curve(gbr,X=X_train_pca,y=y_train_pca,train_sizes=train_sizes,
                                             cv=5,scoring='neg_mean_absolute_error')

In [None]:
train_scores_mean_gbr= -train_scores_gbr.mean(axis=1)
validation_scores_mean_gbr=-validation_scores_gbr.mean(axis=1)

In [None]:
plt.plot(train_sizes, train_scores_mean_gbr, label = 'Training error PCA')
plt.plot(train_sizes, validation_scores_mean_gbr, label = 'Validation error PCA')
plt.ylabel('MAE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a gradient boosting regression model', fontsize = 18, y = 1.03)
plt.legend()

#### This model has much is better than above two in bias and variance.

# 3(d) Gradient Boosting with feature selection

In [None]:
gbr_original=gbr.fit(X_train,y_train)

In [None]:
df_1=pd.DataFrame(X_train.columns,gbr_original.feature_importances_).reset_index()
df_1.columns=['score','variables']
select_columns=df_1.sort_values('score',ascending=False)['variables']

In [None]:
feature_importance_df= pd.concat(objs=[df,df_1],axis=1)
feature_importance_df

In [None]:
feature_importance_df.columns=['Variables_rf','Score_rf','score_gb','Variables_gb']
feature_importance_df=feature_importance_df.sort_values('Score_rf',ascending=False).reset_index(drop=True)

In [None]:
np.sum(feature_importance_df['score_gb'][0:25])

In [None]:
select_columns=df_1.sort_values('score',ascending=False)['variables'][0:25]

In [None]:
X_train_select,X_test_select,y_train_select,y_test_select=train_test_split(data[select_columns.reset_index()['variables']],
                                                                           y,test_size=0.2,random_state=23)

In [None]:
grad_randomsearch_select= RandomizedSearchCV(GradientBoostingRegressor(),param_distributions=param_gradboost,cv=3,
                                      scoring='neg_mean_absolute_error',n_jobs=-1,verbose=1)
grad_fit_select=grad_randomsearch_select.fit(X_train_select,y_train_select)
best_param_grad_select= grad_fit_select.best_params_

In [None]:
pd.DataFrame(grad_fit_select.cv_results_)

In [None]:
best_param_grad_select

In [None]:
gbr_select= GradientBoostingRegressor(n_estimators=100,max_depth=5,learning_rate=0.1)

In [None]:
grad_result_select= gbr_select.fit(X_train_select,y_train_select)

In [None]:
scores_boosting_select= cross_val_score(gbr_select,X_train_select,y_train_select,scoring='neg_mean_absolute_error',cv=5)

In [None]:
absolute_scores_boosting_select= - scores_boosting_select.mean()

In [None]:
y_pred_gbr_select=gbr_select.predict(X_test_select)
test_score_gbr_select= mean_absolute_error(y_test_select,y_pred_gbr_select)

In [None]:
pd.DataFrame({'train_mae_score':[absolute_scores_boosting_select],
             'test_mae_score':[test_score_gbr_select]},index=['Mean'])

In [None]:
train_sizes,train_scores_select,validation_scores_select= learning_curve(gbr_select,X=X_train_select,
                                                                         y=y_train_select,train_sizes=train_sizes,
                                                                           cv=5,scoring='neg_mean_absolute_error')

In [None]:
train_scores_mean_select= -train_scores_select.mean(axis=1)
validation_scores_mean_select=-validation_scores_select.mean(axis=1)

In [None]:
plt.plot(train_sizes, train_scores_mean_select, label = 'Training error PCA')
plt.plot(train_sizes, validation_scores_mean_select, label = 'Validation error PCA')
plt.ylabel('MAE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a gradient boosting regression model', fontsize = 18, y = 1.03)
plt.legend()

### Comparison of all Scores and Models:

In [None]:
Comparison_df= pd.DataFrame({'Training_Scores':[absolute_scores,absolute_scores_1,
                                                absolute_scores_boosting,absolute_scores_boosting_select],
                            'Test_Scores':[test_score,test_score_pca,test_score_gbr,test_score_gbr_select]},
                            index=['Rf','Rf_PCA','gbr','gbr_select'])

In [None]:
Comparison_df['Variance']=np.subtract(Comparison_df['Training_Scores'],Comparison_df['Test_Scores'])

In [None]:
Comparison_df=Comparison_df.sort_values('Training_Scores')

In [None]:
Comparison_df

# 3(e). Stacking

From above comaprison, we are going to create a new dataset through predictions of the four models.

In [None]:
data.head()

In [None]:
X1= rf.predict(pca_data)
X2=rf_pca.predict(pca_final_data)
X3=gbr.predict(pca_data_gbr)
X4=gbr_select.predict(data[select_columns.reset_index()['variables']])

### __To Train the final model on large datasets we had to use whole datasets for prediction here___

In [None]:
data_stacking= pd.DataFrame({'Random_Forest':X1,
                            'Random_Forest_PCA':X2,
                            'GBR':X3,
                            "GBR_select":X4,
                            "Target":y})
data_stacking.head()

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
X_train_stack,X_test_stack,y_train_stack,y_test_stack= train_test_split(
    data_stacking[['Random_Forest','Random_Forest_PCA','GBR','GBR_select']],y,test_size=0.2)

In [None]:
lr=LinearRegression()
lr.fit(X_train_stack,y_train_stack)

In [None]:
training_score_lr= cross_val_score(lr,X_train_stack,y_train_stack,scoring='neg_mean_absolute_error',cv=20)
absolute_training_lr= -training_score_lr.mean()

In [None]:
y_predict_lr= lr.predict(X_test_stack)
test_score_lr= mean_absolute_error(y_test_stack,y_predict_lr)

In [None]:
train_sizes,train_scores_lr,validation_scores_lr= learning_curve(lr,X=X_train_stack,
                                                                         y=y_train_stack,train_sizes=train_sizes,
                                                                           cv=10,scoring='neg_mean_absolute_error')

In [None]:
train_scores_mean_lr= -train_scores_lr.mean(axis=1)
validation_scores_mean_lr=-validation_scores_lr.mean(axis=1)

In [None]:
plt.plot(train_sizes, train_scores_mean_lr, label = 'Training error PCA')
plt.plot(train_sizes, validation_scores_mean_lr, label = 'Validation error PCA')
plt.ylabel('MAE', fontsize = 14)
plt.xlabel('Training set size', fontsize = 14)
plt.title('Learning curves for a linear regression model', fontsize = 18, y = 1.03)
plt.legend()

# Summary:

1. First data is explored using data manipulation techniques of Pandas using visualization techniques by matplotlib package

2. After that, since there are 52 features, Principal Component Analysis(PCA) was done for dimensionality reduction.

3. The first algorithm used is Random Forest Regressor of Sklearn library. It's feature importance was also collected.

4. Secondly same algorithm was used on PCA data which resulted in more bias but less variance.

5. Gradient Boosting Regressor was then used to decrease bias. Here PCA dataset was used as it has less variance originally.

6. Then important features of randomforest were put into Gradient Boosting which proved to be best model of all the four models.

7. During all models, parameters were chosen using Randomized CV and training scores were gathered using cross validation.

8. Finally Stacking of all four models was done. The upper most algorithm was linear regression which proved to be the most effective modelm