# What is GB_XGB ?
* XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. A wide range of applications: Can be used to solve regression, classification



* XGBoost is an implementation of gradient boosted decision trees designed for speed and performance



* Boosting is an ensemble technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made.



* Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.



* This approach supports both regression and classification predictive modeling problems.


## Decision tree,Bagging,Random forest,Boosting,Gradient Boosting,XGBoost
![](xgbt.png)


**Why does XGBoost perform so well?**
* XGBoost and Gradient Boosting Machines  are both ensemble tree methods that apply the principle of boosting weak           learners using the gradient descent architecture. However, XGBoost improves upon the base GBM frameworkthrough systems optimization and algorithmic enhancements.


#### 1.Regularization: 
* This is considered to be as a dominant factor of the algorithm. Regularization is a technique that is used to get rid of overfitting of the model. 

#### 2.Cross-Validation: 
* We use cross-validation by importing the function from sklearn but XGboost is enabled with inbuilt CV function.

#### 3.Missing Value:  
* It is designed in such a way that it can handle missing values. It finds out the trends in the missing values and apprehends them.

#### 4.Flexibility:
* It gives the support to objective functions. They are the function used to evaluate the performance of the model and also it can handle the user-defined validation metrics.



## System Optimization

#### Parallelization:
* XGBoost approaches the process of sequential tree building using parallelized implementation. This is possible due to the interchangeable nature of loops used for building base learners; the outer loop that enumerates the leaf nodes of a tree, and the second inner loop that calculates the features. This nesting of loops limits parallelization because without completing the inner loop (more computationally demanding of the two), the outer loop cannot be started. Therefore, to improve run time, the order of loops is interchanged using initialization through a global scan of all instances and sorting using parallel threads. This switch improves algorithmic performance by offsetting any parallelization overheads in computation.

#### Tree Pruning: 
* The stopping criterion for tree splitting within GBM framework is greedy in nature and depends on the negative loss criterion at the point of split. XGBoost uses ‘max_depth’ parameter as specified instead of criterion first, and starts pruning trees backward. This ‘depth-first’ approach improves computational performance significantly.


#### Hardware Optimization:
* This algorithm has been designed to make efficient use of hardware resources. This is accomplished by cache awareness by allocating internal buffers in each thread to store gradient statistics. Further enhancements such as ‘out-of-core’ computing optimize available disk space while handling big data-frames that do not fit into memory.









**When to Use XGBoost?**

* 1> When you have large number of observations in training data.**

* 2> Number features < number of observations in training data.**

* 3> It performs well when data has mixture numerical and categorical features or just numeric features.**

* 4> When the model performance metrics are to be considered.**

## Business Case: With the given features we need to predict the price of the house.

In [None]:
## importing the libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

In [None]:
##loading the dataset
data=pd.read_csv('hp_data.csv')

# Basic checks

In [None]:
data.head()#first five rows

In [None]:
data.tail()#last five rows

In [None]:
data.info()#to check null values and datatype

In [None]:
data.describe()##used to view some basic statistical details like percentile, mean, std etc. 

# Exploratory Data Analysis (EDA) --------------- TASK

### Univariate Analysis

### Bivariate Analysis

### Check the distribution for each column

# Data preprocessing

### Checking for null values

In [None]:
data.isnull().sum()

### Conversion of categorical columns into numerical columns

# 1.place

In [None]:
## Data preprocessing
## reducing the labels present in the place
data.place.value_counts()#checking counts for each labels

In [None]:
#area wise classification of place columns using loc function
# for Phase 1


# fetching the records that has place as Devarabeesana Halli and for those records, 
# we need to change the value for 'place' as Phase1
data.loc[data['place']=='Devarabeesana Halli','place']='Phase1'

# do the same for other places that belong to Phase1
data.loc[data['place']=='KR Puram','place']='Phase1'
data.loc[data['place']=='BTM Layout','place']='Phase1'
data.loc[data['place']=='Abbaiah Reddy Layout','place']='Phase1'
data.loc[data['place']=='Electronics City Phase 1','place']='Phase1'

In [None]:
data.place.value_counts()

In [None]:
#for phase 2
data.loc[data['place']=='Ambalipura','place']='Phase2'
data.loc[data['place']=='Yelahanka','place']='Phase2'
data.loc[data['place']=='Whitefield','place']='Phase2'
data.loc[data['place']=='Subramanyapura','place']='Phase2'
data.loc[data['place']=='Yelachenahalli','place']='Phase2'

In [None]:
#for phase 3
data.loc[data['place']=='Sarakki Nagar','place']='Phase3'
data.loc[data['place']=='Malleshwaram','place']='Phase3'
data.loc[data['place']=='Gunjur','place']='Phase3'
data.loc[data['place']=='Frazer Town','place']='Phase3'
data.loc[data['place']=='Rajaji Nagar','place']='Phase3'

In [None]:
data.place.value_counts() #finally  15 labels classified into 3 labels

In [None]:
## conversion into numerical 
# ON The basis of count 
data.loc[data['place']=='Phase1','place']=1
data.loc[data['place']=='Phase2','place']=2
data.loc[data['place']=='Phase3','place']=3

In [None]:
data.place.value_counts()#checking for conversion

# 2.built

In [None]:
## preprocessing built
#checking count for each label
data.built.value_counts()

In [None]:
#coversion on the basis of counts 
# whichever label has higher counts give it higher weightage
data.loc[data['built']=='Super built-up  Area','built']=1
data.loc[data['built']=='Built-up  Area','built']=0

In [None]:
data.built.value_counts()#checking count for each label

# 3. sales

In [None]:
data.sale.value_counts()

In [None]:
data.drop(['sale'], axis=1, inplace=True)

In [None]:
data.info()

In [None]:
data['place'] = data['place'].astype('int64')
data['built'] = data['built'].astype('int64')

In [None]:
data.info()

# checking for outlier 

In [None]:
# let's see how data is distributed for every column
plt.figure(figsize=(20,25), facecolor='white')
plotnumber = 1 #counter

for column in data:#columns form data1
    if plotnumber<=9 :#checking whether plot number is less or equal to 6
        ax = plt.subplot(4,2,plotnumber)# 
        sns.boxplot(data[column])# ploting boxplot for outlier 
        plt.xlabel(column,fontsize=20)
        
    plotnumber+=1
plt.show()

# Feature Selection


In [None]:
## Checking correlation

plt.figure(figsize=(10,10))#canvas size
sns.heatmap(data.corr(), annot=True, cmap="RdYlGn", annot_kws={"size":15})#plotting heat map to check correlation

# Model creation

In [None]:
## creating X and y
X=data.loc[:,['place', 'built', 'sqft','yearsOld', 'floor', 'totalFloor',            #independent variable 
       'bhk']]
y=data.price#dependent variable or target 

In [None]:
y

In [None]:
## creating training and testing data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=3)

# what is Gradient Boosting ?
* Gradient boosting is a type of machine learning boosting. It relies on the intuition that when combined with previous models, the best possible next model,  minimizes the overall prediction error. The key idea is to set the target outcomes for this next model in order to minimize the error.

In [None]:
## importing the model library
from sklearn.ensemble import GradientBoostingRegressor
gbm=GradientBoostingRegressor() ## object creation
gbm.fit(X_train,y_train) ## fitting the data
y_gbm=gbm.predict(X_test)#predicting the price


In [None]:
## evaluatin the model
from sklearn.metrics import r2_score# to check model performance
r2_score(y_test,y_gbm)#checking r2score

In [None]:
adj_r2_score=1-(1-0.99)*(1050-1)/(1050-7-1)#adjusted r2 score
adj_r2_score

In [None]:
X_test.shape#rows and columns

In [None]:
X_train.shape

In [None]:
## Installing XGB library

!pip install xgboost

In [None]:
X_train.info()

In [None]:
## model creation
#importing the model library
from xgboost import XGBRegressor

#xgb_r = XGBRegressor(objective ='reg:linear', n_estimators = 10, seed = 123)


xgb_r= XGBRegressor() ## object creation
xgb_r.fit(X_train,y_train)# fitting the data
y_hat=xgb_r.predict(X_test)#predicting the price

In [None]:
r2_score(y_test,y_hat)#R2 score

# Hyper parameter tunning in XG bost 

In [None]:
from sklearn.model_selection import RandomizedSearchCV


n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]#List Comprehension-using for loop in list
max_depth = [3,4,5,6]#The maximum depth of a tree
learning_rate=[0.1,0.2,0.3] #Typical final values to be used: 0.01-0.2
gamma=[0, 1, 2, 3, 4] # Gamma specifies the minimum loss reduction required to make a split. It controls the overfitting. Ranges from 0 to ∞.
subsample=[0.5,0.7,1]#Lower values make the algorithm more conservative and prevents overfitting but too 
                     #small values might lead to under-fitting.Typical values: 0.5-1. Range: (0,1)
colsample_bytree=[0.5,0.7,1]#Denotes the fraction of columns to be randomly samples for each tree. Ranges from 0 to 1

params={
    'max_depth':max_depth,'learning_rate':learning_rate,'n_estimators':n_estimators,
     'gamma':gamma, 'subsample':subsample, 'colsample_bytree':colsample_bytree
}

XGB=XGBRegressor(random_state=42)

rcv= RandomizedSearchCV(XGB, scoring='r2',param_distributions=params, n_iter=100, cv=3, 
                                random_state=42, n_jobs=-1)
                              
#estimator--number of decision tree
#scoring--->performance matrix to check performance
#param_distribution-->hyperparametes(dictionary we created)
#n_iter--->Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.default=10
##cv------> number of folds
#verbose=Controls the verbosity: the higher, the more messages.
#n_jobs---->Number of jobs to run in parallel,-1 means using all processors.
                        
rcv.fit(X_train, y_train) ##training data on randomsearch cv.
cv_best_params = rcv.best_params_ ##it will give you best parameters 
print(f"Best paramters: {cv_best_params}")  ##printing  best parameters

In [None]:
cv_best_params

In [None]:

clf = xgb.XGBClassifier(max_depth=7, n_estimators=1000)

In [None]:
XGB2=XGBRegressor(subsample= 0.5,
 n_estimators= 2000,
 max_depth= 3,
 learning_rate= 0.3, gamma=2,
 colsample_bytree= 0.5)

XGB2.fit(X_train, y_train)#training 



In [None]:
y_predict=XGB2.predict(X_test)#testing

In [None]:
r2_score2=r2_score(y_test,y_predict)#checking performance
r2_score2

## XGBoost
### Pros
1. Less feature engineering required (No need for scaling, normalizing data, can also handle missing values well)
2. Feature importance can be found out(it output importance of each feature, can be used for feature selection)
3. Fast to interpret
4. Outliers have minimal impact.
5. Handles large sized datasets well.
6. Good Execution speed
7. Good model performance (wins most of the Kaggle competitions)
8. Less prone to overfitting

### Cons
1. Difficult interpretation , visualization tough
2. Overfitting possible if parameters not tuned proper


![](def2.png)

In [None]:
pwd