<h1 style='text-align:center'> Boosting for the Win: Will it rain in Seattle?</h1>
    
<p> About the dataset </p>
<ul>
    <li>DATE: This field specifies the date correposnding to the row. </li>
    <li>PRCP: This column specifies the amount of precipitation </li>
    <li>TMAX: This column specifies the Maximum temperature of the day </li>
    <li>TMIN: This column specifies the Minimum temperature of the day </li>
    <li>RAIN: This column is the categorical target variable containing value-'True' if it rained or False if it didn't rain</li>
</ul>

<h1 style='text-align:center'> Introduction to Ensemble Learning </h1>

<h3> What is ensemble learning?</h3>
<p>Ensemble methods is a machine learning technique that combines several base models in order to produce one optimal predictive model. </p>
<p> Let us take decision tree as an example. A Decision Tree determines the predictive value based on series of questions and conditions. For instance, this simple Decision Tree determining on whether an individual should play outside or not. The tree takes several weather factors into account, and given each factor either makes a decision or asks another question.However, if it is raining, we must ask if it is windy or not? If windy, we will not play. But given no wind, tie those shoelaces tight because were going outside to play.</p>

<img src='https://miro.medium.com/max/750/1*ML5ABmp7pxnZuhIokXlPCw.png'>
<br>
<p> When making Decision Trees, there are several factors we must take into consideration: On what features do we make our decisions on? What is the threshold for classifying each question into a yes or no answer? In the first Decision Tree, what if we wanted to ask ourselves if we had friends to play with or not. If we have friends, we will play every time. If not, we might continue to ask ourselves questions about the weather. By adding an additional question, we hope to greater define the Yes and No classes.</p>

<h1> Why Ensemble Learning?</h1>

<p>This is where Ensemble Methods come in handy! Rather than just relying on one Decision Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take a sample of Decision Trees into account, calculate which features to use or questions to ask at each split, and make a final predictor based on the aggregated results of the sampled Decision Trees.</p>

<h1> Types of Ensemble Learning</h1>
There are many ensemble techniques available but we will discuss about the below two most widely used methods:
<ul>
    <li> Bagging </li>
    <li> Boosting</li>
</ul>
<p> To read more about their difference please read this awesome article <a herf="https://towardsdatascience.com/types-of-ensemble-methods-in-machine-learning-4ddaf73879db"> Here</a></p>
 

**Now Enough with the theory.. Let's get our hands dirty. Let's extract the data and look at it and try to visualize it**

In [None]:
import seaborn as sns
import numpy as np # linear algebra
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

import xgboost as xgb

In [None]:

df=pd.read_csv('/kaggle/input/did-it-rain-in-seattle-19482017/seattleWeather_1948-2017.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

<h2> I am personally not a huge fan of Boolean data type. So let's convert it to 1's and 0's </h2>

In [None]:
df['rain'] = df['RAIN'].map({True:1 ,False:0}) 
del df['RAIN']
df.head()

<h3> Now let's do some data visualization to understand the data more. </h3>

In [None]:
sns.countplot(x=df['rain'].values)

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14,6))
df['TMAX'].plot()

<p> Box Plot below clearly indicates there is a difference in TMAX when it rains and when it doesn't </p>

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='rain',y='TMAX',data=df,palette='winter')

<h2> LET'S HANDLE MISSING VALUES NOW.</h2>

In [None]:
df.isna().sum()

<h4> We can either drop the missing data to Fill it with Mean/Median/Mode. Ideally we have only 3 rows with missing data in PRCP column, so we can drop them..However, I don't prefer that. Instead we will impute the data i.e fill it with Mean/Median/Mode. However, let's find out how to figure out which one to use</h4>

<h4> Let's visualize PRCP </h4>

In [None]:
import matplotlib as plt

sns.distplot(a = df['PRCP'], kde=True,color='blue',bins=1)

<h3> Clearly we can see, data has quite some outliers. Let's impute the missing values in this column with Median </h3>

In [None]:
df['PRCP']=df['PRCP'].fillna(df['PRCP'].median())

<h3> Now let's fill the nulls in the target variable </h3>

In [None]:
df['rain']=df['rain'].fillna(df['rain'].value_counts().index[0])

In [None]:
df.isnull().sum()

<h2> Now we have a column named DATE.</h2>
<ul>
<li>We will first convert it into date Format</li>
 <li>Then we will extract the month field from date</li>
 <li>We will then try to see if there is a relation between the month and rainfall</li>
 <li>If we see there is a relation, we shall keep the month feature in our model</li>
</ul>
    </h2>

In [None]:
df['DATE'] = pd.to_datetime(df['DATE'])

In [None]:
df['month']=pd.DatetimeIndex(df['DATE']).month
del df['DATE']

In [None]:
a=df.pivot_table(index=['rain'], columns='month', aggfunc='size', fill_value=0)
key=a.loc[1.0].index
value=a.loc[1.0].values

import seaborn as sns
ax = sns.barplot(x=key, y=value)


<h3> Clearly we can see that there is a season when the days when it rained is much more. Hence we shall now consider this feature</h3>
<p> We shal consider this as a categorical variable and one hot encode it </p>

In [None]:
df = pd.get_dummies(df, columns = ['month'])

In [None]:
X=df.iloc[:,:-1]
del X['rain']
y=df.loc[:,'rain']

<h2> Let's begin with model building. I am running XGBoost here .First let's do a train test split </h2>

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.15, random_state = 42)

In [None]:
xgb_m = xgb.XGBClassifier()
params = {
'learning_rate': [.1,.4, .45, .5, .55, .6],
'colsample_bytree': [.6, .7, .8, .9, 1],
'booster':["gbtree"],
 'min_child_weight': [0.001,0.003,0.01],
}
xgb_cv = GridSearchCV(xgb_m, params, scoring = "accuracy", verbose = 0, cv = 5)
xgb_cv.fit(X_train, y_train)
best_params = xgb_cv.best_params_
print(f"Best parameters: {best_params}")    
xgb_m=xgb.XGBClassifier()   
xgb_m.fit(X_train, y_train)

In [None]:
pred = xgb_m.predict(X_test)

<p> Now let's see model F1 score and confusion Matrix </p>

In [None]:
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix,mean_squared_error,r2_score

print(round(f1_score(y_test, pred,average='binary'), 5))
confusion_matrix(y_test, pred)

<h1> PERFECTO. PLEASE UPVOTE IF YOU ENJOYED THE NOTEBOOK </h1> 