# ㅤㅤㅤㅤㅤㅤㅤRain Prediction with Logistic Regression

***

Rains are essential part of our lives. Clouds give the gift of rains to humans. Weather department tries to forecast when will it rain. So, I try to predict whether it will rain in Australia tomorrow or not.

Hence, in this kernel, I implement Logistic Regression with Python and Scikit-Learn and build a classifier to predict whether or not it will rain tomorrow in Australia. I train a binary classification model using Logistic Regression. I have used the Rain in Australia dataset for this project.

## 1. Import Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing

# import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

## 2. Import Dataset

The next step is to import data

In [None]:
df=pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')

## 3. Exploratory data analysis

- we have imported the data.
- now, its time to explore the data to dain insights about it.

### preview the dataset

In [None]:
df.head()

### view dimention of dataset

In [None]:
df.shape

we can see that there are 145460 row and 23 columns in the dataset

### view column names

In [None]:
col_names=df.columns
col_names

#### Drop RISK_MM variable

we can see that some of columns are less importand than others such as: RISK_MM 
    So, we should drop it as follow

### Checking For datatypes of the attributes

In [None]:
df.info()

comment
- we can see that that the dataset contains mixture of categorical and numerical variables.
- categorical variable have data type : float64 
- Numerical variable have data type : object 
- Also , there are missing values in data set , we are gonna explore it later

### View statistical properties of dataset

In [None]:
df.describe()

## 4. univariate Analysis

##### Explore Rain Tomorrow Target variable

#### check for missing values 

In [None]:
df['RainTomorrow'].isnull().sum()

we can see that there are 3267 missing values in rain tomorrow

#### check for the unique values

In [None]:
df['RainTomorrow'].unique()

#### view the frequency of values 

In [None]:
df['RainTomorrow'].value_counts()

important points to note 
- there are 3267 of missing values 
- There are 31877 predictions that it will rain
- there are 110316 predictions that it wont rain 

### view percentage of frecquency values 

to show the precentage ,we should divided by lenght of dataset 

In [None]:
RainTomorrow={"Yes":31877,
             'No':110316,
             'Missing values':3267}

In [None]:
print ('The precentage is :')
for key,value in RainTomorrow.items():
    print(key ,':', value/len(df))

Hence
- we can see that the total number  of raine tomorrow value : No appears 75% 
- we can see that the total number  of raine tomorrow value : Yes appears 21%
-we can see that the total number  of raine tomorrow value : missing appears 0.2%     

### Visualize frequency distribution of RainTomorrow variable

In [None]:
plt.ax =plt.subplots(figsize=(6,8))
ax=sns.countplot(x='RainTomorrow',data=df,palette="Set3")
#plt.xticks(rotation=90)
plt.grid()
plt.title('Frecquency values');

### Explore Categorical Variables

we should show the categorical variables from dataset

In [None]:
categorical= df.select_dtypes(include=['object'])
categorical.head()

 summary of categorical variables 
 - There is Date variable ,it is denoted by Date columns
 - There are 6 categorical variables. These are given by Location, WindGustDir, WindDir9am, WindDir3pm, RainToday and RainTomorrow.
 - There are two binary categorical variables - RainToday and RainTomorrow
 - RainTomorrow is the target variable.

### Missing values in Categorical Variables

In [None]:
dict={}
for i in list(df[categorical.columns]):
    dict[i]=df[i].isnull().sum()
pd.DataFrame(dict,index=['number of null values']).transpose() 

#### Number of labels: cardinality

>The number of labels within a categorical variable is known as cardinality. A high number of labels within a variable is known as high cardinality. High cardinality may pose some serious problems in the machine learning model. So, I will check for high cardinality.

In [None]:
# check for cardinality in categorical variables

for var in categorical:
    
    print(var, ' contains ', len(df[var].unique()), ' labels')

We can see that there is a Date variable which needs to be preprocessed. I will do preprocessing in the following section.

All the other variables contain relatively smaller number of variables.

#### Feature Engineering of Date Variable

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

we shold convert type of Date variable to datetime as can i process it 


In [None]:
df['Year'] = df['Date'].dt.year

df['Month'] = df['Date'].dt.month

df['Day'] = df['Date'].dt.day

 we have separated the date variable into three variables Now , we should drop date variable 

In [None]:
df.drop('Date',inplace= True,axis=1)

In [None]:
df.head()

Now, we can see that the Date variable is removed from the dataset and we should Explore Categorical data again

#### Explore Categorical Variables one by one 

In [None]:
new_categorical= df.select_dtypes(include=['object'])
new_categorical.head()

#### check missing values in categorical variables.

In [None]:
new_categorical.isnull().sum()

We can see that WindGustDir, WindDir9am, WindDir3pm, RainToday variables contain missing values. I will explore these variables one by one

****

in all variables :
- i will check number of labels and show it 
- convert categorical variable into dummy/indicator variables by function is called get_dummies()

#### Explore Location variable

In [None]:
new_categorical['Location'].unique()

In [None]:
pd.get_dummies(df.Location, drop_first=True).head()

#### Explore WindGustDir variable

In [None]:
new_categorical['WindGustDir'].unique()

In [None]:
pd.get_dummies(df.WindGustDir,drop_first=True,dummy_na=True).head()

#### Explore WindDir9am variable

In [None]:
new_categorical['WindDir9am'].unique()

In [None]:
pd.get_dummies(df.WindDir9am,drop_first=True,dummy_na=True).head()

#### Explore WindDir3pm variable

In [None]:
new_categorical['WindDir3pm'].unique()

In [None]:
pd.get_dummies(df.WindDir3pm,drop_first=True,dummy_na=True).head()

#### Explore RainToday variable

In [None]:
new_categorical['RainToday'].unique()

In [None]:
df.RainToday.value_counts()

In [None]:
pd.get_dummies(df.RainToday,drop_first=True,dummy_na=True).head()

### Explore Numerical Variables 

In [None]:
Numerical= df.select_dtypes(include=['float64','int'])
Numerical.head()

In [None]:
Numerical.columns

### Missing values in numerical variables

In [None]:
Numerical.isnull().sum()

### Check summary statistics

In [None]:
Numerical.describe()

On closer inspection, we can see that the Rainfall, Evaporation, WindSpeed9am and WindSpeed3pm columns may contain outliers.

I will draw boxplots to visualise outliers in the above variables.

In [None]:
plt.figure(figsize=(15,10))


plt.subplot(2, 2, 1)
fig = df.boxplot(column='Rainfall')
fig.set_title('')
fig.set_ylabel('Rainfall')


plt.subplot(2, 2, 2)
fig = df.boxplot(column='Evaporation')
fig.set_title('')
fig.set_ylabel('Evaporation')


plt.subplot(2, 2, 3)
fig = df.boxplot(column='WindSpeed9am')
fig.set_title('')
fig.set_ylabel('WindSpeed9am')


plt.subplot(2, 2, 4)
fig = df.boxplot(column='WindSpeed3pm')
fig.set_title('')
fig.set_ylabel('WindSpeed3pm')

### Find all  outliers

In [None]:
# find outliers for Rainfall variable
IQR = df.Rainfall.quantile(0.75) - df.Rainfall.quantile(0.25)
Lower_fence = df.Rainfall.quantile(0.25) - (IQR * 1.5)
Upper_fence = df.Rainfall.quantile(0.75) + (IQR * 1.5)
print('Rainfall outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))

# find outliers for Evaporation variable

IQR = df.Evaporation.quantile(0.75) - df.Evaporation.quantile(0.25)
Lower_fence = df.Evaporation.quantile(0.25) - (IQR * 1.5)
Upper_fence = df.Evaporation.quantile(0.75) + (IQR * 1.5)
print('Evaporation outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))

# find outliers for WindSpeed9am variable

IQR = df.WindSpeed9am.quantile(0.75) - df.WindSpeed9am.quantile(0.25)
Lower_fence = df.WindSpeed9am.quantile(0.25) - (IQR * 1.5)
Upper_fence = df.WindSpeed9am.quantile(0.75) + (IQR * 1.5)
print('WindSpeed9am outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))

# find outliers for WindSpeed3pm variable

IQR = df.WindSpeed3pm.quantile(0.75) - df.WindSpeed3pm.quantile(0.25)
Lower_fence = df.WindSpeed3pm.quantile(0.25) - (IQR * 1.5)
Upper_fence = df.WindSpeed3pm.quantile(0.75) + (IQR * 1.5)
print('WindSpeed3pm outliers are values < {lowerboundary} or > {upperboundary}'.format(lowerboundary=Lower_fence, upperboundary=Upper_fence))



important point to note :
- For Rainfall, the minimum and maximum values are 0.0 and 371.0. So, the outliers are values > 2.0
- For Evaporation, the minimum and maximum values are 0.0 and 145.0. So, the outliers are values > 14.6
- For WindSpeed9am, the minimum and maximum values are 0.0 and 130.0. So, the outliers are values > 37
- For WindSpeed3pm, the minimum and maximum values are 0.0 and 87.0. So, the outliers are values > 40.5

************

## 5.Multivariate Analysis 

> An important step in EDA is to discover patterns and relationships between variables in the dataset.

>I will use heat map and pair plot to discover the patterns and relationships in the dataset.

>First of all, I will draw a heat map.

In [None]:
correlation = df.corr()

In [None]:
plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap of Rain in Australia Dataset')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)           
plt.show()

Interpretation

- From the above correlation heat map, we can conclude that :-


- MinTemp and MaxTemp variables are highly positively correlated (correlation coefficient = 0.74).


- MinTemp and Temp3pm variables are also highly positively correlated (correlation coefficient = 0.71).


- MinTemp and Temp9am variables are strongly positively correlated (correlation coefficient = 0.90).


- MaxTemp and Temp9am variables are strongly positively correlated (correlation coefficient = 0.89).


- MaxTemp and Temp3pm variables are also strongly positively correlated (correlation coefficient = 0.98).


- WindGustSpeed and WindSpeed3pm variables are highly positively correlated (correlation coefficient = 0.69).


- Pressure9am and Pressure3pm variables are strongly positively correlated (correlation coefficient = 0.96).


- Temp9am and Temp3pm variables are strongly positively correlated (correlation coefficient = 0.86)

# ㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤㅤData Modeling

***

### 6. Declare feature vector and target variable

In [None]:
X = df.drop(['RainTomorrow'], axis=1)

y = df['RainTomorrow']

### 7.Split data into separate training and test set

In [None]:
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
X_train.shape, X_test.shape

### 8. Feature Engineering

**Feature Engineering** is the process of transforming raw data into useful features that help us to understand our model better and increase its predictive power. I will carry out feature engineering on different types of variables.

First, I will display the categorical and numerical variables again separately.

In [None]:
# display categorical variables

categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

In [None]:
# display numerical variables

numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

### missing values in numerical variables

In [None]:
# check missing values in numerical variables in X_train

X_train[numerical].isnull().sum()

In [None]:
# check missing values in numerical variables in X_test

X_test[numerical].isnull().sum()

In [None]:
# impute missing values in X_train and X_test with respective column median in X_train

for df1 in [X_train, X_test]:
    for col in numerical:
        col_median=X_train[col].median()
        df1[col].fillna(col_median, inplace=True) 

In [None]:
# check again missing values in numerical variables in X_train

X_train[numerical].isnull().sum()

In [None]:
# impute missing categorical variables with most frequent value

for df2 in [X_train, X_test]:
    df2['WindGustDir'].fillna(X_train['WindGustDir'].mode()[0], inplace=True)
    df2['WindDir9am'].fillna(X_train['WindDir9am'].mode()[0], inplace=True)
    df2['WindDir3pm'].fillna(X_train['WindDir3pm'].mode()[0], inplace=True)
    df2['RainToday'].fillna(X_train['RainToday'].mode()[0], inplace=True)
y_train.fillna(y_train.mode()[0], inplace=True)
y_test.fillna(y_test.mode()[0], inplace=True)

In [None]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

In [None]:
X_train.isnull().sum()

#### Engineering outliers in numerical variables 

We have seen that the Rainfall, Evaporation, WindSpeed9am and WindSpeed3pm columns contain outliers. I will use top-coding approach to cap maximum values and remove outliers from the above variables.

In [None]:
import numpy as np
def max_value(df3, variable, top):
    return np.where(df3[variable]>top, top, df3[variable])


    X_train['Rainfall'] = max_value(X_train, 'Rainfall', 2)
    X_train['Evaporation'] = max_value(X_train, 'Evaporation', 14.6)
    X_train['WindSpeed9am'] = max_value(X_train, 'WindSpeed9am', 37)
    X_train['WindSpeed3pm'] = max_value(X_train, 'WindSpeed3pm', 40.5)
    
    X_test['Rainfall'] = max_value(X_test, 'Rainfall', 2)
    X_test['Evaporation'] = max_value(X_test, 'Evaporation', 14.6)
    X_test['WindSpeed9am'] = max_value(X_test, 'WindSpeed9am', 37)
    X_test['WindSpeed3pm'] = max_value(X_test, 'WindSpeed3pm', 40.5)

#### Encode categorical variables

In [None]:
X_train[categorical].head()

In [None]:
import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['RainToday'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

We can see that two additional variables RainToday_0 and RainToday_1 are created from RainToday variable.

Now, I will create the X_train training set.

In [None]:
X_train = pd.concat([X_train[numerical], X_train[['RainToday_0', 'RainToday_1']],
                     pd.get_dummies(X_train.Location), 
                     pd.get_dummies(X_train.WindGustDir),
                     pd.get_dummies(X_train.WindDir9am),
                     pd.get_dummies(X_train.WindDir3pm)], axis=1)

In [None]:
X_train.head()

Similarly, I will create the X_test testing set.

In [None]:
X_test = pd.concat([X_test[numerical], X_test[['RainToday_0', 'RainToday_1']],
                     pd.get_dummies(X_test.Location), 
                     pd.get_dummies(X_test.WindGustDir),
                     pd.get_dummies(X_test.WindDir9am),
                     pd.get_dummies(X_test.WindDir3pm)], axis=1)

We now have training and testing set ready for model building. Before that, we should map all the feature variables onto the same scale. It is called feature scaling. I will do it as follows.

In [None]:
X_train.describe()

## 9.Feature Scaling 

In [None]:
cols = X_train.columns
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [None]:
X_train = pd.DataFrame(X_train, columns=[cols])

X_test = pd.DataFrame(X_test, columns=[cols])

In [None]:
X_train.describe()

We now have X_train dataset ready to be fed into the Logistic Regression classifier. I will do it as follows.

## 10.Model training

In [None]:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression


# instantiate the model
logreg = LogisticRegression(solver='liblinear', random_state=0)


# fit the model
logreg.fit(X_train, y_train)

## 11.Predict results

In [None]:
y_pred_test = logreg.predict(X_test)

y_pred_test

**predict_proba method**

**predict_proba** method gives the probabilities for the target variable(0 and 1) in this case, in array form.

0 is for probability of no rain and 1 is for probability of rain.

In [None]:
# probability of getting output as 0 - no rain

logreg.predict_proba(X_test)[:,0]

In [None]:
 #probability of getting output as 1 - rain

logreg.predict_proba(X_test)[:,1]

## 12.Check accuracy score 

In [None]:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

### RandomForest Model

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier

#rf_params = {'n_estimators':[100,150,200],'criterion':['gini','entropy'],}
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_test=rf.predict(X_test)
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

### AdaBoost Model

In [None]:
from sklearn.ensemble import AdaBoostClassifier

rf = AdaBoostClassifier()
rf.fit(X_train, y_train)
y_pred_test=rf.predict(X_test)
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

**Accuracy with Models :**
1. - **LogisticRegression** is : 84.76%
2. - **RandomForestClassifier** : 85.39%
3. - **AdaBoostClassifier** :   84.24 %

## 13.Confusion matrix 

A confusion matrix is a tool for summarizing the performance of a classification algorithm. A confusion matrix will give us a clear picture of classification model performance and the types of errors produced by the model. It gives us a summary of correct and incorrect predictions broken down by each category. The summary is represented in a tabular form.

Four types of outcomes are possible while evaluating a classification model performance. These four outcomes are described below:-

**True Positives (TP)** – True Positives occur when we predict an observation belongs to a certain class and the observation actually belongs to that class.

**True Negatives (TN)** – True Negatives occur when we predict an observation does not belong to a certain class and the observation actually does not belong to that class.

**False Positives (FP)** – False Positives occur when we predict an observation belongs to a certain class but the observation actually does not belong to that class. This type of error is called **Type I error**.

**False Negatives (FN)** – False Negatives occur when we predict an observation does not belong to a certain class but the observation actually belongs to that class. This is a very serious error and it is called **Type II error**.

These four outcomes are summarized in a confusion matrix given below.

In [None]:
from sklearn.metrics import confusion_matrix

cm=confusion_matrix(y_test, y_pred_test)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

The confusion matrix shows 21542 + 3115 = 24657 correct predictions and 1184 + 3251 = 4435 incorrect predictions.

In this case, we have

- True Positives (Actual Positive:1 and Predict Positive:1) - 21542

- True Negatives (Actual Negative:0 and Predict Negative:0) - 3115

- False Positives (Actual Negative:0 but Predict Positive:1) - 1184 (Type I error)

- False Negatives (Actual Positive:1 but Predict Negative:0) - 3251 (Type II error)

In [None]:
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])
cm_matrix.head()

In [None]:
sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu');