# 1. Importing Packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import datetime
import warnings
warnings.filterwarnings(action="ignore")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report,roc_curve, roc_auc_score

# 2. Exploratory Data Analysis

### Reading and analyzing the data

In [None]:
df = pd.read_csv("../input/weather-dataset-rattle-package/weatherAUS.csv")

print(f"The data has {int(df.shape[0])} rows and {int(df.shape[1])} columns")
print("\n")

# creating a function to print the summary of any dataframe

def print_summary(df,Datatype):
    keys = []
    types = []
    unique_number = []
    all_unique_values = []
    missing_rows_number = []
    missing_rows_percentage = []

    for i in df:
        keys.append(i)
        type = "categorical variable" if str(df[i].dtype) == "object" else "continious variable"
        types.append(type)
        missing_rows_number.append(df[i].isna().sum())
        missing_rows_percentage.append(((df[i].isna().sum()/df.shape[0])*100).round(2))

        if type == "categorical variable":
            unique_values = df[i].nunique()

            if unique_values == 2:
                all_unique_values.append(df[i].unique())
            else:
                all_unique_values.append("Not a Boolean Value")

            unique_number.append(unique_values)

        else:
            unique_number.append("NA")
            all_unique_values.append("NA")

    summary_df = pd.DataFrame(data = {"columns":keys,
                                      "DataType":types,
                                      "missingRows":missing_rows_number,
                                      "PercentageMissing":missing_rows_percentage,
                                      "uniqueValuesNumber":unique_number,
                                      "uniqueValues":all_unique_values})
    
    if Datatype == "all":
        print(summary_df)
    
    if Datatype == "categorical":
        print(summary_df[summary_df["DataType"] == "categorical variable"])
        
    if Datatype == "continious":
        print(summary_df[summary_df["DataType"] == "continious variable"])


# printing the summary of the original dataframe
print_summary(df = df,Datatype = "all" )


### Let's explore the target variable RainTomorrow

In [None]:
print("Number of unique values in the target variable RainTomorrow is %d" %(df["RainTomorrow"].nunique()))
print("Number of unique values in the target variable RainTomorrow is %s" %(df["RainTomorrow"].unique()))
print("Number of unique values in the target variable RainTomorrow is %d" %(df["RainTomorrow"].isna().sum()))

Looks like the target variable has 3267 missing values. 
Now do we need to impute the missing values or delete the missing values ? 
My approach is to delete them , because to impute the missing values , we have to build another model.
The accuracy of that predictor should be 100% which will be quiet unlikely.

In [None]:
nonNullRows = df["RainTomorrow"].notna()
df = df[nonNullRows]

All the rows which has null value for the target variable RainTomorrow are removed.

Let's further analyze the target variable after removing the null values

In [None]:
print("Individual count of unique value of Target variable \n",df['RainTomorrow'].value_counts())
print("\n\n")
print("Percentage composition of unique value of Target variable \n",df['RainTomorrow'].value_counts()/len(df))

**77.5% of the total dataset contains No**

So if we set all the rows in the test data to No, our model will still have an overall accuracy of 77.5% which may look good for layman's eyes but it will not help them in their purpose. 

That's why classification metrics such as **precision**, **recall** comes handy to evaluate our classification model

### Exploring all the categorical variables

In [None]:
print_summary(df = df,Datatype = "categorical" )

There are 7 categorical variable out of them **RainTomorrow** is the target variable. So we have 6 categorical variable of interest


****Location****

This weather data is the data of 49 different location of around Australia. It has no missing values in this dataset.
Therefore location cannot be used for prediction unless we are going to group the data by location and predict the rain for individual location. 
If we want to do prediction for the whole dataset , then we need to dummy variables for this location variable.
Nevertheless we can explore the location variable to discover its usefulness if any

****Date****

Date is classified as a categorical variable.

It also has 3436 unique values which poseses a threat of **high cardiality.** (**cardiality** is the number of unique values of a categorical variable). We need to do some preprocessing to explore some usefulness


****RainToday**** is a boolean value of today's rain

Rest of them are some variables indicating the direction of the wind. We need to explore more to find its usefulness

WindGustDir,WindDir9am and WindDir3pm have 16 unique values. 




In [None]:
for col in ["Location","RainToday","WindGustDir","WindDir9am","WindDir3pm"]:
    print(col)
    print("The number of unique values are " + str(df[col].nunique()))
    print("Number of missing values are " + str(df[col].isnull().sum()))
    print("\n")

In [None]:
print(df.describe().T)

### Imputing the missing values

*Filling the categorical value with mode and the continious value with mean*

There are other ways to fill the missing values by using a regression model 

In [None]:
# Filling the categorical value with mode and the continious value with mean

for i in range(len(summary_df)):

    column_name = summary_df["columns"][i]
    if summary_df["DataType"][i] == "categorical variable":
        df[column_name]= df[column_name].fillna(df[column_name].mode()[0])
    
    if summary_df["DataType"][i] == "continious variable":
        df[column_name]= df[column_name].fillna(df[column_name].mean())
    


# 3. Feature Engineering

### Investigating categorical variable with too many categories

In [None]:
summary_df[summary_df['DataType'] == 'categorical variable']

* Breaking data variable, because some of the information in the date could have correlation with the target varaible.
* Replacing Boolean values Yes,No with 1,0
* Encoding the categorical variables

In [None]:
# Working on Date Variable
df['date'] = df.apply(lambda x: datetime.datetime(year = int(x['Date'].split('-')[0]),
                                               month = int(x['Date'].split('-')[1]),
                                               day = int(x['Date'].split('-')[2])), axis=1)

# Month of the year can be correlated to the raining , so lets also have that as one of our predictors
df['month'] = df.apply(lambda x : int(x['Date'].split('-')[1]),axis=1)
df['year'] = df.apply(lambda x : int(x['Date'].split('-')[0]),axis=1)

# Changing to Boolean values for columns RainToday and  RainTomorrow
df['RainToday'] = df['RainToday'].replace(to_replace={"Yes":1,"No":0})
df['RainTomorrow'] = df['RainTomorrow'].replace(to_replace={"Yes":1,"No":0})

# Creating a duplicate variable for location column to plot
df["Location1"] = df["Location"]

# Encoding the other variables which has many categorical labels
le = LabelEncoder()
df["Location"] = le.fit_transform(df["Location"])
df["WindDir9am"]= le.fit_transform(df["WindDir9am"])
df["WindDir3pm"]= le.fit_transform(df["WindDir3pm"])
df["WindGustDir"] = le.fit_transform(df["WindGustDir"])


### Plotting some interesting variables

In [None]:
fig = plt.figure(figsize=(20,20))

axis_0 = fig.add_subplot(6,2,1,title='Rain Today')
names = [str(i) for i in list(df["RainToday"].unique())]
values = list(df["RainToday"].value_counts())
axis_0.bar(names,values,color=['red','green'])


axis_1 = fig.add_subplot(6,2,2,title='Rain Tomorrow')
names1 = [str(i) for i in list(df["RainTomorrow"].unique())]
values1 = list(df["RainTomorrow"].value_counts())
axis_1.bar(names1,values1,color=['red','green'])

axis_2 = fig.add_subplot(6,2,(3,4),title='Daily Max Temperature')
temp = df["MaxTemp"]
bins = 5
axis_2.hist(temp,bins = bins ,color = 'orange')

axis_3 = fig.add_subplot(6,2,(5,6),title='Daily Min Temperature')
temp2 = df["MinTemp"]
bins2 = 5
axis_3.hist(temp,bins = bins ,color = 'blue')


df_month_mean = df[['month','Rainfall','MaxTemp','MinTemp']].groupby('month').mean().reset_index()
df_location_mean = df[['Location1','Rainfall']].groupby('Location1').mean().reset_index()

axis_4 = fig.add_subplot(6,2,(7,8),title='Average Rainfall by Month')
names4 = df_month_mean['Rainfall']
values4 = df_month_mean['month']
axis_4.bar(values4,names4,color='pink')


axis_5 = fig.add_subplot(6,2,(9,10),title='Average Max Temp by Month')
names5 = df_month_mean['MaxTemp']
values5 = df_month_mean['month']
axis_5.bar(values5,names5,color='brown')


axis_6 = fig.add_subplot(6,2,(11,12),title='Average Rainfall by Location')
names6 = df_location_mean['Rainfall']
values6 = df_location_mean['Location1']
axis_6.bar(values6,names6,color='violet')
axis_6.tick_params(labelrotation=90)

# 4. Correlation Analysis

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), annot=True)
plt.xticks(rotation=90)
plt.show()

In [None]:
# Only selecting the variable which has a correlation coefficient value greater than 0.1
corr_table = abs(df.corr()['RainTomorrow'])
corr_table = corr_table[corr_table > 0.1]
predictor_variable = list(corr_table.keys())
df_ML = df[predictor_variable]

plt.figure(figsize=(20,20))
sns.heatmap(df_ML.corr(), annot=True)
plt.xticks(rotation=90)
plt.show()

**Findings**
* It looks like some of the independent variables are correlated among them, this could cause a multi collinearity issue.
* MaxTemp and Temp3pm has R2 value of 0.97
* Pressure9am and Pressure3pm has R2 value of 0.96

### Checking Variance Inflation Factor to investigate multi collinearity

In [None]:
def calculate_VIF(df):
    VIF = pd.DataFrame()
    all_VIF = []
    VIF["predictors"] = df.keys()
    for i in range(df.shape[1]):
        var_VIF = variance_inflation_factor(df.values,i)
        all_VIF.append(var_VIF)
    VIF["VIF"] = all_VIF
    return VIF

df_ML_VIF = calculate_VIF(df_ML.iloc[:,:-1])
df_ML_VIF["Target_correlation"] = abs(df_ML.corr()["RainTomorrow"]).reset_index(drop=True)
df_ML_VIF.sort_values(by="Target_correlation",ascending=False)

**Findings**
* Humidity is highly correlated with its also correlated with other independent variable 
* MaxTemp and Temp3pm also has high correlation
* So in order to strike a balance , we need to eliminate those variables with high VIF and low correlation.
* Then check for VIF again

In [None]:
df_ML_VIF["survivors"] = np.where((df_ML_VIF["Target_correlation"] < 0.3) & (df_ML_VIF["VIF"] > 20),0,1)
final_predictor = df_ML_VIF["predictors"][df_ML_VIF["survivors"]==1]
df_ML_2 = df_ML[list(final_predictor) + ['RainTomorrow']]

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df_ML_2.corr(), annot=True)
plt.xticks(rotation=90)
plt.show()

**Now looks like Sunshine is highly correlated with other predictors. Lets check for VIF**

In [None]:
calculate_VIF(df_ML_2.iloc[:,:-1])

**Findings**
* Looks like the VIF of sunshine is not too high , usually VIF of more than 15 should be removed.
* Sunshine is highly correlated with the Target variable
* Therefore not removing it from the list of predictors

# 4. Model Training

### Splitting the training and test dataset

In [None]:
x = df_ML_2.iloc[:,:-1]
y = df_ML_2.iloc[:,-1:]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

### Train a logistic regression model on the training set

In [None]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=0)
logreg.fit(X_train, y_train)

# 5. Prediction and Model evaluation

### Predicting with the trained logistics regression model

In [None]:
y_pred_test = logreg.predict(X_test)
predicted = pd.DataFrame(y_pred_test,columns=["prediction"])

### Classification Report

In [None]:
report = classification_report(y_test, y_pred_test)
print(report)
print(f"Accuracy of the Logistics Regresssion Model is: {accuracy_score(y_test,y_pred_test)*100}")


### Decoding the Classification report

**Recall**

> Recall = True Positives / True Positives + False Negative

Which means the number of positive cases which our model have correctly predicted from the number of actual postive cases

This means if the audience is only interested in our model's prediction of rain days , then we have to check for model's recall of rain prediction. 

Even if we predict all the days as non-rainy days, we will have a decent over all accuracy but that doesn't helps the audience because most of the days will be non-rainy days.

**Recall only looks at the postive cases**

Having said that, 

* Recall for rainy day prediction is low with our model with 45%. Out of 6420 rainy days we only predicted 2859 
* Recall for non-rainy day prediction is high with our model with 95%. Out of 22672 non-rainy days we only predicted 21443 (But this is not relevent to the audience)

**Precision**

> Precision = True Positives / True Positives + False Positives

How often a prediction is precise, is measured by precision.
If our model predict its going to be a rainy day , how often its true.

Precision for Rainy Day predcition = 2859/(2859 + 1229) = 0.70

Precision for Non-Rainy Day prediction = 21443/(21443 + 3561) = 0.86 (This is not relevant to the audience)

Precision for rain prediction is low with our model.

**F-1score**

> F1-score = 2 * {(recall * precision)/(recall + precision)}

F1-score is a metric that gives equal weightage to both precision and recall and computes a value.
That value should be higher for a good classification model.

In our case, the F-1 score for rainy-day prediction is only 0.54, not so great. Eventhough the overall accuracy is 0.84 which is 84%

### Plotting the confusion matrix

In [None]:
plt.figure(figsize=(10,10))
cm = confusion_matrix(y_test, predicted)
cm = pd.DataFrame(cm)
cm.columns = ["Predicted Non-Rainy Days","Predicted Rainy Days"]
cm.index = ["Actual Non-Rainy Days", "Actual Rainy Days"]
# cm = cm.pivot("Actual","Predicted")
sns.heatmap(cm, annot=True,fmt='d',cmap="Blues")
plt.title("Confusion Matrix for Logistics Model")
plt.show()

# More to come in the next upload