# HOW MUCH DID IT RAIN ?

For agriculture, it is extremely important to know how much it rained on a particular field. However, rainfall is variable in space and time and it is impossible to have rain gauges everywhere. Therefore, remote sensing instruments such as radar are used to provide wide spatial coverage.

Rainfall estimates drawn from remotely sensed observations will never exactly match the measurements that are carried out using rain gauges, due to the inherent characteristics of both sensors. Currently, radar observations are "corrected" using nearby gauges and a single estimate of rainfall is provided to users who need to know how much it rained.

The Challenge is to solve this in probabilistic manner.Knowing the full probabilistic spread of rainfall amounts can be very useful to drive hydrological and agronomic models -- much more than a single estimate of rainfall.

<img src="dual_pol2.jpg">


Unlike a conventional Doppler radar, a polarimetric radar transmits radio wave pulses that have both horizontal and vertical orientations. Because rain drops become flatter as they increase in size and because ice crystals tend to be elongated vertically, whereas liquid droplets tend to be flattened, it is possible to infer the size of rain drops and the type of hydrometeor from the differential reflectivity of the two orientations.

We are given polarimetric radar values and derived quantities at a location over the period of one hour. You will need to produce a probabilistic distribution of the hourly rain gauge total. 

#### ABOUT POLAMETRIC RADAR MEASUREMNTS 

Polarimetric radar offers the promise of being able to better infer drop-sizes and thus improve rainfall estimates since smaller drops evaporate more and of being able to distinguish between echoes due to bioscatter and echoes due to weather.  The US National Weather Service's weather radar network (called NEXRAD) was recently upgraded to polarimetry, and it is the polarimetric radar data collected after the upgrade that you are provided.


This is an kaggle competition and you can download the dataset from the link given below:
https://www.kaggle.com/c/how-much-did-it-rain/data


## LET'S START OUR JOURNEY
Before understanding the data set I will be importing the python libraries that will be used later

In [None]:
# importing the basic libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# IMPORTING other libraires which will be used
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

In [None]:
train=pd.read_csv("../input/train_2013.csv")
test=pd.read_csv("../input/test_2014.csv")

print("Training Size : (%d,%d)"%train.shape)
print("Test Size : (%d,%d)"%test.shape)

In [None]:
train.head(10)

In [None]:
train.columns

In [None]:
test.columns

## UNDERSTANDING THE DATA 
There are 19 provided features, with three of these features being rain
rates predicted from three current algorithms. These three past algorithm
features, RR1, RR2, and RR3, are respectively, the ‘HCA-based’, ‘Zdrbased’,
and ‘Kdp-based’ algorithms. The other 16 features are given as time
series numerical data. An example data point could have its ‘TimeToEnd’
features ‘58.0 55.0 52.0 49.0 41.0,’ indicating radar information taken at 58,
55, . . . , 41 minutes from the end of the hour. For this same row, the features
‘Reflectivity’ as ‘0.0, 0.0, 1.2, 4.5, 0.0’ and ‘RR1’ as ‘0.0. 0.0, 2.2, 0.3, 0.0’
mean these measurements taken at the time points in the ‘TimeToEnd’ series.
The label(expected value) for each row is one float number, the amount in mm of rain
collected for that hour



### DESCRIPTION OF COLUMNS
The columns in the datasets are:

    TimeToEnd:  How many minutes before the end of the hour was this radar observation?

    DistanceToRadar:  Distance between radar and gauge.  This value is scaled and rounded to prevent reverse engineering gauge location

    Composite:  Maximum reflectivity in vertical volume above gauge

    HybridScan: Reflectivity in elevation scan closest to ground

    HydrometeorType:  One of nine categories in NSSL HCA. See presentation for details.

    Kdp:  Differential phase

    RR1:  Rain rate from HCA-based algorithm

    RR2:  Rain rate from Zdr-based algorithm

    RR3:  Rain rate from Kdp-based algorithm

    RadarQualityIndex:  A value from 0 (bad data) to 1 (good data)

    Reflectivity:  In dBZ

    ReflectivityQC:  Quality-controlled reflectivity

    RhoHV:  Correlation coefficient

    Velocity:  (aliased) Doppler velocity

    Zdr:  Differential reflectivity in dB

    LogWaterVolume:  How much of radar pixel is filled with water droplets?

    MassWeightedMean:  Mean drop size in mm

    MassWeightedSD:  Standard deviation of drop size

    Expected: the actual amount of rain reported by the rain gauge for that hour.

#### Hydrometeor types:

    0-no echo

    1-moderate rain

    2-moderate rain

    3-heavy rain

    4-rain/hail

    5-big drops

    6-AP

    7-Birds

    8-unknown

    9-no echo

    10-dry snow

    11-wet snow

    12-ice crystals

    13-graupel

    14-graupel

In [None]:
train['TimeToEnd'][6]

In [None]:
train.info()

In [None]:
sample=pd.read_csv("sampleSubmission.csv")

sample.head(10)

In [None]:
sample.columns

## Submission
Submissions are predictions of the probabilistic distribution of
the hourly rain total. Each row of the submission is a list of values P(y ≤ Y ),
for Y integer values 0, 1, 2, . . . 69, and y the rainfall total, in mm and it is obivous that for every y 

                                 P(y≤k) ≤ P(y ≤ k+1) 
For instance a perfect prediction
for the true label of "2.5" would be the row 0, 0, 0, 1, 1, . . . , 1 corresponding
to P r(y ≤ 0) = P r(y ≤ 1) = P r(y ≤ 2) = 0 and P r(y ≤ 3) = · · · =
P r(y ≤ 69) = 1.                                        
                                        
## MISSING DATA

 There are five types of missing data

-99900: echo below signal-to-noise threshold of radar.  In other words, the true value could be anywhere between -14 and -inf, but we don't know 

-99901: range folded data

-99903: data not collected such as due to beam blockage or beyond data range

nan: derived quantity could not be computed because some input was one of the above codes

999.0: RadarQualityIndex could not be computed because pixel was at edge of echo

Frequency plot of first 100 expected values. So that we can visualize how are expected values distributed

In [None]:
plt.subplots(figsize=(20,20))
sns.distplot(train['Expected'].head(100))

## BASIC MODELS

These are the models which doesn't contain any feature engineering. They are just the simple Statistics and some common sense. 

Clearly they are approximation but they provide a benchmark to other models which contain some features.

There are 987398(87.6367 % of train data) points which have 0 as actual amount of rainfall. That means actually there was no rainfall.

So what i will do for test data also that probability of rainfall less than any value is 1


In [None]:
ans=pd.DataFrame(columns=sample.columns)
ans['Id']=test['Id']
ans.head(10)

In [None]:
cols=list(sample.columns)

In [None]:
cols.remove('Id')

In [None]:
ans[cols]=1                  # making each probability as 1.
ans.head(10)

In [None]:
ans.to_csv("No_rain.csv",index=False)   # got an score of private:0.01025920 and public:0.01017651  
print("Done")

Through this I got an score of private:0.01025920 and public:0.01017651 and after seeing in leaderboard this score will land you to 204-210 rank(not bad!)

This is another model which also dont contain any feature engineering. In this what I did was I computed the proportions of classes( which I mentioned below) in the train and put these proportions in the train set irrespective of the features.


In [None]:
length=train.shape[0]
print(length)

In [None]:
# length of predicted0 and predicted1 and all these are same
for i in cols:
    l=len(i)
    if(l==10):
        k=int(i[l-1])
    else:                     # for handling two digits like 11...
        a=int(i[9])
        b=int(i[10])
        k=a*10+b
    ans[i]=(train.loc[train['Expected']<=k,'Expected'].value_counts()/(length)).sum()

print("Done")

In [None]:
ans.head(10)

In [None]:
ans.to_csv("only_train_per.csv",index=False)           
print("Done")

By this model I got a score of private:0.00978634 and public:0.00971225. This score will give rank of 188 in public and 191 in private leaderboard.

## EXPLORATION

The above one were our basic models. So now it's time to explore the data and try to build some models

Before going further we will have to choose how to model the problem. Here I will model by transforming the problem into classification problem.


So from the sample submission we can see that for each id we have 69 columns which represent the probabilities. The first column will represent <=0 probability and second column will represent will represent <= 1 probability and so on.

So it is similar to classification algorithm which has 70(?) classes and each class represent the value lying between (i,i+1).

The classes will be like this :

    if expected value is 0 then it is class 0

    if expected value is between 0 to 1(inclusive) then it is class 1

    if expected value is between 1 to 2(inclusive) then it is class 2

    if expected value is between 2 to 3(inclusive) then it is class 3

    ......

    ......

    if expected value is between 68 to 69(inclusive) then it is class 69

So at the total we will have 70 classes.

Then P(y<=Y) : summation of all probabilities of the classes 0,(0,1),(1,2),(2,3),(3,4) and so on till (Y-1,Y) which represent each column of sample submission

Before this I have to decide  what to do with the training examples which have expected values greater than 69 because in the problem we have to predict till 69mm. So we will see how many of the train examples have expected value greater than 69.

In [None]:
len(train.loc[train['Expected']>69])

So there are total 5582 train examples which have greater than 69mm. So now we have to decide to keep this values as seperate class or remove these examples as this can be outliers.
So I am removing these values from the training set

In [None]:
# removing the examples which have greater than 69mm ranifall
train.drop((train.loc[train['Expected']>69]).index,inplace=True)
train.shape[0]

In [None]:
# converting the Expected values to classes.
train.loc[train['Expected']==0.0,'Expected']=0

for i in range(69):              # max value will go to 68
    train.loc[(train['Expected']>i) & (train['Expected']<=(i+1)),'Expected']=(i+1)
    
train['Expected']=train['Expected'].astype(int)

There is no training example with label 68. This can be easily seen from graph.

In [None]:
train.loc[(train['Expected']==68),'Expected']

In [None]:
plt.subplots(figsize=(15,9))
plt.xticks(rotation='90')                   # for rotation of 90 degree
sns.countplot(train['Expected'])

 RR1:Rain rate from HCA-based algorithm. I will take the mean of the values at all the times recorded and this will be our first feature

## RR1

I have chosen RR1 first because when we see train data you can see that if RR1 is non-zero then the rainfall is non-zero and if RR1 contains values like -99900 or -99901 or -99902 then also the expected rainfall was zero.

So I will use one feature and that is RR1. The results for this submission are written below.

In [None]:
# k=list(map(float,train['RR1'][6].split()))
l=[]                                                    # empty list 
for i in train.index:
    k=list(map(float,train['RR1'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    mean=sum(k)/len(k)
    l.append(mean)

In [None]:
rr1=np.array(l)
rr1.shape=(train.shape[0],1)
print(rr1.shape)

In [None]:
plt.subplots(figsize=(15,9))
plt.scatter(rr1[0:500,:],train['Expected'].head(500),color='blue')
plt.xlabel("RR1")
plt.ylabel("Expected Rainfall")
plt.show()

In [None]:
l=[]
for i in test.index:
    k=list(map(float,test['RR1'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    mean=sum(k)/len(k)
    l.append(mean)

In [None]:
rr1_test=np.array(l)
rr1_test.shape=(test.shape[0],1)
print(rr1_test.shape)

I removed the code for submission.

With the above model I got a public score of 0.00842523 and private score of 0.00842799 which is not bad. This model was better than the benchmark models.
With this score the submission rank was 74 in public and 72 in private. So now we are having a rank of less than 100.
I also tried with other learning rates

0.1 : The results were decent

0.3  : The score got worsened and the error was greater when compared with 0.1 learning rate (got a score of 0.13)
Basic models were better than this.

0.9 : The results were not good.

## RR2

As the above model performed better than benchmark models. So I included another feature RR2: Rain rate from Zdr-based algorithm in the training matrix. The results got improved but by a small margin.The scores of this model are given below

In [None]:
# converting rr2 into mean values in train data
j=[]                                                    # empty list 
for i in train.index:
    k=list(map(float,train['RR2'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    mean=sum(k)/len(k)
    j.append(mean)

In [None]:
rr2=np.array(j)
rr2.shape=(train.shape[0],1)
print(rr2.shape)

In [None]:
plt.subplots(figsize=(15,9))
plt.scatter(rr2[0:500,:],train['Expected'].head(500),color='blue')
plt.xlabel("RR2")
plt.ylabel("Expected Rainfall")
plt.show()

In [None]:
# converting rr2 values into mean values in test data 
j=[]                                                    # empty list 
for i in test.index:
    k=list(map(float,test['RR2'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    mean=sum(k)/len(k)
    j.append(mean)

In [None]:
rr2_test=np.array(j)
rr2_test.shape=(test.shape[0],1)
print(rr2_test.shape)

I removed the code for submission.

With RR2 as a feature I got a public score of 0.00839472 and private score of 0.00840271. This score will give rank of 70 in private leaderboard.

## RR3


I included another feature RR3: Rain rate from Kdp-based algorithm in the training matrix. This model performed bad compared to the above model.The scores of this model are given below

In [None]:
j=[]                                                    # empty list 
for i in train.index:
    k=list(map(float,train['RR3'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    mean=sum(k)/len(k)
    j.append(mean)

In [None]:
rr3=np.array(j)
rr3.shape=(train.shape[0],1)
print(rr3.shape)

In [None]:
plt.subplots(figsize=(15,9))
plt.scatter(rr3[0:500,:],train['Expected'].head(500),color='blue')
plt.xlabel("RR3")
plt.ylabel("Expected Rainfall")
plt.show()

In [None]:
j=[]                                                    # empty list 
for i in test.index:
    k=list(map(float,test['RR2'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    mean=sum(k)/len(k)
    j.append(mean)

In [None]:
rr3_test=np.array(j)
rr3_test.shape=(test.shape[0],1)
print(rr3_test.shape)

I removed the code for submission.

With RR3 as another feature I got a public score of 0.00840888 and private score of 0.00841514. The model performed bad compared to above model

## ADDING MORE FEATURES 

As we can easily see that from RR1 to RR3 the score has not changed much. So now its time to get to another features and explore more.
As the error increased when we added a feature RR3. So i am currently removing it as a feature 
 
## Radar Quality Index

I will add Radar quality index as another feature.It is given that if it is zero then it is bad data and if it is 1 then it is good data. This will be useful in predicting the rainfall because this will classify the dat as bad or good.

The missing values are 999.0 and description was that 999.0 means they can't the compute the radar quality index.
Suppose if the observation contains more 999.0 then it is of course a bad data. So I will take the mean of all the values and if the observation is 999.0 I will replace it with zero as zero corresponds to bad data 

In [None]:
p=[]
for i in train.index:
    k=list(map(float,train['RadarQualityIndex'][i].split()))
    k=[0.0 if x==999.0 else x for x in k]
    m=sum(k)/float(len(k))
    p.append(m)

In [None]:
RQi=np.array(p)
RQi.shape=(train.shape[0],1)
print(RQi.shape)

In [None]:
plt.subplots(figsize=(15,9))
plt.scatter(RQi[0:500,:],train['Expected'].head(500),color='blue')
plt.xlabel("Radar Quality Index")
plt.ylabel("Expected Rainfall")
plt.show()

In [None]:
p=[]
for i in test.index:
    k=list(map(float,test['RadarQualityIndex'][i].split()))
    k=[0.0 if x==999.0 else x for x in k]
    m=sum(k)/float(len(k))
    p.append(m)

In [None]:
RQi_test=np.array(p)
RQi_test.shape=(test.shape[0],1)
print(RQi.shape)

I removed the code for submission

With this model I got public score of 0.00835402 and private score of 0.00836117 and this will give you a rank of 65 in private learderboard in public leaderboard also.

## NUMBER OF RADAR SCANS

I am thinking that if the number of radar scans are more then may be the rainfall prediction will be accurate.
So I am going to include this feature. I have read a article on this problem and article suggest that there is good correlation between number of scans and rainfall.

Number of scans will be equal to number of spaces plus one in TimeToEnd column.

In [None]:
train['TimeToEnd'][0].count(" ")+1

In [None]:
numberofscans=[]
for i in train.index:
    number=train['TimeToEnd'][i].count(" ")+1
    numberofscans.append(number)

numberofscans=np.array(numberofscans).reshape(train.shape[0],1)
print(numberofscans.shape)

In [None]:
plt.subplots(figsize=(15,9))
plt.scatter(numberofscans,y.reshape(train.shape[0],1))
plt.xlabel("NUMBER OF RADAR SCANS")
plt.ylabel("RAINFALL")

In [None]:
numberofscans_test=[]
for i in test.index:
    number=test['TimeToEnd'][i].count(" ")+1
    numberofscans_test.append(number)

numberofscans_test=np.array(numberofscans_test).reshape(test.shape[0],1)
print(numberofscans_test.shape)

I removed for the code for submission.

With this model I got a public score of 0.00816744 and private score of 0.00816330 and this will give you a rank of 52 in private leaderboard.

## REFLECTIVITY QC

As you can see from the train data you can see that reflectivity is zero in most cases if we replace -99900 as zero
and while the reflectivity is zero the rainfall is also zero.

In [None]:
r=[]
for i in train.index:
    k=list(map(float,train['ReflectivityQC'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    m=sum(k)/len(k)
    r.append(m)

reflectivity=np.array(r).reshape(train.shape[0],1)
print(reflectivity.shape)

In [None]:
r=[]
for i in test.index:
    k=list(map(float,test['ReflectivityQC'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    m=sum(k)/len(k)
    r.append(m)
    
reflectivity_test=np.array(r).reshape(test.shape[0],1)
print(reflectivity_test.shape)

I removed the code for submission.

With this model(Xgboost) I got a public score of 0.00787731 and private score of 0.00786504. With this score you will get rank of 32 in leaderboard.


## HYBRID SCAN

HybridScan: Reflectivity in elevation scan closest to ground is also related to reflectivity. So my plan is to include this feature also

In [None]:
r=[]
for i in train.index:
    k=list(map(float,train['HybridScan'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    m=sum(k)/len(k)
    r.append(m)

hybrid=np.array(r).reshape(train.shape[0],1)
print(hybrid.shape)

In [None]:
r=[]
for i in test.index:
    k=list(map(float,test['HybridScan'][i].split()))
    k=[0 if (x==-99900.0 or x==-99901.0 or x==-99903.0) else x for x in k]
    m=sum(k)/len(k)
    r.append(m)
    
hybrid_test=np.array(r).reshape(test.shape[0],1)
print(hybrid_test.shape)

## TRAINING MATRICES

In [None]:
X=np.hstack((rr1,rr2,RQi,numberofscans,reflectivity,hybrid))
print(X.shape)
print(y.shape)

In [None]:
X_test=np.hstack((rr1_test,rr2_test,RQi_test,numberofscans_test,reflectivity_test,hybrid_test))
print(X_test.shape)

## TRAINING

This time we will be using two models random forest and Xgboost and we will see the individual score and after that I am thinking to ensemble these models with weighted voting

I am commenting out the code as it is taking more time for running

### XGB

In [None]:
# xgb=XGBClassifier()
# xgb.fit(X,y)

In [None]:
# xgb_predict=xgb.predict_proba(X_test)
# print(xgb_predict.shape)

In [None]:
# temp=xgb_predict[:,68].reshape(test.shape[0],1)
# xgb_predict[:,68]=0.0
# xgb_predict=np.hstack((xgb_predict,temp))
# print(xgb_predict.shape)

In [None]:
# xgb_predict=np.cumsum(xgb_predict,axis=1)
# print(xgb_predict.shape)

In [None]:
# hybrid_data=pd.DataFrame(xgb_predict,columns=cols)
# hybrid_data.head(10)

In [None]:
# hybrid_data=pd.concat([test['Id'],hybrid_data],axis=1)
# hybrid_data.to_csv("hybrid.csv",index=False)
# print("Done")

With this i got a public score of 0.00783509 and private score of 0.00782772 and a rank of 31 in private leaderboard.
Till now this is the best model which outperformed all other models. 

I also tried ensembling with random forest algorithm but the results didn't improved. 

## RANDOM FOREST 

In [None]:
# rf=RandomForestClassifier(n_estimators=10,random_state=42)
# rf.fit(X,y)

In [None]:
# f=rf.predict_proba(X_test)
# f.shape

In [None]:
# temp=f[:,68].reshape(test.shape[0],1)
# f[:,68]=0
# f=np.hstack((f,temp))
# f=np.cumsum(f,axis=1)
# print(f.shape)

In [None]:
# hybrid_rf=pd.DataFrame(f,columns=cols)
# hybrid_rf=pd.concat([test['Id'],hybrid_rf],axis=1)
# hybrid_rf.head(10)

In [None]:
# hybrid_rf[hybrid_rf[cols]>1]=1

In [None]:
# hybrid_rf.to_csv("hybrid_rf.csv",index=False)

## ENSEMBLING

Ensembling is a general term for combining many classifiers by averaging or voting.Generally, ensembles of classifiers perform better than single classifiers, and the averaging process allows for moregranularity of choice in the bias-variance tradeoff.

In [None]:
# print(hybrid_data.shape,hybrid_rf.shape)

## AVERAGING

In [None]:
# ensemble=hybrid_data[cols]*0.8+hybrid_rf[cols]*0.2
# ensemble.head(10)

In [None]:
# ensemble=pd.concat([test['Id'],ensemble],axis=1)
# ensemble.to_csv("random_forest+Xgboost.csv",index=False)
# print("Done")

This end's our journey of predicting the rainfall 


In Chinese mythology, the one who can predict the rain is known as the messenger who can talk to "the ruler of the ocean, the dragon king" ! :) :)

Thanks for reading !!