# Introduction  

## Binary-Prediction-with-a-Rainfall-Dataset (Kaggle competition)

The dataset for this competition (both train and test) was generated from a deep learning model trained on the Rainfall Prediction using Machine Learning dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

The dataset contains various meteorological attributes recorded over time. Below is a brief description of each column:

    ID: A unique identifier for each record.
    Day: Represents the day of the year (1 to 365).
    Pressure: Ranges from 999 to 1034.6 hPa.
    Temperature Features: Max temp (10.4°C - 36.0°C), Min temp (4.0°C - 29.8°C), and Average temp.
    Dew Point: Ranges from -0.3°C to 26.7°C.
    Humidity: Ranges from 39% to 98%.
    Cloud Cover: Ranges from 2% to 100%.
    Sunshine Duration: Ranges from 0 to 12.1 hours. 

This dataset provides essential meteorological variables, which can be useful for weather prediction models, climate analysis, and other predictive analytics.

# Importing Dependencies

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd

# Loading Dataset

In [None]:
# loading the training data
train_path = './train.csv'
train_df = pd.read_csv(train_path)

# loading the testing data
test_path = './test.csv'
test_df = pd.read_csv(test_path)

## Train data

checking the train data head and tail

In [7]:
train_df.head(5)

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
0,0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1
1,1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1
2,2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1
3,3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1
4,4,5,1021.8,21.3,18.4,15.2,9.6,52.0,45.0,3.6,40.0,24.8,0


In [8]:
train_df.tail(5)

Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
2185,2185,361,1014.6,23.2,20.6,19.1,19.9,97.0,88.0,0.1,40.0,22.1,1
2186,2186,362,1012.4,17.2,17.3,16.3,15.3,91.0,88.0,0.0,50.0,35.3,1
2187,2187,363,1013.3,19.0,16.3,14.3,12.6,79.0,79.0,5.0,40.0,32.9,1
2188,2188,364,1022.3,16.4,15.2,13.8,14.7,92.0,93.0,0.1,40.0,18.0,1
2189,2189,365,1013.8,21.2,19.1,18.0,18.0,89.0,88.0,1.0,70.0,48.0,1


In [10]:
# displaying the cols of train data,  shape 
train_df.columns, train_df.shape

(Index(['id', 'day', 'pressure', 'maxtemp', 'temparature', 'mintemp',
        'dewpoint', 'humidity', 'cloud', 'sunshine', 'winddirection',
        'windspeed', 'rainfall'],
       dtype='object'),
 (2190, 13))

In [22]:
# Displaying the info and description of the train data
print('Train data information:- \n')
train_df.info()

print('\nTrain data description:- ')
train_df.describe().T

Train data information:- 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2190 entries, 0 to 2189
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             2190 non-null   int64  
 1   day            2190 non-null   int64  
 2   pressure       2190 non-null   float64
 3   maxtemp        2190 non-null   float64
 4   temparature    2190 non-null   float64
 5   mintemp        2190 non-null   float64
 6   dewpoint       2190 non-null   float64
 7   humidity       2190 non-null   float64
 8   cloud          2190 non-null   float64
 9   sunshine       2190 non-null   float64
 10  winddirection  2190 non-null   float64
 11  windspeed      2190 non-null   float64
 12  rainfall       2190 non-null   int64  
dtypes: float64(10), int64(3)
memory usage: 222.6 KB

Train data description:- 


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,2190.0,1094.5,632.342866,0.0,547.25,1094.5,1641.75,2189.0
day,2190.0,179.948402,105.203592,1.0,89.0,178.5,270.0,365.0
pressure,2190.0,1013.602146,5.655366,999.0,1008.6,1013.0,1017.775,1034.6
maxtemp,2190.0,26.365799,5.65433,10.4,21.3,27.8,31.2,36.0
temparature,2190.0,23.953059,5.22241,7.4,19.3,25.5,28.4,31.5
mintemp,2190.0,22.170091,5.05912,4.0,17.7,23.85,26.4,29.8
dewpoint,2190.0,20.454566,5.288406,-0.3,16.8,22.15,25.0,26.7
humidity,2190.0,82.03653,7.800654,39.0,77.0,82.0,88.0,98.0
cloud,2190.0,75.721918,18.026498,2.0,69.0,83.0,88.0,100.0
sunshine,2190.0,3.744429,3.626327,0.0,0.4,2.4,6.8,12.1


In [17]:
# checking for missing values in the train data
train_df.isnull().sum()

# checking for zeros in the train data
train_df.isin([0]).sum()


id                 1
day                0
pressure           0
maxtemp            0
temparature        0
mintemp            0
dewpoint           0
humidity           0
cloud              0
sunshine         337
winddirection      0
windspeed          0
rainfall         540
dtype: int64

In [33]:
# checiking for duplicates in the train data
train_df.duplicated().sum()

np.int64(0)

## Test data

checking the test data head and tail

In [None]:
print('test data head:- \n')
test_df.head(5)

test data head:- 

test data tail:- 



Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed
725,2915,361,1020.8,18.2,17.6,16.1,13.7,96.0,95.0,0.0,20.0,34.3
726,2916,362,1011.7,23.2,18.1,16.0,16.0,78.0,80.0,1.6,40.0,25.2
727,2917,363,1022.7,21.0,18.5,17.0,15.5,92.0,96.0,0.0,50.0,21.9
728,2918,364,1014.4,21.0,20.0,19.7,19.8,94.0,93.0,0.0,50.0,39.5
729,2919,365,1020.9,22.2,18.8,17.0,13.3,79.0,89.0,0.2,60.0,50.6


In [25]:
print('test data tail:- \n')
test_df.tail(5)

test data tail:- 



Unnamed: 0,id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed
725,2915,361,1020.8,18.2,17.6,16.1,13.7,96.0,95.0,0.0,20.0,34.3
726,2916,362,1011.7,23.2,18.1,16.0,16.0,78.0,80.0,1.6,40.0,25.2
727,2917,363,1022.7,21.0,18.5,17.0,15.5,92.0,96.0,0.0,50.0,21.9
728,2918,364,1014.4,21.0,20.0,19.7,19.8,94.0,93.0,0.0,50.0,39.5
729,2919,365,1020.9,22.2,18.8,17.0,13.3,79.0,89.0,0.2,60.0,50.6


In [26]:
# checikng the cols of test data, shape
test_df.columns, test_df.shape

(Index(['id', 'day', 'pressure', 'maxtemp', 'temparature', 'mintemp',
        'dewpoint', 'humidity', 'cloud', 'sunshine', 'winddirection',
        'windspeed'],
       dtype='object'),
 (730, 12))

In [27]:
# checking the info and description of the test data
print('Test data information:- \n')
test_df.info()

print('\nTest data description:- ')
test_df.describe().T

Test data information:- 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 730 entries, 0 to 729
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             730 non-null    int64  
 1   day            730 non-null    int64  
 2   pressure       730 non-null    float64
 3   maxtemp        730 non-null    float64
 4   temparature    730 non-null    float64
 5   mintemp        730 non-null    float64
 6   dewpoint       730 non-null    float64
 7   humidity       730 non-null    float64
 8   cloud          730 non-null    float64
 9   sunshine       730 non-null    float64
 10  winddirection  729 non-null    float64
 11  windspeed      730 non-null    float64
dtypes: float64(10), int64(2)
memory usage: 68.6 KB

Test data description:- 


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,730.0,2554.5,210.877136,2190.0,2372.25,2554.5,2736.75,2919.0
day,730.0,183.0,105.438271,1.0,92.0,183.0,274.0,365.0
pressure,730.0,1013.503014,5.505871,1000.0,1008.725,1012.7,1017.6,1032.2
maxtemp,730.0,26.372466,5.672521,7.4,21.6,27.8,31.0,35.8
temparature,730.0,23.963288,5.278098,5.9,19.825,25.65,28.375,31.8
mintemp,730.0,22.110274,5.170744,4.2,17.825,23.9,26.4,29.1
dewpoint,730.0,20.460137,5.391169,-0.0,16.8,22.3,25.0,26.7
humidity,730.0,82.669863,7.818714,39.0,77.25,82.0,89.0,98.0
cloud,730.0,76.360274,17.934121,0.0,69.0,83.0,88.0,100.0
sunshine,730.0,3.664384,3.639272,0.0,0.325,2.2,6.675,11.8


In [28]:
# checking for missing values in the test data
test_df.isnull().sum()


id               0
day              0
pressure         0
maxtemp          0
temparature      0
mintemp          0
dewpoint         0
humidity         0
cloud            0
sunshine         0
winddirection    1
windspeed        0
dtype: int64

In [29]:
# checking for zeros in the test data
test_df.isin([0]).sum()

id                 0
day                0
pressure           0
maxtemp            0
temparature        0
mintemp            0
dewpoint           1
humidity           0
cloud              1
sunshine         122
winddirection      0
windspeed          0
dtype: int64

In [30]:
# filling the missing values in the test data with the mean
test_df['winddirection'].fillna(test_df['winddirection'].mean(), inplace=True)

In [32]:
# checing for duplicates in the test data
test_df.duplicated().sum()

np.int64(0)

## Data overview

Train data shape = (2190, 13)

Test data shape = (730, 12) [rainfall is missing which is the target value]