# Predict Droughts using Weather & Soil Data


By: Wihar Paladugula <br/>

ID: RQ47971

Data Scource: https://www.kaggle.com/datasets/cdminix/us-drought-meteorological-data 

## Importing Necessary Packages

In [12]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import datetime as dt

## Reading the csv file

In [3]:
df = pd.read_csv("train_timeseries.csv")

## Exploring the data frame

In [4]:
print(f"The data has {df.shape[0]} rows and {df.shape[1]} columns")

The data has 19300680 rows and 21 columns


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19300680 entries, 0 to 19300679
Data columns (total 21 columns):
 #   Column       Dtype  
---  ------       -----  
 0   fips         int64  
 1   date         object 
 2   PRECTOT      float64
 3   PS           float64
 4   QV2M         float64
 5   T2M          float64
 6   T2MDEW       float64
 7   T2MWET       float64
 8   T2M_MAX      float64
 9   T2M_MIN      float64
 10  T2M_RANGE    float64
 11  TS           float64
 12  WS10M        float64
 13  WS10M_MAX    float64
 14  WS10M_MIN    float64
 15  WS10M_RANGE  float64
 16  WS50M        float64
 17  WS50M_MAX    float64
 18  WS50M_MIN    float64
 19  WS50M_RANGE  float64
 20  score        float64
dtypes: float64(19), int64(1), object(1)
memory usage: 3.0+ GB


In [7]:
df.columns

Index(['fips', 'date', 'PRECTOT', 'PS', 'QV2M', 'T2M', 'T2MDEW', 'T2MWET',
       'T2M_MAX', 'T2M_MIN', 'T2M_RANGE', 'TS', 'WS10M', 'WS10M_MAX',
       'WS10M_MIN', 'WS10M_RANGE', 'WS50M', 'WS50M_MAX', 'WS50M_MIN',
       'WS50M_RANGE', 'score'],
      dtype='object')

In [10]:
df.head()

Unnamed: 0,fips,date,PRECTOT,PS,QV2M,T2M,T2MDEW,T2MWET,T2M_MAX,T2M_MIN,...,TS,WS10M,WS10M_MAX,WS10M_MIN,WS10M_RANGE,WS50M,WS50M_MAX,WS50M_MIN,WS50M_RANGE,score
0,1001,2000-01-01,0.22,100.51,9.65,14.74,13.51,13.51,20.96,11.46,...,14.65,2.2,2.94,1.49,1.46,4.85,6.04,3.23,2.81,
1,1001,2000-01-02,0.2,100.55,10.42,16.69,14.71,14.71,22.8,12.61,...,16.6,2.52,3.43,1.83,1.6,5.33,6.13,3.72,2.41,
2,1001,2000-01-03,3.65,100.15,11.76,18.49,16.52,16.52,22.73,15.32,...,18.41,4.03,5.33,2.66,2.67,7.53,9.52,5.87,3.66,
3,1001,2000-01-04,15.95,100.29,6.42,11.4,6.09,6.1,18.09,2.16,...,11.31,3.84,5.67,2.08,3.59,6.73,9.31,3.74,5.58,1.0
4,1001,2000-01-05,0.0,101.15,2.95,3.86,-3.29,-3.2,10.82,-2.66,...,2.65,1.6,2.5,0.52,1.98,2.94,4.85,0.65,4.19,


## Data Cleaning

In [14]:
df['date']=pd.to_datetime(df['date'])

In [16]:
df['year'] = df['date'].dt.year

In [17]:
min_year = df['year'].min()
max_year = df['year'].max()

print(f"Minimum Year: {min_year}")
print(f"Maximum Year: {max_year}")

Minimum Year: 2000
Maximum Year: 2016


> Reducing number of records so that memory is sufficient by taking data from 2010 to 2016

In [18]:
df = df[(df['year'] >= 2010) & (df['year'] <= 2016)]

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7947156 entries, 3653 to 19300679
Data columns (total 22 columns):
 #   Column       Dtype         
---  ------       -----         
 0   fips         int64         
 1   date         datetime64[ns]
 2   PRECTOT      float64       
 3   PS           float64       
 4   QV2M         float64       
 5   T2M          float64       
 6   T2MDEW       float64       
 7   T2MWET       float64       
 8   T2M_MAX      float64       
 9   T2M_MIN      float64       
 10  T2M_RANGE    float64       
 11  TS           float64       
 12  WS10M        float64       
 13  WS10M_MAX    float64       
 14  WS10M_MIN    float64       
 15  WS10M_RANGE  float64       
 16  WS50M        float64       
 17  WS50M_MAX    float64       
 18  WS50M_MIN    float64       
 19  WS50M_RANGE  float64       
 20  score        float64       
 21  year         int64         
dtypes: datetime64[ns](1), float64(19), int64(2)
memory usage: 1.4 GB


In [19]:
print(f"The data has {df.shape[0]} rows and {df.shape[1]} columns")

The data has 7947156 rows and 22 columns


In [21]:
min_year = df['year'].min()
max_year = df['year'].max()

print(f"Minimum Year: {min_year}")
print(f"Maximum Year: {max_year}")

Minimum Year: 2010
Maximum Year: 2016


In [22]:
df.score.unique()

array([   nan, 0.    , 0.0507, ..., 1.6143, 0.7399, 0.606 ])