# Objective: To scrape weather data from "http://weather-prediction.surge.sh/", clean it and save as a '.csv' file.

Data features description :

1. station - used weather station number: 1 to 25
2. Date - Present day
3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (Â°C): 20 to 37.6
4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (Â°C): 11.3 to 29.9
5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5
6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100
7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (Â°C): 17.6 to 38.5
8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (Â°C): 14.3 to 29.6
9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9
10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4
11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97
12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97
13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98
14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97
15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7
16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6
17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8
18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7
19. lat - Latitude (Â°): 37.456 to 37.645
20. lon - Longitude (Â°): 126.826 to 127.135
21. DEM - Elevation (m): 12.4 to 212.3
22. Slope - Slope (Â°): 0.1 to 5.2
23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9
24. Next_Tmax - The next-day maximum air temperature (Â°C): 17.4 to 38.9
25. Next_Tmin - The next-day minimum air temperature (Â°C): 11.3 to 29.8

## Import necessary libraries

In [1]:
#import statements
import pandas as pd
import numpy as np

from bs4 import BeautifulSoup
from requests import get

## Scrape data and store in a dataframe

In [2]:
url = "http://weather-prediction.surge.sh/"
html_content = get(url).content

soup = BeautifulSoup(html_content, 'html.parser')
table = soup.find('table', attrs={'class': 'new-table table-striped'})
headings = table.thead.tr
body = table.tbody.find_all('tr')

In [3]:
def return_text(object):
    if not(object):
        return None
    else:
        text = object.text.strip()
        return None if not(text) else text

In [4]:
col_names = []
for th in headings.find_all('th'):
    col_names.append(return_text(th))
print(f'Column names: {col_names}\nCount: {len(col_names)}')

data = {col:[] for col in col_names}

for tr in body:
    tds = tr.find_all('td')
    for idx, col in enumerate(col_names):
        data[col].append(return_text(tds[idx]))

df_data = pd.DataFrame(data)
df_data.head(10)

Column names: ['station', 'Date', 'Present_Tmax', 'Present_Tmin', 'LDAPS_RHmin', 'LDAPS_RHmax', 'LDAPS_Tmax_lapse', 'LDAPS_Tmin_lapse', 'LDAPS_WS', 'LDAPS_LH', 'LDAPS_CC1', 'LDAPS_CC2', 'LDAPS_CC3', 'LDAPS_CC4', 'LDAPS_PPT1', 'LDAPS_PPT2', 'LDAPS_PPT3', 'LDAPS_PPT4', 'lat', 'lon', 'DEM', 'Slope', 'Solar_radiation', 'Next_Tmax', 'Next_Tmin']
Count: 25


Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar_radiation,Next_Tmax,Next_Tmin
0,1,30-06-13,28.7,21.4,58.25568771,91.11636353,28.07410146,23.00693617,6.818886966,69.45180527,...,0,0.0,0,37.6046,126.991,212.335,2.785,5992.895996,29.1,21.2
1,2,30-06-13,31.9,21.6,52.26339722,90.60472107,29.85068856,24.03500932,5.691889932,51.93744783,...,0,0.0,0,37.6046,127.032,44.7624,0.5141,5869.3125,30.5,22.5
2,3,Thirty-Six-2013,31.6,23.3,48.69047928,83.97358704,30.09129171,24.56563342,6.138223678,20.57304966,...,0,0.0,0,37.5776,127.058,33.3068,0.2661,5863.555664,31.1,23.9
3,4,30-06-13,32.0,23.4,58.23978806,96.48368835,29.7046289,23.32617729,5.650050263,65.72714393,...,0,,0,37.645,127.022,45.716,2.5348,5856.964844,31.7,24.3
4,5,30-06-13,31.4,21.9,56.17409515,90.15512848,29.11393432,23.48647993,5.735004306,107.9655353,...,0,0.0,0,37.5507,127.135,35.038,0.5055,5859.552246,31.2,22.5
5,6,30-06-13,31.9,23.5,52.43712616,85.30725098,29.21934227,23.8226129,6.182295263,50.23138913,...,0,0.0,0,37.5102,127.042,54.6384,0.1457,5873.780762,31.5,24.0
6,7,30-06-13,31.4,24.4,56.28718948,81.01976013,28.55185865,24.23846712,,125.110007,...,0,0.0,0,37.5776,126.838,12.37,0.0985,5849.233398,30.9,23.4
7,8,30-06-13,32.1,,52.32621765,78.00453949,28.85198158,23.81905389,6.104417304,42.01154665,...,0,0.0,0,37.4697,126.91,52.518,1.5629,5863.992188,31.1,22.9
8,9,30-06-13,31.4,,55.33879089,80.78460693,28.4269752,23.33237339,6.017135074,85.11097145,...,0,0.0,0,37.4967,126.826,50.9312,0.4125,5876.901367,31.3,21.6
9,10,30-06-13,31.6,,56.65120316,86.84963226,27.57670489,22.52701832,6.518841068,63.00607544,...,0,0.0,0,37.4562,126.955,208.507,5.1782,5893.608398,30.5,21.0


In [5]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 25 columns):
station             1200 non-null object
Date                1195 non-null object
Present_Tmax        1187 non-null object
Present_Tmin        1065 non-null object
LDAPS_RHmin         1019 non-null object
LDAPS_RHmax         1175 non-null object
LDAPS_Tmax_lapse    1175 non-null object
LDAPS_Tmin_lapse    1175 non-null object
LDAPS_WS            1109 non-null object
LDAPS_LH            1175 non-null object
LDAPS_CC1           1175 non-null object
LDAPS_CC2           1175 non-null object
LDAPS_CC3           1175 non-null object
LDAPS_CC4           995 non-null object
LDAPS_PPT1          1175 non-null object
LDAPS_PPT2          1175 non-null object
LDAPS_PPT3          1037 non-null object
LDAPS_PPT4          1175 non-null object
lat                 1200 non-null object
lon                 1200 non-null object
DEM                 1200 non-null object
Slope               1200 non-null

## Type-casting

In [6]:
df_data.station = df_data.station.astype(int)
df_data.Date = df_data.Date.astype(str).replace('None', np.nan)
df_data.iloc[:,2:] = df_data.iloc[:,2:].astype(float)

## Data cleaning

In [7]:
df_data.Date.unique()

array(['30-06-13', 'Thirty-Six-2013', nan, '1/7/2013', '2/7/2013',
       '3/7/2013', '4/7/2013', '5/7/2013', '6/7/2013', '7/7/2013',
       '8/7/2013', '9/7/2013', '10/7/2013', '11/7/2013', '12/7/2013',
       '13-07-13', '14-07-13', '15-07-13', '15-Seven-2013', '16-07-13',
       '17-07-13', '18-07-13', '19-07-13', '20-07-13', 'Twenty-07-2013',
       '21-07-13', '22-07-13', '23-07-13', '24-07-13', '25-07-13',
       '26-07-13', '27-07-13', '28-07-13', '29-07-13', '30-07-13',
       '31-07-13', '1/8/2013', '2/8/2013', '3/8/2013', '4/8/2013',
       '5/8/2013', '6/8/2013', '7/8/2013', '07-08-Thirteen', '8/8/2013',
       '9/8/2013', 'Nine-08-2013', '10/8/2013', '11/8/2013', '12/8/2013',
       '13-08-13', '14-08-13', '15-08-13', '16-08-13'], dtype=object)

In [8]:
def clean_date(strDate):
    ''' Handles misrepresented dd, mm and yy in date. Tries to convert into dd/mm/yy format. '''
    
    clean_date = strDate
    words_num = {'one':'01', 'two':'02', 'three':'03', 'four':'04', 'five':'05', 'six':'06', 'seven':'07', 'eight':'08', 'nine':'09', 'ten':'10', 'eleven':'11', 'twelve':'12',
            'thirteen':'13', 'fourteen':'14', 'fifteen':'15', 'sixteen':'16', 'seventeen':'17','eighteen':'18', 'nineteen':'19', 'twenty':'20', 'twentyone': '21', 
            'twentytwo':'22', 'twentythree':'23', 'twentyfour':'24', 'twentyfive':'25', 'twentysix':'26', 'twentyseven':'27', 'twentyeight':'28', 'twentynine':'29',
            'thirty':'30', 'thirtyone':'31'
            }
    
    if isinstance(strDate,str):
        ltWords = strDate.split('-')
        if len(ltWords) == 3:
            dd, mm, yy = 'dd', 'mm', 'yy'
            
            # handle misrepresentation of dd
            dd_word = ltWords[0].lower().strip()
            if dd_word in words_num.keys():
                dd = words_num[dd_word]
            else: 
                dd = ltWords[0]
            
            # handle misrepresentation of mm
            mm_word = ltWords[1].lower().strip()
            if mm_word in words_num.keys():
                mm = words_num[mm_word]
            else:
                mm = ltWords[1]
            
            # handle misrepresentation of yy
            yy_word = ltWords[2].lower().strip()
            if yy_word in words_num.keys():
                yy = words_num[yy_word]
            else:
                yy = ltWords[2][-2:]
            
            clean_date = dd + '/' + mm + '/' + yy
        return clean_date
    return strDate

In [9]:
df_data['Date'] = df_data['Date'].apply(lambda x: clean_date(x))
df_data.Date.unique()

array(['30/06/13', nan, '1/7/2013', '2/7/2013', '3/7/2013', '4/7/2013',
       '5/7/2013', '6/7/2013', '7/7/2013', '8/7/2013', '9/7/2013',
       '10/7/2013', '11/7/2013', '12/7/2013', '13/07/13', '14/07/13',
       '15/07/13', '16/07/13', '17/07/13', '18/07/13', '19/07/13',
       '20/07/13', '21/07/13', '22/07/13', '23/07/13', '24/07/13',
       '25/07/13', '26/07/13', '27/07/13', '28/07/13', '29/07/13',
       '30/07/13', '31/07/13', '1/8/2013', '2/8/2013', '3/8/2013',
       '4/8/2013', '5/8/2013', '6/8/2013', '7/8/2013', '07/08/13',
       '8/8/2013', '9/8/2013', '09/08/13', '10/8/2013', '11/8/2013',
       '12/8/2013', '13/08/13', '14/08/13', '15/08/13', '16/08/13'],
      dtype=object)

### Handle null values in 'Date' column

In [10]:
df_data[df_data['Date'].isna()]

Unnamed: 0,station,Date,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar_radiation,Next_Tmax,Next_Tmin
13,14,,31.3,23.8,50.745735,74.49881,29.498526,24.459427,6.319478,16.835611,...,0.0,0.0,0.0,37.4967,126.927,30.968,0.618,5857.949707,31.7,22.9
79,5,,30.5,20.9,70.627464,90.730034,29.871706,23.106501,9.116771,83.304561,...,0.0,0.0,0.041595,37.5507,127.135,35.038,0.5055,5841.578125,28.7,23.3
688,14,,30.8,22.6,71.90918,92.612534,28.532291,24.167366,8.193516,50.247033,...,2.193649,0.0,0.173879,37.4967,126.927,30.968,0.618,5503.832031,27.2,23.4
978,4,,33.1,27.6,68.092934,96.064369,31.863802,27.169974,8.155877,81.423075,...,0.009569,0.0,0.0,37.645,127.022,45.716,2.5348,5205.58252,32.7,28.3
1187,13,,32.4,26.9,45.7365,84.29541,32.752636,27.122157,9.124786,34.823897,...,0.0,0.0,0.0,37.5776,127.083,59.8324,2.6865,4951.04834,33.2,27.5


Fill-in missing date based on station no.

In [11]:
na_date_indices = df_data[df_data['Date'].isna()].index

#Handle missing values in 'Date' column
for index in na_date_indices:
    # Handle missing value for first row. Further, ensure we have all stations in order. 
    # Otherwise, there can be chance in future that a row is missing from the dataset or the sorting has been done in wrong order.
    if index>0 and (df_data.loc[index-1, 'station'] == df_data.loc[index, 'station']-1):
        df_data.loc[index, 'Date'] = df_data.loc[index-1, 'Date']
    # if missing value is in the first row, the next row date will be used if station number is in sequence
    elif df_data.loc[index+1, 'station'] == df_data.loc[index, 'station']+1:
        df_data.loc[index, 'Date'] = df_data.loc[index+1, 'Date']
        
df_data['Date'].isna().sum()

0

### Convert 'Date' from string to datetime type

In [12]:
df_data['Date'] = pd.to_datetime(df_data['Date'])

### Handle null values in other columns

In [13]:
for col in df_data.columns:
    na_sum = df_data[col].isna().sum()
    if  na_sum > 0:
        print(f'Null values in {col} = {na_sum} ({round(na_sum/df_data.shape[0], 3)}%)')

Null values in Present_Tmax = 13 (0.011%)
Null values in Present_Tmin = 135 (0.112%)
Null values in LDAPS_RHmin = 181 (0.151%)
Null values in LDAPS_RHmax = 25 (0.021%)
Null values in LDAPS_Tmax_lapse = 25 (0.021%)
Null values in LDAPS_Tmin_lapse = 25 (0.021%)
Null values in LDAPS_WS = 91 (0.076%)
Null values in LDAPS_LH = 25 (0.021%)
Null values in LDAPS_CC1 = 25 (0.021%)
Null values in LDAPS_CC2 = 25 (0.021%)
Null values in LDAPS_CC3 = 25 (0.021%)
Null values in LDAPS_CC4 = 205 (0.171%)
Null values in LDAPS_PPT1 = 25 (0.021%)
Null values in LDAPS_PPT2 = 25 (0.021%)
Null values in LDAPS_PPT3 = 163 (0.136%)
Null values in LDAPS_PPT4 = 25 (0.021%)
Null values in Next_Tmax = 26 (0.022%)
Null values in Next_Tmin = 35 (0.029%)


As the null values in each column are small compared to the observations in the dataset, the null values can be imputed.

### Imputing null values using mean

In [14]:
df_data = df_data.fillna(df_data.mean())

In [15]:
df_data.isna().sum()

station             0
Date                0
Present_Tmax        0
Present_Tmin        0
LDAPS_RHmin         0
LDAPS_RHmax         0
LDAPS_Tmax_lapse    0
LDAPS_Tmin_lapse    0
LDAPS_WS            0
LDAPS_LH            0
LDAPS_CC1           0
LDAPS_CC2           0
LDAPS_CC3           0
LDAPS_CC4           0
LDAPS_PPT1          0
LDAPS_PPT2          0
LDAPS_PPT3          0
LDAPS_PPT4          0
lat                 0
lon                 0
DEM                 0
Slope               0
Solar_radiation     0
Next_Tmax           0
Next_Tmin           0
dtype: int64

## Cleaned Data

In [16]:
df_data.describe()

Unnamed: 0,station,Present_Tmax,Present_Tmin,LDAPS_RHmin,LDAPS_RHmax,LDAPS_Tmax_lapse,LDAPS_Tmin_lapse,LDAPS_WS,LDAPS_LH,LDAPS_CC1,...,LDAPS_PPT2,LDAPS_PPT3,LDAPS_PPT4,lat,lon,DEM,Slope,Solar_radiation,Next_Tmax,Next_Tmin
count,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,...,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0,1200.0
mean,13.0,29.373041,24.019624,64.037715,91.909603,29.683377,24.60682,8.209924,55.510203,0.486746,...,0.99904,0.149295,0.30266,37.54472,126.9914,61.854944,1.256692,5532.67856,29.832538,23.873906
std,7.214109,2.577703,1.59202,12.466989,4.699431,2.309906,1.616469,2.083082,29.07299,0.260664,...,2.956984,0.621049,1.340105,0.050353,0.079434,54.276072,1.370316,272.804884,2.712461,1.710373
min,1.0,22.4,17.9,36.725609,72.999237,23.356801,19.547129,3.881777,5.404379,0.010504,...,0.0,0.0,0.0,37.4562,126.826,12.37,0.0985,4915.099121,21.8,17.8
25%,7.0,27.4,23.1,54.645017,88.98303,27.96712,23.454546,6.686705,34.249035,0.247092,...,0.0,0.0,0.0,37.5102,126.937,28.7,0.2713,5314.874389,27.6,22.8
50%,13.0,29.5,24.019624,64.037715,92.859123,29.802532,24.581414,8.209924,49.505085,0.486746,...,0.0,0.0,0.0,37.5507,126.995,45.716,0.618,5589.764892,30.0,23.8
75%,19.0,31.3,24.9,72.685385,95.320145,31.36509,25.613003,9.220296,73.372072,0.711969,...,0.326271,0.101849,0.0,37.5776,127.042,59.8324,1.7678,5776.479492,31.9,24.9
max,25.0,35.5,28.3,95.818939,99.985825,35.572714,28.190031,18.04369,153.779546,0.951899,...,21.621661,6.652697,12.216177,37.645,127.135,212.335,5.1782,5992.895996,36.6,28.3


In [17]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1200 entries, 0 to 1199
Data columns (total 25 columns):
station             1200 non-null int32
Date                1200 non-null datetime64[ns]
Present_Tmax        1200 non-null float64
Present_Tmin        1200 non-null float64
LDAPS_RHmin         1200 non-null float64
LDAPS_RHmax         1200 non-null float64
LDAPS_Tmax_lapse    1200 non-null float64
LDAPS_Tmin_lapse    1200 non-null float64
LDAPS_WS            1200 non-null float64
LDAPS_LH            1200 non-null float64
LDAPS_CC1           1200 non-null float64
LDAPS_CC2           1200 non-null float64
LDAPS_CC3           1200 non-null float64
LDAPS_CC4           1200 non-null float64
LDAPS_PPT1          1200 non-null float64
LDAPS_PPT2          1200 non-null float64
LDAPS_PPT3          1200 non-null float64
LDAPS_PPT4          1200 non-null float64
lat                 1200 non-null float64
lon                 1200 non-null float64
DEM                 1200 non-null float64
Slope 

### Save as '.csv' file

In [18]:
df_data.to_csv('weather_cleaned_data.csv', index=False)