# Introduction 
This project consists in analysing the last 30 years measured wind in Dublin and the possible relationship to the climate change using the synoptic station in Dublin_Airport, Co Dublin. 
Source: https://data.gov.ie/dataset/dublin-airport-hourly-data

### 1.1 Preparing the environment
First, we must prepare the necessary working environment to start the analysis problem. This introductory section is divided as follows:
- Library installation.
- Load the dataset
- Check the DF info 

First we need to explore and to anderstand the dataset, to run the program we need to import necessary libraries, like pandas and datetime.

In [15]:

# import DataFrames
import pandas as pd

# import dates and times
import datetime as dt

# import numpy
import numpy as np

# import os 
import os 

Adding the dataset and verify if this was loaded properly.

In [16]:
# Load the dataset
# Skip the first 23 lines which contain metadata, then read the CSV
df = pd.read_csv("wind.csv", skiprows=23)

  df = pd.read_csv("wind.csv", skiprows=23)


In [17]:
# Display the first rows
df.head()

Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,...,ind.3,wdsp,ind.4,wddir,ww,w,sun,vis,clht,clamt
0,01-jan-1945 00:00,2,0.0,0,4.9,0,4.6,4.4,8.2,95,...,1,0,1,0,50,4,0.0,200,2,8
1,01-jan-1945 01:00,3,0.0,0,5.1,0,4.9,4.4,8.5,97,...,1,0,1,0,45,4,0.0,200,2,8
2,01-jan-1945 02:00,2,0.0,0,5.1,0,4.8,4.4,8.5,97,...,1,0,1,0,50,4,0.0,4800,4,8
3,01-jan-1945 03:00,0,0.2,0,5.2,0,5.0,4.4,8.5,97,...,1,0,1,0,50,4,0.0,6000,4,8
4,01-jan-1945 04:00,2,0.0,0,5.6,0,5.4,5.0,8.8,97,...,1,7,1,250,50,5,0.0,6000,4,8


To have more information about the dataset and check what info it's usfull for the annalysis we can use: 
- The df.shape to have: number of rows, number of columns
- The df.info to obtain: a summary of the DataFrame.

In [18]:
# number of rows and columns
df.shape


(709297, 21)

In [19]:
# info of the data frame
df.info

<bound method DataFrame.info of                      date  ind  rain  ind.1  temp  ind.2  wetb  dewpt vappr  \
0       01-jan-1945 00:00    2   0.0      0   4.9      0   4.6    4.4   8.2   
1       01-jan-1945 01:00    3   0.0      0   5.1      0   4.9    4.4   8.5   
2       01-jan-1945 02:00    2   0.0      0   5.1      0   4.8    4.4   8.5   
3       01-jan-1945 03:00    0   0.2      0   5.2      0   5.0    4.4   8.5   
4       01-jan-1945 04:00    2   0.0      0   5.6      0   5.4    5.0   8.8   
...                   ...  ...   ...    ...   ...    ...   ...    ...   ...   
709292  30-nov-2025 20:00    0   0.6      0   7.0      0   6.6    6.0   9.4   
709293  30-nov-2025 21:00    2   0.0      0   9.7      0   8.4    7.0  10.0   
709294  30-nov-2025 22:00    0   1.2      0   9.7      0   8.9    8.0  10.8   
709295  30-nov-2025 23:00    0   0.7      0  10.0      0   9.2    8.3  11.0   
709296  01-dec-2025 00:00    0   0.3      0  10.8      0   9.8    8.8  11.3   

       rhum  ...  i

We can use 'describe' method to check the dispersion and shape of a dataset’s distribution, excluding NaN values.

In [20]:
# describe the data frame
df.describe()

Unnamed: 0,ind,rain,ind.1,temp,ind.2,wetb,dewpt,msl,ind.3,wdsp,ind.4,ww,w,sun
count,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0,709297.0
mean,0.632851,0.086648,0.014224,9.664489,0.029308,8.224054,6.603725,1013.584091,1.353533,10.114264,1.353712,15.596149,17.181295,0.167409
std,1.101539,0.418524,0.118687,4.893245,0.257702,4.409004,4.592909,12.347392,0.808965,5.68015,0.809007,22.553069,24.214145,0.326679
min,0.0,0.0,0.0,-11.5,0.0,-11.5,-17.7,944.1,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,6.1,0.0,5.1,3.3,1006.2,1.0,6.0,1.0,2.0,2.0,0.0
50%,0.0,0.0,0.0,9.7,0.0,8.4,6.9,1014.8,2.0,9.0,2.0,2.0,11.0,0.0
75%,2.0,0.0,0.0,13.2,0.0,11.6,10.0,1022.2,2.0,14.0,2.0,21.0,11.0,0.1
max,6.0,26.5,2.0,29.1,6.0,22.2,20.5,1048.7,6.0,46.0,7.0,97.0,99.0,1.0


We can rename the columns we need to have an understandable names.

In [21]:
# rename the columns

df = df.rename(columns={
    'date': 'Date and Time', 
    'temp': 'Air Temperature', 
    'wdsp': 'Mean Wind Speed', 
    'wddir': 'Wind Direction', 
    'ww': 'Present weather', 
    'w': 'Past Weather'})
df.head()

Unnamed: 0,Date and Time,ind,rain,ind.1,Air Temperature,ind.2,wetb,dewpt,vappr,rhum,...,ind.3,Mean Wind Speed,ind.4,Wind Direction,Present weather,Past Weather,sun,vis,clht,clamt
0,01-jan-1945 00:00,2,0.0,0,4.9,0,4.6,4.4,8.2,95,...,1,0,1,0,50,4,0.0,200,2,8
1,01-jan-1945 01:00,3,0.0,0,5.1,0,4.9,4.4,8.5,97,...,1,0,1,0,45,4,0.0,200,2,8
2,01-jan-1945 02:00,2,0.0,0,5.1,0,4.8,4.4,8.5,97,...,1,0,1,0,50,4,0.0,4800,4,8
3,01-jan-1945 03:00,0,0.2,0,5.2,0,5.0,4.4,8.5,97,...,1,0,1,0,50,4,0.0,6000,4,8
4,01-jan-1945 04:00,2,0.0,0,5.6,0,5.4,5.0,8.8,97,...,1,7,1,250,50,5,0.0,6000,4,8


Verify the df.types, to know if we need to change the datetime format, for better analysis.

In [22]:
df.dtypes

Date and Time       object
ind                  int64
rain               float64
ind.1                int64
Air Temperature    float64
ind.2                int64
wetb               float64
dewpt              float64
vappr               object
rhum                object
msl                float64
ind.3                int64
Mean Wind Speed      int64
ind.4                int64
Wind Direction      object
Present weather      int64
Past Weather         int64
sun                float64
vis                 object
clht                object
clamt               object
dtype: object

Date and time already is a dataframe, however to ensure the code consider this as a dataframe we can explicit say it.

In [23]:
df['Date and Time'] = df['Date and Time'].astype('datetime64[ns]')

# Set the 'Date and Time' column as the index
df.set_index('Date and Time', inplace=True)
df = df.sort_index()
df.head()


Unnamed: 0_level_0,ind,rain,ind.1,Air Temperature,ind.2,wetb,dewpt,vappr,rhum,msl,ind.3,Mean Wind Speed,ind.4,Wind Direction,Present weather,Past Weather,sun,vis,clht,clamt
Date and Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1945-01-01 00:00:00,2,0.0,0,4.9,0,4.6,4.4,8.2,95,1035.8,1,0,1,0,50,4,0.0,200,2,8
1945-01-01 01:00:00,3,0.0,0,5.1,0,4.9,4.4,8.5,97,1035.8,1,0,1,0,45,4,0.0,200,2,8
1945-01-01 02:00:00,2,0.0,0,5.1,0,4.8,4.4,8.5,97,1035.8,1,0,1,0,50,4,0.0,4800,4,8
1945-01-01 03:00:00,0,0.2,0,5.2,0,5.0,4.4,8.5,97,1036.1,1,0,1,0,50,4,0.0,6000,4,8
1945-01-01 04:00:00,2,0.0,0,5.6,0,5.4,5.0,8.8,97,1036.2,1,7,1,250,50,5,0.0,6000,4,8


In [24]:
# verify the columns
df.columns

Index(['ind', 'rain', 'ind.1', 'Air Temperature', 'ind.2', 'wetb', 'dewpt',
       'vappr', 'rhum', 'msl', 'ind.3', 'Mean Wind Speed', 'ind.4',
       'Wind Direction', 'Present weather', 'Past Weather', 'sun', 'vis',
       'clht', 'clamt'],
      dtype='object')

In [25]:
# verify the index
df.index


DatetimeIndex(['1945-01-01 00:00:00', '1945-01-01 01:00:00',
               '1945-01-01 02:00:00', '1945-01-01 03:00:00',
               '1945-01-01 04:00:00', '1945-01-01 05:00:00',
               '1945-01-01 06:00:00', '1945-01-01 07:00:00',
               '1945-01-01 08:00:00', '1945-01-01 09:00:00',
               ...
               '2025-11-30 15:00:00', '2025-11-30 16:00:00',
               '2025-11-30 17:00:00', '2025-11-30 18:00:00',
               '2025-11-30 19:00:00', '2025-11-30 20:00:00',
               '2025-11-30 21:00:00', '2025-11-30 22:00:00',
               '2025-11-30 23:00:00', '2025-12-01 00:00:00'],
              dtype='datetime64[ns]', name='Date and Time', length=709297, freq=None)

As we can see, the CSV data is hourly. To better organize the data, considering the amount, it's best to group it by hour, month, and year.

In [None]:
#df['year'] = df['Date and Time'].dt.year
#df['month'] = df['Date and Time'].dt.month
#df['hour'] = df['Date and Time'].dt.hour


KeyError: 'Date and Time'

As we can see, the data starts from 1945, but we only need 30 years for the analysis, from 1995 to 2025, so we can ignore the years prior to this period.

df_30y = df[df["year"] >= df["year"].max() - 30]

df_30y["year"].min(), df_30y["year"].max()


In [None]:
# Filter the data for the years 1995 to 2025
df = df[df['year'].isin(range(1995, 2026))]

# Display the first few rows of the filtered data
print(df.head())

End