# CA1 – Data Cleaning and Preparation Using Python

## 1 – Describe and Rename Columns (10 marks)

• Explore the dataset and describe what kind of data each column contains.

• Some column names are abbreviations (e.g., wdspd). Rename columns to meaningful names that describe the data clearly.

• You may research the dataset on Kaggle or other online sources to understand what each column represents.

• The goal is to make the dataset self-explanatory and easy to interpret.

• Explore the dataset and describe what kind of data each column contains.

In [1]:
import pandas as pd
# Loads a sample of hourly Irish weather data.
weather_data = pd.read_csv('hrly_Irish_weather.csv')

# Show the shape (rows, columns) of the loaded sample
print("Shape: " , {weather_data.shape})

# List the column names so you can confirm headers and spot any naming issues
print("Columns:", weather_data.columns.tolist())

# Print a small preview (first 5 rows) to inspect values and identify missing/non-sense entries
print("First 5 rows:")
print(weather_data.iloc[:5])

  weather_data = pd.read_csv('hrly_Irish_weather.csv')


Shape:  {(4660423, 18)}
Columns: ['county', 'station', 'latitude', 'longitude', 'date', 'rain', 'temp', 'wetb', 'dewpt', 'vappr', 'rhum', 'msl', 'wdsp', 'wddir', 'sun', 'vis', 'clht', 'clamt']
First 5 rows:
   county  station  latitude  longitude               date rain  temp  wetb  \
0  Galway  ATHENRY    53.289     -8.786  26-jun-2011 01:00  0.0  15.3  14.5   
1  Galway  ATHENRY    53.289     -8.786  26-jun-2011 02:00  0.0  14.7  13.7   
2  Galway  ATHENRY    53.289     -8.786  26-jun-2011 03:00  0.0  14.3  13.4   
3  Galway  ATHENRY    53.289     -8.786  26-jun-2011 04:00  0.0  14.4  13.6   
4  Galway  ATHENRY    53.289     -8.786  26-jun-2011 05:00  0.0  14.4  13.5   

  dewpt vappr rhum     msl wdsp wddir  sun  vis clht clamt  
0  13.9  15.8   90  1016.0    8   190  NaN  NaN  NaN   NaN  
1  12.9  14.9   89  1015.8    7   190  NaN  NaN  NaN   NaN  
2  12.6  14.6   89  1015.5    6   190  NaN  NaN  NaN   NaN  
3  12.8  14.8   90  1015.3    7   180  NaN  NaN  NaN   NaN  
4  12.7  14.7

• Some column names are abbreviations (e.g., wdspd). Rename columns to meaningful names that describe the data clearly.


In [2]:
# Create a comprehensive mapping based on your actual columns
# reference:
# https://www.met.ie/cms/assets/uploads/2018/05/KeyHourly.txt
# https://stackoverflow.com/questions/11346283/renaming-column-names-in-pandas

column_mapping = {
    'wetb': 'Wet Bulb Air Temperature C',
    'dewpt': 'Dew Point Air Temperature C',
    'vappr': 'Vapour Pressure',
    'rhum': 'Relative Humidity',
    'msl': 'Mean Sea Level Pressure',
    'wdsp': 'Mean Hourly Wind Speed',
    'wddir': 'Predominant Hourly wind Direction',
    'vis': 'Visibility',
    'clht': 'Cloud Ceiling Height',
    'clamt': 'Cloud Amount'
}

# Rename the columns
weather_data = weather_data.rename (columns=column_mapping)

# Verify the changes
print("New column names:")
print(weather_data.columns.tolist())


New column names:
['county', 'station', 'latitude', 'longitude', 'date', 'rain', 'temp', 'Wet Bulb Air Temperature C', 'Dew Point Air Temperature C', 'Vapour Pressure', 'Relative Humidity', 'Mean Sea Level Pressure', 'Mean Hourly Wind Speed', 'Predominant Hourly wind Direction', 'sun', 'Visibility', 'Cloud Ceiling Height', 'Cloud Amount']


## 2 – Identify Missing and Non-sense Values (10 marks)
• Investigate the dataset to find all missing, null, and non-sense values.

• Non-sense values include entries like "?", "error", "missing", "NaN", or other inconsistent symbols.

• Report which columns contain such values and how many appear in each column.

• Summarise your findings clearly using printed outputs and a markdown explanation.

In [3]:
missing_data = weather_data.isnull().sum()
print("Missing values per column:")
print(missing_data)

Missing values per column:
county                                     0
station                                    0
latitude                                   0
longitude                                  0
date                                       0
rain                                       0
temp                                       0
Wet Bulb Air Temperature C                 0
Dew Point Air Temperature C                0
Vapour Pressure                            0
Relative Humidity                          0
Mean Sea Level Pressure                    0
Mean Hourly Wind Speed                229032
Predominant Hourly wind Direction     229032
sun                                  2585167
Visibility                           2585167
Cloud Ceiling Height                 2585167
Cloud Amount                         2585167
dtype: int64


The data below has the most missing data

Mean Hourly Wind Speed                229032

Predominant Hourly wind Direction     229032

sun                                  2585167

Visibility                           2585167

Cloud Ceiling Height                 2585167

Cloud Amount                         2585167

## 3 – Develop and Apply a Cleaning Strategy (15 marks)

• Develop a clear and well-structured strategy to clean missing and non-sense data.

• A missing value can be a blank cell, while non-sense values may include symbols or strings that do not represent real data.

• Apply your cleaning strategy systematically using Pandas, and clearly justify the methods you choose (for example, why you replaced, removed, or imputed specific
values).

• Explain your cleaning steps using markdown cells.

In [4]:
# Remove any row with missing values
weather_data_clean = weather_data.dropna()

print(f"Removed {len(weather_data) - len(weather_data_clean)} rows with missing values")

Removed 2585167 rows with missing values


# 4 – Detect Outliers (10 marks)

• Examine the dataset for possible outliers using an appropriate statistical method.

• You may use descriptive statistical tests (e.g., IQR or z-score).

• Report which columns contain outliers and describe how you identified them. Explain
it using markdown cells.

In [5]:
import numpy as np

print(" Outlier detection")
# Select only numeric columns from the dataset
numeric_cols = weather_data.select_dtypes(include=[np.number]).columns

print(numeric_cols)

# Iterate through each numeric column to detect outliers
for col in numeric_cols:
    # Remove NaN values from the column before analysis
    data = weather_data[col].dropna()
    if len(data) == 0:  # Skip if column is empty after removing NaN
        continue
        
    # Calculate quartiles and IQR
    Q1 = data.quantile(0.25)  # First quartile
    Q3 = data.quantile(0.75)  # Third quartile
    IQR = Q3 - Q1            # Interquartile range
    
    # Detect outliers using 1.5 * IQR rule
    # Values < (Q1 - 1.5*IQR) or > (Q3 + 1.5*IQR) are considered outliers
    outliers = data[(data < Q1 - 1.5 * IQR) | (data > Q3 + 1.5 * IQR)].count()
    print(f"{col}: {outliers} outliers")


 Outlier detection
Index(['latitude', 'longitude'], dtype='object')
latitude: 0 outliers
longitude: 0 outliers


# References

https://www.geeksforgeeks.org/python/how-to-use-pandas-filter-with-iqr/

https://www.geeksforgeeks.org/data-science/detect-and-remove-the-outliers-using-python/

https://www.geeksforgeeks.org/dsa/interquartile-range-iqr/

# 5 – Handle Outliers (10 marks)

• Choose and apply suitable methods to handle detected outliers.

• Document your reasoning and the steps you take to address them (e.g., removing,
capping, or transforming values).

• Demonstrate the effect of your outlier-handling process on the dataset. For example
you can compare the mean value of column having outliers before and after handling
it.

In [6]:
# Minimal outlier capping
df_clean = weather_data.copy()

for col in df_clean.select_dtypes(include='number'):
    Q1, Q3 = df_clean[col].quantile([0.25, 0.75])
    lower = Q1 - 1.5 * (Q3 - Q1)
    upper = Q3 + 1.5 * (Q3 - Q1)
    df_clean[col] = df_clean[col].clip(lower, upper)

print("Outliers capped!")

Outliers capped!


# 6 – Check and Sort by Date (10 marks)

• Examine whether the dataset is properly sorted by its date or time column.

• If it is not sorted, reorder it chronologically.

• Confirm that sorting was successful.

In [7]:
# First, ensure the date column is in proper datetime format
date_col = 'date'

# Convert to datetime if not already done (using pandas as shown in notes)
pd.api.types.is_datetime64_any_dtype(df_clean[date_col])
print("Converting date column to datetime format.")
df_clean[date_col] = pd.to_datetime(df_clean[date_col])
print("Date conversion completed.")

print("Checking date sorting.")

# Check if the dates are monotonically increasing (properly sorted)
if not df_clean[date_col].is_monotonic_increasing:
    print("Dataset is not sorted by date. Sorting chronologically.")
    
    # Sort by date column
    df_clean = df_clean.sort_values(date_col)
    
    # Reset index after sorting to maintain proper order
    df_clean = df_clean.reset_index(drop=True)
    
    print("Sorting completed successfully!")
    
    # Display some basic info about the sorted dates
    print(f"First date after sorting: {df_clean[date_col].iloc[0]}")
    print(f"Last date after sorting: {df_clean[date_col].iloc[-1]}")
else:
    print("Dataset is already properly sorted by date.")

# Display the date range (using datetime formatting from notes)
print(f"Date range: {df_clean[date_col].min().strftime('%Y-%m-%d')} to {df_clean[date_col].max().strftime('%Y-%m-%d')}")

Converting date column to datetime format.


  df_clean[date_col] = pd.to_datetime(df_clean[date_col])


Date conversion completed.
Checking date sorting.
Dataset is not sorted by date. Sorting chronologically.
Sorting completed successfully!
First date after sorting: 1990-01-01 00:00:00
Last date after sorting: 2020-06-01 00:00:00
Date range: 1990-01-01 to 2020-06-01


# Reference

lecture7_dates.pdf

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values

https://pandas.pydata.org/docs/reference/api/pandas.Series.is_monotonic_increasing.html


# 7– Date-based Slicing (10 marks)

• Use Pandas slicing to extract and analyse data based on specific date ranges. Select
data for the month of your birth in a year of your choice and select columns “rain” and
“temp”

• Calculate the average rainfall and average temperature for that month.

• Present your results clearly with code and markdown explanation.

First I Converted date, temp, rainfall to numbers. Checking where the data start date till the end from ( 1990-01-01 00:00:00 to 2020-06-01 00:00:00 )
Checked the date with my birth year and month for the average rainfall in mm and temperature C

# Task 8 – Location-based Slicing (15 marks)

• Select your favourite Irish county or weather station and perform an analysis of
weather patterns there.

• Identify which months or seasons are best for visiting based on temperature and
rainfall data. For this task you need to compare your calculated statistics with the
ideal conditions and select days of the months in each year for those conditions. For
example, good days to visit Dublin are the ones when temperature above 20 °C and
low chances of rainfall. Create similar conditional statements using suitable Pandas
functions and display suitable days range in each year.

• Summarise your insights with calculations and a short explanation.


# Task 9 – Remove Empty or Irrelevant Columns (5 marks)

• Identify any columns that are empty or do not have any useful information.

• Remove such columns and display the shape of the final cleaned dataset

In [8]:

# Load the data with proper missing value handling
weather_data = pd.read_csv('hrly_Irish_weather.csv', na_values=["N/A", "NaN", "NA", " "])

print("Original dataset shape:", weather_data.shape)

# Remove empty columns (all values are NaN)
weather_data_cleaned = weather_data.dropna(axis='columns', how='all')

print("Final cleaned dataset shape:", weather_data_cleaned.shape)

Original dataset shape: (4660423, 18)
Final cleaned dataset shape: (4660423, 18)


# Task 10 – Save and Submit (5 marks)
• Save your cleaned dataset as: cleaned_hrly_Irish_weather.csv
• Ensure that all data cleaning and processing steps are completed within the same
notebook.

• Submit both your cleaned dataset and your Jupyter Notebook.

In [9]:
# Save the cleaned dataset
weather_data.to_csv('cleaned_hrly_Irish_weather.csv', index=False)

print("Dataset saved successfully as: cleaned_hrly_Irish_weather.csv")
print(f"File contains {len(weather_data)} rows and {len(weather_data.columns)} columns")

Dataset saved successfully as: cleaned_hrly_Irish_weather.csv
File contains 4660423 rows and 18 columns
