**DATA CLEANING**

To identify missing data, the isnull().sum() method was employed. Rows containing missing values in critical columns, namely 'City,' 'Date,' 'PM2.5,' and 'AQI,' were subsequently removed. To address missing values in other columns, mean imputation was applied using the fillna() method. Duplicate records were detected using the duplicated() method and eliminated to ensure data integrity.

In [2]:
import csv
import pandas as pd
import os
import sys


def cf(filename):
  """
  Author: Sharon Karamba
  Concatenates a filename with a directory path and returns the absolute path.
  Useful for file-handling iterations.
  Preconditions:
  - filename: a string representing the name of the file to be concatenated
  with the directory path.
  Postconditions:
  - Returns a string representing the absolute path of the file after
  concatenation.
  Args:
  filename (str): The name of the file to be concatenated with the
  directory path.
  Returns:
  str: The absolute path of the file after concatenation.
  """
  if 'google.colab' in sys.modules:
    #If running in colab, return the Colab file path
    if not os.path.exists('/content/drive'):
      #Mount Google Drive if it's not already mounted
      from google.colab import drive
      drive.mount('/content/drive')
    else:
      print("Google Drive is already mounted.")
      print()
    PATH = '/content/drive/MyDrive' # root
    data_dir = os.path.join(PATH, filename.lstrip('/'))
  else:
    #If not running in colab, return the local file path
    data_dir = os.path.join(os.getcwd(), filename)
  return data_dir
  print()


In [1]:

#Load the dataset (adjust the path as necessary)
filename = 'air_quality_kolkata.csv'
filename = cf(filename)

try:
  df = pd.read_csv(filename)
except FileNotFoundError:
  print("Error: File not found. Please check the path and filename.")
  exit()  #Terminate execution if file is not found

#Display the initial dataset information
print("Initial Dataset Info:")
print(df.info())
print("First few rows of data:")
print(df.head())

#1. Handling Missing Values
#Check for missing values
print("Missing Values Per Column:")
print(df.isnull().sum())

#Drop rows with missing values in critical columns or fill them
df = df.dropna(subset=['City', 'Date', 'PM2.5', 'AQI'])  #Drop if critical

#Impute other missing values with the column mean for numeric columns only
#Select numeric columns
numeric_cols = df.select_dtypes(include=['number']).columns

#Fill NaN in numeric columns with their respective means
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

#2. Removing Duplicates
#Check for duplicates
print("Number of duplicate rows:", df.duplicated().sum())

#Drop duplicates
df = df.drop_duplicates()

#Display cleaned dataset information
print("Cleaned Dataset Info:")
print(df.info())


#Save cleaned data (use error handling)
try:
  df.to_csv('cleaned_air_quality_kolkata.csv', index=False)
  print("Data cleaning complete. Cleaned dataset saved to cleaned_air_quality_kolkata.csv")
except PermissionError:
  print("Error: You may not have permission to write to the specified location.")


Error: File not found. Please check the path and filename.
Initial Dataset Info:


NameError: name 'df' is not defined