In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt



In [None]:
df = pd.read_csv('/content/Airline_Delay_Cause.csv')

# Task
Identify categorical columns in the DataFrame `df`, fill their missing values with 'NA', and then save the cleaned DataFrame to a new CSV file named 'cleaned_data.csv'.

## Identify Categorical Columns

### Subtask:
Identify all categorical columns in the DataFrame `df`.


**Reasoning**:
To identify categorical columns, I need to inspect the data types of all columns in the DataFrame. The `info()` method provides a summary of the DataFrame, including the data type of each column.



In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32881 entries, 0 to 32880
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   year                 32881 non-null  int64  
 1   month                32881 non-null  int64  
 2   carrier              32881 non-null  object 
 3   carrier_name         32881 non-null  object 
 4   airport              32881 non-null  object 
 5   airport_name         32881 non-null  object 
 6   arr_flights          32816 non-null  float64
 7   arr_del15            32805 non-null  float64
 8   carrier_ct           32816 non-null  float64
 9   weather_ct           32814 non-null  float64
 10  nas_ct               32816 non-null  float64
 11  security_ct          32816 non-null  float64
 12  late_aircraft_ct     32816 non-null  float64
 13  arr_cancelled        32816 non-null  float64
 14  arr_diverted         32814 non-null  float64
 15  arr_delay            32816 non-null 

**Reasoning**:
Based on the `df.info()` output, columns with 'object' dtype are typically categorical. I will identify these columns and store them in a list.



In [None]:
categorical_cols = df.select_dtypes(include='object').columns.tolist()
print("Identified categorical columns:")
print(categorical_cols)

Identified categorical columns:
['carrier', 'carrier_name', 'airport', 'airport_name']


## Fill Missing Values

### Subtask:
For each identified categorical column, fill any missing values (NaN) with the string 'NA'.


**Reasoning**:
I will iterate through the list of identified categorical columns and fill any missing values (NaN) with the string 'NA' using the `fillna()` method.



In [None]:
for col in categorical_cols:
    df[col].fillna('NA', inplace=True)
print("Missing values in categorical columns filled with 'NA'.")

Missing values in categorical columns filled with 'NA'.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('NA', inplace=True)


**Reasoning**:
The previous code generated a `FutureWarning` because `inplace=True` on a chained assignment (like `df[col].fillna(...)`) does not guarantee modification of the original DataFrame. To correctly fill missing values and avoid the warning, I will explicitly reassign the result of `fillna()` back to the column.



In [None]:
for col in categorical_cols:
    df[col] = df[col].fillna('NA')
print("Missing values in categorical columns filled with 'NA'.")

Missing values in categorical columns filled with 'NA'.


## Save Cleaned Data to CSV

### Subtask:
Save the cleaned DataFrame to a new CSV file named 'cleaned_data.csv'.


**Reasoning**:
Now that the missing values in categorical columns have been filled, I will save the cleaned DataFrame `df` to a new CSV file named 'cleaned_data.csv' as instructed, ensuring that the index is not written to the file.



In [None]:
df.to_csv('cleaned_data.csv', index=False)
print("Cleaned DataFrame saved to 'cleaned_data.csv'.")

Cleaned DataFrame saved to 'cleaned_data.csv'.


## Final Task

### Subtask:
Confirm that the data cleaning and saving process has been completed successfully.


## Summary:

### Data Analysis Key Findings
*   **Categorical Column Identification:** The analysis successfully identified four categorical columns in the DataFrame `df`: `['carrier', 'carrier_name', 'airport', 'airport_name']`.
*   **Missing Value Handling:** Missing values (`NaN`) in the identified categorical columns were successfully filled with the string 'NA'. A `FutureWarning` related to `inplace=True` was encountered and resolved by explicitly reassigning the result of `fillna()`.
*   **Data Saving:** The cleaned DataFrame was successfully saved to a new CSV file named 'cleaned_data.csv' without including the DataFrame index.

### Insights or Next Steps
*   The `cleaned_data.csv` file is now ready for further analysis or machine learning model training, ensuring consistency in categorical data representation.
*   Consider exploring the distribution of 'NA' values in the filled columns to understand the extent of missing data and its potential impact on subsequent analysis.


# Task
Load the data from the CSV file "/content/Airline_Delay_Cause.csv" into a pandas DataFrame, identify numerical and categorical columns, fill missing numerical values with their respective column means and missing categorical values with 'NA', then save the cleaned DataFrame to a new CSV file named 'cleaned_data_with_imputed_values.csv' without including the index, and finally confirm the completion of the process.

## Load Data

### Subtask:
Load the data from the CSV file into a pandas DataFrame.


**Reasoning**:
To load the data from the CSV file into a pandas DataFrame, I will use the `pd.read_csv()` function.



In [None]:
df = pd.read_csv('/content/Airline_Delay_Cause.csv')
print("Data loaded successfully into DataFrame 'df'.")
df.head()

Data loaded successfully into DataFrame 'df'.


Unnamed: 0,year,month,carrier,carrier_name,airport,airport_name,arr_flights,arr_del15,carrier_ct,weather_ct,...,security_ct,late_aircraft_ct,arr_cancelled,arr_diverted,arr_delay,carrier_delay,weather_delay,nas_delay,security_delay,late_aircraft_delay
0,2022,5,9E,Endeavor Air Inc.,ABE,"Allentown/Bethlehem/Easton, PA: Lehigh Valley ...",136.0,7.0,5.95,0.0,...,0.0,1.0,0.0,0.0,255.0,222.0,0.0,4.0,0.0,29.0
1,2022,5,9E,Endeavor Air Inc.,ABY,"Albany, GA: Southwest Georgia Regional",91.0,16.0,7.38,0.0,...,0.0,6.09,0.0,0.0,884.0,351.0,0.0,81.0,0.0,452.0
2,2022,5,9E,Endeavor Air Inc.,ACK,"Nantucket, MA: Nantucket Memorial",19.0,2.0,0.13,0.0,...,0.0,0.88,1.0,0.0,138.0,4.0,0.0,106.0,0.0,28.0
3,2022,5,9E,Endeavor Air Inc.,AEX,"Alexandria, LA: Alexandria International",88.0,14.0,7.26,0.76,...,0.0,1.64,0.0,0.0,947.0,585.0,35.0,125.0,0.0,202.0
4,2022,5,9E,Endeavor Air Inc.,AGS,"Augusta, GA: Augusta Regional at Bush Field",181.0,19.0,13.84,0.0,...,0.0,2.09,0.0,0.0,808.0,662.0,0.0,87.0,0.0,59.0


## Identify Column Types

### Subtask:
Identify and separate columns into numerical and categorical types based on their data types.


**Reasoning**:
Based on the `df.info()` output, I will programmatically identify columns with 'object' dtype as categorical and columns with 'int64' or 'float64' dtypes as numerical. I will then store their names in respective lists and print them for confirmation.



In [None]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = df.select_dtypes(include='object').columns.tolist()

print("Identified numerical columns:")
print(numerical_cols)
print("\nIdentified categorical columns:")
print(categorical_cols)

Identified numerical columns:
['year', 'month', 'arr_flights', 'arr_del15', 'carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct', 'arr_cancelled', 'arr_diverted', 'arr_delay', 'carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']

Identified categorical columns:
['carrier', 'carrier_name', 'airport', 'airport_name']


## Fill Missing Numerical Values

### Subtask:
For all identified numerical columns, calculate the mean of each column and fill any missing values (NaN) with that mean.


**Reasoning**:
To fill missing numerical values with the mean, I will iterate through each numerical column, calculate its mean, and then use the `fillna()` method to replace NaN values, explicitly reassigning the column to the result.



In [None]:
for col in numerical_cols:
    mean_value = df[col].mean()
    df[col] = df[col].fillna(mean_value)
print("Missing values in numerical columns filled with their respective means.")

Missing values in numerical columns filled with their respective means.


## Fill Missing Categorical Values

### Subtask:
For all identified categorical columns, fill any missing values (NaN) with the string 'NA'.


**Reasoning**:
To address the subtask of filling missing categorical values, I will iterate through the `categorical_cols` list and use the `fillna()` method to replace NaN values with the string 'NA', ensuring the results are reassigned to the DataFrame.



In [None]:
for col in categorical_cols:
    df[col] = df[col].fillna('NA')
print("Missing values in categorical columns filled with 'NA'.")

Missing values in categorical columns filled with 'NA'.


**Reasoning**:
Now that both numerical and categorical missing values have been handled, I will save the cleaned DataFrame to a new CSV file named 'cleaned_data_with_imputed_values.csv' as per the main task's requirements.



In [None]:
df.to_csv('cleaned_data_with_imputed_values.csv', index=False)
print("Cleaned DataFrame saved to 'cleaned_data_with_imputed_values.csv'.")

Cleaned DataFrame saved to 'cleaned_data_with_imputed_values.csv'.


## Final Task

### Subtask:
Confirm that the data cleaning and saving process has been completed successfully.


## Summary:

### Q&A
Yes, the data cleaning and saving process has been completed successfully. Missing numerical values were filled with column means, missing categorical values with 'NA', and the cleaned DataFrame was saved to a new CSV file.

### Data Analysis Key Findings
*   Data from `/content/Airline_Delay_Cause.csv` was successfully loaded into a pandas DataFrame containing 21 columns and multiple rows.
*   17 numerical columns were identified, including `year`, `month`, `arr_flights`, `arr_del15`, and various delay-related counts and durations (e.g., `carrier_ct`, `weather_delay`).
*   4 categorical columns were identified: `carrier`, `carrier_name`, `airport`, and `airport_name`.
*   Missing values in all identified numerical columns were successfully imputed with their respective column means.
*   Missing values in all identified categorical columns were successfully imputed with the string 'NA'.
*   The final cleaned DataFrame, with all missing values addressed, was saved to a new CSV file named `cleaned_data_with_imputed_values.csv`, without including the DataFrame index.

### Insights or Next Steps
*   The cleaned dataset is now ready for further exploratory data analysis or machine learning model training, as all missing values have been handled.
*   Future steps could involve analyzing the distributions of the imputed numerical values to ensure they do not significantly skew the original data patterns, and examining the impact of 'NA' imputation on categorical variable analysis.
