# Week 2 - Preprocessing, part 2

# 1. Lesson: None

# 2. Weekly graph question

The Storytelling With Data book mentions planning on a "Who, What, and How" for your data story.  Write down a possible Who, What, and How for your data, using the ideas in the book.

*DATASET*
Flight delay dataset

*WHO* 
Airline Operations Leaders who want to improve the flight experience for customers and reduce operational costs.

*WHAT*
They want to know which factors contribute flight delays in order to optimize the deployment of resources and communication to customers.

*HOW*
The Flight Delay dataset can help idenitfy delay factors by route, weather, temperature, airline, etc.

# 3. Homework - work with your own data

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

This week, you will do the same types of exercises as last week, but you should use your chosen datasets that someone in your class found last semester. (They likely will not be the particular datasets that you found yourself.)

### Here are some types of analysis you can do  Use Google, documentation, and ChatGPT to help you:

- Summarize the datasets using info() and describe()

- Are there any duplicate rows?

- Are there any duplicate values in a given column (when this would be inappropriate?)

- What are the mean, median, and mode of each column?

- Are there any missing or null values?

    - Do you want to fill in the missing value with a mean value?  A value of your choice?  Remove that row?

- Identify any other inconsistent data (e.g. someone seems to be taking an action before they are born.)

- Encode any categorical variables (e.g. with one-hot encoding.)

### Conclusions:

- Are the data usable?  If not, find some new data!

- Do you need to modify or correct the data in some way?

- Is there any class imbalance?  (Categories that have many more items than other categories).

# Flight Delay Dataset Examination

In [None]:
df = pd.read_csv("DelayData.csv", delimiter=",")

# Summarize the dataset using info() and describe()
print("DataFrame Info:")
print(df.info())
print("\nDataFrame Description:")
print(df.describe(include='all'))

# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

# Check for duplicate values in a given column (example: 'tailnum')
duplicate_values_tailnum = df['tailnum'].duplicated().sum()
print(f"\nNumber of duplicate values in 'tailnum': {duplicate_values_tailnum}")

# Calculate mean, median, and mode of each column
# mean_values = df.mean()
# median_values = df.median()
# mode_values = df.mode().iloc[0]

# print("\nMean values:")
# print(mean_values)
# print("\nMedian values:")
# print(median_values)
# print("\nMode values:")
# print(mode_values)

# Check for missing or null values
missing_values = df.isnull().sum()
print("\nMissing values:")
print(missing_values)

# # Fill missing values with mean (example: 'temperature')
# df['temperature'].fillna(df['temperature'].mean(), inplace=True)
# df['windspeed'].fillna(df['windspeed'].mean(), inplace=True)
# df['windspeedsquare'].fillna(df['windspeedsquare'].mean(), inplace=True)
# df['windgustspeed'].fillna(df['windgustspeed'].mean(), inplace=True)

# Identify inconsistent data (example: 'year' should be reasonable)
inconsistent_data = df[df['year'] < 1900]
print("\nInconsistent data (year < 1900):")
print(inconsistent_data)

# Encode categorical variables (example: 'origin', 'dest', 'uniquecarrier')
df_encoded = pd.get_dummies(df, columns=['origin', 'dest', 'uniquecarrier', 'origincityname', 'originstate'])

print("\nDataFrame with encoded categorical variables:")
print(df_encoded.head())



# Priceline Flight Dataset Examination

In [19]:
df = pd.read_csv("flight.csv", delimiter=",")
#df = df.head(100)

  df = pd.read_csv("flight.csv", delimiter=",")


# USDOT On Time Flight Reporting Dataset Examination

In [None]:
df = pd.read_csv("T_ONTIME_REPORTING.csv", delimiter=",")
df = df.head(100)

# Priceline Flight Dataset Examination

## Observations
### Data Quality
- There are several empty columns. The CSV appears to be corrupt. We will remove the bad columns.
- Date and Time columns are text.
- Price column is an object, not numeric.
- There are some rows of bad data. Dollar sign in the price column. Number of Stops = Express Deal. Travel Time = Save $39. etc. These rows will be removed.
### INFO() on the base dataset
- 15 columns, 2,459 rows
- 2 rows missing an airline name
- 296 rows missing a departure airport
- 3 rows missing a ticket price
- 299 rows missing an arrival airport (maybe they never left?)
- Several rows don't have 2nd or 3rd stoppage values. These are most likely flights that were one leg
### DESCRIBE() on the clean dataset
- 57 unique airlines
- Travel time rangs from ~2.66 hours to 82 hours
- Wait times range from 39 minutes to 390 minutes
- Ticker prices ranges from $135 to $8000



In [81]:
# Load the Flight Dataset
df_src = pd.read_csv("flight.csv", delimiter=",", low_memory=False)

def flight_data_cleaner(df):
    # Remove the junk columns
    df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
    # Clean up the column names
    df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('[^a-zA-Z0-9_]', '', regex=True).str.replace('__','_')
    return df

def convert_to_minutes(time_str):
    if pd.isna(time_str):
        return np.nan
    if isinstance(time_str, float):
        return np.nan
    try:
        total_minutes = 0
        if 'h' in time_str:
            hours, minutes = time_str.split('h ')
            total_minutes += int(hours) * 60
            total_minutes += int(minutes.replace('m', ''))
        else:
            total_minutes += int(time_str.replace('m', ''))
        return total_minutes
    except:
        return None

df_src = flight_data_cleaner(df_src)
df_clean = df_src.copy()
df_clean['travel_time_minutes'] = df_clean['travel_time'].apply(convert_to_minutes)
df_clean['1st_stop_wait_minutes'] = df_clean['1st_stoppage_waiting_hour'].apply(convert_to_minutes)
df_clean['2nd_stop_wait_minutes'] = df_clean['2nd_stoppagewaiting_time'].apply(convert_to_minutes)
df_clean['3rd_stop_wait_minutes'] = df_clean['3rd_stoppage_waiting_time'].apply(convert_to_minutes)


# Map the stops column
stops_mapping = {
    'Nonstop': 0,
    '1 Stop': 1,
    '2 Stops': 2,
    '3 Stops': 3
}
df_clean['stops'] = df_clean['number_of_stoppage'].map(stops_mapping)

# Replace any value containing a dollar sign with None
df_clean['ticket_prizedoller'] = df_clean['ticket_prizedoller'].apply(lambda x: None if '$' in str(x) or 'Alaska' in str(x) else x)
# Convert the entire column to float
df_clean['ticket_price_usd'] = df_clean['ticket_prizedoller'].astype(float)


def clean_and_combine_datetime(df, date_col, time_col):
    def parse_datetime(date_str, time_str):
        try:
            # Strip leading/trailing spaces and prefixes
            if isinstance(date_str, str):
                date_str = date_str.strip()
                if 'Arrives:' in date_str:
                    date_str = date_str.split('Arrives: ')[-1]
                date_str = date_str.split(', ')[-1]
            else:
                return None
            
            if isinstance(time_str, str):
                time_str = time_str.strip().upper().replace('A', 'AM').replace('P', 'PM')
            else:
                return None
            
            # Add current year to date string
            current_year = datetime.now().year
            date_str = f"{date_str} {current_year}"
            
            # Combine date and time strings
            datetime_str = f"{date_str} {time_str}"
            
            # Parse combined datetime string
            return datetime.strptime(datetime_str, '%b %d %Y %I:%M%p')
        except ValueError:
            return None
    
    # Apply the parsing function to the DataFrame
    df['arrival_datetime'] = df.apply(lambda row: parse_datetime(row[date_col], row[time_col]), axis=1)
    
    return df

# Clean and combine datetime
df_clean = clean_and_combine_datetime(df, 'arrival_date', 'arrival_time')


def convert_to_24_hour(time_str):
    try:
        # Check if the value is a string
        if isinstance(time_str, str):
            # Normalize lowercase 'a'/'p' to 'AM'/'PM'
            time_str = time_str.replace('a', 'AM').replace('p', 'PM')
            # Convert to 24-hour format
            return datetime.strptime(time_str, '%I:%M%p').strftime('%H:%M')
        else:
            return None
    except (ValueError, TypeError):
        # Return None for invalid or missing values
        return None

# Apply the conversion function to the 'depreture_time' column
df_clean['departure_time_24hr'] = df_clean['depreture_time'].apply(convert_to_24_hour)


#Remove all the junk columns
junk_cols = {'travel_time', 'number_of_stoppage','depreture_time','ticket_prizedoller','1st_stoppage_waiting_hour','2nd_stoppagewaiting_time','3rd_stoppage_waiting_time','arrival_date','arrival_time'}

df_clean = df_clean.drop(columns=junk_cols)


In [None]:
print(df_src.info())

In [86]:
df_clean.describe()

Unnamed: 0,travel_time_minutes,1st_stop_wait_minutes,2nd_stop_wait_minutes,3rd_stop_wait_minutes,stops,ticket_price_usd,arrival_datetime
count,2459.0,2401.0,654.0,17.0,2459.0,2455.0,2295
mean,1551.06344,510.922116,426.108563,356.764706,1.294429,1316.544603,2025-04-08 19:46:51.320261376
min,160.0,39.0,50.0,176.0,0.0,135.0,2025-04-08 00:20:00
25%,985.0,180.0,170.0,176.0,1.0,771.0,2025-04-08 11:30:00
50%,1430.0,424.0,260.0,200.0,1.0,1128.0,2025-04-08 18:03:00
75%,2000.0,770.0,543.5,346.0,2.0,1607.0,2025-04-09 00:05:00
max,4930.0,1440.0,1435.0,1385.0,3.0,7867.0,2025-04-10 11:00:00
std,728.401941,376.460505,351.522249,391.952569,0.517722,885.885657,


In [None]:

# Summarize the dataset using info() and describe()
print("DataFrame Info:")
print(df.info())
print("\nDataFrame Description:")
print(df.describe(include='all'))

# Check for duplicate rows
duplicate_rows = df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

# Check for duplicate values in a given column (example: 'tailnum')
duplicate_values_tailnum = df['tailnum'].duplicated().sum()
print(f"\nNumber of duplicate values in 'tailnum': {duplicate_values_tailnum}")

# Calculate mean, median, and mode of each column
# mean_values = df.mean()
# median_values = df.median()
# mode_values = df.mode().iloc[0]

# print("\nMean values:")
# print(mean_values)
# print("\nMedian values:")
# print(median_values)
# print("\nMode values:")
# print(mode_values)

# Check for missing or null values
missing_values = df.isnull().sum()
print("\nMissing values:")
print(missing_values)

# # Fill missing values with mean (example: 'temperature')
# df['temperature'].fillna(df['temperature'].mean(), inplace=True)
# df['windspeed'].fillna(df['windspeed'].mean(), inplace=True)
# df['windspeedsquare'].fillna(df['windspeedsquare'].mean(), inplace=True)
# df['windgustspeed'].fillna(df['windgustspeed'].mean(), inplace=True)

# Identify inconsistent data (example: 'year' should be reasonable)
inconsistent_data = df[df['year'] < 1900]
print("\nInconsistent data (year < 1900):")
print(inconsistent_data)

# Encode categorical variables (example: 'origin', 'dest', 'uniquecarrier')
df_encoded = pd.get_dummies(df, columns=['origin', 'dest', 'uniquecarrier', 'origincityname', 'originstate'])

print("\nDataFrame with encoded categorical variables:")
print(df_encoded.head())



# 4. Storytelling With Data graph

Just like last week: choose any graph in the Introduction of Storytelling With Data. Use matplotlib to reproduce it in a rough way. I don't expect you to spend an enormous amount of time on this; I understand that you likely will not have time to re-create every feature of the graph. However, if you're excited about learning to use matplotlib, this is a good way to do that. You don't have to duplicate the exact values on the graph; just the same rough shape will be enough.  If you don't feel comfortable using matplotlib yet, do the best you can and write down what you tried or what Google searches you did to find the answers.

In [None]:
import matplotlib.pyplot as plt

