### Data cleaning and processing with pandas.



#### Always check your files!



If you execute this code, you will get a lengthy error message, telling you that this file does not exist. 



In [1]:
import pandas as pd # inport pandas as pd
df :pd.DataFrame = pd.read_csv("example_file.csv") # the pandas dataframe

However, depending on your circumstances you may also get very weird error messages, and a simple typo in the filename can cost you lots of time trying to figure out what is going on. It is thus much better to test yourself  whether a filename exists or not. This is easily done with a few lines of code.



In [1]:
import pathlib as pl 

fn: str = "example_file.csv"  # file name
cwd :pl.Path = pl.Path.cwd()  # get the current working directory 
fqfn :pl.Path = pl.Path(f"{cwd}/{fn}") # build the fully qualified file name
if not fqfn.exists():  # check if the file is actually there
    raise FileNotFoundError(f"Cannot find file {fqfn}")

I in fact keep a file full of those handy little snippets, so I can use copy/paste whenever I do file operations. The `pathlib` library provides many useful methods to modify the path, the filename, or its extension, but the above will do for this course. In the coming exercises we will often import external data with pandas, so you want to create a second code template where you combine the above with the pandas import statement (you can pass the `fqfn` pathlib object to pandas instead of the filename string) 



#### Dropping Rows with Missing Values (`NaN`)




When working with real-world data, it is common to encounter missing information (i.e., empty cells). Upon reading a data file, pandas places a `NaN` symbol in each cell that is empty. `NaN` stands for "NotaNumber". Pandas provides the `dropna()` method, which allows us to filter out rows or columns with missing (`NaN`) values:



In [1]:
import pandas as pd

# Create a sample DataFrame
data = {
    "A": [1, 2, None, 4],
    "B": [None, 2, 3, 4],
    "C": [1, None, 3, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

In [1]:
df_cleaned = df.dropna()

print("\nDataFrame after dropping rows with NaN:")
print(df_cleaned)

#### Sorting




Sorting data by a given column is often the first step in data processing.
Sorting data is useful for identifying trends or ranking entries based on importance. Pandas provides the  `sort_values()` method, which you can use to sort a DataFrame by any column in ascending or descending order. Here’s a generic example:



In [1]:
import pandas as pd

# Create a sample DataFrame
data = {
    "Name": ["A", "B", "C", "D"],
    "Value": [10, 30, 20, 40]
}

df = pd.DataFrame(data)

# Sort the DataFrame by the "Value" column in descending order
df_sorted = df.sort_values(by="Value", ascending=False)

print("DataFrame sorted by 'Value' in descending order:")
print(df_sorted)

#### Grouping and subtotals




Grouping is a powerful feature in pandas that allows you to split a DataFrame into groups based on the values in one or more columns. After grouping, you can apply aggregate functions (e.g., sum, mean, or count) to calculate useful summary statistics for each group. This is similar to the pivot table feature in excel.



In [1]:
# Grouping data and calculating totals:
import pandas as pd

# Create a sample DataFrame
data = {
    "Category": ["A", "B", "A", "C", "B", "A"],
    "Value": [10, 20, 30, 40, 50, 60]
}

df = pd.DataFrame(data)

# Group by the "Category" column and calculate the total (sum) for each group
grouped_totals = df.groupby("Category")["Value"].sum()
print(grouped_totals)

#### Creating a data pipeline




In the above examples we used a variety of steps, to pre-process our data. Depending on your data, this can get quite messy. Sometimes it is thus useful to define explicit functions to do the data processing, and then join them together in a clean and readable way. Whether this step is worth it depends on the complexity of your task.



In [1]:
import pandas as pd

# Sample data
data = {
    'Category': ['A', 'B', 'A', 'B', 'C', 'C', 'A', 'B'],
    'Values': [10, 20, None, 40, 30, None, 50, 60]
}

df = pd.DataFrame(data)

# Define functions for each step
def drop_missing(df):
    return df.dropna()

def group_and_sum(df):
    return df.groupby('Category', as_index=False).sum()

def sort_by_values(df):
    return df.sort_values(by='Values', ascending=False)

# Use the pipe method to transform the data
cleaned_df = (df
              .pipe(drop_missing)
              .pipe(group_and_sum)
              .pipe(sort_by_values))

print(cleaned_df)

#### Pandas and Date-Time Data



Sometimes we need to work with date-time data, like `2023-01-01 12:00:00` but many data processing tools work better with numbers instead of dates. A common way to represent dates as numbers is to use **Unix timestamps**.



##### What is a Unix Timestamp?



A Unix timestamp is a numeric value that represents the number of ****seconds**** since January 1, 1970 (commonly called the "epoch time"). Example:

    - 1970-01-01 00:00:00 UTC → 0
    - 1970-01-01 00:00:01 UTC → 1
    - 2023-01-01 12:00:00 UTC → 1672574400

By converting date-time data into Unix timestamps, we can efficiently store, compare, and work with time in numeric form.



##### How to Import and Convert Date-Time Data



Assume you have a CSV file (`data.csv`) that looks like this:

    | datetime_column       | value |
    |-----------------------|-------|
    | 2023-01-01 12:00:00  |   100 |
    | 2023-01-02 13:30:00  |   200 |
    | 2023-01-03 15:45:00  |   300 |

Here’s how to:

1.  Read in the file.
2.  Convert the date-time information into Unix timestamps.



In [1]:
import pandas as pd

# Step 1: Load the CSV file into a pandas DataFrame
df = pd.read_csv('data.csv')

# Step 2: Convert the date-time column into pandas' datetime format
df['datetime_column'] = pd.to_datetime(df['datetime_column'])

# Step 3: Convert the datetime column to Unix timestamps in seconds
df['timestamp'] = df['datetime_column'].astype('int64') // 10**9

# Step 4: Print the updated DataFrame to see the result
print(df)

##### How the Code Works



The `pd.to_datetime()` function transforms the dates and times in `datetime_column` into a special format that pandas understands. This makes it easy to manipulate and extract time-related information.

The `.astype('int64')` converts the datetime column into a numeric value representing the number of **nanoseconds** since 1970-01-01.  Why nanoseconds? Pandas stores dates with very high precision!  

To get seconds we divide by `1e9` (10<sup>9</sup>), because there are 1 billion nanoseconds in one second. Note the use of `//` instead of `/`.  The `//` operator  will perform an integer division (cast to the nearest lower integer), ensuring you always get an integer result.

After running the code, the DataFrame will look like this:

    | datetime_column       | value | timestamp |
    |-----------------------+-------+----------------|
    | 2023-01-01 12:00:00  |   100 |      1672574400 |
    | 2023-01-02 13:30:00  |   200 |      1672666200 |
    | 2023-01-03 15:45:00  |   300 |      1672759500 |

-   The `datetime_column` shows the original date-time.
-   The `value` column contains your other data.
-   The `timestamp` column now contains the numeric representation of the datetime in ****seconds**** since 1970-01-01.



#### Categorizing Date time data



To categorize your date-time data in a pandas DataFrame into `day` and `night`, you can use the `dt` accessor to extract the hour from the `datetime` column. You can then define a function or use a lambda function with conditions to label each row as `day` or `night` based on the hour. Here's how you can do this:



In [1]:
import pandas as pd

# Example data (datetime data for several days)
data = {
    'datetime': [
        '2023-10-01 01:00:00',
        '2023-10-01 13:00:00',
        '2023-10-01 23:00:00',
        '2023-10-02 09:00:00',
        '2023-10-02 20:00:00',
    ],
    'value': [10, 20, 15, 25, 30]
}

# Create a pandas DataFrame
df = pd.DataFrame(data)

# Convert the =datetime= column to pandas datetime format
df['datetime'] = pd.to_datetime(df['datetime'])

# Define the time categorization
# Assume 'night' is from 20:00 (8 PM) to 5:59 (5:59 AM), and the rest is 'day'.
def categorize_time(hour):
    if 6 <= hour < 20:  # 6 AM to 7:59 PM are day
        return 'day'
    else:               # 8 PM to 5:59 AM are night
        return 'night'

# Apply the categorization function to the DataFrame
df['time_category'] = df['datetime'].dt.hour.apply(categorize_time)

# Output the resulting DataFrame
print(df)