
![Data Processing Funnel](d_1.png)

**Purpose:** This notebook focuses on loading the necessary data files for the coral reef analysis project and performing initial data cleaning and preprocessing steps. These steps aim to prepare the data for subsequent analyses in other notebooks.

**Inputs:**

* `CREMP_Pcover_2023_StonyCoralSpecies.csv` 
* `CREMP_Pcover_2023_TaxaGroups.csv` 
* `CREMP_Temperatures_2023.csv` 
* `CREMP_Stations_2023.csv`

**Outputs:**

* Cleaned and preprocessed DataFrames: `df_stony_coral`, `df_taxa_groups`, `df_temperature`, `df_locations`.


**Preprocessing Steps:**

1.  **Load Data:** Load the data files into Pandas DataFrames.
2.  **Inspect Data:** Display the first few rows and column information of each DataFrame to understand its structure and data types.
3.  **Handle Missing Values:** Identify and address missing values appropriately (e.g., drop rows/columns, fill with a value).
4.  **Standardize Column Names:** Ensure consistency in column names across DataFrames.
5.  **Extract Year:** Extract the year from any date or datetime columns in the temperature data.
6.  **Merge Location Data (Optional):** If needed, merge location data with percent cover data for spatial analysis.
7.  **Save Processed Data (Optional):** Save the cleaned and preprocessed DataFrames for use in other notebooks.

**Notes:**

* This notebook assumes the data files are located in a subdirectory named "C:\\Users\\vijai\\Desktop\Florida". Adjust the file paths as necessary.
* The specific preprocessing steps may vary depending on the format and content of the original data files.

In [1]:
import pandas as pd

import os

# Define data directory (adjust if necessary)
data_dir = "C:\\Users\\vijai\\Desktop\Florida"

# Define file paths
stony_coral_file = os.path.join(data_dir, "CREMP_Pcover_2023_StonyCoralSpecies.csv")  # Replace with actual file name/extension
taxa_groups_file = os.path.join(data_dir, "CREMP_Pcover_2023_TaxaGroups.csv")  # Replace with actual file name/extension
temperature_file = os.path.join(data_dir, "CREMP_Temperatures_2023.csv")  # Replace with actual file name/extension
locations_file = os.path.join(data_dir, "CREMP_Stations_2023.csv")  # Replace with actual file name/extension

![Data Processing Funnel](d_2.png)
#
## Defines the paths to the data files.  It's crucial to adjust the file paths to match the actual
## location and names of your data files.

In [2]:
# Load data
try:
    df_stony_coral = pd.read_csv(stony_coral_file)
    df_taxa_groups = pd.read_csv(taxa_groups_file)
    df_temperature = pd.read_csv(temperature_file)
    df_locations = pd.read_csv(locations_file)  # Changed to pd.read_csv
    print("Data files loaded successfully!")

except FileNotFoundError as e:
    print(f"Error: One or more data files not found: {e}")
except Exception as e:
    print(f"An error occurred while loading data: {e}")

Data files loaded successfully!


![Data Processing Funnel](d_3.png)
## This block inspects the loaded DataFrames by printing the first few rows
## (using df.head()) and displaying column information, data types, and
## non-null counts (using df.info()). This helps to understand the structure
## of the data.

In [3]:
# Inspect data
print("\n--- Stony Coral Data ---")
print(df_stony_coral.head())
print(df_stony_coral.info())

print("\n--- Taxa Groups Data ---")
print(df_taxa_groups.head())
print(df_taxa_groups.info())

print("\n--- Temperature Data ---")
print(df_temperature.head())
print(df_temperature.info())

print("\n--- Locations Data ---")
print(df_locations.head())
print(df_locations.info())


--- Stony Coral Data ---
   OID_  Year               Date Subregion Habitat  SiteID    Site_name  \
0     1  1996  7/25/1996 0:00:00        UK      HB      10  Rattlesnake   
1     2  1996  7/25/1996 0:00:00        UK      HB      10  Rattlesnake   
2     3  1996  7/25/1996 0:00:00        UK      HB      10  Rattlesnake   
3     4  1996  7/25/1996 0:00:00        UK      HB      10  Rattlesnake   
4     5  1996  7/22/1996 0:00:00        UK      HB      11   El Radabob   

   StationID Surveyed_all_years  points  ...  Porites_porites_complex  \
0        101                  N     479  ...                      0.0   
1        102                  N     525  ...                      0.0   
2        103                  N     558  ...                      0.0   
3        104                  N     446  ...                      0.0   
4        111                  N     450  ...                      0.0   

   Pseudodiploria_clivosa  Pseudodiploria_strigosa  Scleractinia  Scolymia_sp  \
0  

![Data Processing Funnel](d_4.png)
## This block deals with missing values in the DataFrames. It first prints the number of missing values in each column. Then, it provides examples of handling missing values (dropping rows and filling with the mean).  Remember that the specific handling of missing values should be adapted based on the nature of your data and the analysis you intend to perform.

In [4]:
# Handle missing values

print("\n--- Missing Values Before Handling ---")
print("Stony Coral:\n", df_stony_coral.isnull().sum())
print("\nTaxa Groups:\n", df_taxa_groups.isnull().sum())
print("\nTemperature:\n", df_temperature.isnull().sum())
print("\nLocations:\n", df_locations.isnull().sum())

# Example: Drop rows with any missing values in stony coral data
df_stony_coral = df_stony_coral.dropna()

# Example: Fill missing temperature values with the mean
if 'Temperature' in df_temperature.columns:  # Check if 'Temperature' column exists
    df_temperature['Temperature'] = df_temperature['Temperature'].fillna(df_temperature['Temperature'].mean())

print("\n--- Missing Values After Handling ---")
print("Stony Coral:\n", df_stony_coral.isnull().sum())
print("Taxa Groups:\n", df_taxa_groups.isnull().sum())
print("Temperature:\n", df_temperature.isnull().sum())
print("Locations:\n", df_locations.isnull().sum())


--- Missing Values Before Handling ---
Stony Coral:
 OID_                                 0
Year                                 0
Date                                 6
Subregion                            0
Habitat                              0
SiteID                               0
Site_name                            0
StationID                            0
Surveyed_all_years                   0
points                               0
Acropora_cervicornis                 0
Acropora_palmata                     0
Agaricia_fragilis                    0
Agaricia_lamarcki                    0
Cladocora_arbuscula                  0
Colpophyllia_natans                  0
Dendrogyra_cylindrus                 0
Dichocoenia_stokesii                 0
Diploria_labyrinthiformis            0
Eusmilia_fastigiata                  0
Favia_fragum                         0
Helioseris_cucullata                 0
Isophyllia_rigida                    0
Isophyllia_sinuosa                   0
Madracis_a

![Standardizing Column Name](d_5.png)
## This block standardizes column names across the DataFrames to ensure consistency. Consistent column names are essential for merging and comparing data.  The example code renames 'Site_name' to 'SiteName' and 'Year' to 'year', but you should adjust this based on your specific requirements.

In [5]:
# Standardize column names (example; adjust based on your needs)

print("\n--- Column Names Before Standardization ---")
print("Stony Coral:", df_stony_coral.columns)
print("\nTaxa Groups:", df_taxa_groups.columns)
print("\nTemperature:", df_temperature.columns)
print("\nLocations:", df_locations.columns)

if 'Site_name' in df_stony_coral.columns:
    df_stony_coral = df_stony_coral.rename(columns={'Site_name': 'SiteName'})
if 'Site_name' in df_taxa_groups.columns:
    df_taxa_groups = df_taxa_groups.rename(columns={'Site_name': 'SiteName'})
if 'Year' in df_temperature.columns:
    df_temperature = df_temperature.rename(columns={'Year': 'year'})
if 'Year' in df_stony_coral.columns:
    df_stony_coral = df_stony_coral.rename(columns={'Year': 'year'})
if 'Year' in df_taxa_groups.columns:
    df_taxa_groups = df_taxa_groups.rename(columns={'Year': 'year'})

print("\n--- Column Names After Standardization ---")
print("Stony Coral:", df_stony_coral.columns)
print("\nTaxa Groups:", df_taxa_groups.columns)
print("\nTemperature:", df_temperature.columns)
print("\nLocations:", df_locations.columns)


--- Column Names Before Standardization ---
Stony Coral: Index(['OID_', 'Year', 'Date', 'Subregion', 'Habitat', 'SiteID', 'Site_name',
       'StationID', 'Surveyed_all_years', 'points', 'Acropora_cervicornis',
       'Acropora_palmata', 'Agaricia_fragilis', 'Agaricia_lamarcki',
       'Cladocora_arbuscula', 'Colpophyllia_natans', 'Dendrogyra_cylindrus',
       'Dichocoenia_stokesii', 'Diploria_labyrinthiformis',
       'Eusmilia_fastigiata', 'Favia_fragum', 'Helioseris_cucullata',
       'Isophyllia_rigida', 'Isophyllia_sinuosa', 'Madracis_aurentenra',
       'Madracis_decactis_complex', 'Manicina_areolata',
       'Meandrina_meandrites', 'Millepora_alcicornis', 'Millepora_complanata',
       'Montastraea_cavernosa', 'Mussa_angulosa', 'Mycetophyllia_aliciae',
       'Mycetophyllia_ferox', 'Mycetophyllia_lamarckiana_complex',
       'Oculina_diffusa', 'Oculina_robusta', 'Orbicella_annularis_complex',
       'Phyllangia_americana', 'Porites_astreoides', 'Porites_porites_complex',
     

# Extract Year from Temperature Data
#
## This block extracts the year from the 'Date' or 'DateTime' column in the temperature data. This is often necessary for time-based analysis.  The code checks for both column names and uses pd.to_datetime() to convert the column to datetime objects before extracting the year.

In [6]:
# Extract year from temperature data (if applicable)

print("\n--- Temperature Data Before Year Extraction ---")
print(df_temperature.head())
print(df_temperature.dtypes)

if 'Date' in df_temperature.columns:  # Check if 'Date' column exists
    df_temperature['Year'] = pd.to_datetime(df_temperature['Date']).dt.year
elif 'DateTime' in df_temperature.columns: # Check if 'DateTime' column exists
    df_temperature['Year'] = pd.to_datetime(df_temperature['DateTime']).dt.year

print("\n--- Temperature Data After Year Extraction ---")
print(df_temperature.head())
print(df_temperature.dtypes)


--- Temperature Data Before Year Extraction ---
   OID_  SiteID    Site_name  year  Month  Day  Time  TempC  TempF
0     1      10  Rattlesnake  2020      6   12  11.0  29.59  85.26
1     2      10  Rattlesnake  2020      6   12  12.0  29.76  85.57
2     3      10  Rattlesnake  2020      6   12  13.0  29.81  85.66
3     4      10  Rattlesnake  2020      6   12  14.0  30.19  86.34
4     5      10  Rattlesnake  2020      6   12  15.0  30.34  86.61
OID_           int64
SiteID         int64
Site_name     object
year           int64
Month          int64
Day            int64
Time         float64
TempC        float64
TempF        float64
dtype: object

--- Temperature Data After Year Extraction ---
   OID_  SiteID    Site_name  year  Month  Day  Time  TempC  TempF
0     1      10  Rattlesnake  2020      6   12  11.0  29.59  85.26
1     2      10  Rattlesnake  2020      6   12  12.0  29.76  85.57
2     3      10  Rattlesnake  2020      6   12  13.0  29.81  85.66
3     4      10  Rattlesnake  

# Merge Location Data 
#
## This block demonstrates how to merge location data with the stony coral data using the 'SiteName' column.  Merging might be necessary if you want to include location attributes in your coral analysis.  The 'how' parameter in pd.merge() determines the type of join (left, right, inner, outer).

In [7]:
# Merge location data (optional; if you need to combine location information)

print("\n--- DataFrames Before Merge (Example) ---")
print("Stony Coral Columns:", df_stony_coral.columns)
print("\nLocations Columns:", df_locations.columns)

if 'SiteName' in df_stony_coral.columns and 'SiteName' in df_locations.columns:
    df_stony_coral = pd.merge(df_stony_coral, df_locations, on='SiteName', how='left')
    print("\n--- Stony Coral Data After Merge ---")
    print(df_stony_coral.head())
    print(df_stony_coral.info())

print("\n--- DataFrames After Optional Merge ---")
print("Stony Coral Columns:", df_stony_coral.columns)
print("\nTaxa Groups Columns:", df_taxa_groups.columns)
print("\nTemperature Columns:", df_temperature.columns)
print("\nLocations Columns:", df_locations.columns)


--- DataFrames Before Merge (Example) ---
Stony Coral Columns: Index(['OID_', 'year', 'Date', 'Subregion', 'Habitat', 'SiteID', 'SiteName',
       'StationID', 'Surveyed_all_years', 'points', 'Acropora_cervicornis',
       'Acropora_palmata', 'Agaricia_fragilis', 'Agaricia_lamarcki',
       'Cladocora_arbuscula', 'Colpophyllia_natans', 'Dendrogyra_cylindrus',
       'Dichocoenia_stokesii', 'Diploria_labyrinthiformis',
       'Eusmilia_fastigiata', 'Favia_fragum', 'Helioseris_cucullata',
       'Isophyllia_rigida', 'Isophyllia_sinuosa', 'Madracis_aurentenra',
       'Madracis_decactis_complex', 'Manicina_areolata',
       'Meandrina_meandrites', 'Millepora_alcicornis', 'Millepora_complanata',
       'Montastraea_cavernosa', 'Mussa_angulosa', 'Mycetophyllia_aliciae',
       'Mycetophyllia_ferox', 'Mycetophyllia_lamarckiana_complex',
       'Oculina_diffusa', 'Oculina_robusta', 'Orbicella_annularis_complex',
       'Phyllangia_americana', 'Porites_astreoides', 'Porites_porites_complex',


# Save Processed Data 
#
##  This block shows examples of how to save the processed DataFrames to CSV or Parquet files. Saving the processed data is useful if you plan to use it in other notebooks, as it avoids having to repeat the preprocessing steps. The code is commented out; uncomment the lines if you want to save the data.

In [10]:
# Save processed data 

# Define the output directory
output_dir = "C:\\temp"  # Or your desired output directory (e.g., "C:/temp")

# --- Robust Data Saving ---

# 1. Ensure Output Directory Exists
import os
os.makedirs(output_dir, exist_ok=True)  # Create directory if it doesn't exist

# 2. Rebuild DataFrames from Values (Strongest Method)
def rebuild_dataframe(df):
    """Rebuilds a Pandas DataFrame to ensure a clean structure."""
    new_df = pd.DataFrame(df.values, columns=df.columns)
    return new_df.reset_index(drop=True)

df_stony_coral = rebuild_dataframe(df_stony_coral)
df_taxa_groups = rebuild_dataframe(df_taxa_groups)
df_temperature = rebuild_dataframe(df_temperature)
df_locations = rebuild_dataframe(df_locations)

# 3. Handle Mixed Data Types (Convert to String if Necessary)
def handle_mixed_types(df):
    """Converts columns to numeric where possible, else to string."""
    for col in df.columns:
        try:
            df[col] = pd.to_numeric(df[col])
        except ValueError:
            df[col] = df[col].astype(str)
    return df

df_stony_coral = handle_mixed_types(df_stony_coral)
df_taxa_groups = handle_mixed_types(df_taxa_groups)
df_temperature = handle_mixed_types(df_temperature)
df_locations = handle_mixed_types(df_locations)

# 4. Save to CSV (Most Reliable Format)
def save_to_csv(df, filename):
    """Saves a DataFrame to CSV in the output directory."""
    filepath = os.path.join(output_dir, filename)
    df.to_csv(filepath, index=False, encoding='utf-8')  # Explicit encoding
    print(f"Saved to: {filepath}")

save_to_csv(df_stony_coral, "processed_stony_coral.csv")
save_to_csv(df_taxa_groups, "processed_taxa_groups.csv")
save_to_csv(df_temperature, "processed_temperature.csv")
save_to_csv(df_locations, "processed_locations.csv")

# Example: Save to Parquet (Optional, if needed for large files)
# def save_to_parquet(df, filename):
#     """Saves a DataFrame to Parquet in the output directory."""
#     filepath = os.path.join(output_dir, filename)
#     df.to_parquet(filepath, index=False)
#     print(f"Saved to: {filepath}")

# save_to_parquet(df_stony_coral, "processed_stony_coral.parquet")
# save_to_parquet(df_taxa_groups, "processed_taxa_groups.parquet")
# save_to_parquet(df_temperature, "processed_temperature.parquet")
# save_to_parquet(df_locations, "processed_locations.parquet")

print("\n--- Data Saving Complete ---")
print(f"Processed data saved to: {output_dir}")

Saved to: C:\temp\processed_stony_coral.csv
Saved to: C:\temp\processed_taxa_groups.csv
Saved to: C:\temp\processed_temperature.csv
Saved to: C:\temp\processed_locations.csv

--- Data Saving Complete ---
Processed data saved to: C:\temp
