# Predicting Antibiotic Resistance in *Salmonella enterica* Using Isolate Data

## Introduction

Antibiotic resistance is a pressing public health issue with widespread economic consequences. As bacterial pathogens like *Salmonella enterica* develop resistance to common antibiotics, treating infections becomes more challenging, leading to higher healthcare costs and increased risks for affected populations. *Salmonella enterica*, a significant cause of foodborne illness globally, has exhibited growing resistance to antibiotics, which complicates treatment and control efforts.

This project aims to develop a predictive model that forecasts antibiotic resistance trends in *Salmonella enterica* by analyzing isolate data. Leveraging machine learning techniques, the model will utilize antibiotic susceptibility testing (AST) results along with specimen metadata, such as the year of collection and geographic region, to identify emerging resistance patterns. These predictions are intended to inform public health strategies, optimize resource allocation, and ultimately improve food safety measures.

## Project Objectives

- **Forecast Antibiotic Resistance**: Develop a model that predicts resistance patterns in *Salmonella enterica* using AST results and metadata, enabling timely interventions.
- **Support Public Health and Food Safety**: Provide actionable insights to healthcare providers and policymakers that can guide decisions on resource allocation and risk management.
- **Explore Data-Driven Solutions**: Apply data science techniques, including data wrangling, exploratory data analysis, and machine learning, to address a real-world issue with far-reaching implications.

## Data Source

The dataset used in this project is sourced from the CDC National Antimicrobial Resistance Monitoring System (NARMS), which provides comprehensive surveillance data on antimicrobial resistance in foodborne pathogens. This dataset includes detailed information on antibiotic susceptibility testing and resistance determinants, making it well-suited for predictive modeling of resistance trends.

## Workflow Overview

1. **Data Wrangling**: Load and clean the dataset, handling missing values, duplicates, and outliers as necessary. Feature selection and transformation will prepare the data for analysis.
2. **Exploratory Data Analysis (EDA)**: Explore the data to identify trends and relationships that could inform the modeling process.
3. **Model Development**: Use machine learning algorithms (e.g., random forest, gradient boosting) to build a predictive model. The model will be evaluated based on metrics such as accuracy, precision, recall, and F1-score.
4. **Model Deployment and Interpretation**: Deploy the model with comprehensive documentation and interpret the results, highlighting potential applications in public health and food safety.

## Data Wrangling

In [1]:
# Import necessary libraries
import pandas as pd

In [2]:
# Load the Excel file
file_path = 'Salmonella_IsolateData.xlsx'  # Replace with your file path
salmonella_isolate_df = pd.read_excel(file_path)

In [3]:
# Display the first few rows of the dataset to understand its structure
salmonella_isolate_df.head()

Unnamed: 0,Specimen ID,NCBI Accession Number,WGS ID,AST Approved,WGS Approved,Genus,Species,Serotype,Data Year,Region Name,...,TEL Concl,TEL ConclPred,TET Equiv,TET Rslt,TET Concl,TET ConclPred,TIO Equiv,TIO Rslt,TIO Concl,TIO ConclPred
0,AM39779,,,yes,no,Salmonella,enterica,Typhi,2008,Region 6,...,,,>,32,R,,=,0.5,S,
1,AM42221,,,yes,no,Salmonella,enterica,Typhi,2009,Region 9,...,,,<=,4,S,,=,1.0,S,
2,AM40656,,,yes,no,Salmonella,enterica,Typhi,2009,Region 2,...,,,>,32,R,,=,1.0,S,
3,AM18807,,,yes,no,Salmonella,enterica,Typhi,2003,Region 2,...,,,>,32,R,,=,1.0,S,
4,AM07505,,,yes,no,Salmonella,enterica,Typhi,2000,Region 2,...,,,<=,4,S,,<=,0.5,S,


In [4]:
# Check for duplicates in the 'Specimen ID' column
duplicate_specimen_ids = salmonella_isolate_df[salmonella_isolate_df.duplicated(subset='Specimen ID', keep=False)]

# Count of duplicate entries
duplicate_count = duplicate_specimen_ids.shape[0]

duplicate_count

0

In [5]:
# Identify columns with all missing values (100% NaN) before removing them
# This step is crucial to eliminate any columns that do not contribute to the analysis
columns_all_missing = salmonella_isolate_df.columns[salmonella_isolate_df.isnull().all()]
columns_all_missing

Index(['CLI Equiv', 'CLI Rslt', 'CLI Concl', 'ERY Equiv', 'ERY Rslt',
       'ERY Concl', 'FFN Equiv', 'FFN Rslt', 'FFN Concl', 'TEL Equiv',
       'TEL Rslt', 'TEL Concl'],
      dtype='object')

In [6]:
# Define the list of columns to be removed 
columns_to_remove = [
    'NCBI Accession Number', 'WGS ID', 'AST Approved', 'WGS Approved', 'Resistance Pattern',
    'Lost resistance on retest', 'CLI Equiv', 'CLI Rslt', 'CLI Concl', 'ERY Equiv', 'ERY Rslt',
    'ERY Concl', 'FFN Equiv', 'FFN Rslt', 'FFN Concl', 'TEL Equiv',
    'TEL Rslt', 'TEL Concl'
]

# Remove the specified columns from the dataset
salmonella_isolate_df_cleaned = salmonella_isolate_df.drop(columns=columns_to_remove)

# Display the shape of dataset before and after removing columns
print(salmonella_isolate_df.shape)
print(salmonella_isolate_df_cleaned.shape)

(9900, 148)
(9900, 130)


In [7]:
# Identify and remove predicted conclusion columns that are not needed for analysis
# Removing redundant information prevents confusion and maintains dataset clarity
columns_concl_pred_no_underscore = [col for col in salmonella_isolate_df_cleaned.columns if col.endswith('ConclPred')]
salmonella_isolate_df_cleaned = salmonella_isolate_df_cleaned.drop(columns=columns_concl_pred_no_underscore)

# Display the shape of the dataset after removing these predicted conclusion columns
salmonella_isolate_df_cleaned.shape

(9900, 97)

In [8]:
# Calculate the threshold for 90% missing values
# Identifying columns with excessive missing data allows for informed decisions on retention or removal
threshold = 0.9 * len(salmonella_isolate_df_cleaned)

columns_to_drop_90_percent_missing = salmonella_isolate_df_cleaned.columns[salmonella_isolate_df_cleaned.isnull().sum() > threshold]

# Drop columns with more than 90% missing values
salmonella_isolate_df_cleaned = salmonella_isolate_df_cleaned.drop(columns=columns_to_drop_90_percent_missing)

# Display the shape of the dataset after dropping columns with excessive missing values
salmonella_isolate_df_cleaned.shape

(9900, 70)

In [9]:
# Create the 'MAR_Level' column by counting the number of resistances (R) across all selected antibiotics
# This new feature allows for analysis of multi-antibiotic resistance levels
antibiotic_columns = [col for col in salmonella_isolate_df_cleaned.columns if col.endswith('Concl')]

salmonella_isolate_df_cleaned['MAR_Level'] = salmonella_isolate_df_cleaned[antibiotic_columns].apply(lambda row: row.str.count('R').sum(), axis=1)

# Sort MAR_Level counts in ascending order and display
mar_level_counts = salmonella_isolate_df_cleaned['MAR_Level'].value_counts().sort_index()
mar_level_counts

MAR_Level
0.0    2975
1.0    5104
2.0     577
3.0      97
4.0      61
5.0     231
6.0     538
7.0     251
8.0      66
Name: count, dtype: int64

In [10]:
# Display all columns with 'object' data type
# This helps to identify potential categorical variables for further processing
object_columns = salmonella_isolate_df_cleaned.select_dtypes(include='object').columns

# Print the names of all object-type columns
print("Object-type columns:", object_columns.tolist())

Object-type columns: ['Specimen ID', 'Genus', 'Species', 'Serotype', 'Data Year', 'Region Name', 'Age Group', 'Specimen Source', 'Resistance Determinants', 'Predictive Resistance Pattern', 'AMI Equiv', 'AMI Concl', 'AMP Equiv', 'AMP Concl', 'AUG Equiv', 'AUG Concl', 'AXO Equiv', 'AXO Concl', 'AZM Equiv', 'AZM Concl', 'CEP Equiv', 'CEP Concl', 'CHL Equiv', 'CHL Concl', 'CIP Equiv', 'CIP Concl', 'COL Equiv', 'COL Concl', 'COT Equiv', 'COT Concl', 'FIS Equiv', 'FIS Concl', 'FOX Equiv', 'FOX Concl', 'GEN Equiv', 'GEN Concl', 'KAN Equiv', 'KAN Concl', 'MER Equiv', 'MER Concl', 'NAL Equiv', 'NAL Concl', 'SMX Equiv', 'SMX Concl', 'STR Equiv', 'STR Concl', 'TET Equiv', 'TET Concl', 'TIO Equiv', 'TIO Concl']


In [11]:
# Calculate the number of null values in each specified column
# Assessing null values in critical columns helps inform imputation strategies
columns_of_interest = ['Specimen ID', 'Genus', 'Species', 'Serotype', 'Data Year', 'Region Name', 'Age Group', 'Specimen Source']
null_values = salmonella_isolate_df_cleaned[columns_of_interest].isnull().sum()

# Display the null values for each column
null_values

Specimen ID          0
Genus                0
Species              0
Serotype             0
Data Year            0
Region Name          0
Age Group          308
Specimen Source     22
dtype: int64

In [12]:
# Remove any asterisks from the 'Data Year' column
# This cleans up the data, ensuring that year entries are properly formatted for analysis
salmonella_isolate_df_cleaned['Data Year'] = salmonella_isolate_df_cleaned['Data Year'].str.replace('*', '', regex=False)

# Convert the cleaned 'Data Year' column to numeric
salmonella_isolate_df_cleaned['Data Year'] = pd.to_numeric(salmonella_isolate_df_cleaned['Data Year'], errors='coerce')

In [13]:
# Check the data type of the 'Data Year' column to confirm it is in the correct numeric format
data_year_dtype = salmonella_isolate_df_cleaned['Data Year'].dtype

# Display the data type
data_year_dtype

dtype('int64')

In [14]:
# Confirm all values in 'Data Year' are numeric and correctly formatted
non_numeric_values = salmonella_isolate_df_cleaned['Data Year'].astype(str).str.isnumeric().sum()
total_values = len(salmonella_isolate_df_cleaned['Data Year'])

# Confirm if all values are numeric and correctly formatted
all_numeric = (non_numeric_values == total_values)

# Display confirmation whether all values are numeric
all_numeric


True

In [15]:
# Impute missing values in 'Age Group' and 'Specimen Source' with the mode
# This ensures that missing categorical data is handled appropriately, maintaining the integrity of the dataset
salmonella_isolate_df_cleaned['Age Group'] = salmonella_isolate_df_cleaned['Age Group'].fillna(salmonella_isolate_df_cleaned['Age Group'].mode()[0])
salmonella_isolate_df_cleaned['Specimen Source'] = salmonella_isolate_df_cleaned['Specimen Source'].fillna(salmonella_isolate_df_cleaned['Specimen Source'].mode()[0])

In [16]:
# Calculate and display the number of null values in selected columns after imputation
columns_of_interest = ['Specimen ID', 'Genus', 'Species', 'Serotype', 'Data Year', 'Region Name', 'Age Group', 'Specimen Source', 'Resistance Determinants', 'Predictive Resistance Pattern']
null_values = salmonella_isolate_df_cleaned[columns_of_interest].isnull().sum()

# Display the null values for each column
null_values

Specimen ID                      0
Genus                            0
Species                          0
Serotype                         0
Data Year                        0
Region Name                      0
Age Group                        0
Specimen Source                  0
Resistance Determinants          0
Predictive Resistance Pattern    0
dtype: int64

In [17]:
# Display the data types of the specified columns to confirm their formats
data_types = salmonella_isolate_df_cleaned[columns_of_interest].dtypes

# Display the data types for each column
data_types

Specimen ID                      object
Genus                            object
Species                          object
Serotype                         object
Data Year                         int64
Region Name                      object
Age Group                        object
Specimen Source                  object
Resistance Determinants          object
Predictive Resistance Pattern    object
dtype: object

In [18]:
# Replace all null values in columns ending with 'Rslt' with 0
# This indicates non-resistance in antibiotic testing results, ensuring consistency across the dataset
rslt_columns = [col for col in salmonella_isolate_df_cleaned.columns if col.endswith('Rslt')]

# Fill null values with 0 in these columns
salmonella_isolate_df_cleaned[rslt_columns] = salmonella_isolate_df_cleaned[rslt_columns].fillna(0)

# Display the first few rows to confirm changes
salmonella_isolate_df_cleaned[rslt_columns].head()


Unnamed: 0,AMI Rslt,AMP Rslt,AUG Rslt,AXO Rslt,AZM Rslt,CEP Rslt,CHL Rslt,CIP Rslt,COL Rslt,COT Rslt,FIS Rslt,FOX Rslt,GEN Rslt,KAN Rslt,MER Rslt,NAL Rslt,SMX Rslt,STR Rslt,TET Rslt,TIO Rslt
0,1.0,32,4.0,0.25,0.0,0.0,32,0.25,0.0,4.0,256.0,4.0,0.25,8.0,0.0,32.0,0.0,64.0,32,0.5
1,2.0,1,1.0,0.25,0.0,0.0,8,0.015,0.0,0.12,64.0,2.0,0.5,8.0,0.0,4.0,0.0,32.0,4,1.0
2,2.0,32,16.0,0.25,0.0,0.0,8,4.0,0.0,0.12,256.0,8.0,0.5,8.0,0.0,32.0,0.0,32.0,32,1.0
3,1.0,32,8.0,0.25,0.0,16.0,32,0.015,0.0,4.0,0.0,8.0,0.25,8.0,0.0,2.0,512.0,64.0,32,1.0
4,4.0,2,0.5,0.25,0.0,4.0,4,0.015,0.0,0.12,0.0,4.0,0.5,16.0,0.0,4.0,128.0,32.0,4,0.5


In [19]:
# Convert all 'Rslt' columns to float type
# Standardizing data types ensures compatibility with subsequent analyses and models
salmonella_isolate_df_cleaned[rslt_columns] = salmonella_isolate_df_cleaned[rslt_columns].astype('float64')

# Check for any remaining null values in 'Rslt' columns and display their data types
rslt_data_types = salmonella_isolate_df_cleaned[rslt_columns].dtypes

rslt_data_types

AMI Rslt    float64
AMP Rslt    float64
AUG Rslt    float64
AXO Rslt    float64
AZM Rslt    float64
CEP Rslt    float64
CHL Rslt    float64
CIP Rslt    float64
COL Rslt    float64
COT Rslt    float64
FIS Rslt    float64
FOX Rslt    float64
GEN Rslt    float64
KAN Rslt    float64
MER Rslt    float64
NAL Rslt    float64
SMX Rslt    float64
STR Rslt    float64
TET Rslt    float64
TIO Rslt    float64
dtype: object

In [20]:
# Select columns with 'Equiv' for null value count
equiv_columns = [col for col in salmonella_isolate_df_cleaned.columns if 'Equiv' in col]

# Count the number of null values in the selected 'Equiv' column
null_values_equiv = salmonella_isolate_df_cleaned[equiv_columns].isnull().sum()

# Display the count of null values for each 'Equiv'
null_values_equiv

salmonella_isolate_df_cleaned[equiv_columns].dtypes

AMI Equiv    object
AMP Equiv    object
AUG Equiv    object
AXO Equiv    object
AZM Equiv    object
CEP Equiv    object
CHL Equiv    object
CIP Equiv    object
COL Equiv    object
COT Equiv    object
FIS Equiv    object
FOX Equiv    object
GEN Equiv    object
KAN Equiv    object
MER Equiv    object
NAL Equiv    object
SMX Equiv    object
STR Equiv    object
TET Equiv    object
TIO Equiv    object
dtype: object

In [21]:
# Select columns with 'Concl' for null value count
concl_columns = [col for col in salmonella_isolate_df_cleaned.columns if 'Concl' in col]

# Count the number of null values in the selected 'Concl' column
null_values_concl = salmonella_isolate_df_cleaned[concl_columns].isnull().sum()

# Display the count of null values for each 'Concl'
null_values_concl

salmonella_isolate_df_cleaned[concl_columns].dtypes

AMI Concl    object
AMP Concl    object
AUG Concl    object
AXO Concl    object
AZM Concl    object
CEP Concl    object
CHL Concl    object
CIP Concl    object
COL Concl    object
COT Concl    object
FIS Concl    object
FOX Concl    object
GEN Concl    object
KAN Concl    object
MER Concl    object
NAL Concl    object
SMX Concl    object
STR Concl    object
TET Concl    object
TIO Concl    object
dtype: object

In [22]:
# Fill missing values in 'Equiv' columns with '=' to align with 'Rslt' = 0
# This maintains logical consistency in the dataset
for col in [col for col in salmonella_isolate_df_cleaned.columns if col.endswith('Equiv')]:
    salmonella_isolate_df_cleaned[col] = salmonella_isolate_df_cleaned[col].fillna('=')

In [23]:
# Impute missing values in 'Concl' columns with 'X' to indicate missing data
# This signifies that breakpoints or epidemiological cutoff values are not available for certain tests
for col in [col for col in salmonella_isolate_df_cleaned.columns if col.endswith('Concl')]:
    salmonella_isolate_df_cleaned[col] = salmonella_isolate_df_cleaned[col].fillna('X')

In [24]:
# Check the entire DataFrame for any remaining null values
# This will return the count of null values for each column in the DataFrame
null_values_df = salmonella_isolate_df_cleaned.isnull().sum()

# Display columns that still have null values, if any
null_values_df[null_values_df > 0]

Series([], dtype: int64)

In [25]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)  # Allow all columns to be shown

# Select numeric columns from the DataFrame
numeric_columns = salmonella_isolate_df_cleaned.select_dtypes(include=['float64', 'int64']).columns

# Get summary statistics for numeric columns in the DataFrame
numeric_summary = salmonella_isolate_df_cleaned[numeric_columns].describe()

# Display the summary statistics
print(numeric_summary)

         Data Year     AMI Rslt     AMP Rslt     AUG Rslt     AXO Rslt  \
count  9900.000000  9900.000000  9900.000000  9900.000000  9900.000000   
mean   2012.299293     0.548384     4.840909     1.921010     1.353283   
std       6.528499     0.976557     9.808060     2.494488     8.307389   
min    1999.000000     0.000000     1.000000     0.500000     0.250000   
25%    2007.000000     0.000000     1.000000     1.000000     0.250000   
50%    2012.000000     0.000000     1.000000     1.000000     0.250000   
75%    2018.000000     1.000000     2.000000     1.000000     0.250000   
max    2024.000000     8.000000    32.000000    32.000000    64.000000   

          AZM Rslt    CEP Rslt     CHL Rslt     CIP Rslt     COL Rslt  \
count  9900.000000  9900.00000  9900.000000  9900.000000  9900.000000   
mean      2.615152     0.36798     8.481414     0.408747     0.039874   
std       3.039666     1.61986     8.603403     0.800991     0.125916   
min       0.000000     0.00000     2.0000

In [26]:
# Display the first few rows of the cleaned DataFrame
salmonella_isolate_df_cleaned.head()

Unnamed: 0,Specimen ID,Genus,Species,Serotype,Data Year,Region Name,Age Group,Specimen Source,Resistance Determinants,Predictive Resistance Pattern,AMI Equiv,AMI Rslt,AMI Concl,AMP Equiv,AMP Rslt,AMP Concl,AUG Equiv,AUG Rslt,AUG Concl,AXO Equiv,AXO Rslt,AXO Concl,AZM Equiv,AZM Rslt,AZM Concl,CEP Equiv,CEP Rslt,CEP Concl,CHL Equiv,CHL Rslt,CHL Concl,CIP Equiv,CIP Rslt,CIP Concl,COL Equiv,COL Rslt,COL Concl,COT Equiv,COT Rslt,COT Concl,FIS Equiv,FIS Rslt,FIS Concl,FOX Equiv,FOX Rslt,FOX Concl,GEN Equiv,GEN Rslt,GEN Concl,KAN Equiv,KAN Rslt,KAN Concl,MER Equiv,MER Rslt,MER Concl,NAL Equiv,NAL Rslt,NAL Concl,SMX Equiv,SMX Rslt,SMX Concl,STR Equiv,STR Rslt,STR Concl,TET Equiv,TET Rslt,TET Concl,TIO Equiv,TIO Rslt,TIO Concl,MAR_Level
0,AM39779,Salmonella,enterica,Typhi,2008,Region 6,10-19,Blood,Not sequenced,Not sequenced,=,1.0,S,>,32.0,R,=,4.0,S,<=,0.25,S,=,0.0,X,=,0.0,X,>,32.0,R,=,0.25,I,=,0.0,X,>,4.0,R,>,256.0,R,=,4.0,S,<=,0.25,S,<=,8.0,S,=,0.0,X,>,32.0,R,=,0.0,X,>,64.0,R,>,32.0,R,=,0.5,S,7.0
1,AM42221,Salmonella,enterica,Typhi,2009,Region 9,30-39,Stool,Not sequenced,Not sequenced,=,2.0,S,<=,1.0,S,<=,1.0,S,<=,0.25,S,=,0.0,X,=,0.0,X,=,8.0,S,<=,0.015,S,=,0.0,X,<=,0.12,S,=,64.0,S,=,2.0,S,=,0.5,S,<=,8.0,S,=,0.0,X,=,4.0,S,=,0.0,X,<=,32.0,S,<=,4.0,S,=,1.0,S,0.0
2,AM40656,Salmonella,enterica,Typhi,2009,Region 2,30-39,Blood,Not sequenced,Not sequenced,=,2.0,S,>,32.0,R,=,16.0,I,<=,0.25,S,=,0.0,X,=,0.0,X,=,8.0,S,=,4.0,R,=,0.0,X,<=,0.12,S,>,256.0,R,=,8.0,S,=,0.5,S,<=,8.0,S,=,0.0,X,>,32.0,R,=,0.0,X,<=,32.0,S,>,32.0,R,=,1.0,S,5.0
3,AM18807,Salmonella,enterica,Typhi,2003,Region 2,10-19,Stool,Not sequenced,Not sequenced,=,1.0,S,>,32.0,R,=,8.0,S,<=,0.25,S,=,0.0,X,=,16.0,I,>,32.0,R,<=,0.015,S,=,0.0,X,>,4.0,R,=,0.0,X,=,8.0,S,<=,0.25,S,<=,8.0,S,=,0.0,X,=,2.0,S,>,512.0,R,>,64.0,R,>,32.0,R,=,1.0,S,6.0
4,AM07505,Salmonella,enterica,Typhi,2000,Region 2,20-29,Blood,Not sequenced,Not sequenced,<=,4.0,S,<=,2.0,S,<=,0.5,S,<=,0.25,S,=,0.0,X,=,4.0,S,<=,4.0,S,<=,0.015,S,=,0.0,X,<=,0.12,S,=,0.0,X,<=,4.0,S,=,0.5,S,<=,16.0,S,=,0.0,X,<=,4.0,S,<=,128.0,S,<=,32.0,S,<=,4.0,S,<=,0.5,S,0.0


In [27]:
# Save the cleaned DataFrame to a CSV file
file_path = 'salmonella_isolate_df_cleaned.csv'  # Specify the desired file path and name
salmonella_isolate_df_cleaned.to_csv(file_path, index=False)

# Output file path for confirmation
file_path

'salmonella_isolate_df_cleaned.csv'

## Summary of Data Wrangling and Cleaning Process

In this notebook, we performed a comprehensive data wrangling and cleaning process on the *Salmonella enterica* isolate dataset to prepare it for analysis and modeling. The following key steps were undertaken:

1. **Data Loading**: The dataset was loaded from the CDC National Antimicrobial Resistance Monitoring System (NARMS) to ensure that we are working with reliable and comprehensive surveillance data.

2. **Checking for Duplicates**: We confirmed no duplicates in `Specimen ID` entries, this is our unique identifier.

3. **Initial Exploration**: We displayed the first few rows of the dataset to gain insights into its structure, including the types of variables and potential issues such as missing values.

4. **Removal of Redundant Columns**: Columns with 90% missing values and those deemed unnecessary for analysis were identified and removed. This step helps streamline the dataset and focus on relevant features.

5. **Handling Missing Values**: 
   - Missing values in critical categorical columns (`Age Group` and `Specimen Source`) were filled with the most common category (mode) to maintain data integrity.
   - For the `Concl` columns, missing entries were imputed with 'X', indicating that breakpoints or epidemiological cutoff values are unavailable.
   - `Equiv` columns with missing data were filled with '=', aligning them logically with their corresponding `Rslt` values filled with 0.

6. **Data Type Verification**: We checked and ensured that data types for all columns were appropriate for analysis. Necessary conversions were made to ensure compatibility with modeling techniques.

7. **Creation of New Features**: A new feature, `MAR_Level`, was derived by counting the number of antibiotic resistances across the dataset, providing a quantitative measure of multi-antibiotic resistance.

8. **Final Validation**: The dataset was checked for any remaining null values to confirm that all critical fields were addressed.

9. **Summary Statistics**: We generated summary statistics to understand the distribution of numeric data.

10. **Data Saving**: The cleaned dataset was saved to a CSV file for future use, ensuring that the preprocessing steps can be replicated or modified as needed in subsequent analyses.

By following these steps, we have prepared a clean, structured dataset that is ready for exploratory data analysis and predictive modeling, which will support the overall objectives of this project in understanding antibiotic resistance patterns in *Salmonella enterica*.