# Notebook 1: Data Ingestion and Cleaning

This notebook handles the initial phase of the project: loading the raw data, performing essential cleaning operations, and preparing it for exploratory data analysis.

## Objectives
- Load the dataset from the source file.
- Inspect the data for missing values, duplicates, and inconsistencies.
- Remove irrelevant columns that do not contribute to the predictive model.
- Save the cleaned data to a new file for subsequent analysis.



## 1. Import Libraries
Import the necessary Python libraries for data manipulation and analysis.


In [None]:
import pandas as pd
import numpy as np


## 2. Load the Data
Load the raw dataset from the `data` directory.


In [None]:
data = pd.read_csv('data/data.csv')
data.head()


## 3. Data Cleaning
Clean the data by removing unnecessary columns and checking for missing values. The `id` column is a unique identifier for each patient and is not relevant for the classification task. Additionally, we will check for any missing data.


In [None]:
# Drop the 'id' column as it is not needed for analysis
if 'id' in data.columns:
    data.drop('id', axis=1, inplace=True)

# Drop the 'Unnamed: 32' column if it exists
if 'Unnamed: 32' in data.columns:
    data.drop('Unnamed: 32', axis=1, inplace=True)
    
# Check for any missing values
print(data.isnull().sum())


## 4. Save the Cleaned Data
Save the cleaned dataframe to a new CSV file, which will be used in the subsequent notebooks for analysis and modeling.


In [None]:
data.to_csv('data/clean-data.csv', index=False)
print("Cleaned data saved to data/clean-data.csv")


# Notebook 1: Data Ingestion and Cleaning

This notebook handles the initial phase of the project: loading the raw data, performing essential cleaning operations, and preparing it for exploratory data analysis.

## Objectives
- Load the dataset from the source file.
- Inspect the data for missing values, duplicates, and inconsistencies.
- Remove irrelevant columns that do not contribute to the predictive model.
- Save the cleaned data to a new file for subsequent analysis.



## 1. Import Libraries
Import the necessary Python libraries for data manipulation and analysis.


In [None]:
import pandas as pd
import numpy as np



## 2. Load the Data
Load the raw dataset from the `data` directory.


In [None]:
data = pd.read_csv('data/data.csv')
data.head()


## 3. Data Cleaning
Clean the data by removing unnecessary columns and checking for missing values. The `id` column is a unique identifier for each patient and is not relevant for the classification task. Additionally, we will check for any missing data.


In [None]:
# Drop the 'id' column as it is not needed for analysis
if 'id' in data.columns:
    data.drop('id', axis=1, inplace=True)

# Drop the 'Unnamed: 32' column if it exists
if 'Unnamed: 32' in data.columns:
    data.drop('Unnamed: 32', axis=1, inplace=True)
    
# Check for any missing values
print(data.isnull().sum())


## 4. Save the Cleaned Data
Save the cleaned dataframe to a new CSV file, which will be used in the subsequent notebooks for analysis and modeling.


In [None]:
# Drop the 'id' column as it is not needed for analysis
if 'id' in data.columns:
    data.drop('id', axis=1, inplace=True)

# Drop the 'Unnamed: 32' column if it exists
if 'Unnamed: 32' in data.columns:
    data.drop('Unnamed: 32', axis=1, inplace=True)

# Check for any missing values
print(data.isnull().sum())


## 4. Save the Cleaned Data
Save the cleaned dataframe to a new CSV file, which will be used in the subsequent notebooks for analysis and modeling.


In [None]:
data.to_csv('data/clean-data.csv', index=False)
print("Cleaned data saved to data/clean-data.csv")
