![Digital Futures](https://github.com/digital-futures-academy/DataScienceMasterResources/blob/main/Resources/datascience-notebook-header.png?raw=true)

## Learner Stories

```txt
As a DATA PROFESSIONAL,  
I want to be able to use bulk data import techniques,  
so that I can efficiently load large datasets into my database
```

# What is Bulk Data Importing?

Bulk data importing is a technique used to load large datasets into a database. This is a common task for data professionals who work with large datasets.  Bulk data importing is important because it allows data professionals to efficiently load large datasets into a database. This can save time and resources, and make it easier to work with large datasets.

## What are the benefits of Bulk Data Importing?


1. **Efficiency**:
    - Bulk data importing is a fast and efficient way to load large datasets into a database. This can save time and resources, and make it easier to work with large datasets.
2. **Scalability**:
    - Bulk data importing is a scalable technique that can be used to load datasets of any size into a database. This makes it easy to work with large datasets and to scale up as needed.
3. **Ease of use**:
    - Bulk data importing is a simple and easy-to-use technique that can be used by data professionals of all skill levels. This makes it easy to load large datasets into a database without the need for complex tools or techniques.
4. **Flexibility**:
    - Bulk data importing is a flexible technique that can be used with a wide range of databases and data formats. This makes it easy to load large datasets into a database, regardless of the database or data format being used.
5. **Reliability**:
    - Bulk data importing is a reliable technique that can be used to load large datasets into a database without the risk of data loss or corruption. This makes it easy to work with large datasets and to ensure that the data is loaded correctly.
6. **Cost-effective**:
    - Bulk data importing is a cost-effective technique that can save time and resources. This makes it easy to work with large datasets and to load data into a database without the need for expensive tools or techniques.
7. **Performance**:
    - Bulk data importing is a high-performance technique that can be used to load large datasets into a database quickly and efficiently. This makes it easy to work with large datasets and to load data into a database without the need for complex tools or techniques.
8. **Security**:
    - Bulk data importing is a secure technique that can be used to load large datasets into a database without the risk of data loss or corruption. This makes it easy to work with large datasets and to ensure that the data is loaded correctly.
9. **Compatibility**:
    - Bulk data importing is a compatible technique that can be used with a wide range of databases and data formats. This makes it easy to load large datasets into a database, regardless of the database or data format being used.
10. **Automation**:
    - Bulk data importing is an automated technique that can be used to load large datasets into a database without the need for manual intervention. This makes it easy to work with large datasets and to load data into a database quickly and efficiently.




---

## What are the Best Practices for Bulk Data Importing?

1. ***Use Efficient Data Formats***
    - **CSV**: Common and easy to use, but can be inefficient for large datasets.
    - **Parquet/ORC**: Columnar storage formats that are highly efficient for large-scale data processing.
    - **Avro**: A row-based storage format that is compact and efficient for serialization.
2. ***Batch Processing***
   - **Batch Size**: Import data in batches to avoid overwhelming the database and to manage memory usage.
   - **Parallel Processing**: Use parallel processing to speed up the import process.
3. ***Index Management***
    - **Disable Indexes**: Temporarily disable indexes during bulk import to speed up the process.
    - **Rebuild Indexes**: Rebuild indexes after the import is complete to ensure data integrity and query performance.
4. ***Data Validation and Cleansing***
    - **Pre-Validation**: Validate data before import to catch errors early.
    - **Post-Validation**: Validate data after import to ensure data integrity.
5. ***Error Handling and Logging***
    - **Error Logging**: Log errors during import to identify and fix issues.
    - **Retry Mechanism**: Implement a retry mechanism for transient errors.
6. ***Transaction Management***
    - **Transactions**: Use transactions to ensure data consistency and to rollback in case of errors.

---

## Demonstration: Bulk Data Importing with Python

### Set Up the Environment and Dataset

In [7]:
import pandas as pd
from sqlalchemy import create_engine, text
import sqlite3

In [8]:
data = {
    'customer_id': range(1, 10001),  # 10,000 rows
    'name': [f'Customer {i}' for i in range(1, 10001)],
    'email': [f'customer{i}@example.com' for i in range(1, 10001)],
    'city': ['City A' if i % 2 == 0 else 'City B' for i in range(1, 10001)]
}

In [9]:
# Put the data into a DataFrame
df = pd.DataFrame(data)
df.head()

Unnamed: 0,customer_id,name,email,city
0,1,Customer 1,customer1@example.com,City B
1,2,Customer 2,customer2@example.com,City A
2,3,Customer 3,customer3@example.com,City B
3,4,Customer 4,customer4@example.com,City A
4,5,Customer 5,customer5@example.com,City B


In [10]:
# Create an SQL in-memory database
engine = create_engine('sqlite:///:memory:')

---

## Steps to Import the Data in Bulk

### Step 1: Create a table to import the data to

Remember that this should reflect the structure of the data you are importing.

In [11]:
# Create the customers table
with engine.connect() as conn:
    conn.execute(text('''
        CREATE TABLE customers (
            customer_id INT PRIMARY KEY,
            name VARCHAR(255),
            email VARCHAR(255),
            city VARCHAR(255)
        );
    '''))

### Step 2: Bulk import the data into the database

In [12]:
# Bulk import data into the database
df.to_sql('customers', engine, if_exists='append', index=False, method='multi', chunksize=1000)

10000

#### `df.to_sql()`?

The `to_sql()` method in pandas allows you to write a DataFrame to a SQL database. This method is a convenient way to import data from a DataFrame into a SQL database.

The parameters to this function are:

- `name`: The name of the table to create in the database.
- `con`: The database connection object (in this case `engine`).
- `if_exists`: What to do if the table already exists. Options are `'fail'`, `'replace'`, `'append'`.
- `index`: Whether to write the DataFrame index as a column in the table.
- `method`: The method to use for inserting data. Options are `'multi'`, `'single'`.
- `chunksize`: The number of rows to write at a time.

#### Is `chunksize` important?

Yes, `chunksize` is important when importing large datasets. It allows you to import the data in chunks, which can help to manage memory usage and speed up the import process.

#### What are suggested values for `chunksize`?

The optimal value for `chunksize` depends on the size of the dataset and the available memory. A good starting point is to use a value that is a multiple of the number of rows in the dataset.  This is because the `chunksize` parameter specifies the number of rows to write at a time, so a value that is a multiple of the number of rows in the dataset will allow the data to be imported in chunks that are evenly divided.

#### Is df_sql() Transactional?

Yes, the `to_sql()` method in pandas is transactional. This means that if an error occurs during the import process, the changes will be rolled back and the data will not be imported.

### Step 3: Verify the data has been imported

In [14]:
# Verify the import by querying the database
with engine.connect() as conn:
    result = conn.execute(text('SELECT COUNT(*) FROM customers'))
    print(f'Total rows imported: {result.fetchone()[0]}')

    # Optionally, display a few rows to verify
    result = conn.execute(text('SELECT * FROM customers LIMIT 5'))
    for row in result:
        print(row)


Total rows imported: 10000
(1, 'Customer 1', 'customer1@example.com', 'City B')
(2, 'Customer 2', 'customer2@example.com', 'City A')
(3, 'Customer 3', 'customer3@example.com', 'City B')
(4, 'Customer 4', 'customer4@example.com', 'City A')
(5, 'Customer 5', 'customer5@example.com', 'City B')


## Summary

- ***Efficient Data Formats***: Use efficient data formats like Parquet, ORC, or Avro for large-scale data processing.
- ***Batch Processing***: Import data in batches and use parallel processing to speed up the import process.
- ***Index Management***: Disable indexes during import and rebuild them afterward to improve performance.
- ***Data Validation and Cleansing***: Validate data before and after import to ensure data integrity.
- ***Error Handling and Logging***: Implement error logging and retry mechanisms to handle errors during import.
- ***Transaction Management***: Use transactions to ensure data consistency and rollback in case of errors.

---

---

## Extended Example

In [15]:
import pandas as pd
from sqlalchemy import create_engine, text
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Generate a larger dataset
data = {
    'customer_id': range(1, 10001),  # 10,000 rows
    'name': [f'Customer {i}' for i in range(1, 10001)],
    'email': [f'customer{i}@example.com' for i in range(1, 10001)],
    'city': ['City A' if i % 2 == 0 else 'City B' for i in range(1, 10001)]
}
df = pd.DataFrame(data)

# Data validation and cleansing
def validate_and_cleanse(df):
    # Example validation: Ensure no missing values
    if df.isnull().values.any():
        raise ValueError("Data contains missing values")
    # Example cleansing: Convert all email addresses to lowercase
    df['email'] = df['email'].str.lower()
    return df

try:
    df = validate_and_cleanse(df)
except ValueError as e:
    logging.error(f"Data validation error: {e}")
    raise

# Create an SQLite in-memory database connection
engine = create_engine('sqlite:///:memory:')

# Create the customers table
with engine.connect() as conn:
    conn.execute(text('''
        CREATE TABLE customers (
            customer_id INT PRIMARY KEY,
            name VARCHAR(255),
            email VARCHAR(255),
            city VARCHAR(255)
        );
    '''))

    # Disable indexes (if applicable)
    # Note: SQLite does not support disabling indexes, but this is a placeholder for databases that do
    # conn.execute(text('ALTER INDEX idx_customers_email DISABLE;'))

# Bulk import data into the database
try:
    df.to_sql('customers', engine, if_exists='append', index=False, method='multi', chunksize=1000)
    logging.info("Data imported successfully")
except Exception as e:
    logging.error(f"Error during data import: {e}")
    raise

# Rebuild indexes (if applicable)
with engine.connect() as conn:
    # Note: SQLite does not support rebuilding indexes, but this is a placeholder for databases that do
    # conn.execute(text('REINDEX TABLE customers;'))

    # Verify the import by querying the database
    result = conn.execute(text('SELECT COUNT(*) FROM customers'))
    logging.info(f'Total rows imported: {result.fetchone()[0]}')

    # Optionally, display a few rows to verify
    result = conn.execute(text('SELECT * FROM customers LIMIT 5'))
    for row in result:
        logging.info(row)

2024-12-12 12:06:06,061 - INFO - Data imported successfully
2024-12-12 12:06:06,064 - INFO - Total rows imported: 10000
2024-12-12 12:06:06,065 - INFO - (1, 'Customer 1', 'customer1@example.com', 'City B')
2024-12-12 12:06:06,065 - INFO - (2, 'Customer 2', 'customer2@example.com', 'City A')
2024-12-12 12:06:06,066 - INFO - (3, 'Customer 3', 'customer3@example.com', 'City B')
2024-12-12 12:06:06,066 - INFO - (4, 'Customer 4', 'customer4@example.com', 'City A')
2024-12-12 12:06:06,066 - INFO - (5, 'Customer 5', 'customer5@example.com', 'City B')
