# Authenticating with S3 Bucket: Reading Data Objects into Pandas DataFrame

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate create a notebook that authenticates to an s3 bucket reads the data objects into a dataframe

 The Jupyter notebook in question is a comprehensive guide on how to authenticate with an S3 bucket using AWS SDK or environment variables. It then shows how to navigate and interact with this bucket, specifically focusing on reading data objects into a pandas DataFrame for further analysis. Key steps include installing necessary libraries, setting up AWS credentials, creating an S3 client, listing objects in the bucket, selecting and reading data objects, and performing basic exploratory data analysis on the resulting DataFrame. This notebook serves as a practical resource for users who wish to leverage cloud-based storage and data processing capabilities to work with their data effectively.

## Set Up AWS Credentials

 The provided code is nearly correct, but it includes placeholders for the AWS access key and secret key. In a real-world scenario, you should never hardcode these values directly into your scripts. Instead, use secure methods to store and retrieve them. Here's an improved version of the script that uses environment variables to store the credentials:

In [None]:
import os
import boto3
from botocore.exceptions import NoCredentialsError

In [None]:
# Option 1: Set up AWS Credentials using environment variables
os.environ['AWS_ACCESS_KEY_ID'] = 'YOUR_ACCESS_KEY'
os.environ['AWS_SECRET_ACCESS_KEY'] = 'YOUR_SECRET_KEY'
s3 = boto3.resource('s3')

In [None]:
# Test AWS S3 connection by listing all the buckets in your account
try:
    for bucket in s3.buckets.all():
        print(bucket.name)
except NoCredentialsError:
    print("No AWS credentials found")

## Create S3 Client

 import boto3

   # Replace the access key and secret key with your actual credentials
   ACCESS_KEY = 'your-access-key'
   SECRET_KEY = 'your-secret-key'

   # Create an S3 client object using the provided credentials
   s3_client = boto3.client('s3', aws_access_key_id=ACCESS_KEY, aws_secret_access_key=SECRET_KEY)

The improvements made to the code include adding comments for clarity and specifying that the AWS access key ID and secret access key should be replaced with the actual credentials.

## List Objects in the Bucket

 The code provided is already correct and works as intended. However, I have added error handling to make the code more robust. Here's the improved version:

In [None]:
# Import necessary libraries
import boto3
from botocore.exceptions import NoCredentialsError

In [None]:
def list_objects_in_bucket(bucket_name):
    # Define S3 resource
    s3 = boto3.resource('s3')

    try:
        bucket = s3.Bucket(bucket_name)
        objects = bucket.objects.all()

        # Print out each object key
        for obj in objects:
            print(obj.key)
    except NoCredentialsError:
        print("No AWS credentials found")
    except Exception as e:
        print(f"An error occurred: {e}")

# Specify your bucket name and call the function
bucket_name = 'your-bucket-name'
list_objects_in_bucket(bucket_name)
This code adds a try/except block to handle any errors that might occur when trying to list the objects in the specified S3 bucket. If no AWS credentials are found, it will print an error message. For other exceptions, it will print the exception itself. This makes debugging easier and provides more useful feedback in case of errors.

## Select and Read Data Object

 import pandas as pd
   import boto3
   from io import BytesIO

   s3 = boto3.client('s3')

   bucket = 'your-bucket'
   key = 'path/to/your/object.csv'

   obj = s3.get_object(Bucket=bucket, Key=key)

   data = obj['Body'].read()

   # Use BytesIO instead of StringIO to handle binary data directly from the object
   data_file = BytesIO(data)

   df = pd.read_csv(data_file, delimiter=',')  # adjust parameters as needed

## Explore DataFrame

 import pandas as pd

   def explore_dataframe(df):
       # Display data frame shape
       print("Shape of DataFrame:", df.shape)

       # Display first few rows of the DataFrame
       print("\nFirst few rows of the DataFrame:")
       display(df.head())

       # Check for missing values in the DataFrame
       print("\nMissing values in the DataFrame:")
       display(df.isnull().sum())

       # Summary statistics for numeric columns in the DataFrame
       print("\nSummary statistics for numeric columns:")
       display(df.describe(include=[pd.np.number]))

   explore_dataframe(df)