# Week 5, Class 1: Introduction to Pandas: Series and DataFrames

## 1. What is Pandas?
**Pandas** is a fast, powerful, flexible, and easy-to-use open-source data analysis and manipulation tool, built on top of Python. Its two primary data structures, the **Series** and the **DataFrame**, are designed to handle the kind of data, such as tables, spreadsheets, and time-series data.

* `Series`: A one-dimensional labeled array. Think of it as a single column of a spreadsheet.

* `DataFrame`: A two-dimensional labeled data structure with columns of potentially different types. It's the equivalent of a spreadsheet or an SQL table.

Pandas is built on NumPy, so many of the performance benefits we saw with vectorized operations carry over.

In [None]:
import pandas as pd
import numpy as np

## 2. The Pandas Series
A `Series` is a one-dimensional array-like object capable of holding data of any type (e.g., integers, floats, strings, Python objects). It has an **index**, which is a list of labels that uniquely identifies each element.

### Creating a Series
You can create a `Series` from various Python objects.

#### From a Python List
When creating a `Series` from a list, Pandas automatically creates a default integer index starting from 0.

In [None]:
# Create a Series from a list
temperatures = pd.Series([25.5, 26.1, 24.9, 27.0])
print(temperatures)

# You can access the data and the index separately
print(f"\nSeries data: {temperatures.values}")
print(f"Series index: {temperatures.index}")
print(temperatures[1])

#### With a Custom Index

You can explicitly provide a custom index, which can be useful for labeling your data points.

In [None]:
# Create a Series of pH values with custom sample IDs as the index
ph_values = pd.Series([6.8, 7.1, 7.0, 6.9], index=["Sample_A", "Sample_B", "Sample_C", "Sample_D"])
print(ph_values)

# Accessing elements by their custom label (index)
print(f"\nValue for Sample_B: {ph_values['Sample_B']}")

#### From a Dictionary

If you create a `Series` from a dictionary, the keys of the dictionary become the index of the Series, and the values become the data.

In [None]:
# Create a Series from a dictionary of chemical properties
chemical_properties = {
    "Density": 1.15,
    "Boiling_Point": 150.2,
    "Melting_Point": -5.3
}
properties_series = pd.Series(chemical_properties)
print(properties_series)

## 3. The Pandas `DataFrame`

A **`DataFrame`** is the primary workhorse of Pandas. It's a two-dimensional, size-mutable, tabular data structure with labeled axes (rows and columns). It's essentially a dictionary of `Series` objects that share the same index.

### Creating a `DataFrame`

The most common way to create a `DataFrame` in scientific applications is from a dictionary, where the keys become the column names and the values are the data for each column.

In [None]:
# Create a dictionary where each key is a column name and the value is a list of data
experiment_data = {
    "Sample_ID": ["A_1", "A_2", "B_1", "B_2"],
    "Weight_g": [10.5, 12.1, 9.8, 11.3],
    "pH_Level": [7.2, 7.5, 6.9, 7.1]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(experiment_data)
print(df)

Note how Pandas automatically created a default integer index (0, 1, 2, 3) for the rows.

### Specifying Column Order

The order of columns in the DataFrame will be based on the order of keys in your dictionary. You can explicitly specify the order if you need to.

In [None]:
# Using the same data as above
df_ordered = pd.DataFrame(experiment_data, columns=["Weight_g", "pH_Level", "Sample_ID"])
print(df_ordered)

### From a List of Dictionaries

Another common way to create a DataFrame is from a list of dictionaries, where each dictionary represents a row.

In [None]:
# Each dictionary represents a row
list_of_rows = [
    {"sample": "X", "conc": 0.5, "temp": 25.0},
    {"sample": "Y", "conc": 0.8, "temp": 24.5},
    {"sample": "Z", "conc": 0.6, "temp": 25.1}
]

df_from_list = pd.DataFrame(list_of_rows)
print(df_from_list)

## 4. DataFrame Attributes

Once you have a DataFrame, you can inspect its structure using various attributes.

In [None]:
print(df)

In [None]:
# Using the `df` from our first example
print(f"DataFrame Index: {df.index}")
print(f"DataFrame Columns: {df.columns}")
print(f"DataFrame Shape (rows, columns): {df.shape}")
print(f"DataFrame Dimensions: {df.ndim}")
print(f"DataFrame Data Types:\n{df.dtypes}")

The `dtype` of a column is the type of data it holds, similar to the NumPy `dtype` we've seen before. Pandas uses `object` as a general type for strings and other Python objects.

## 5. First Look at Data Analysis with a `DataFrame`

The power of DataFrames comes from their ability to handle tabular data in a structured way. Here's a preview of some common operations you'll perform, which we'll cover in more detail later.

### 5.1. Selecting Columns

You can select a single column from a DataFrame, which returns a Pandas `Series` object.

In [None]:
# Create a DataFrame for a quick example
data = {
    'sample_id': ['A01', 'A02', 'B01', 'B02'],
    'temperature_c': [25.5, 26.1, 24.9, 27.0],
    'ph_level': [7.2, 7.5, 6.9, 7.1],
    'pressure_kPa': [101.3, 101.5, 100.9, 101.8]
}
df = pd.DataFrame(data)

# Select a single column using bracket notation (returns a Series)
temperatures_series = df['temperature_c']
print(f"Type of selected column: {type(temperatures_series)}")
print(f"\nTemperature column:\n{temperatures_series}")

# Select multiple columns (returns a DataFrame)
subset_df = df[['sample_id', 'ph_level']]
print(f"\nType of selected columns: {type(subset_df)}")
print(f"\nSubset DataFrame:\n{subset_df}")

### 5.2. Basic Descriptive Statistics

You can quickly get descriptive statistics for your numerical columns using simple methods.

In [None]:
# Get a summary of statistics for all numerical columns
print("Descriptive statistics for numerical columns:")
print(df.describe())

# Get the average of a specific column
average_temp = df['temperature_c'].mean()
print(f"\nAverage temperature: {average_temp:.2f}°C")

# Find the maximum value in a column
max_pressure = df['pressure_kPa'].max()
print(f"Maximum pressure: {max_pressure:.1f} kPa")

## Summary and Key Takeaways

* **Pandas** is a core library for data analysis, providing powerful, labeled data structures.
* A **`Series`** is a one-dimensional, labeled array.
* A **`DataFrame`** is a two-dimensional table-like structure, similar to a spreadsheet.
* The most common way to create a `DataFrame` is from a dictionary of lists, where the keys become column names.
* Key attributes like `.index`, `.columns`, `.shape`, and `.dtypes` are useful for quickly inspecting your data.
* You can easily select one or more columns and perform quick statistical analysis on them.

## Exercises

Complete the following exercises in a new Python script or a new Jupyter Notebook.

1.  **Create and Inspect a `Series`:**
    * Create a list of 5 molecular weights: `molecular_weights = [18.015, 36.46, 98.08, 58.44, 44.01]`.
    * Create a `pd.Series` from this list.
    * Print the `Series`.
    * Print the values and index of the `Series` separately.

2.  **Create a `DataFrame` from an Experiment:**
    * You have the following experimental data:
        * `Run_Number`: `[1, 2, 3, 4, 5]`
        * `Yield_g`: `[15.2, 16.1, 15.8, 17.0, 16.5]`
        * `Catalyst_Type`: `["A", "B", "A", "C", "B"]`
    * Create a dictionary from this data.
    * Create a `pd.DataFrame` named `yield_df` from this dictionary.
    * Print the `yield_df`.

3.  **Inspect Your `DataFrame`:**
    * Using the `yield_df` you created in Exercise 2:
    * Print the number of rows and columns using the `.shape` attribute.
    * Print the data type of each column using the `.dtypes` attribute.
    * Print a list of all column names using the `.columns` attribute.

4.  **Create a DataFrame with a Custom Index:**
    * Using the `experiment_data` dictionary from the class notes, create a new `DataFrame` called `df_indexed`.
    * This time, set the `Sample_ID` column as the index for the DataFrame. You can do this by using the `index` parameter during DataFrame creation or by using `df.set_index('Sample_ID')`.
    * Print the new `df_indexed` to see the effect. What is different about the output?

5.  **Calculate and Print Statistics:**
    * Using the `yield_df` from Exercise 2:
    * Select and print the `Yield_g` column.
    * Calculate and print the average (`mean`) and standard deviation (`std`) of the `Yield_g` column.
    * Find and print the maximum value in the `Yield_g` column.