# 1. Introduction
**Pandas** stands for **Pan**el **da**ta, which is an economics term for multi-dimensional datasets. Pandas is built on top of Numpy and has two primary data structures
   1. Series (1D data structure)
   2. Dataframe (2D data structure)

If you want higher dimensional data structures you should use the [xrray package](https://pypi.org/project/xarray/).

Pandas series and dataframes are typically easier to work with than Numpy
arrays because they can use labels as index numbers. That is, you can give a row or a column a heading label and then using the label to obtain the data value, whereas in numpy you need to use the index numbers. Pandas can also contain multiple data types while numpy arrays can only have one data type. 

## Pandas DataFrames vs. Numpy Arrays
<br>

| Feature | **Pandas DataFrame** | **NumPy Array** |
|---------|----------------------|-----------------|
| Type | 2D table with labeled rows & columns | N-dimensional homogeneous array |
| Supports multiple data types? |Yes (e.g., strings & numbers in one column) |No (all elements must be the same type) |
| Supports heterogeneous data? |Yes (each column can have different types) |No (all elements must be the same `dtype`) |
| Row & column labels |Yes (uses `Index` and column names) |No (only positional indexing) |
| Access by labels? |Yes (`df.loc["row_label", "column_label"]`) |No (only by index position) |
| Access by index position? |Yes (`df.iloc[1, 2]`) |Yes (`arr[1, 2]`) |
| Supports missing values? |Yes (`NaN`) |No (must use masked arrays or `np.nan`) |
| Built-in functions for missing data? |Yes (`df.fillna()`, `df.dropna()`) |No (requires workarounds) |
| Supports row/column operations? |Yes (`df.mean(axis=0)`, `df.sum(axis=1)`) |Yes (`arr.mean(axis=0)`, `arr.sum(axis=1)`) |
| Supports grouping & aggregation? |Yes (`df.groupby()`, `df.agg()`) |No |
| Supports reshaping? |Yes (`df.pivot()`, `df.melt()`) |Yes (`arr.reshape()`) |
| Supports merging/joining? |Yes (`df.merge()`, `df.join()`) |No |
| Mutable |Yes (you can modify/add/delete columns, rows or elements)|Partially (you can modify elements but not change shape)|
| Read/Write CSV? |Yes (`df.to_csv()`, `pd.read_csv()`) |No (must use Pandas or `np.savetxt()`) |
| Read/Write JSON? |Yes (`df.to_json()`, `pd.read_json()`) |No |
| Read/Write Excel? |Yes (`df.to_excel()`, `pd.read_excel()`) |No |
| Save/Load Binary Format? |Yes (`df.to_pickle()`) |Yes (`np.save()`, `np.load()`) |

If you have not installed pandas you need to activate your conda virtual environment and install it using your terminal
```bash
conda activate NameOfEnvironment
conda install -c conda-forge pandas
```
Now check the version that is installed

In [None]:
import pandas as pd
print(pd.__version__)

# 2. Pandas Object Overview
## 2.1: Functions
### Series Functions (series only)
| Function | Syntax | Parameters | Description |
|----------|--------|------------|-------------|
| `str` | `series.str.method()` | String-specific method | Enables vectorized string operations. |
| `dt` | `series.dt.method()` | Datetime-specific method | Enables date/time manipulations. |
| `map()` | `series.map(func)` | `func`: Function to apply | Maps function to each value in the Series. |
| `unique()` | `series.unique()` | No parameters | Returns unique values in the Series. |
| `nunique()` | `series.nunique()` | No parameters | Returns the number of unique values. |
| `value_counts()` | `series.value_counts()` | `normalize`, `sort`, `ascending`, `bins` | Returns frequency counts of unique values. |
| `idxmax()` | `series.idxmax()` | No parameters | Returns index of the maximum value. |
| `idxmin()` | `series.idxmin()` | No parameters | Returns index of the minimum value. |
| `cumsum()` | `series.cumsum()` | No parameters | Computes the cumulative sum. |
| `cumprod()` | `series.cumprod()` | No parameters | Computes the cumulative product. |
| `cummax()` | `series.cummax()` | No parameters | Computes the cumulative max. |
| `cummin()` | `series.cummin()` | No parameters | Computes the cumulative min. |
| `shift()` | `series.shift(periods=1)` | `periods` (int, default=1) | Shifts values up or down. |
| `diff()` | `series.diff(periods=1)` | `periods` (int, default=1) | Computes the difference between elements. |
| `rank()` | `series.rank(method='average')` | `method`: Ranking method | Computes ranks of elements. |


### Dataframe functions (df only)
| Function | Syntax | Parameters | Description |
|----------|--------|------------|-------------|
| `groupby()` | `df.groupby(by)` | `by`: Column(s) to group by | Groups data and applies aggregation. |
| `pivot()` | `df.pivot(index, columns, values)` | `index`, `columns`, `values` | Reshapes data by pivoting columns. |
| `pivot_table()` | `df.pivot_table(values, index, columns)` | `values`, `index`, `columns`, `aggfunc` | Similar to `pivot()`, but allows aggregation. |
| `melt()` | `df.melt(id_vars, value_vars)` | `id_vars`, `value_vars` | Converts wide format to long format. |
| `merge()` | `df.merge(df2, on, how)` | `on`: Key column, `how`: Join type | Merges two DataFrames on a key. |
| `join()` | `df.join(df2, on)` | `on`: Key column | Joins two DataFrames based on index. |
| `concat()` | `pd.concat([df1, df2], axis=0)` | `axis`: 0 for rows, 1 for columns | Concatenates multiple DataFrames. |
| `stack()` | `df.stack()` | No parameters | Converts columns into rows (long format). |
| `unstack()` | `df.unstack()` | No parameters | Converts rows into columns (wide format). |
| `explode()` | `df.explode(column)` | `column`: Column to explode | Expands list-like column values into rows. |
| `T` | `df.T` | No parameters | Transposes rows and columns. |
| `corr()` | `df.corr()` | No parameters | Computes correlation between columns. |
| `cov()` | `df.cov()` | No parameters | Computes covariance between columns. |
| `duplicated()` | `df.duplicated(subset, keep='first')` | `subset`: Columns to check | Identifies duplicate rows. |
| `drop_duplicates()` | `df.drop_duplicates(subset, keep='first')` | `subset`: Columns to check | Removes duplicate rows. |
| `sample()` | `df.sample(n=5, frac=None)` | `n`: Number of samples, `frac`: Fraction | Returns a random sample. |
| `nlargest()` | `df.nlargest(n, columns)` | `n`: Number of rows, `columns`: Sort column | Returns top `n` largest values in a column. |
| `nsmallest()` | `df.nsmallest(n, columns)` | `n`: Number of rows, `columns`: Sort column | Returns top `n` smallest values in a column. |
| `fillna()` | `df.fillna(value)` | `value`: Value to fill missing data | Fills missing values. |
| `dropna()` | `df.dropna(axis=0)` | `axis`: 0 (rows), 1 (columns) | Drops missing values. |
| `replace()` | `df.replace(to_replace, value)` | `to_replace`, `value` | Replaces specific values. |


### Series and dataframe functions
| Function | Syntax | Parameters | Description |
|----------|--------|------------|-------------|
| `head()` | `df.head(n)` | `n` (int, default=5): Number of rows to return | Returns the first `n` rows. |
| `tail()` | `df.tail(n)` | `n` (int, default=5): Number of rows to return | Returns the last `n` rows. |
| `describe()` | `df.describe()` | `percentiles`, `include`, `exclude` | Generates summary statistics for numerical data. |
| `count()` | `df.count()` | `axis=0` (count rows) or `axis=1` (count columns) | Counts non-null values. |
| `sum()` | `df.sum(axis=0)` | `axis=0` (column-wise) or `axis=1` (row-wise) | Computes sum of values. |
| `mean()` | `df.mean(axis=0)` | `axis=0` (default, column-wise) | Computes mean of values. |
| `min()` | `df.min(axis=0)` | `axis=0` (default, column-wise) | Returns the minimum value. |
| `max()` | `df.max(axis=0)` | `axis=0` (default, column-wise) | Returns the maximum value. |
| `std()` | `df.std(axis=0)` | `axis=0` (default, column-wise) | Computes standard deviation. |
| `var()` | `df.var(axis=0)` | `axis=0` (default, column-wise) | Computes variance. |
| `median()` | `df.median(axis=0)` | `axis=0` (default, column-wise) | Computes the median value. |
| `mode()` | `df.mode()` | No parameters | Returns the most frequent value(s). |
| `abs()` | `df.abs()` | No parameters | Returns absolute values. |
| `clip()` | `df.clip(lower, upper)` | `lower`, `upper`: Clip values to limits | Clips values to a range. |
| `apply()` | `df.apply(func, axis=0)` | `func`: Function to apply, `axis` | Applies a function element-wise. |
| `map()` | `df.map(func)` | `func`: Function to apply | Applies function to each element (Series only). |
| `astype()` | `df.astype(dtype)` | `dtype`: Target data type | Converts data type of elements. |


## 2.2 Attributes

## Series Attributes
| Attribute | Syntax | Description |
|-----------|--------|-------------|
| `name` | `series.name` | Returns or sets the name of the Series. |
| `dtype` | `series.dtype` | Returns the data type of the Series. |
| `nbytes` | `series.nbytes` | Returns the total memory usage of the Series (in bytes). |
| `T` | `series.T` | Returns the Series itself (useful for compatibility with DataFrames). |
| `hasnans` | `series.hasnans` | Returns `True` if the Series contains `NaN` values. |
| `is_unique` | `series.is_unique` | Returns `True` if all values in the Series are unique. |


## Dataframe Attributes
| Attribute | Syntax | Description |
|-----------|--------|-------------|
| `columns` | `df.columns` | Returns column labels of the DataFrame. |
| `axes` | `df.axes` | Returns a list of row and column index labels. |
| `T` | `df.T` | Returns the transposed DataFrame (swaps rows and columns). |
| `info()` | `df.info()` | Displays a summary of the DataFrame (index, dtypes, memory usage). |
| `memory_usage()` | `df.memory_usage()` | Returns the memory usage of each column. |
| `select_dtypes()` | `df.select_dtypes(include=[...])` | Selects columns of a specific dtype. |
| `nbytes` | `df.nbytes` | Returns the total memory usage of the DataFrame (in bytes). |


## Series and Dataframe Attributes
These work on both
| Attribute | Syntax | Description |
|-----------|--------|-------------|
| `index` | `df.index` | Returns the row index labels. |
| `columns` | `df.columns` | Returns the column labels (DataFrame only). |
| `shape` | `df.shape` | Returns the dimensions (rows, columns). |
| `size` | `df.size` | Returns the total number of elements (rows × columns). |
| `ndim` | `df.ndim` | Returns the number of dimensions (1D for Series, 2D for DataFrame). |
| `values` | `df.values` | Returns data as a NumPy array. |
| `dtypes` | `df.dtypes` | Returns the data type of each column. |
| `empty` | `df.empty` | Returns `True` if the object is empty. |


# 3. Pandas Series
## 3.2 Creating Series 
### pd.Series() constructor
General syntax for the **pd.Series() Constructor:**
```python
pd.Series(data=None, index=None, dtype=None, name=None, copy=False)
```
| Parameter | Description |
|-----------|-------------|
| `data` | The main data (can be list, dict, array, scalar, etc.) |
| `index` | Labels for the Series elements (default: auto-generated integers) |
| `dtype` | The data type (e.g., `float`, `int`, `str`) |
| `name` | Optional name for the Series |
| `copy` | If `True`, makes a copy of the data |

#### From Lists
- Index can be manually assigned or autogenerated (0,1,2,...)
- By default produces a view if possible

In [8]:
import pandas as pd

# Creating a Series: Boiling points (in °C) of halogens
boiling_points = pd.Series(
    [-188.1, -34.0, 59.5, 184.4, 336.8],  # Values
    index=["F", "Cl", "Br", "I", "At"],  # Element symbols as labels
    name="Boiling Point (°C)"
    )


# Display the Series
print(boiling_points)

# Example Operations
print("\nHighest Boiling Point:", boiling_points.max())  # Get the highest boiling point
print("\nSorted Boiling Points:\n", boiling_points.sort_values())  # Sort in ascending order
print("\nRanked Boiling Points:\n", boiling_points.rank())  # Rank the elements


F    -188.1
Cl    -34.0
Br     59.5
I     184.4
At    336.8
Name: Boiling Point (°C), dtype: float64

Highest Boiling Point: 336.8

Sorted Boiling Points:
 F    -188.1
Cl    -34.0
Br     59.5
I     184.4
At    336.8
Name: Boiling Point (°C), dtype: float64

Ranked Boiling Points:
 F     1.0
Cl    2.0
Br    3.0
I     4.0
At    5.0
Name: Boiling Point (°C), dtype: float64


#### From Numpy Arrays

In [12]:
import numpy as np
import pandas as pd
arr = np.array([1,2,3,4])
s=pd.Series(arr, index=(['H','He', 'Li', 'Be']))
print(s)

H     1
He    2
Li    3
Be    4
dtype: int64


#### From Dictionary

In [14]:
# Creating a Series from a dictionary (keys become the index)
data = {"H": 1.008, "He": 4.0026, "Li": 6.94, "Be": 9.0122}
s = pd.Series(data, name="Atomic Mass (g/mol)")
print(s)

H     1.0080
He    4.0026
Li    6.9400
Be    9.0122
Name: Atomic Mass (g/mol), dtype: float64


#### From Scalar Value
-useful for initializing placeholders

In [15]:
s = pd.Series(1, index=["A", "B", "C", "D"])
print(s)

A    1
B    1
C    1
D    1
dtype: int64


s = pd.Series(1, index=["A", "B", "C", "D"])
print(s)

### Indirect Methods
#### From DataFrame Column

In [16]:
import pandas as pd

# Creating a DataFrame
df = pd.DataFrame({"Element": ["H", "He", "Li"], "Atomic Mass": [1.008, 4.0026, 6.94]})

# Extracting a single column as a Series
atomic_mass_series = df["Atomic Mass"]
print(type(atomic_mass_series))  # <class 'pandas.core.series.Series'>
print(atomic_mass_series)

<class 'pandas.core.series.Series'>
0    1.0080
1    4.0026
2    6.9400
Name: Atomic Mass, dtype: float64


#### Using apply() on dataframe column

In [17]:
df["Atomic Mass Squared"] = df["Atomic Mass"].apply(lambda x: x ** 2)
print(type(df["Atomic Mass Squared"]))  # <class 'pandas.core.series.Series'>
print(df["Atomic Mass Squared"])

<class 'pandas.core.series.Series'>
0     1.016064
1    16.020807
2    48.163600
Name: Atomic Mass Squared, dtype: float64


#### Using map() to a dataframe column

In [22]:
import pandas as pd

# Step 1: Create a DataFrame of electronegativity values for halogens
halogens = pd.DataFrame({
    "Element": ["F", "Cl", "Br", "I", "At"],
    "Electronegativity": [3.98, 3.16, 2.96, 2.66, 2.2]
})

# Step 2: Define a function to classify elements based on electronegativity
def classify_electronegativity(value):
    if value >= 3.5:
        return "Highly Electronegative"
    elif value >= 2.5:
        return "Moderately Electronegative"
    else:
        return "Low Electronegativity"

# Step 3: Use `map()` to create a new Series with classifications
electronegativity_class = halogens["Electronegativity"].map(classify_electronegativity)

# Step 4: Add the new classification Series to the DataFrame
halogens["Electronegativity Category"] = electronegativity_class

# Step 5: Display the Series and DataFrame
print(electronegativity_class)
print(type(electronegativity_class))
print("\n")
print(halogens)
print(type(halogens))
# Step 6 (Optional): Save the DataFrame to a CSV file
halogens.to_csv("halogens_electronegativity.csv", index=False)


0        Highly Electronegative
1    Moderately Electronegative
2    Moderately Electronegative
3    Moderately Electronegative
4         Low Electronegativity
Name: Electronegativity, dtype: object
<class 'pandas.core.series.Series'>


  Element  Electronegativity  Electronegativity Category
0       F               3.98      Highly Electronegative
1      Cl               3.16  Moderately Electronegative
2      Br               2.96  Moderately Electronegative
3       I               2.66  Moderately Electronegative
4      At               2.20       Low Electronegativity
<class 'pandas.core.frame.DataFrame'>


#### Using iloc() on a dataframe

In [27]:
import pandas as pd

# Create a sample DataFrame with chemical properties
df = pd.DataFrame({
    "Element": ["Na", "Mg", "Al", "Si", "P"],
    "Atomic Number": [11, 12, 13, 14, 15],
    "Electronegativity": [0.93, 1.31, 1.61, 1.90, 2.19]
})

# Extract the "Electronegativity" column using iloc
electronegativity_series = df.iloc[:, 2]  # Selecting column index 2

print(type(electronegativity_series))  # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)


<class 'pandas.core.series.Series'>
0    0.93
1    1.31
2    1.61
3    1.90
4    2.19
Name: Electronegativity, dtype: float64


#### Using loc() on a dataframe

In [28]:
# Extract "Electronegativity" column using loc
electronegativity_series = df.loc[:, "Electronegativity"]

print(type(electronegativity_series))  # Output: <class 'pandas.core.series.Series'>
print(electronegativity_series)


<class 'pandas.core.series.Series'>
0    0.93
1    1.31
2    1.61
3    1.90
4    2.19
Name: Electronegativity, dtype: float64


# 4. Pandas DataFrames

In [3]:
# Convert melting points from °C to Kelvin (for both Series & DataFrame)
to_kelvin = lambda temp: temp + 273.15

# Apply function to the boiling points Series
boiling_points_K = boiling_points.apply(to_kelvin)
print("\nBoiling Points in Kelvin:\n", boiling_points_K)

# Apply function to the DataFrame column
print("\n", alkali_metals)
alkali_metals["Melting Point (K)"] = alkali_metals["Melting Point (°C)"].apply(to_kelvin)
print("\nUpdated Alkali Metals DataFrame:\n", alkali_metals)



Boiling Points in Kelvin:
 F      85.05
Cl    239.15
Br    332.65
I     457.55
At    609.95
Name: Boiling Point (°C), dtype: float64

     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Li              3                 152            0.534               180.5
Na             11                 186            0.970                97.8
K              19                 227            0.860                63.5
Rb             37                 248            1.530                39.3
Cs             55                 265            1.870                28.5

Updated Alkali Metals DataFrame:
     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)  \
Li              3                 152            0.534               180.5   
Na             11                 186            0.970                97.8   
K              19                 227            0.860                63.5   
Rb             37                 248            1.530             

In [2]:
# Creating a DataFrame with physical properties of alkali metals
alkali_metals = pd.DataFrame({
    "Atomic Number": [3, 11, 19, 37, 55],  # Lithium to Cesium
    "Atomic Radius (pm)": [152, 186, 227, 248, 265],
    "Density (g/cm³)": [0.534, 0.97, 0.86, 1.53, 1.87],
    "Melting Point (°C)": [180.5, 97.8, 63.5, 39.3, 28.5]
}, index=["Li", "Na", "K", "Rb", "Cs"])  # Using symbols as row index

# Display the DataFrame
print(alkali_metals)

# Example Operations
print("\nMean Atomic Radius:", alkali_metals["Atomic Radius (pm)"].mean())  # Average atomic radius
print("\nSorted by Melting Point:\n", alkali_metals.sort_values("Melting Point (°C)"))  # Sort by melting point
print("\nDensity Correlation:\n", alkali_metals.corr())  # Correlation between properties


    Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Li              3                 152            0.534               180.5
Na             11                 186            0.970                97.8
K              19                 227            0.860                63.5
Rb             37                 248            1.530                39.3
Cs             55                 265            1.870                28.5

Mean Atomic Radius: 215.6

Sorted by Melting Point:
     Atomic Number  Atomic Radius (pm)  Density (g/cm³)  Melting Point (°C)
Cs             55                 265            1.870                28.5
Rb             37                 248            1.530                39.3
K              19                 227            0.860                63.5
Na             11                 186            0.970                97.8
Li              3                 152            0.534               180.5

Density Correlation:
                     Ato