# Pandas - DataFrame and Series

## Introduction

Pandas is a powerful Python library for data manipulation and analysis. It provides two primary data structures:
- **Series**: 1-dimensional labeled array
- **DataFrame**: 2-dimensional labeled data structure (like a table)

These structures make it easy to work with structured data in Python.

In [None]:
import pandas as pd
import numpy as np

---

## 1. Pandas Series

A **Series** is a one-dimensional array with labeled indices. Think of it as a column in a spreadsheet with row labels.

### 1.1 Creating a Series

In [None]:
# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series1 = pd.Series(data)
print("Series from list:")
print(series1)
print("\nType:", type(series1))

In [None]:
# Creating a Series with custom index
series2 = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print("Series with custom index:")
print(series2)

In [None]:
# Creating a Series from a dictionary
dict_data = {'India': 1400, 'China': 1450, 'USA': 330, 'Indonesia': 275}
series3 = pd.Series(dict_data)
print("Series from dictionary:")
print(series3)

In [None]:
# Creating a Series from a NumPy array
arr = np.array([100, 200, 300, 400])
series4 = pd.Series(arr, index=['Q1', 'Q2', 'Q3', 'Q4'])
print("Series from NumPy array:")
print(series4)

In [None]:
# Creating a Series with a scalar value
series5 = pd.Series(5, index=[0, 1, 2, 3])
print("Series with scalar value:")
print(series5)

### 1.2 Series Attributes

In [None]:
# Series attributes
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])

print("Values:", s.values)      # Get the values as NumPy array
print("Index:", s.index)        # Get the index
print("Data type:", s.dtype)    # Get data type
print("Shape:", s.shape)        # Get shape (rows,)
print("Size:", s.size)          # Get number of elements
print("Name:", s.name)          # Get series name (None if not set)

In [None]:
# Setting name for Series
s.name = "My Series"
s.index.name = "Labels"
print(s)

### 1.3 Indexing and Slicing

In [None]:
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

# Accessing by label
print("Value at 'c':", s['c'])

# Accessing by position
print("Value at position 2:", s[2])

# Accessing multiple elements
print("\nMultiple elements:")
print(s[['a', 'c', 'e']])

In [None]:
# Slicing
print("Slicing by position [1:4]:")
print(s[1:4])

print("\nSlicing by label ['b':'d']:")
print(s['b':'d'])  # Note: End label is INCLUDED

### 1.4 Series Operations

In [None]:
s = pd.Series([10, 20, 30, 40, 50])

# Arithmetic operations
print("Original Series:")
print(s)

print("\nAdding 5:")
print(s + 5)

print("\nMultiplying by 2:")
print(s * 2)

print("\nSquare root:")
print(np.sqrt(s))

In [None]:
# Statistical operations
s = pd.Series([10, 20, 30, 40, 50])

print("Sum:", s.sum())
print("Mean:", s.mean())
print("Median:", s.median())
print("Standard Deviation:", s.std())
print("Min:", s.min())
print("Max:", s.max())
print("\nDescribe:")
print(s.describe())

In [None]:
# Operations between two Series
s1 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
s2 = pd.Series([5, 15, 25], index=['a', 'b', 'c'])

print("Series 1:")
print(s1)
print("\nSeries 2:")
print(s2)
print("\nAddition:")
print(s1 + s2)
print("\nSubtraction:")
print(s1 - s2)

### 1.5 Boolean Indexing

In [None]:
s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])

# Filter values greater than 25
print("Values > 25:")
print(s[s > 25])

# Filter values between 20 and 40
print("\nValues between 20 and 40:")
print(s[(s >= 20) & (s <= 40)])

---

## 2. Pandas DataFrame

A **DataFrame** is a 2-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.

### 2.1 Creating a DataFrame

In [None]:
# Creating DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'City': ['New York', 'Paris', 'London', 'Tokyo']
}

df1 = pd.DataFrame(data)
print("DataFrame from dictionary:")
print(df1)

In [None]:
# Creating DataFrame from list of lists
data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Paris'],
    ['Charlie', 35, 'London']
]

df2 = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print("DataFrame from list of lists:")
print(df2)

In [None]:
# Creating DataFrame from list of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Paris'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'London'}
]

df3 = pd.DataFrame(data)
print("DataFrame from list of dictionaries:")
print(df3)

In [None]:
# Creating DataFrame with custom index
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}

df4 = pd.DataFrame(data, index=['emp1', 'emp2', 'emp3'])
print("DataFrame with custom index:")
print(df4)

In [None]:
# Creating DataFrame from NumPy array
arr = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df5 = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print("DataFrame from NumPy array:")
print(df5)

### 2.2 DataFrame Attributes

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
})

print("Columns:", df.columns)
print("Index:", df.index)
print("Shape:", df.shape)          # (rows, columns)
print("Size:", df.size)            # Total elements
print("\nData types:")
print(df.dtypes)

In [None]:
# Info method - shows summary
print("DataFrame Info:")
df.info()

In [None]:
# Describe - statistical summary
print("Statistical Summary:")
print(df.describe())

In [None]:
# Head and Tail
print("First 2 rows:")
print(df.head(2))

print("\nLast 2 rows:")
print(df.tail(2))

### 2.3 Accessing Data

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000]
})

# Accessing a single column (returns a Series)
print("Name column:")
print(df['Name'])
print("\nType:", type(df['Name']))

In [None]:
# Accessing column using dot notation
print("Age column:")
print(df.Age)

In [None]:
# Accessing multiple columns (returns a DataFrame)
print("Multiple columns:")
print(df[['Name', 'Salary']])
print("\nType:", type(df[['Name', 'Salary']]))

### 2.4 DataFrame Indexing: loc and iloc

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000]
}, index=['emp1', 'emp2', 'emp3', 'emp4'])

print("DataFrame:")
print(df)

In [None]:
# loc - Label-based indexing
print("\nAccessing row 'emp2' using loc:")
print(df.loc['emp2'])

print("\nAccessing specific element:")
print(df.loc['emp2', 'Name'])

print("\nAccessing multiple rows and columns:")
print(df.loc[['emp1', 'emp3'], ['Name', 'Salary']])

In [None]:
# iloc - Position-based indexing
print("Accessing row at position 1 using iloc:")
print(df.iloc[1])

print("\nAccessing specific element:")
print(df.iloc[1, 0])  # Row 1, Column 0

print("\nAccessing multiple rows and columns:")
print(df.iloc[[0, 2], [0, 2]])  # Rows 0,2 and Columns 0,2

In [None]:
# Slicing with loc and iloc
print("Slicing with loc (emp1 to emp3):")
print(df.loc['emp1':'emp3'])  # Includes end label

print("\nSlicing with iloc (0 to 3):")
print(df.iloc[0:3])  # Excludes end position

In [None]:
# at and iat - Fast scalar access
print("Using at:")
print(df.at['emp2', 'Age'])

print("\nUsing iat:")
print(df.iat[1, 1])  # Row 1, Column 1

### 2.5 Adding and Deleting Columns

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

print("Original DataFrame:")
print(df)

# Adding a new column
df['Salary'] = [50000, 60000, 70000]
print("\nAfter adding Salary column:")
print(df)

In [None]:
# Adding a column with calculation
df['Tax'] = df['Salary'] * 0.2
print("After adding Tax column:")
print(df)

In [None]:
# Adding column using insert (at specific position)
df.insert(2, 'Department', ['IT', 'HR', 'Finance'])
print("After inserting Department column at position 2:")
print(df)

In [None]:
# Deleting a column
df_copy = df.copy()
df_copy.drop('Tax', axis=1, inplace=True)
print("After dropping Tax column:")
print(df_copy)

In [None]:
# Deleting using del
df_copy2 = df.copy()
del df_copy2['Department']
print("After deleting Department column:")
print(df_copy2)

### 2.6 Adding and Deleting Rows

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
})

print("Original DataFrame:")
print(df)

# Adding a new row using concat
new_row = pd.DataFrame({'Name': ['Charlie'], 'Age': [35]})
df = pd.concat([df, new_row], ignore_index=True)
print("\nAfter adding a row:")
print(df)

In [None]:
# Deleting a row
df_copy = df.copy()
df_copy = df_copy.drop(1)  # Drop row at index 1
print("After dropping row at index 1:")
print(df_copy)

---

## 3. DataFrame Operations

### 3.1 Sorting

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'David', 'Charlie', 'Bob'],
    'Age': [25, 28, 35, 30],
    'Salary': [50000, 55000, 70000, 60000]
})

print("Original DataFrame:")
print(df)

# Sorting by column
print("\nSorted by Age:")
print(df.sort_values('Age'))

# Sorting in descending order
print("\nSorted by Salary (descending):")
print(df.sort_values('Salary', ascending=False))

In [None]:
# Sorting by multiple columns
df2 = pd.DataFrame({
    'Department': ['IT', 'HR', 'IT', 'HR'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Salary': [50000, 60000, 55000, 58000]
})

print("Sorted by Department, then Salary:")
print(df2.sort_values(['Department', 'Salary']))

In [None]:
# Sorting by index
df3 = df.set_index('Name')
print("Sorted by index:")
print(df3.sort_index())

### 3.2 Filtering and Querying

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 28],
    'Salary': [50000, 60000, 70000, 55000]
})

# Boolean filtering
print("Employees with Age > 28:")
print(df[df['Age'] > 28])

# Multiple conditions
print("\nEmployees with Age > 25 AND Salary > 55000:")
print(df[(df['Age'] > 25) & (df['Salary'] > 55000)])

In [None]:
# Using query method
print("Using query method:")
print(df.query('Age > 28 and Salary > 55000'))

In [None]:
# Using isin method
print("Employees named Alice or Charlie:")
print(df[df['Name'].isin(['Alice', 'Charlie'])])

### 3.3 Statistical Operations

In [None]:
df = pd.DataFrame({
    'A': [10, 20, 30, 40],
    'B': [15, 25, 35, 45],
    'C': [5, 10, 15, 20]
})

print("DataFrame:")
print(df)

# Column-wise statistics
print("\nSum of each column:")
print(df.sum())

print("\nMean of each column:")
print(df.mean())

print("\nMax of each column:")
print(df.max())

In [None]:
# Row-wise statistics
print("Sum of each row:")
print(df.sum(axis=1))

print("\nMean of each row:")
print(df.mean(axis=1))

In [None]:
# Correlation and covariance
print("Correlation matrix:")
print(df.corr())

print("\nCovariance matrix:")
print(df.cov())

### 3.4 Handling Missing Data

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, None, 35, 28],
    'Salary': [50000, 60000, None, 55000]
})

print("DataFrame with missing values:")
print(df)

# Check for missing values
print("\nMissing values:")
print(df.isnull())

print("\nCount of missing values per column:")
print(df.isnull().sum())

In [None]:
# Dropping rows with missing values
print("After dropping rows with any missing values:")
print(df.dropna())

In [None]:
# Filling missing values
print("Filling missing values with 0:")
print(df.fillna(0))

print("\nFilling with column mean:")
df_filled = df.copy()
df_filled['Age'].fillna(df['Age'].mean(), inplace=True)
print(df_filled)

---

## 4. Relationship Between Series and DataFrame

In [None]:
# DataFrame is a collection of Series
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
})

print("DataFrame:")
print(df)
print("\nType of DataFrame:", type(df))

# Each column is a Series
print("\nName column (Series):")
print(df['Name'])
print("Type:", type(df['Name']))

In [None]:
# Each row is also a Series
print("First row (Series):")
print(df.iloc[0])
print("Type:", type(df.iloc[0]))

In [None]:
# Creating DataFrame from Series
s1 = pd.Series([1, 2, 3], name='A')
s2 = pd.Series([4, 5, 6], name='B')

df_from_series = pd.concat([s1, s2], axis=1)
print("DataFrame created from Series:")
print(df_from_series)

---

## 5. Practical Examples

### Example 1: Employee Management System

In [None]:
# Create employee dataset
employees = pd.DataFrame({
    'EmployeeID': [101, 102, 103, 104, 105],
    'Name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'David Wilson', 'Eve Davis'],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
    'Age': [28, 35, 32, 45, 29],
    'Salary': [75000, 65000, 80000, 90000, 68000],
    'JoinYear': [2020, 2018, 2019, 2015, 2021]
})

print("Employee Dataset:")
print(employees)

In [None]:
# Find average salary by department
print("Average Salary by Department:")
print(employees.groupby('Department')['Salary'].mean())

In [None]:
# Find employees with salary > 70000
print("Employees with Salary > $70,000:")
high_earners = employees[employees['Salary'] > 70000]
print(high_earners[['Name', 'Department', 'Salary']])

In [None]:
# Add years of experience column
current_year = 2025
employees['YearsOfExperience'] = current_year - employees['JoinYear']
print("\nWith Years of Experience:")
print(employees[['Name', 'JoinYear', 'YearsOfExperience']])

### Example 2: Student Grades Analysis

In [None]:
# Create student grades dataset
grades = pd.DataFrame({
    'StudentID': [1, 2, 3, 4, 5],
    'Name': ['John', 'Emma', 'Michael', 'Sophia', 'William'],
    'Math': [85, 92, 78, 95, 88],
    'Science': [90, 88, 85, 92, 86],
    'English': [88, 95, 80, 90, 92]
})

print("Student Grades:")
print(grades)

In [None]:
# Calculate average grade for each student
subject_columns = ['Math', 'Science', 'English']
grades['Average'] = grades[subject_columns].mean(axis=1)
print("\nWith Average Grades:")
print(grades)

In [None]:
# Find top performer
top_student = grades.loc[grades['Average'].idxmax()]
print("\nTop Performer:")
print(f"Name: {top_student['Name']}")
print(f"Average: {top_student['Average']:.2f}")

In [None]:
# Subject-wise statistics
print("\nSubject-wise Statistics:")
print(grades[subject_columns].describe())

---

## Summary

### Series
- 1-dimensional labeled array
- Can be created from lists, dictionaries, arrays
- Supports indexing, slicing, and boolean filtering
- Provides statistical methods

### DataFrame
- 2-dimensional labeled data structure
- Collection of Series
- Supports various creation methods
- Provides powerful indexing with `loc` and `iloc`
- Supports adding/deleting columns and rows
- Rich set of operations: sorting, filtering, grouping, statistics
- Can handle missing data

### Key Differences
| Feature | Series | DataFrame |
|---------|--------|----------|
| Dimensions | 1D | 2D |
| Structure | Column | Table |
| Access | Single index | Row and column index |
| Use case | Single variable | Multiple variables |

### Common Methods
- **Inspection**: `head()`, `tail()`, `info()`, `describe()`
- **Selection**: `loc[]`, `iloc[]`, `at[]`, `iat[]`
- **Modification**: `drop()`, `insert()`, `fillna()`
- **Analysis**: `sort_values()`, `groupby()`, `mean()`, `sum()`