# Pandas for AI Engineers: A Q&A Tutorial (2025)

Welcome to this hands-on tutorial on pandas! We'll explore this powerful library through a series of questions and answers, designed to take you from beginner to proficient. Let's get started!

## 1. Getting Started with Pandas

Question: Write a single line of code to import the pandas library. What is the syntax?

In [None]:
import pandas as pd

## 2. Fundamental Data Structures: Series and DataFrames

Question: What is a pandas Series and how do you create one from a list?

In [None]:
ages = pd.Series([25, 30, 35], name='Age')
print(ages)

0    25
1    30
2    35
Name: Age, dtype: int64


Question: How can you create a DataFrame from a Python dictionary?

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


## 3. Inspecting Your Data

Question: How do you view the first few rows of a DataFrame?

In [None]:
df.head(3)

Question: How can you get a summary of the numerical columns in your DataFrame?

In [None]:
df.describe()

Question: How do you select a single column or multiple columns from a DataFrame?

In [None]:
print(df['Name'])  # Single column
print(df[['Name', 'Age']])  # Multiple columns

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35


## 4. Handling Missing Data

Question: How can you create a DataFrame with missing values and then identify them?

In [None]:
df_missing = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8]
})
print(df_missing.isnull())

       A      B
0  False  False
1  False   True
2   True  False
3  False  False


Question: What are the common strategies for handling missing data? Show how to drop rows with missing values and how to fill them.

In [None]:
# Drop rows with any missing values
print(df_missing.dropna())

# Fill missing values with 0
print(df_missing.fillna(0))

     A    B
0  1.0  5.0
3  4.0  8.0
     A    B
0  1.0  5.0
1  2.0  0.0
2  0.0  7.0
3  4.0  8.0


## 5. Combining DataFrames

Question: How do you concatenate two DataFrames vertically?

In [None]:
df1 = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']})
df2 = pd.DataFrame({'A': ['A2', 'A3'], 'B': ['B2', 'B3']})
print(pd.concat([df1, df2], ignore_index=True))

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3


Question: How do you merge two DataFrames based on a common key?

In [None]:
left = pd.DataFrame({'key': ['K0', 'K1'], 'A': ['A0', 'A1']})
right = pd.DataFrame({'key': ['K0', 'K1'], 'B': ['B0', 'B1']})
print(pd.merge(left, right, on='key'))

  key   A   B
0  K0  A0  B0
1  K1  A1  B1


## 6. Grouping and Aggregation

Question: How can you group a DataFrame by a column and calculate the mean of another column for each group?

In [None]:
df_salary = pd.DataFrame({
    'Department': ['HR', 'Tech', 'HR', 'Tech'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David'],
    'Salary': [70000, 80000, 75000, 85000]
})
print(df_salary.groupby('Department')['Salary'].mean())

              Salary
Department          
HR           72500.0
Tech         82500.0


## 7. Working with Time Series Data

Question: How do you create a range of dates and set it as the index of a DataFrame?

In [None]:
dates = pd.date_range('20230101', periods=6)
df_time = pd.DataFrame({'Value': range(1,7)}, index=dates)
print(df_time)

            Value
Date             
2023-01-01      1
2023-01-02      2
2023-01-03      3
2023-01-04      4
2023-01-05      5
2023-01-06      6


## 8. Reading and Writing Data

Question: How do you write a DataFrame to a CSV file and then read it back?

In [None]:
# Writing to CSV
df_time.to_csv('sample_time_series.csv')

# Reading from CSV
df_read = pd.read_csv('sample_time_series.csv')
print(df_read.head())

  Unnamed: 0  Value
0  2023-01-01      1
1  2023-01-02      2
2  2023-01-03      3
3  2023-01-04      4
4  2023-01-05      5


## 9. Advanced: Applying Functions

Question: How can you apply a custom function to every element in a DataFrame?

In [None]:
df_apply = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
print(df_apply.apply(lambda x: x * 2))

   A   B
0  2   8
1  4  10
2  6  12


## 10. Pandas for AI/ML: Feature Engineering

Question: What is one-hot encoding and how can you perform it on a categorical feature?

In [None]:
df_color = pd.DataFrame({'Color': ['Red', 'Blue', 'Green']})
print(pd.get_dummies(df_color, dtype=bool))

   Color_Blue  Color_Green  Color_Red
0       False        False       True
1        True        False      False
2       False         True      False


Question: How can you normalize a numerical feature to a range between 0 and 1?

In [None]:
from sklearn.preprocessing import MinMaxScaler

df_norm = pd.DataFrame({'Values': [10, 20, 30, 40, 50]})
scaler = MinMaxScaler()
df_norm['Normalized'] = scaler.fit_transform(df_norm[['Values']])
print(df_norm)

   Values  Normalized
0      10        0.00
1      20        0.25
2      30        0.50
3      40        0.75
4      50        1.00


## Summary and Next Steps

Congratulations! You've worked through the fundamentals of pandas, from data structures to feature engineering for machine learning. 

Key Concepts Covered:
- Importing pandas
- Creating and manipulating Series and DataFrames
- Inspecting and cleaning data
- Combining datasets
- Grouping and aggregating data
- Working with time series
- Reading and writing data
- Preparing data for AI/ML models

Next Steps:
- Explore more advanced indexing with .loc and .iloc.
- Dive into creating pivot tables.
- Practice with larger, real-world datasets from sources like Kaggle.

Happy coding! 🚀

# Pandas Advanced Topics: A Q&A Tutorial

Welcome to the next stage of your pandas journey! This notebook builds on the fundamentals and dives into more advanced topics: sophisticated data selection with .loc and .iloc, and powerful data summarization with pivot tables.

## 1. Advanced Indexing with .loc and .iloc

First, let's set up a sample DataFrame to work with. This DataFrame contains sales data for different products across several regions.

In [None]:
import pandas as pd
import numpy as np

data = {'Region': ['North', 'North', 'West', 'South', 'South', 'West'],
        'Product': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Sales': [250, 200, 300, 450, 350, 150],
        'Quantity': [10, 8, 12, 18, 15, 5] }
df = pd.DataFrame(data, index=['R1', 'R2', 'R3', 'R4', 'R5', 'R6'])
df

### Label-Based Indexing with .loc

Question: How do you use .loc to select data for specific row and column labels? For example, select the 'Sales' and 'Quantity' for rows 'R2' through 'R4'.

In [None]:
# .loc is used for label-based indexing.
# The format is df.loc[row_labels, column_labels]
df.loc['R2':'R4', ['Sales', 'Quantity']]

### Integer-Position Based Indexing with .iloc

Question: How do you use .iloc to select data based on integer positions? For example, select the first three rows and the first two columns.

In [None]:
# .iloc is used for integer-location based indexing.
# The format is df.iloc[row_positions, column_positions]
df.iloc[0:3, 0:2]

Question: What is the key difference between .loc and .iloc when slicing?

Answer: The main difference is that .loc includes the last element in a slice, while .iloc does not. 
- df.loc['R2':'R4'] includes rows with labels 'R2', 'R3', and 'R4'.
- df.iloc[1:4] includes rows at positions 1, 2, and 3 (which correspond to 'R2', 'R3', 'R4'), but excludes the row at position 4.

## 2. Diving into Pivot Tables

Pivot tables are a powerful tool for summarizing data. They allow you to reshape or 'pivot' your data by specifying which columns become the new rows, columns, and the values to be aggregated.

Question: How do you create a pivot table to show the average sales for each product in each region?

In [None]:
# We can use the pivot_table() method.
# 'index' specifies the rows, 'columns' specifies the columns, and 'values' are the data to aggregate.
pivot1 = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='mean')
pivot1

Question: How can you create a more complex pivot table that shows both the total sales and total quantity for each product, grouped by region?

In [None]:
# You can pass multiple values and use different aggregation functions.
pivot2 = df.pivot_table(values=['Sales', 'Quantity'], 
                        index='Region', 
                        aggfunc={'Sales': np.sum, 'Quantity': np.sum})
pivot2

## Summary and Further Learning

In this notebook, we've explored two powerful, advanced features in pandas:

- Advanced Indexing: You learned to select data with precision using label-based (.loc) and position-based (.iloc) indexing.
- Pivot Tables: You saw how to reshape and summarize data to gain insights quickly.

What's next?
- Try using multi-level indexing in your DataFrames.
- Explore the crosstab() function, which is a specialized version of a pivot table that can compute a frequency table of two or more factors.
- Work on a real-world dataset to apply these concepts.

Keep practicing to become a pandas expert! 🚀