# Pandas Basics

**SESSION 1: Introduction to Pandas**

*What is Pandas?*

The Problem Pandas Solves

Real-world data is:
- Large
- Messy
- In tables (rows & columns)
- Comes from CSV, Excel, SQL, APIs

Pandas is built to:

- Handle tabular data
- Clean, filter, transform data
- Prepare data for analysis & visualization

*Pandas vs Excel vs SQL*

| Tool   | Purpose                                         |
| ------ | ----------------------------------------------- |
| Excel  | Manual analysis, small data                     |
| SQL    | Query data from database                        |
| Pandas | Clean, transform, analyze data programmatically |


*Core Pandas Data Structures*

*Series (1D Data)*

*What is a Series?*

A Series is:

- One-dimensional
- Like a single Excel column
- Has index + values

In [None]:
import pandas as pd

In [None]:
marks = pd.Series([20,30,40,50])
print (marks)

In [None]:
marks = pd.Series([20,30,40,50] , index = ["maths","english", "sceince","hindi"])
print (marks)

In [None]:

marks = pd.Series(
    [20, 30],
    index=["math", "science"],
    name="marks"
)

print(marks)



In [None]:
salaries =pd.Series([20000,40000,50000,60000,70000], index = ["rahul", "amit","sam","jhon","jack"])
print (salaries)

**Interview Questions(series)**

1) What is a Pandas Series?
- A Pandas Series is a one-dimensional labeled data structure that can hold any data type.

2) Difference between list and Series?
- A list is a basic Python collection without labels, while a Series has indexed labels and powerful data operations.

3) Can a Series have custom index?
- Yes, a Pandas Series can have user-defined (custom) index labels.

4) Is Series mutable?
- Yes, a Pandas Series is mutable, meaning its values can be changed.

5) What is dtype in Pandas?
- dtype represents the data type of elements stored in a Pandas object.

*DataFrame(2D data)*

What is a DataFrame?

 A DataFrame is:
- Two-dimensional
- Rows + Columns
- Like an Excel sheet / SQL table

In [None]:
#Creating a DataFrame
# From Dictionary
data = {
    'name': [ 'rahul' ,'amit','jhon'],
    'Age': [20,30,22],
    'salary': [20000,30000,50000]
    
}

df = pd.DataFrame(data)

print (df)


*DataFrame Anatomy*

In [None]:
df.shape      # (rows, columns)
df.columns    # column names
df.index      # row labels
df.dtypes     # data types


*Accessing DataFrame Data*

In [None]:
# Select Column
df['Salary'] 


In [None]:
# Select Multiple Columns
df[['Name', 'Salary']]


In [None]:
# Add New Column
df['Bonus'] = df['Salary'] * 0.1


In [None]:
# Delete Column
df.drop('Bonus', axis=1, inplace=True)


In [None]:
data = {
    'name': ['rahul','amit','sam','jhon','jack'],
    'salary': [20000,4000,50000,60000,70000]

}
df = pd.DataFrame(data)
print(df)
# Add a new column: Tax = Salary * 0.05
df["tax"]= df["salary"]* 0.05
print(df)
# Remove the Tax column
df.drop("tax", axis = 1, inplace = True)
print(df)

**Interview Questions (DataFrame)**

1) What is a DataFrame?
- A DataFrame is a two-dimensional labeled data structure in Pandas used to store tabular data.

2) Difference between Series & DataFrame?
- A Series is one-dimensional, while a DataFrame is two-dimensional with rows and columns.

3) How do you create a DataFrame?
- A DataFrame can be created using dictionaries, lists, arrays, or files like CSV/Excel.

4) How do you add/remove a column?
- Columns are added using df["col"] = value and removed using df.drop("col", axis=1).

5) What does df.shape return?
- df.shape returns the number of rows and columns as a tuple.

**SESSION 2: Reading Real Data & Data Inspection**

What Is “Real Data” in Analytics?

In real jobs, data comes from:

- CSV files (most common)
- Excel files
- Databases (SQL)
- APIs

In [None]:
import pandas as pd


In [None]:
# Read CSV in Pandas
df = pd.read_csv('employees.csv')

*Reading Excel File*

In [None]:
# Read Excel
df = pd.read_excel('sales.xlsx')


In [None]:
# Read Specific Sheet
df = pd.read_excel('sales.xlsx', sheet_name='January')




*Interview Questions (Reading Files)*

1) Difference between CSV and Excel?
- CSV is a plain text file with comma-separated values, while Excel is a binary spreadsheet file with multiple sheets and formatting.

2) How do you read a CSV file in Pandas?
- Use pd.read_csv("file.csv") to read a CSV file.

3) How do you read a specific Excel sheet?
- Use pd.read_excel("file.xlsx", sheet_name="Sheet1") to read a specific sheet.

4) What happens if file path is wrong?
- Pandas raises a FileNotFoundError if the file path is incorrect.

In [None]:
# View First Rows
df.head()
df.head(10)
# View Last Rows
df.tail()
# Shape of Data
df.shape
# Data Types
df.dtypes
# Full Data Info
df.info()
# Statistical Summary
df.describe()



**Interview Questions (Inspection)**

Difference between head() and tail()?
- head() shows the first rows of a DataFrame, while tail() shows the last rows.

2) What does df.shape return?
- df.shape returns the number of rows and columns as a tuple.

3) Why is df.info() important?
- df.info() provides summary information about columns, data types, and missing values.

4) What is object dtype?
- object dtype usually represents string or mixed data types in Pandas.

5) What does df.describe() show?
- df.describe() displays statistical summary like count, mean, min, max, and quartiles of numeric columns.

*Practice*

**Interview Questions (Scenario-Based)**

1) You load a dataset. What are the first 5 commands you run?

- df.head(), df.shape, df.columns, df.info(), df.describe() are the first commands to understand the data.

2) Why might describe() not show all columns?

- describe() shows only numeric columns by default, not categorical or object columns.

3) How do you identify missing values quickly?

- Use df.isnull().sum() to quickly count missing values in each column.

4) Why is checking data types important before analysis?

- Because wrong data types can lead to incorrect calculations and analysis results.

**SESSION 3: Selecting, Filtering & Slicing Data**

*Mental Model First*

In [None]:
# Select a Single Column
df['Salary']
# Select Multiple Columns
df[['Name', 'Salary']]
# Select Age column
df['Age']
# Select Name and City together
df[['Name', 'City']]

*Selecting Rows (iloc vs loc)*

*iloc → Position-based*

In [None]:
df.iloc[0]       # First row
df.iloc[0:3]     # First 3 rows
df.iloc[:, 0]    # All rows, first column
df.iloc[:, 1:3]  # All rows, columns 1 to 2


*loc → Label-based*

In [None]:
df.loc[0]                # Row with index label 0
df.loc[0, 'Salary']      # Specific value
df.loc[:, 'Salary']      # All salary values
df.loc[:, ['Name','City']]


*Filtering Rows*

In [None]:
# Basic Filtering
df[df['Salary'] > 50000]
# Multiple Conditions
df[(df['Age'] > 25) & (df['Salary'] > 50000)]


In [None]:
# Filter by String
df[df['City'] == 'Delhi']
# isin() — VERY USEFUL
df[df['City'].isin(['Delhi', 'Mumbai'])]


In [None]:
# fetch first 5 rows using iloc
df.iloc[:5]


In [None]:
# Fetch Salary of row index 2
df.iloc[2]['Salary']

In [None]:
# Slicing Rows & Columns Together
df.loc[0:2, ['Name', 'Salary']]


In [None]:
# Chaining Operations
df[df['Salary'] > 50000][['Name', 'Salary']]


In [None]:
# for large data, prefer

df.loc[df['Salary'] > 50000, ['Name', 'Salary']]


In [None]:
# Show names of employees earning more than average salary
df[df['Salary'] > df['Salary'].mean()]


*Practice*

In [None]:
# Employees older than 24
df[df["Age"] > 24]
# Employees from Delhi
df[df["City"] == "Delhi"]

# Employees with Salary > 45,000 and City = Mumbai
df[(df["Salary"] > 45000) & (df["City"] == "Mumbai")]

# Employees from Delhi or Mumbai
df[(df["City"] == "Delhi") | (df["City"] == "Mumbai")]



Interview Answer

- iloc works with integer positions,
- loc works with labels and names

*Interview Questions (loc & iloc)*

Difference between loc and iloc?
- loc selects data by label-based indexing, while iloc selects data by integer-position indexing.

Which one is faster?
- iloc is generally slightly faster because it works with direct integer positions.

Can loc accept slicing?
- Yes, loc supports label-based slicing and includes both start and end labels.

*Interview Questions (Filtering)*

Why do we use & instead of and?
- & performs element-wise logical operations on arrays/Series, while and works only with single booleans.

What does boolean indexing mean?
- Boolean indexing means filtering data using a boolean (True/False) condition.

How do you filter multiple conditions?
- By combining conditions using & (AND) or | (OR) with parentheses.

*Interview Scenario Questions*

How would you filter high-value customers?
- By applying a condition on spending or revenue columns (e.g., amount > threshold).

How do you select specific columns after filtering?
- By chaining column selection after a filter or using loc with row conditions and column names.

When would you prefer loc over iloc?
- When filtering by labels or selecting specific column names clearly and readably.

**SESSION 4: Sorting, Indexing & Resetting Index**

**Sorting Data**

*What is Sorting?*

*Sorting = arranging rows based on column values.*

Common business questions:

- Who earns the highest salary?
- Top 10 customers by revenue?
- Lowest performing products?

In [None]:
# Basic Sorting (Single Column)
df.sort_values('Salary')


In [None]:
# Descending Order
df.sort_values('Salary', ascending=False)


In [None]:
# Sort by Multiple Columns
df.sort_values(['City', 'Salary'], ascending=[True, False], inplace=True)




In [None]:
# Sort employees by Age (youngest first)
df.sort_values(['Age'])

In [None]:
# Sort employees by Salary (highest first)
df.Sort_values(['Salary'])

In [None]:
# Sort by City, then Salary




**Understanding Index**

*What is an Index?*

*Index = row labels, not row numbers.*

In [None]:
# Reset Index
df.reset_index()


In [None]:
# Proper Way (USE THIS)
df.reset_index(drop=True)


In [None]:
# Filter data where Salary > 50000

df[df["Salary"] > 50000]


In [None]:
# Reset index properly
df = df.reset_index(drop=True)


In [None]:
# Print final DataFrame
print(df)


*Interview Questions (Index)*

What is an index in Pandas?
- It’s the row label that uniquely identifies each row in a DataFrame.

Why do we reset index after filtering?
- To make the row numbers sequential again after rows are removed.

What does drop=True do?
- It prevents the old index from being added as a column.

*Setting a Column as Index*

In [None]:
df.set_index('Name')


In [None]:
# Set Name as index
df.set_index("Name", inplace=True)



In [None]:
# Reset it back
df = df.reset_index(drop=True)

*Interview Questions (set_index)*

*Chaining: Filter → Sort → Reset*

Mini Analyst Task

Show top 3 highest paid employees from Delhi

Steps:

- Filter City = Delhi
- Sort Salary descending
- Take first 3 rows
- Reset index

*Interview Scenario Question*

“You filtered a DataFrame and now the index looks strange.
What will you do and why?”


- I will reset the index using reset_index(drop=True) to make the data clean and readable.

**SESSION 5: Mini Analytics Project (End-to-End)**

In [None]:
# Step 1: Load the Dataset
import pandas as pd

df = pd.read_csv('students.csv')



In [None]:
# Step 2: Inspect the Data
df.head()
df.shape
df.columns
df.info()


In [None]:
# Step 3: Business Question
# Find students who scored more than 80 mark

high_scorers = df[df['Marks'] > 80]


In [None]:
# Step 4: Sort the Results
# Show top scorers first

high_scorers = high_scorers.sort_values('Marks', ascending=False)


In [None]:
# Step 5: Select Relevant Columns
# Show only Name and Marks for reporting

result = high_scorers[['Name', 'Marks']]


In [None]:
# Step 6: Reset Index

result = result.reset_index(drop=True)


In [None]:
# Step 7: Save Output for Stakeholders

result.to_csv('top_students.csv', index=False)


In [None]:
df

*Practice Tasks*

In [None]:
# Task 1
df_task1 = df.loc[
    (df["City"] == "Delhi") & (df["Marks"] > 50) & (df["Subject"] == "Math")
].sort_values(by="Marks", ascending=False)
print(df_task1)





In [None]:
# Task 2
df_task2 = df.loc[
    df["City"] == "Mumbai",
    ["Name", "Subject", "Marks"]
]
print (df_task2)

**Interview Questions**

Q: What do you do immediately after loading a dataset?
- I inspect it using head(), info(), and shape()
Q: Why sort after filtering, not before?
- Filtering reduces data size and improves efficiency.
Q: Why do you reset index here?
- To make the output clean, readable, and ready for reporting
How do you analyze a CSV file?
- Load it with pandas, inspect structure, clean data, then analyze key columns.

How do you filter high-performing records?
- Apply conditional filters on performance metrics (e.g., Marks > threshold).

Why is index resetting important?
- It keeps row numbers clean and meaningful after filtering or sorting.

How do you prepare data for reporting?
- Clean, filter, sort, select relevant columns, and format values.

Can you explain your Pandas pipeline step by step?
- Load → filter → sort → select → reset index → export.