
# Introduction to Pandas
<span style="font-size:18px; font-weight:bold;">

Welcome! In this section, we’ll explore Pandas, the most popular Python library for working with structured (tabular) data. Think of it as a Python-powered spreadsheet — but far more powerful, faster, and flexible. It's built on top of NumPy and is essential for data analysis, machine learning, and data wrangling.

**Recommended Learning Flow**

To get the most out of Pandas, follow these topics in order:

1. Series
2. DataFrames  
3. Missing Data
4. GroupBy
5. Merging, Joining, and Concatenating
6. Operations
7. Data Input and Output

---
<span>



# Series: The Building Block of Pandas

## What is a Series?
<span style="font-size:18px; font-weight:bold;">

A Series is a one-dimensional array-like object with labeled indices. It's like a NumPy array but with added powers — you can assign custom labels to each value.

- Can hold any data type (int, float, string, Python objects)

- Each value is indexed, and indexes can be labeled

- Very similar to a dictionary

<span style="font-size:18px; font-weight:bold;">

**Setup**
<span>

In [378]:
import numpy as np
import pandas as pd

<span style="font-size:18px; font-weight:bold;">

## Creating a Series

You can create a Series from:

- Lists

- NumPy arrays

- Dictionaries



### From a List

In [382]:
labels = ['a', 'b', 'c']
my_list = [10, 20, 30]

pd.Series(my_list)
pd.Series(my_list, index=labels)

a    10
b    20
c    30
dtype: int64

<span style="font-size:18px; font-weight:bold;">

 **Tip: If you don’t provide an index, Pandas will assign default numeric indices starting from 0.**

### From a NumPy array

In [386]:
arr = np.array([10, 20, 30])

pd.Series(arr)
pd.Series(arr, index=['x', 'y', 'z'])


x    10
y    20
z    30
dtype: int64

<span style="font-size:18px; font-weight:bold;">

**Use case: NumPy users will find this familiar — but Series adds label-based indexing on top.**

### From a dictionary

In [390]:
d = {'a': 10, 'b': 20, 'c': 30}
pd.Series(d)

a    10
b    20
c    30
dtype: int64

**Why it's useful: The dictionary keys become index labels. This is great for structured data with named values.**

<span style="font-size:18px; font-weight:bold;">

## What Can a Series Hold?
**Unlike NumPy arrays, a Series can contain heterogeneous data types:**

In [394]:
pd.Series(['Hello', 42, 3.14, [1,2], {'key': 'val'}])

0             Hello
1                42
2              3.14
3            [1, 2]
4    {'key': 'val'}
dtype: object

In [396]:
pd.Series([sum, print, len])  # Yes, even functions!

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

<span style="font-size:18px; font-weight:bold;">

## 1.1 Indexing and Retrieval
**The index in a Series acts like a key in a dictionary.**

In [399]:
ser1 = pd.Series([1,2,3,4], index=['USA', 'Germany', 'USSR', 'Japan'])  
ser2 = pd.Series([1,2,5,4], index=['USA', 'Germany', 'Italy', 'Japan'])

<span style="font-size:18px; font-weight:bold;">

**Access by Lable**:

In [402]:
ser1['USA']  # Returns 1

1

<span style="font-size:18px; font-weight:bold;">

## Series Arithmetic (Auto-alignment)
**Operations between Series match values by index, not by position:**

In [405]:
ser1 + ser2  # USSR and Italy become NaN where no match

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

<span style="font-size:18px; font-weight:bold;">

**Note: If a label is missing in either Series, the result will be NaN for that label.**

# Pandas DataFrame
<span style="font-size:18px; font-weight:bold;">
A DataFrame is like a table — a collection of multiple Series objects aligned by a common index (rows). It’s the core structure used in Pandas and is similar to Excel spreadsheets and SQL tables.
</span>

## Creating a DataFrame Using Lists

In [463]:
columns = ['Student ID', 'Course ID', 'Marks']
student_data = [(103, 201, 67), (103, 203, 67), (103, 204, 89)]
dataframe = pd.DataFrame(student_data, columns=columns)

In [465]:
dataframe

Unnamed: 0,Student ID,Course ID,Marks
0,103,201,67
1,103,203,67
2,103,204,89


## Creating a DataFrame Using Dictionary


In [530]:
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': ['m','t','h']}
index=['A','B','C']
df = pd.DataFrame(data,index)


In [532]:
df

Unnamed: 0,col1,col2,col3
A,1,4,m
B,2,5,t
C,3,6,h


<span style="font-size:18px; font-weight:bold;">

## Selecting Columns

In [535]:
df['col1']   # Returns a Series

A    1
B    2
C    3
Name: col1, dtype: int64

In [537]:
df[['col1', 'col2']]   # Returns a DataFrame

Unnamed: 0,col1,col2
A,1,4
B,2,5
C,3,6


<span style="font-size:18px; font-weight:bold;">

### Dot notation like df.Marks is possible but not recommended because:

- It fails if column names have spaces

- It can be ambiguous with method names

In [540]:
df.col3

A    m
B    t
C    h
Name: col3, dtype: object

<span style="font-size:18px; font-weight:bold;">

## Creating New Columns

In [543]:
df['new_col'] = df['col1'] + df['col2']

In [545]:
df

Unnamed: 0,col1,col2,col3,new_col
A,1,4,m,5
B,2,5,t,7
C,3,6,h,9


<span style="font-size:18px; font-weight:bold;">

Add derived values easily — similar to Excel formulas.



<span style="font-size:18px; font-weight:bold;">

## Removing Columns and Rows

<span style="font-size:18px; font-weight:bold;">

### Remove a column:

In [550]:
new_df=df.drop('new_col', axis=1)            # Returns a new DataFrame and keep the original dataframe unchenaged

In [552]:
new_df

Unnamed: 0,col1,col2,col3
A,1,4,m
B,2,5,t
C,3,6,h


In [554]:
df.drop('new_col', axis=1, inplace=True)  # Modifies in place

In [556]:
df

Unnamed: 0,col1,col2,col3
A,1,4,m
B,2,5,t
C,3,6,h


<span style="font-size:18px; font-weight:bold;">

### Remove a row

In [563]:
df.drop('A', axis=0) # Returns a new DataFrame and keep the original dataframe unchenaged

Unnamed: 0,col1,col2,col3
B,2,5,t
C,3,6,h


In [565]:
df.drop('A', axis=0, inplace=True) # Modifies in place

In [567]:
df

Unnamed: 0,col1,col2,col3
B,2,5,t
C,3,6,h


<span style="font-size:18px; font-weight:bold;">

## Selecting Rows

In [576]:
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6], 'col3': ['m','t','h']}
index=['A','B','C']
df = pd.DataFrame(data,index)

<span style="font-size:18px; font-weight:bold;">

### By label

In [578]:
df.loc['A']

col1    1
col2    4
col3    m
Name: A, dtype: object

<span style="font-size:18px; font-weight:bold;">
    
### By integer position:


In [590]:
df.iloc[2]

col1    3
col2    6
col3    h
Name: C, dtype: object

<span style="font-size:18px; font-weight:bold;">


### Subset (rows and columns):


In [594]:
df.loc[['A','B'], ['col1','col2']]


Unnamed: 0,col1,col2
A,1,4
B,2,5
