# Chapter 20: Getting Started with pandas

Introduction to pandas — Series, DataFrame, indexing, selection, and core operations



### What is pandas? (Slide 24)


<p><strong>pandas</strong> is the most important Python library for data manipulation and analysis.</p>
<p><strong>Two Core Data Structures:</strong></p>
<ul>
<li><strong>Series</strong> — 1D labeled array (like a column in a spreadsheet)</li>
<li><strong>DataFrame</strong> — 2D labeled table (like a spreadsheet or SQL table)</li>
</ul>
<p><strong>Why pandas?</strong></p>
<ul>
<li>Built on top of NumPy — fast vectorized operations</li>
<li>Handles missing data (<code>NaN</code>) natively</li>
<li>Powerful I/O: read CSV, Excel, JSON, SQL, HTML, HDF5</li>
<li>GroupBy, merge, reshape, pivot — all in one library</li>
<li>Time series support built-in</li>
</ul>


> **Note:** Install: pip install pandas


### Series vs DataFrame (Slide 43)


<p><strong>Series</strong> — A single column of data with labels (index).</p>
<pre><code>Index  │ Value
───────┼──────
  a    │  10
  b    │  20
  c    │  30
</code></pre>
<p>Think of it as: a <strong>single column</strong> in an Excel spreadsheet, or a Python dict with ordered keys.</p>
<p><strong>DataFrame</strong> — A table of data with row labels (index) + column labels.</p>
<pre><code>       │ Name     │ Age │ City
───────┼──────────┼─────┼──────────
  0    │ Alice    │  25 │ New York
  1    │ Bob      │  30 │ London
  2    │ Charlie  │  35 │ Tokyo
</code></pre>
<p>Think of it as: an <strong>Excel spreadsheet</strong>, a SQL table, or a dict of Series (each column is a Series sharing the same index).</p>
<p><strong>Relationship:</strong></p>
<ul>
<li>A DataFrame is made up of multiple Series — one per column</li>
<li>Extracting a single column from a DataFrame gives you a Series</li>
<li><code>df['Name']</code> → returns a Series</li>
<li><code>df[['Name', 'Age']]</code> → returns a smaller DataFrame</li>
</ul>


> **Note:** Series = 1 column, DataFrame = table of columns


### Introduction to Series (Slide 25)


In [1]:
import pandas as pd
import numpy as np

# pd.Series(data, index) — 1D labeled array
# .values  — underlying NumPy array
# .index   — index labels
# .name    — optional name for the Series

obj = pd.Series([4, 7, -5, 3])
print(obj)
# 0    4
# 1    7
# 2   -5
# 3    3
# dtype: int64

print(obj.values)  # [ 4  7 -5  3] — NumPy array
print(obj.index)   # RangeIndex(start=0, stop=4, step=1)

# Custom index labels
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
print(obj2['a'])   # -5 — access by label
print(obj2[['a', 'b', 'c']])  # select multiple


0    4
1    7
2   -5
3    3
dtype: int64
[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)
-5
a   -5
b    7
c    3
dtype: int64


> **Note:** Series = NumPy array + index labels


### Series from Dict & Operations (Slide 26)


In [2]:
# Creating Series from a dict (keys become index)
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj = pd.Series(sdata)
print(obj)

# NumPy-like operations preserve the index-value link
print(obj[obj > 20000])   # Filter by condition
print(obj * 2)            # Scalar multiplication
print(np.exp(obj))        # NumPy ufuncs work!

# 'in' checks the INDEX, not values
print('Ohio' in obj)   # True
print(35000 in obj)    # False (not checking values!)

# pd.isnull(series) / pd.notnull(series) — detect missing values
# .isnull() / .notnull() — instance methods too
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj2 = pd.Series(sdata, index=states)  # California → NaN
print(pd.isnull(obj2))
# California     True
# Ohio          False
# Oregon        False
# Texas         False


Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
Ohio     35000
Texas    71000
dtype: int64
Ohio       70000
Texas     142000
Oregon     32000
Utah       10000
dtype: int64
Ohio      inf
Texas     inf
Oregon    inf
Utah      inf
dtype: float64
True
False
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool


  result = getattr(ufunc, method)(*inputs, **kwargs)


> **Note:** 'in' operator checks the index, not values


### Introduction to DataFrame (Slide 27)


In [3]:
# pd.DataFrame(data, columns, index) — 2D labeled table
# Equal-length dict of lists is the most common input

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002, 2003],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

print(frame)
#    state  year  pop
# 0   Ohio  2000  1.5
# 1   Ohio  2001  1.7
# ...

# .head(n) — first n rows (default 5)
# .tail(n) — last n rows
# .shape   — (rows, cols) tuple
# .columns — column labels
# .dtypes  — data type of each column
print(frame.head())
print(frame.shape)    # (6, 3)
print(frame.columns)  # Index(['state', 'year', 'pop'])
print(frame.dtypes)


    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
(6, 3)
Index(['state', 'year', 'pop'], dtype='str')
state        str
year       int64
pop      float64
dtype: object


> **Note:** DataFrame is like a dict of Series sharing the same index


### DataFrame: Columns & Rows (Slide 28)


In [4]:
data = {'state': ['Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2001, 2002],
        'pop': [1.5, 1.7, 2.4, 2.9]}
frame = pd.DataFrame(data)

# --- Accessing Columns ---
# frame['col']  — returns Series (bracket notation)
# frame.col     — dot notation (only for valid Python names)
print(frame['state'])
print(frame.year)       # Same as frame['year']

# --- Accessing Rows ---
# frame.loc[label]   — by index label
# frame.iloc[pos]    — by integer position
print(frame.loc[1])     # Row with label 1
print(frame.iloc[0])    # First row

# --- Modifying Columns ---
frame['debt'] = 16.5               # Scalar → broadcast to all
frame['debt'] = np.arange(4.)      # Array assignment
frame['eastern'] = frame.state == 'Ohio'  # Computed column

# --- Deleting Columns ---
del frame['eastern']  # Removes column in-place
print(frame.columns)


0      Ohio
1      Ohio
2    Nevada
3    Nevada
Name: state, dtype: str
0    2000
1    2001
2    2001
3    2002
Name: year, dtype: int64
state    Ohio
year     2001
pop       1.7
Name: 1, dtype: object
state    Ohio
year     2000
pop       1.5
Name: 0, dtype: object
Index(['state', 'year', 'pop', 'debt'], dtype='str')


> **Note:** Bracket notation always works; dot notation fails for names with spaces


### DataFrame from Nested Dict (Slide 29)


In [5]:
# Nested dict → outer keys = columns, inner keys = row index
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio':   {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame = pd.DataFrame(pop)
print(frame)
#       Nevada  Ohio
# 2000     NaN   1.5    ← Nevada has no 2000, so NaN
# 2001     2.4   1.7
# 2002     2.9   3.6

# .T — transpose (swap rows and columns)
print(frame.T)

# .index.name  — name for the row index
# .columns.name — name for the column labels
frame.index.name = 'year'
frame.columns.name = 'state'
print(frame)

# .to_numpy() — convert DataFrame to NumPy array
# .values      — same (older syntax)
print(frame.to_numpy())
# [[ nan  1.5]
#  [ 2.4  1.7]
#  [ 2.9  3.6]]


      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5
        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5
state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2000      NaN   1.5
[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]


> **Note:** Missing data at non-overlapping keys becomes NaN automatically


### Index Objects (Slide 30)


In [6]:
# Index objects are IMMUTABLE — cannot be modified after creation
# This makes them safe to share between data structures

import pandas as pd
import numpy as np

obj = pd.Series(range(3), index=['a', 'b', 'c'])
idx = obj.index
print(idx)       # Index(['a', 'b', 'c'], dtype='object')
print(idx[1:])   # Index(['b', 'c']) — sliceable

# idx[1] = 'd'  # TypeError! Index is immutable

# Index works like a fixed-size set (but can have duplicates)
print('a' in idx)   # True
print('z' in idx)   # False

# Useful Index methods:
# .append(other)    — concatenate with another Index
# .difference(other) — set difference
# .intersection(other) — set intersection
# .union(other)     — set union
# .isin(values)     — boolean array of membership
# .unique()         — unique values
# .is_unique        — True if no duplicates


Index(['a', 'b', 'c'], dtype='str')
Index(['b', 'c'], dtype='str')
True
False


> **Note:** Immutability makes Index safe to share between structures


### Reindexing (Slide 31)


In [7]:
# .reindex(new_index) — create new object aligned to new index
# Missing labels → NaN.  Existing labels → values preserved
# method='ffill' — forward fill (carry last valid value forward)
# method='bfill' — backward fill

obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
print(obj2)
# a   -5.3
# b    7.2
# c    3.6
# d    4.5
# e    NaN   ← new label, no data

# Forward fill interpolation
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
print(obj3.reindex(range(6), method='ffill'))
# 0      blue
# 1      blue    ← filled from 0
# 2    purple
# 3    purple    ← filled from 2
# 4    yellow
# 5    yellow    ← filled from 4


a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: str


> **Note:** Reindex creates a new object — does not modify in-place


### Dropping Entries (Slide 32)


In [8]:
# .drop(labels)          — drop rows by label (returns new object)
# .drop(labels, axis=1)  — drop columns
# .drop(labels, inplace=True) — modify in-place (careful!)

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
print(obj.drop('c'))        # Drop row 'c'
print(obj.drop(['c', 'd'])) # Drop multiple

# DataFrame
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

# Drop rows
print(data.drop(['Colorado', 'Ohio']))

# Drop columns (axis=1 or axis='columns')
print(data.drop('two', axis=1))
print(data.drop(['two', 'four'], axis='columns'))

# In-place (destroys data permanently!)
# obj.drop('c', inplace=True)  # ⚠️ No undo!


a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
e    4.0
dtype: float64
          one  two  three  four
Utah        8    9     10    11
New York   12   13     14    15
          one  three  four
Ohio        0      2     3
Colorado    4      6     7
Utah        8     10    11
New York   12     14    15
          one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New York   12     14


> **Note:** ⚠️ inplace=True destroys data permanently — no undo


### Selection with loc and iloc (Slide 33)


In [9]:
# .loc[row_label, col_label]  — LABEL-based indexing
# .iloc[row_pos, col_pos]     — INTEGER position-based indexing
# ⚠️ loc slicing is INCLUSIVE of endpoint!
# ⚠️ iloc slicing is EXCLUSIVE (like Python)

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

# loc — label-based
print(data.loc['Colorado', ['two', 'three']])
print(data.loc[:'Utah', 'two'])  # Inclusive of 'Utah'!

# iloc — integer-based
print(data.iloc[2])              # 3rd row (Utah)
print(data.iloc[2, [3, 0, 1]])   # Utah: cols 3,0,1
print(data.iloc[[1, 2], [3, 0, 1]])  # Rows 1-2, cols 3,0,1
print(data.iloc[:, :3])          # All rows, first 3 cols

# Boolean selection
print(data.loc[data.three > 5])
print(data.iloc[:, :3][data.three > 5])


two      5
three    6
Name: Colorado, dtype: int64
Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int64
one       8
two       9
three    10
four     11
Name: Utah, dtype: int64
four    11
one      8
two      9
Name: Utah, dtype: int64
          four  one  two
Colorado     7    4    5
Utah        11    8    9
          one  two  three
Ohio        0    1      2
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14
          one  two  three  four
Colorado    4    5      6     7
Utah        8    9     10    11
New York   12   13     14    15
          one  two  three
Colorado    4    5      6
Utah        8    9     10
New York   12   13     14


> **Note:** loc for labels, iloc for integers — loc includes endpoint!


### Arithmetic & Data Alignment (Slide 34)


In [10]:
# When operating on two Series/DataFrames, pandas aligns on index
# Non-overlapping labels → NaN
# .add(other, fill_value=0)  — add with custom fill for missing
# .sub(), .mul(), .div()     — same pattern

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

print(s1 + s2)
# a    5.2
# c    1.1
# d    NaN   ← 'd' only in s1
# e    0.0
# f    NaN   ← 'f' only in s2
# g    NaN   ← 'g' only in s2

# DataFrame alignment (rows AND columns)
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                   columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                   columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

print(df1 + df2)       # NaN where labels don't overlap
print(df1.add(df2, fill_value=0))  # 0 instead of NaN


a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64
            b   c     d   e
Colorado  NaN NaN   NaN NaN
Ohio      3.0 NaN   6.0 NaN
Oregon    NaN NaN   NaN NaN
Texas     9.0 NaN  12.0 NaN
Utah      NaN NaN   NaN NaN
            b    c     d     e
Colorado  6.0  7.0   8.0   NaN
Ohio      3.0  1.0   6.0   5.0
Oregon    9.0  NaN  10.0  11.0
Texas     9.0  4.0  12.0   8.0
Utah      0.0  NaN   1.0   2.0


> **Note:** Alignment introduces NaN where labels don't overlap


### Operations Between DataFrame & Series (Slide 35)


In [11]:
# Broadcasting between DataFrame and Series (like NumPy)
# By default: Series index matches DataFrame COLUMNS
# Use axis='index' to match rows instead

frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
print(frame)

# Series from first row
series = frame.iloc[0]  # b=0, d=1, e=2

# Subtract series from each row (broadcasting down)
print(frame - series)
# Each row has series subtracted:
# Utah     0  0  0
# Ohio     3  3  3
# Texas    6  6  6
# Oregon   9  9  9

# Broadcast across columns instead (match on index/rows)
series2 = frame['d']  # Utah=1, Ohio=4, Texas=7, Oregon=10
print(frame.sub(series2, axis='index'))
# Subtracts column 'd' value from every column in each row


          b     d     e
Utah    0.0   1.0   2.0
Ohio    3.0   4.0   5.0
Texas   6.0   7.0   8.0
Oregon  9.0  10.0  11.0
          b    d    e
Utah    0.0  0.0  0.0
Ohio    3.0  3.0  3.0
Texas   6.0  6.0  6.0
Oregon  9.0  9.0  9.0
          b    d    e
Utah   -1.0  0.0  1.0
Ohio   -1.0  0.0  1.0
Texas  -1.0  0.0  1.0
Oregon -1.0  0.0  1.0


> **Note:** Default broadcast matches columns; use axis='index' for rows


### Function Application & Mapping (Slide 36)


In [12]:
# np.abs(frame)     — NumPy ufuncs work on DataFrames!
# .apply(func)      — apply function to each COLUMN (or row)
# .apply(func, axis='columns') — apply to each ROW
# .map(func)   — apply function to every ELEMENT
# series.map(func)  — element-wise on a Series

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

# NumPy ufuncs
print(np.abs(frame))

# apply: function over columns (returns one value per column)
f = lambda x: x.max() - x.min()  # Range of each column
print(frame.apply(f))  # One value per column

# apply across rows
print(frame.apply(f, axis='columns'))  # One value per row

# applymap: element-wise formatting
fmt = lambda x: f'{x:.2f}'
print(frame.map(fmt))

# Series.map — element-wise
print(frame['e'].map(fmt))


               b         d         e
Utah    0.194380  0.437472  0.835484
Ohio    0.231714  0.496480  0.672926
Texas   0.485528  1.862816  0.343481
Oregon  0.543482  0.406126  0.611766
b    1.02901
d    1.45669
e    1.44725
dtype: float64
Utah      1.029864
Ohio      0.904641
Texas     2.348344
Oregon    1.155248
dtype: float64
            b     d      e
Utah    -0.19  0.44   0.84
Ohio    -0.23  0.50   0.67
Texas   -0.49  1.86   0.34
Oregon   0.54  0.41  -0.61
Utah       0.84
Ohio       0.67
Texas      0.34
Oregon    -0.61
Name: e, dtype: str


> **Note:** apply → rows/cols, applymap → every element, map → Series only


### Sorting & Ranking (Slide 37)


In [13]:
# .sort_index()           — sort by index labels
# .sort_index(axis=1)     — sort columns alphabetically
# .sort_values(by='col')  — sort by column values
# .sort_values(by=['a','b']) — sort by multiple columns
# .rank()                  — assign ranks (avg for ties)
# .rank(method='first')    — break ties by order of appearance

obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
print(obj.sort_index())       # a=1, b=2, c=3, d=0
print(obj.sort_index(ascending=False))  # Descending

obj2 = pd.Series([4, np.nan, 7, -3, 2])
print(obj2.sort_values())  # NaN always sorted to end

# DataFrame sorting
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
print(frame.sort_values(by='b'))         # Sort by col 'b'
print(frame.sort_values(by=['a', 'b']))  # Sort by 'a' then 'b'

# Ranking
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
print(obj.rank())  # Ties get average rank
print(obj.rank(method='first'))  # Ties broken by position


a    1
b    2
c    3
d    0
dtype: int64
d    0
c    3
b    2
a    1
dtype: int64
3   -3.0
4    2.0
0    4.0
2    7.0
1    NaN
dtype: float64
   b  a
2 -3  0
3  2  1
0  4  0
1  7  1
   b  a
2 -3  0
0  4  0
3  2  1
1  7  1
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64


> **Note:** NaN values are always placed at the end when sorting


### Summary Statistics (Slide 38)


In [14]:
# .describe()  — summary stats for all columns at once
# .sum()       — column totals
# .mean()      — column means (skips NaN by default)
# .median()    — column medians
# .min() / .max() — min/max per column
# .idxmin() / .idxmax() — index LABEL of min/max
# .cumsum()    — cumulative sum
# .pct_change() — percent change between consecutive elements

df = pd.DataFrame({'one': [1.4, 7.1, np.nan, 0.75],
                   'two': [np.nan, -4.5, np.nan, -1.3]},
                  index=['a', 'b', 'c', 'd'])

print(df.sum())           # Column sums (NaN skipped)
print(df.sum(axis=1))     # Row sums
print(df.mean(axis=1, skipna=False))  # NaN if any missing

print(df.idxmax())   # Label of max in each column
print(df.cumsum())   # Running total (NaN skipped)

print(df.describe())
# count, mean, std, min, 25%, 50%, 75%, max


one    9.25
two   -5.80
dtype: float64
a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64
a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64
one    b
two    d
dtype: str
    one  two
a  1.40  NaN
b  8.50 -4.5
c   NaN  NaN
d  9.25 -5.8
            one       two
count  3.000000  2.000000
mean   3.083333 -2.900000
std    3.493685  2.262742
min    0.750000 -4.500000
25%    1.075000 -3.700000
50%    1.400000 -2.900000
75%    4.250000 -2.100000
max    7.100000 -1.300000


> **Note:** Most methods skip NaN by default — use skipna=False to include


### Correlation & Covariance (Slide 39)


In [15]:
# .corr()       — pairwise correlation of all columns
# .cov()        — pairwise covariance of all columns
# .corrwith(other) — correlation with another Series or DataFrame
# series.corr(other) — correlation between two Series
# series.cov(other)  — covariance between two Series

import pandas as pd
import numpy as np

# Example with random data
np.random.seed(42)
df = pd.DataFrame(np.random.randn(100, 4),
                  columns=['A', 'B', 'C', 'D'])

# Correlation matrix (values between -1 and +1)
print(df.corr())
#           A         B         C         D
# A  1.000000 -0.025    0.031    0.018
# B -0.025    1.000000  0.050   -0.041
# ...

# Covariance matrix
print(df.cov())

# Correlation between two specific columns
print(df['A'].corr(df['B']))
print(df['A'].cov(df['B']))

# Correlation with a specific Series
print(df.corrwith(df['A']))  # Each col's corr with 'A'


          A         B         C         D
A  1.000000 -0.016390  0.017745 -0.019994
B -0.016390  1.000000 -0.033668  0.136856
C  0.017745 -0.033668  1.000000 -0.038070
D -0.019994  0.136856 -0.038070  1.000000
          A         B         C         D
A  0.753536 -0.013548  0.016081 -0.017048
B -0.013548  0.906749 -0.033471  0.128004
C  0.016081 -0.033471  1.089965 -0.039040
D -0.017048  0.128004 -0.039040  0.964795
-0.016389923345609087
-0.013547919243909752
A    1.000000
B   -0.016390
C    0.017745
D   -0.019994
dtype: float64


> **Note:** corr=1 (perfect positive), corr=-1 (perfect negative), corr=0 (none)


### Unique Values & Counts (Slide 40)


In [16]:
# .unique()        — array of unique values (in order of appearance)
# .nunique()       — count of unique values
# .value_counts()  — frequency count (sorted descending)
# .isin(values)    — boolean: is each element in the given list?

obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

print(obj.unique())  # ['c' 'a' 'd' 'b'] — preserves order
print(obj.nunique()) # 4

print(obj.value_counts())
# c    3
# a    3
# b    2
# d    1

# Filter with isin (great for subsetting)
mask = obj.isin(['b', 'c'])
print(mask)
print(obj[mask])  # Only 'b' and 'c' values

# DataFrame value_counts across columns
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3]})
print(data.apply(lambda x: x.value_counts()).fillna(0))


<StringArray>
['c', 'a', 'd', 'b']
Length: 4, dtype: str
4
c    3
a    3
b    2
d    1
Name: count, dtype: int64
0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool
0    c
5    b
6    b
7    c
8    c
dtype: str
   Qu1  Qu2
1  1.0  1.0
2  0.0  2.0
3  2.0  2.0
4  2.0  0.0


> **Note:** value_counts is great for quick frequency analysis


### Handling Missing Data (Slide 41)


In [17]:
# pd.isnull(obj) / .isnull()     — True where NaN
# pd.notnull(obj) / .notnull()   — True where not NaN
# .dropna()                      — drop rows with ANY NaN
# .dropna(how='all')             — drop only if ALL values NaN
# .dropna(axis=1)                — drop columns with NaN
# .dropna(thresh=n)              — keep rows with at least n non-NaN
# .fillna(value)                 — replace NaN with a value
# .ffill()        — forward fill
# .bfill()        — backward fill

data = pd.Series([1, np.nan, 3.5, np.nan, 7])
print(data.dropna())        # [1.0, 3.5, 7.0]
print(data.fillna(0))       # NaN → 0
print(data.ffill())  # NaN → previous value

# DataFrame
df = pd.DataFrame(np.random.randn(4, 3))
df.iloc[:2, 1] = np.nan
df.iloc[:1, 2] = np.nan

print(df.dropna())            # Rows with ANY NaN gone
print(df.dropna(how='all'))   # Only if ALL cols are NaN
print(df.dropna(thresh=2))    # Keep if ≥2 non-NaN values
print(df.fillna({1: 0.5, 2: 0}))  # Per-column fill values


0    1.0
2    3.5
4    7.0
dtype: float64
0    1.0
1    0.0
2    3.5
3    0.0
4    7.0
dtype: float64
0    1.0
1    1.0
2    3.5
3    3.5
4    7.0
dtype: float64
          0         1         2
2 -1.067620 -0.142379  0.120296
3  0.514439  0.711615 -1.124642
          0         1         2
0 -1.594428       NaN       NaN
1  0.046981       NaN  0.622850
2 -1.067620 -0.142379  0.120296
3  0.514439  0.711615 -1.124642
          0         1         2
1  0.046981       NaN  0.622850
2 -1.067620 -0.142379  0.120296
3  0.514439  0.711615 -1.124642
          0         1         2
0 -1.594428  0.500000  0.000000
1  0.046981  0.500000  0.622850
2 -1.067620 -0.142379  0.120296
3  0.514439  0.711615 -1.124642


> **Note:** dropna/fillna return new objects — use inplace=True to modify


### pandas Best Practices (Slide 42)


<p><strong>✅ Do:</strong></p>
<ul>
<li>Use <code>.loc</code> and <code>.iloc</code> for clear, explicit indexing</li>
<li>Use <code>.copy()</code> when you need an independent subset</li>
<li>Chain methods: <code>df.dropna().sort_values('col').head(10)</code></li>
<li>Use <code>value_counts()</code> for quick data exploration</li>
<li>Check <code>.dtypes</code> and <code>.shape</code> early and often</li>
<li>Use <code>.describe()</code> to quickly understand distributions</li>
</ul>
<p><strong>❌ Don't:</strong></p>
<ul>
<li>Use chained indexing: <code>df['col'][0]</code> — use <code>df.loc[0, 'col']</code> instead</li>
<li>Iterate row-by-row with for loops — use vectorized ops or <code>.apply()</code></li>
<li>Forget that <code>loc</code> slicing is INCLUSIVE (unlike Python/iloc)</li>
<li>Ignore <code>SettingWithCopyWarning</code> — it means you're modifying a view</li>
</ul>
<p><strong>Quick Reference:</strong></p>
<ul>
<li><code>loc</code> = <strong>L</strong>abels | <code>iloc</code> = <strong>I</strong>ntegers</li>
<li><code>apply</code> = rows/cols | <code>applymap</code> = every cell | <code>map</code> = Series only</li>
<li><code>sort_index</code> = by labels | <code>sort_values</code> = by data</li>
</ul>
