<a href="https://colab.research.google.com/github/urmilapol/urmilapolprojects/blob/master/pandastutphy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

o	Loads Pandas library into memory
o	pd is standard alias (industry convention)[3]
o	Enables DataFrame/Series creation with pd.DataFrame()


o	Imports NumPy (Pandas dependency for numerical operations)
o	np standard alias
o	Used for array operations within Pandas
o	Verifies successful installation
o	Shows installed Pandas version (e.g., "2.1.4")
o	Confirms import worked before proceeding


In [1]:
# Install pandas (run once)
# !pip install pandas

# Import pandas
import pandas as pd
import numpy as np
print("Pandas version:", pd.__version__)


Pandas version: 2.2.2


•	pd.Series() converts Python list [10, 20, 30, 40, 50] to Pandas Series
•	Auto-generates index: 0, 1, 2, 3, 4 (starts at 0 by default)
•	Data type: int64 (all integers)
•	Variable s: Now holds the Series object


In [2]:
# From list
s = pd.Series([10, 20, 30, 40, 50])
print("Basic Series:")
print(s)
print("\nData type:", s.dtype)
print("Values:", s.values)
print("Index:", s.index)


Basic Series:
0    10
1    20
2    30
3    40
4    50
dtype: int64

Data type: int64
Values: [10 20 30 40 50]
Index: RangeIndex(start=0, stop=5, step=1)


Attribute	Returns	Purpose	Example
shape	(5,)	Dimensions	Row count only
size	5	Element count	Total data points
name	"Calories"	Column label	Header name
dtype	int64	Data type	Memory format


In [3]:
# Series attributes
print("Shape:", s.shape)
print("Size:", s.size)
print("Name:", s.name)

# Add name to series
s.name = "Calories"
print("\nNamed Series:")
print(s)


Shape: (5,)
Size: 5
Name: None

Named Series:
0    10
1    20
2    30
3    40
4    50
Name: Calories, dtype: int64


This code demonstrates all major indexing methods for Pandas Series - position-based, location-based, and label-based.

•	s[^0]: Gets element at position 0 (first item)
•	s[0:3]: Slice from position 0 to 3 (excludes 3)
•	.iloc: Integer LOCation - always uses positions
•	[^2]: Single position
•	[[0,2,4]]: Multiple positions (double brackets return Series)


In [4]:
# Basic indexing
print("s[0]:", s[0])
print("s[0:3]:", s[0:3])

# Location-based (iloc)
print("\ns.iloc[2]:", s.iloc[2])
print("s.iloc[[0,2,4]]:", s.iloc[[0,2,4]])

# Custom index
fruits = ['apple', 'banana', 'grapes', 'orange', 'mango']
s.index = fruits
print("\nCustom Index Series:")
print(s)

# Label-based indexing
print("\ns['banana']:", s['banana'])
print("s['grapes':'orange']:", s['grapes':'orange'])  # Note: end inclusive!


s[0]: 10
s[0:3]: 0    10
1    20
2    30
Name: Calories, dtype: int64

s.iloc[2]: 30
s.iloc[[0,2,4]]: 0    10
2    30
4    50
Name: Calories, dtype: int64

Custom Index Series:
apple     10
banana    20
grapes    30
orange    40
mango     50
Name: Calories, dtype: int64

s['banana']: 20
s['grapes':'orange']: grapes    30
orange    40
Name: Calories, dtype: int64


•	Keys: Fruit names (become Series index)
•	Values: Protein amounts (become Series data)
•	Data type: Float (decimals)

•	pd.Series(fruit_protein): Dict → Series conversion
•	Keys → Index: 'apple', 'banana', etc. become labels
•	Values → Data: 0.3, 1.1, etc. become values
•	name="Protein": Sets column header name


In [5]:
# Dictionary to Series
fruit_protein = {
    'apple': 0.3,
    'banana': 1.1,
    'grapes': 0.5,
    'orange': 0.9,
    'mango': 0.8
}
s2 = pd.Series(fruit_protein, name="Protein")
print("Series from dict:")
print(s2)


Series from dict:
apple     0.3
banana    1.1
grapes    0.5
orange    0.9
mango     0.8
Name: Protein, dtype: float64


In [6]:
# Select values > 0.5
high_protein = s2[s2 > 0.5]
print("Protein > 0.5:")
print(high_protein)

# Multiple conditions (use & |, with parentheses)
medium_protein = s2[(s2 > 0.5) & (s2 < 2)]
print("\n0.5 < Protein < 2:")
print(medium_protein)


Protein > 0.5:
banana    1.1
orange    0.9
mango     0.8
Name: Protein, dtype: float64

0.5 < Protein < 2:
banana    1.1
orange    0.9
mango     0.8
Name: Protein, dtype: float64


In [7]:
# Modify values
s2['mango'] = 2.8
print("After modification:")
print(s2)


After modification:
apple     0.3
banana    1.1
grapes    0.5
orange    0.9
mango     2.8
Name: Protein, dtype: float64


In [8]:
# Simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['NYC', 'LA', 'Chicago']
}
df = pd.DataFrame(data)
print("Basic DataFrame:")
print(df)


Basic DataFrame:
      Name  Age     City
0    Alice   25      NYC
1      Bob   30       LA
2  Charlie   35  Chicago


In [9]:
# More complex example
df2 = pd.DataFrame({
    'Fruit': fruits,
    'Calories': [52, 89, 69, 47, 60],
    'Protein': [0.3, 1.1, 0.7, 0.9, 0.8]
})
print("\nFruit DataFrame:")
print(df2)



Fruit DataFrame:
    Fruit  Calories  Protein
0   apple        52      0.3
1  banana        89      1.1
2  grapes        69      0.7
3  orange        47      0.9
4   mango        60      0.8


In [11]:
# Column selection
print("Calories column:", df2['Calories'])

# Multiple columns
print("\nFruit and Protein:")
print(df2[['Fruit', 'Protein']])

# loc/iloc
# Fixed: Using boolean indexing to select the row where 'Fruit' is 'banana'
print("\nRow for 'banana':", df2[df2['Fruit'] == 'banana'])
print("iloc[1:3]:")
print(df2.iloc[1:3])


Calories column: 0    52
1    89
2    69
3    47
4    60
Name: Calories, dtype: int64

Fruit and Protein:
    Fruit  Protein
0   apple      0.3
1  banana      1.1
2  grapes      0.7
3  orange      0.9
4   mango      0.8

Row for 'banana':     Fruit  Calories  Protein
1  banana        89      1.1
iloc[1:3]:
    Fruit  Calories  Protein
1  banana        89      1.1
2  grapes        69      0.7


In [12]:
# Create sample with missing data
df_missing = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8],
    'C': [9, 10, 11, 12]
})
print("DataFrame with missing values:")
print(df_missing)

# Check missing
print("\nMissing values:")
print(df_missing.isnull().sum())

# Fill missing
df_filled = df_missing.fillna(0)
print("\nFilled with 0:")
print(df_filled)


DataFrame with missing values:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  7.0  11
3  4.0  8.0  12

Missing values:
A    1
B    1
C    0
dtype: int64

Filled with 0:
     A    B   C
0  1.0  5.0   9
1  2.0  0.0  10
2  0.0  7.0  11
3  4.0  8.0  12


In [13]:
# Create duplicates
df_dup = pd.DataFrame({
    'Name': ['A', 'B', 'A', 'C'],
    'Value': [1, 2, 1, 3]
})
print("With duplicates:")
print(df_dup)

# Remove duplicates
print("\nNo duplicates:")
print(df_dup.drop_duplicates())


With duplicates:
  Name  Value
0    A      1
1    B      2
2    A      1
3    C      3

No duplicates:
  Name  Value
0    A      1
1    B      2
3    C      3


In [14]:
# Add column
df2['Sugar'] = [10, 12, 16, 9, 14]
print("Added Sugar column:")
print(df2)

# Drop column
df_no_cal = df2.drop('Calories', axis=1)
print("\nDropped Calories:")
print(df_no_cal)


Added Sugar column:
    Fruit  Calories  Protein  Sugar
0   apple        52      0.3     10
1  banana        89      1.1     12
2  grapes        69      0.7     16
3  orange        47      0.9      9
4   mango        60      0.8     14

Dropped Calories:
    Fruit  Protein  Sugar
0   apple      0.3     10
1  banana      1.1     12
2  grapes      0.7     16
3  orange      0.9      9
4   mango      0.8     14


In [15]:
df_renamed = df2.rename(columns={'Protein': 'Prot_g'})
print("Renamed columns:")
print(df_renamed)


Renamed columns:
    Fruit  Calories  Prot_g  Sugar
0   apple        52     0.3     10
1  banana        89     1.1     12
2  grapes        69     0.7     16
3  orange        47     0.9      9
4   mango        60     0.8     14


In [16]:
print("Original:")
print(df2)

# Add value to column
df2['Calories'] = df2['Calories'] * 1.1  # 10% increase
print("\nCalories +10%:")
print(df2)


Original:
    Fruit  Calories  Protein  Sugar
0   apple        52      0.3     10
1  banana        89      1.1     12
2  grapes        69      0.7     16
3  orange        47      0.9      9
4   mango        60      0.8     14

Calories +10%:
    Fruit  Calories  Protein  Sugar
0   apple      57.2      0.3     10
1  banana      97.9      1.1     12
2  grapes      75.9      0.7     16
3  orange      51.7      0.9      9
4   mango      66.0      0.8     14


describe() = "How does my data behave?"  → Stats
info()    = "What is my data made of?"   → Structure


In [17]:
print("DataFrame Info:")
print(df2.describe())
print("\nColumn info:")
print(df2.info())


DataFrame Info:
        Calories   Protein      Sugar
count   5.000000  5.000000   5.000000
mean   69.740000  0.760000  12.200000
std    18.218205  0.296648   2.863564
min    51.700000  0.300000   9.000000
25%    57.200000  0.700000  10.000000
50%    66.000000  0.800000  12.000000
75%    75.900000  0.900000  14.000000
max    97.900000  1.100000  16.000000

Column info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Fruit     5 non-null      object 
 1   Calories  5 non-null      float64
 2   Protein   5 non-null      float64
 3   Sugar     5 non-null      int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 292.0+ bytes
None


In [18]:
# Save to CSV
df2.to_csv('fruits_data.csv', index=False)
print("Saved to CSV")

# Read CSV
df_read = pd.read_csv('fruits_data.csv')
print("\nRead from CSV:")
print(df_read)


Saved to CSV

Read from CSV:
    Fruit  Calories  Protein  Sugar
0   apple      57.2      0.3     10
1  banana      97.9      1.1     12
2  grapes      75.9      0.7     16
3  orange      51.7      0.9      9
4   mango      66.0      0.8     14
