# Pandas Practice Notebook — Google Colab Ready

This notebook is designed for teaching and practicing **Pandas** in Google Colab. It contains step-by-step notes, runnable code cells, and small exercises using real, freely available datasets.

**What you'll learn:**

- Loading data from CSV, JSON, and Excel URLs using `pd.read_*` functions
- Inspecting data with `.head()`, `.info()`, `.describe()`, `.shape`, `.dtypes`
- Basic cleaning and simple transformations
- Quick EDA snippets and small exercises for students

Each dataset section contains: a brief description, direct URL, load code, inspection, and quick EDA examples.

---

## Setup & Imports

Run this cell first in Google Colab. It installs nothing (all standard libs), imports required packages, and sets display options for Pandas.

If you're running in Colab, the notebook will work as-is. For local Jupyter, it also works the same.


In [1]:
import pandas as pd
import numpy as np

# Useful display options
pd.set_option('display.max_columns', 60)
pd.set_option('display.max_rows', 100)

print('pandas version:', pd.__version__)
print('numpy version:', np.__version__)

pandas version: 2.2.2
numpy version: 2.0.2


## Dataset List (URLs)

We'll use the following freely available datasets (hosted on GitHub/raw URLs):

1. **Iris** — small classic dataset for classification (CSV)
- URL: https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv

2. **Titanic** — passenger data for survival analysis (CSV)
- URL: https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv

### Helper: Common inspection functions

We'll use a short helper function to print common inspection outputs for any dataframe.

In [13]:
def inspect_df(df, name='DataFrame'):
    print(f"--- {name} ---")
    print('Shape:', df.shape)
    print('\nInfo:')
    display(df.info())
    print('\nHead:')
    display(df.head(2))
    print('\nTail:')
    display(df.tail(3))
    print('\nDescribe:')
    display(df.describe(include='all'))

# Note: `display` works in Colab/Jupyter to show nicer outputs


## 1) Iris Dataset (CSV)

Classic dataset: 150 rows, measurements of iris flower species.


In [14]:
iris_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv'
iris = pd.read_csv(iris_url)
inspect_df(iris, 'Iris')

# Quick examples
print('\nUnique species:', iris['species'].unique())
print('\nGroup mean of numeric columns by species:')
iris.groupby('species').mean()

--- Iris ---
Shape: (150, 5)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


None


Head:


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa



Tail:


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica



Describe:


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,setosa
freq,,,,,50
mean,5.843333,3.057333,3.758,1.199333,
std,0.828066,0.435866,1.765298,0.762238,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,



Unique species: ['setosa' 'versicolor' 'virginica']

Group mean of numeric columns by species:


Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


**Exercise:** Find the species with the largest average petal length. Use `.groupby()` and `.sort_values()`.

## 2) Titanic Dataset (CSV)

Passenger data including survival, class, age, fare, etc.


In [15]:
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
titanic = pd.read_csv(titanic_url)
inspect_df(titanic, 'Titanic')

# Quick cleaning example: missing ages
print('\nMissing ages:', titanic['Age'].isna().sum())
# Fill missing ages with median as a simple strategy
titanic['Age_filled'] = titanic['Age'].fillna(titanic['Age'].median())
print('Missing ages after fill:', titanic['Age_filled'].isna().sum())

# Quick aggregation
display(titanic.groupby('Pclass')['Survived'].mean())

--- Titanic ---
Shape: (891, 12)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


None


Head:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C



Tail:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q



Describe:


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891.0,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,,,891,2,,,,681.0,,147,3
top,,,,"Dooley, Mr. Patrick",male,,,,347082.0,,G6,S
freq,,,,1,577,,,,7.0,,4,644
mean,446.0,0.383838,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,257.353842,0.486592,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,0.0,1.0,,,0.42,0.0,0.0,,0.0,,
25%,223.5,0.0,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,446.0,0.0,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,668.5,1.0,3.0,,,38.0,1.0,0.0,,31.0,,



Missing ages: 177
Missing ages after fill: 0


Unnamed: 0_level_0,Survived
Pclass,Unnamed: 1_level_1
1,0.62963
2,0.472826
3,0.242363


**Exercise:** Compute survival rate by `Sex` and `Pclass`. Which group had the highest survival?

## World Population (JSON)

A JSON file with country-year population records. This shows loading JSON into a DataFrame.


In [16]:
pop_url = 'https://gist.githubusercontent.com/carmoreira/0e023107ed43177f0f0513649a191c01/raw/population.json'
pop = pd.read_json(pop_url)
inspect_df(pop, 'World Population')

# Quick filter: population of India in 2010-2020 (if present)
if {'Country Name','Year','Value'}.issubset(pop.columns):
    india = pop[(pop['Country Name'].str.contains('India', case=False, na=False)) & (pop['Year'] >= 2010) & (pop['Year'] <= 2020)]
    display(india.sort_values('Year'))
else:
    print('Different schema — show sample rows:')
    display(pop.head())

--- World Population ---
Shape: (228, 2)

Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   country     228 non-null    object
 1   population  228 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 3.7+ KB


None


Head:


Unnamed: 0,country,population
0,Afghanistan,35530081
1,Albania,2930187



Tail:


Unnamed: 0,country,population
225,Yugoslavia,10640000
226,Zambia,17094130
227,Zimbabwe,16529904



Describe:


Unnamed: 0,country,population
count,228,228.0
unique,228,
top,Afghanistan,
freq,1,
mean,,32651360.0
std,,133513000.0
min,,50.0
25%,,434956.2
50%,,5485446.0
75%,,19314860.0


Different schema — show sample rows:


Unnamed: 0,country,population
0,Afghanistan,35530081
1,Albania,2930187
2,Algeria,41318142
3,American Samoa,55641
4,Andorra,76965


**Exercise:** Find the top 10 records by population value. What countries and years appear?

## Quick Tips & Common Pitfalls

- Always check `df.info()` to see dtypes and missing values.
- Use `pd.to_datetime()` to convert date-like strings to `datetime` dtype.
- When filling missing values, think about median/mean or model-based imputation.
- Use `.astype()` to convert types explicitly when needed.
- For large CSVs, use `pd.read_csv(..., usecols=[...], nrows=1000)` to sample first.
- If URLs change, host copies on your own GitHub for stability.


## Small Project: Exploratory Analysis Task

Pick one dataset above and perform a short EDA report (5–8 cells):

1. Load and inspect the data
2. Clean missing values (briefly describe strategy)
3. Produce 3 quick plots (histogram, boxplot, bar counts)
4. Summarize 3 key observations

If you want, convert this notebook into an assignment for students and ask them to submit a short markdown report.

## Additional Resources

- Pandas docs: https://pandas.pydata.org/
- Pandas cheat sheet: search for *pandas cheat sheet* (lots of printable PDFs)
- Kaggle Datasets: https://www.kaggle.com/datasets

---

End of notebook. Happy coding!