# pandas examples

**Dataframes**

In [7]:
import pandas as pd

# Create a sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['Brussels', 'Antwerp', 'Ghent']
}

df = pd.DataFrame(data)
df.head()



Unnamed: 0,Name,Age,City
0,Alice,25,Brussels
1,Bob,30,Antwerp
2,Charlie,35,Ghent


**Using df.iterrows()**

In [16]:
# Use iterrows() to loop through each row
for idx, row in df.iterrows():
    # row is a Series, so you access columns by name
    print(f"Index {idx}: {row['Name']} is {row['Age']} years old and lives in {row['City']}")

Index 0: Alice is 25 years old and lives in Brussels
Index 1: Bob is 30 years old and lives in Antwerp
Index 2: Charlie is 35 years old and lives in Ghent


 **Using df.itertuples()**

This is generally the fastest, and gives you namedtuples you can access by attribute:

In [18]:
for row in df.itertuples(index=True, name="Row"):
    # row.Index is the original DataFrame index
    # row.Name, row.Age, row.City are the columns
    print(f"Index {row.Index}: {row.Name} is {row.Age} years old and lives in {row.City}")

Index 0: Alice is 25 years old and lives in Brussels
Index 1: Bob is 30 years old and lives in Antwerp
Index 2: Charlie is 35 years old and lives in Ghent


**Notes on performance**
    
- itertuples() is faster and should be your default if you just need to read values.

- iterrows() converts each row to a Series (slower) but allows more pandas-style operations on row.

**Using enumerate()**

In [26]:
# Example: Using enumerate() with .iterrows()
for i, (index, row) in enumerate(df.iterrows()):
    print(f"Row {i}: {row['Name']} is {row['Age']} years old and lives in {row['City']}")

Row 0: Alice is 25 years old and lives in Brussels
Row 1: Bob is 30 years old and lives in Antwerp
Row 2: Charlie is 35 years old and lives in Ghent


In [30]:
# Example: Using enumerate() with .itertuples()
for i, row in enumerate(df.itertuples(index=False)):
    print(f"Row {i}: {row.Name} is {row.Age} years old and lives in {row.City}")

Row 0: Alice is 25 years old and lives in Brussels
Row 1: Bob is 30 years old and lives in Antwerp
Row 2: Charlie is 35 years old and lives in Ghent


**When to use enumerate():**

- When you want to **count** how many rows you've processed.
- When the DataFrame index isn't meaningful or consistent.
- To add a **serial number** or loop index unrelated to the DataFrame index.

## Series

A Series is:

- A one-dimensional array of data (e.g. numbers, strings, dates, etc.)
- With an associated index labeling each element
- Always ordered

Series are designed for **tabular data analysis**.

In [3]:
import pandas as pd

s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print(s)

a    10
b    20
c    30
dtype: int64


**Series vs JSON (dictionary)**

In [12]:
# json object
person = {
    "name": "Alice",
    "age": 30,
    "city": "Brussels"
}

# Access value
print(person["age"])  # → 30


30


In [14]:
# First execute previous cell
print(person)

{'name': 'Alice', 'age': 30, 'city': 'Brussels'}


In [8]:
# equivalent pandas series
import pandas as pd

person_series = pd.Series({
    "name": "Alice",
    "age": 30,
    "city": "Brussels"
})

# Access value
print(person_series["age"])  # → 30


30


In [15]:
# First execute previous cell
print(person_series)

name       Alice
age           30
city    Brussels
dtype: object


**Tabular data analysis**

A Pandas Series is like a typed, indexed version of a JSON object, with powerful tools for analysis, computation, and integration into DataFrames.

In [16]:
# Suppose we have multiple persons
people = pd.DataFrame([
    {"name": "Alice", "age": 30},
    {"name": "Bob", "age": 25}
])

# Get a Series (column)
ages = people["age"]  # Series: index = row number, values = ages

print(ages.mean())  # → 27.5

27.5


**Mean and Mode from a Series**

In [18]:
import pandas as pd

# Sample Series: ages of students
ages = pd.Series([20, 22, 22, 23, 21, 22, 24, 25])

# Mean (average)
mean_value = ages.mean()
print(f"Mean age: {mean_value}")  # Output: 22.375

# Mode (most frequent value)
mode_value = ages.mode()
print(f"Mode age(s): {mode_value.values}")  # Output: [22]


Mean age: 22.375
Mode age(s): [22]


| Function          | Description                     |
| ----------------- | ------------------------------- |
| `.mean()`         | Arithmetic average (if numeric) |
| `.median()`       | Middle value                    |
| `.mode()`         | Most frequent value(s)          |
| `.sum()`          | Total sum of values             |
| `.value_counts()` | Frequency of each unique value  |


**DataFrame: Students per Academy**

In [42]:
import pandas as pd

data = {
    'Academy': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE'],
    'Region': ['NORTH', 'WEST', 'CENTRAL', 'SOUTH', 'EAST'],
    'Students_SP': [120, 80, 150, 90, 80],
    'Type': ['HEI', 'PRO', 'HEI', 'EDU', 'PRO']
}

df = pd.DataFrame(data)
print(df)


  Academy   Region  Students_SP Type
0     AAA    NORTH          120  HEI
1     BBB     WEST           80  PRO
2     CCC  CENTRAL          150  HEI
3     DDD    SOUTH           90  EDU
4     EEE     EAST           80  PRO


In [44]:
# df.info() Gives a quick overview of column types and non-null values.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Academy      5 non-null      object
 1   Region       5 non-null      object
 2   Students_SP  5 non-null      int64 
 3   Type         5 non-null      object
dtypes: int64(1), object(3)
memory usage: 292.0+ bytes


In [46]:
#df.describe() Generates summary statistics for numeric columns.
df.describe()

Unnamed: 0,Students_SP
count,5.0
mean,104.0
std,30.495901
min,80.0
25%,80.0
50%,90.0
75%,120.0
max,150.0


In [48]:
# df.groupby() Aggregates values by category (e.g. total students per region).
df.groupby('Region')['Students_SP'].sum()

Region
CENTRAL    150
EAST        80
NORTH      120
SOUTH       90
WEST        80
Name: Students_SP, dtype: int64

In [50]:
#df.sort_values() Sorts the DataFrame by a column.
df.sort_values(by='Students_SP', ascending=False)

Unnamed: 0,Academy,Region,Students_SP,Type
2,CCC,CENTRAL,150,HEI
0,AAA,NORTH,120,HEI
3,DDD,SOUTH,90,EDU
1,BBB,WEST,80,PRO
4,EEE,EAST,80,PRO


In [52]:
# df.value_counts() Counts unique values (e.g. how many of each type).
df['Type'].value_counts()

Type
HEI    2
PRO    2
EDU    1
Name: count, dtype: int64

In [54]:
# df.loc[] and df.iloc[] Access rows by label or position.
df.loc[0]       # First row by label

Academy          AAA
Region         NORTH
Students_SP      120
Type             HEI
Name: 0, dtype: object

In [56]:
# df.iloc[] Access rows by label or position.
df.iloc[0:2]    # First two rows by position

Unnamed: 0,Academy,Region,Students_SP,Type
0,AAA,NORTH,120,HEI
1,BBB,WEST,80,PRO


In [58]:
#df.mean(), df.mode(), df.median() Statistical summaries.df['Students_SP'].mean()    # Average
df['Students_SP'].mean()    # Average
df['Students_SP'].mode()    # Most common value(s)
df['Students_SP'].median()  # Middle value


90.0

In [60]:
#df.apply() Apply a function to each row or column.
df['Size_Category'] = df['Students_SP'].apply(lambda x: 'High' if x > 100 else 'Low')

In [62]:
#df.isnull(), df.fillna(), df.dropna() Handle missing values.
df.isnull().sum()
df.fillna(0)
df.dropna()

Unnamed: 0,Academy,Region,Students_SP,Type,Size_Category
0,AAA,NORTH,120,HEI,High
1,BBB,WEST,80,PRO,Low
2,CCC,CENTRAL,150,HEI,High
3,DDD,SOUTH,90,EDU,Low
4,EEE,EAST,80,PRO,Low


In [64]:
#df.to_csv(), df.to_json() Export your DataFrame.
df.to_csv("students.csv", index=False)
df.to_json("students.json", orient="records")
# orient="records": Specifies the format of the output JSON — in this case, a list of row-wise dictionaries.

**What does orient="records" mean?**

This tells Pandas to convert each row of the DataFrame into a dictionary, and return a list of those dictionaries — perfect for APIs or human-readable data.

Other possible orientations: **"split", "index", "columns", "table"** — each for different use cases.

| Use Case                      | Recommended `orient`     |
| ----------------------------- | ------------------------ |
| API data exchange             | `"records"` or `"table"` |
| Re-importing into Pandas      | `"split"` or `"table"`   |
| Index-based processing        | `"index"`                |
| Column-based analytics/export | `"columns"`              |


**Examples of JSON Output**

In [66]:
#orient="records"
[
  {"Name": "Alice", "Age": 25},
  {"Name": "Bob", "Age": 30}
]


[{'Name': 'Alice', 'Age': 25}, {'Name': 'Bob', 'Age': 30}]

In [68]:
#orient="split"
{
  "index": [0, 1],
  "columns": ["Name", "Age"],
  "data": [["Alice", 25], ["Bob", 30]]
}

{'index': [0, 1],
 'columns': ['Name', 'Age'],
 'data': [['Alice', 25], ['Bob', 30]]}

In [70]:
#orient="index"
{
  "0": {"Name": "Alice", "Age": 25},
  "1": {"Name": "Bob", "Age": 30}
}

{'0': {'Name': 'Alice', 'Age': 25}, '1': {'Name': 'Bob', 'Age': 30}}

In [72]:
#orient="columns"
{
  "Name": ["Alice", "Bob"],
  "Age": [25, 30]
}

{'Name': ['Alice', 'Bob'], 'Age': [25, 30]}

In [74]:
#orient="table"
{
  "schema": {...},
  "data": [
    {"index": 0, "Name": "Alice", "Age": 25},
    {"index": 1, "Name": "Bob", "Age": 30}
  ]
}

{'schema': {Ellipsis},
 'data': [{'index': 0, 'Name': 'Alice', 'Age': 25},
  {'index': 1, 'Name': 'Bob', 'Age': 30}]}

**Read a JSON file back into a Pandas DataFrame using pd.read_json()**

Step 1: Save the DataFrame to JSON

In [76]:
import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30]
})

# Save as JSON with orient='records'
df.to_json("students.json", orient="records", lines=False)


 Step 2: Load JSON back into a DataFrame

In [78]:
# Load the file back
df_loaded = pd.read_json("students.json", orient="records")

print(df_loaded)


    Name  Age
0  Alice   25
1    Bob   30


**Other orient values**

To load JSON correctly, you must match the orient you used to save it:

In [80]:
# Example: if saved with orient='split'
df.to_json("students_split.json", orient="split")

# Load it back
df_split = pd.read_json("students_split.json", orient="split")


**Important Notes:**

orient="records" works best when lines=False (default).

If you're exporting JSON with lines=True (newline-delimited JSON), use:

In [82]:
df.to_json("students.json", orient="records", lines=True)
df = pd.read_json("students.json", lines=True)


**Load JSON data directly from a URL or API**

Here's a practical example showing how to load JSON data directly from a URL or API into a Pandas DataFrame using pd.read_json().

In [84]:
import pandas as pd

# Load JSON data from an API endpoint
url = "https://jsonplaceholder.typicode.com/users"

df = pd.read_json(url)

print(df.head())


   id              name   username                      email  \
0   1     Leanne Graham       Bret          Sincere@april.biz   
1   2      Ervin Howell  Antonette          Shanna@melissa.tv   
2   3  Clementine Bauch   Samantha         Nathan@yesenia.net   
3   4  Patricia Lebsack   Karianne  Julianne.OConner@kory.org   
4   5  Chelsey Dietrich     Kamren   Lucio_Hettinger@annie.ca   

                                             address                  phone  \
0  {'street': 'Kulas Light', 'suite': 'Apt. 556',...  1-770-736-8031 x56442   
1  {'street': 'Victor Plains', 'suite': 'Suite 87...    010-692-6593 x09125   
2  {'street': 'Douglas Extension', 'suite': 'Suit...         1-463-123-4447   
3  {'street': 'Hoeger Mall', 'suite': 'Apt. 692',...      493-170-9623 x156   
4  {'street': 'Skiles Walks', 'suite': 'Suite 351...          (254)954-1289   

         website                                            company  
0  hildegard.org  {'name': 'Romaguera-Crona', 'catchPhrase': 'Mu

The nested fields like address and company are dictionaries → Pandas stores them as objects.

You can **flatten** these if needed (see below).

**Flatten Nested JSON**

To turn nested fields (like "address.city") into separate columns:

In [86]:
# Flatten nested fields using json_normalize
import requests
from pandas import json_normalize

response = requests.get(url)
data = response.json()

df_flat = json_normalize(data)

print(df_flat.columns)


Index(['id', 'name', 'username', 'email', 'phone', 'website', 'address.street',
       'address.suite', 'address.city', 'address.zipcode', 'address.geo.lat',
       'address.geo.lng', 'company.name', 'company.catchPhrase', 'company.bs'],
      dtype='object')


**Summary**
| Task                         | Code                                         |
| ---------------------------- | -------------------------------------------- |
| Load simple JSON from URL    | `pd.read_json(url)`                          |
| Load and flatten nested JSON | `requests.get().json()` + `json_normalize()` |
