In [1]:
import pandas as pd
df = pd.read_csv("../Project 1 - Housing in mexico/mexico-real-estate-clean.csv")
df.head(3)

Unnamed: 0,property_type,state,lat,lon,area_m2,price_usd
0,house,Estado de México,19.560181,-99.233528,150,67965.56
1,house,Nuevo León,25.688436,-100.198807,186,63223.78
2,apartment,Guerrero,16.767704,-99.764383,82,84298.37


#### Groupby method

1. **Collector Role**: `groupby()` acts like a collector that takes your entire DataFrame and looks at a chosen parameter (a column or multiple columns).

2. **Sorting into Bins**: It scans through all the rows and decides which group (bin) each row belongs to, based on that parameter.

3. **Splitting into Sub-DataFrames**: Behind the scenes, it’s as if the original DataFrame is being split into multiple smaller DataFrames — each one containing only the rows that share the same value(s) for that parameter.

4. **Holding Groups Together**: These smaller DataFrames aren’t handed to you immediately. Instead, they’re kept together inside a special grouped object, like neatly stacked folders ready to be worked on.

5. **Operation Dispatcher**: Once you’ve got these groups, you can tell pandas to perform some operation (like counting, summing, averaging, or applying custom logic) on each group individually.

6. **Reassembly**: After applying the operation, pandas reassembles the results into a new DataFrame or Series, with one entry per group.

7. **Analytical Lens**: Conceptually, `groupby()` doesn’t just chop up the data — it gives you a new lens to look at your dataset: not as one big table, but as many smaller, related tables that share a defining feature.

---

👉 In essence: `groupby()` is like temporarily turning one DataFrame into a *collection of smaller DataFrames*, grouped by the parameter you care about, so you can analyze each group independently before putting the answers back together.





 Understanding the Transformation

 Step 1: What `items` contains (tuple structure)
```python
for items in state_group["price_usd"]:
    print(type(items))        # <class 'tuple'>
    print(items[0])           # Group name (e.g., 'Guerrero')
    print(items[1].mean())    # Mean of that group's Series
```

Each `items` is a **tuple** with:
- `items[0]`: Group key/name (string) - e.g., "Guerrero", "Estado de México"
- `items[1]`: Series containing price_usd values for that group

Step 2: How `.mean()` transforms tuples to Series

When you call `state_group["price_usd"].mean()`, pandas internally:

1. **Iterates through each group** (just like your for loop)
2. **Extracts the group key** (`items[0]`) 
3. **Calculates the mean** of the group's Series (`items[1].mean()`)
4. **Constructs a new Series** where:
   - **Index** = All the group keys collected
   - **Values** = All the calculated means

Visual Representation:

```python
# What happens internally in .mean():

# Tuple 1: ('Estado de México', Series([67965.56]))
# → Index: 'Estado de México', Value: 67965.56

# Tuple 2: ('Guerrero', Series([84298.37, 94308.8]))  
# → Index: 'Guerrero', Value: 89303.585

# Tuple 3: ('Nuevo León', Series([63223.78]))
# → Index: 'Nuevo León', Value: 63223.78

# Final Series construction:
mean_price = pd.Series(
    data=[67965.56, 89303.585, 63223.78],
    index=['Estado de México', 'Guerrero', 'Nuevo León'],
    name='price_usd'
)
```

Key Point:
The **aggregation method** (`.mean()`) is what performs this transformation. It:
- Takes the tuple-generating iterator
- Applies the aggregation function to each group's data
- Combines results into a properly indexed Series

This is pandas' elegant way of converting grouped data back into a structured format while preserving the relationship between group names and their computed values.

In [2]:
state_group = df.groupby("state")   # This converts the dataframe into a special groupby object
print(state_group)
print("The state_group is of the type", type(state_group))

# Iterate through the groupby object. Each 'group' is a tuple.
for group_df in state_group:
    print(type(group_df))   #This will be a tuple
    print(len(group_df))    # For this case only, this will be 2s
    item1,item2 = group_df # If we get to know the length of the tuple, we can unpack them into the number of items of that length. This is known as iteratble unpacking
    print(f"Item 1 is {item1}") # Then we can see what is in item 1
    if item1 == "Guerrero":
        guerrero_tuple = group_df   # We can save that item in a variable to accesss it outside the looop


<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002487A5CC590>
The state_group is of the type <class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<class 'tuple'>
2
Item 1 is Aguascalientes
<class 'tuple'>
2
Item 1 is Baja California
<class 'tuple'>
2
Item 1 is Baja California Sur
<class 'tuple'>
2
Item 1 is Campeche
<class 'tuple'>
2
Item 1 is Chiapas
<class 'tuple'>
2
Item 1 is Chihuahua
<class 'tuple'>
2
Item 1 is Colima
<class 'tuple'>
2
Item 1 is Distrito Federal
<class 'tuple'>
2
Item 1 is Durango
<class 'tuple'>
2
Item 1 is Estado de México
<class 'tuple'>
2
Item 1 is Guanajuato
<class 'tuple'>
2
Item 1 is Guerrero
<class 'tuple'>
2
Item 1 is Hidalgo
<class 'tuple'>
2
Item 1 is Jalisco
<class 'tuple'>
2
Item 1 is Morelos
<class 'tuple'>
2
Item 1 is Nayarit
<class 'tuple'>
2
Item 1 is Nuevo León
<class 'tuple'>
2
Item 1 is Oaxaca
<class 'tuple'>
2
Item 1 is Puebla
<class 'tuple'>
2
Item 1 is Querétaro
<class 'tuple'>
2
Item 1 is Quintana Roo
<class 'tuple'>
2
Item 

In [3]:
print(type(guerrero_tuple)) # Since this is tuple and we know it's length, we can unpack it again
item1, item2 = guerrero_tuple
print(type(item1))
print(type(item2))
# Now we can use the get_group() function which is a special function of pandas made specially for group_by() object of pandas
# Or we can just acces by accessing the second item of the tuple
print(guerrero_tuple[0])
guerrero_tuple[1].head()


<class 'tuple'>
<class 'str'>
<class 'pandas.core.frame.DataFrame'>
Guerrero


Unnamed: 0,property_type,state,lat,lon,area_m2,price_usd
2,apartment,Guerrero,16.767704,-99.764383,82,84298.37
3,apartment,Guerrero,16.829782,-99.911012,150,94308.8
10,apartment,Guerrero,16.775165,-99.789939,117,157269.15
61,house,Guerrero,16.860338,-99.870399,168,94835.67
63,house,Guerrero,16.771518,-99.772466,180,79029.72


In [4]:
state_group.get_group("Guerrero").head()

Unnamed: 0,property_type,state,lat,lon,area_m2,price_usd
2,apartment,Guerrero,16.767704,-99.764383,82,84298.37
3,apartment,Guerrero,16.829782,-99.911012,150,94308.8
10,apartment,Guerrero,16.775165,-99.789939,117,157269.15
61,house,Guerrero,16.860338,-99.870399,168,94835.67
63,house,Guerrero,16.771518,-99.772466,180,79029.72


##### Applying functions to groups

In [5]:
for items in state_group["price_usd"]:  # This selects the particular section of the group returns it.
    print(items)

('Aguascalientes', 142     136490.35
257      41622.32
518      41105.20
522     142253.50
769      35789.47
820     192105.26
1019    202631.58
1051     41578.95
1603     78502.86
1691    193359.39
Name: price_usd, dtype: float64)
('Baja California', 53       41095.45
238      42149.18
326      62570.46
420     133442.74
568      73761.07
601      71911.21
649      71911.21
684      78947.37
707      36263.16
758      51578.95
839      89421.05
867     102631.58
918      42631.58
983     102631.58
1018     36092.11
1052     34210.53
1064     51578.95
1068     89473.68
1132     41578.95
1176     62585.30
1224     36817.31
1289     33982.78
1349     36817.31
1411     95499.52
1485     61643.18
1530     51632.75
1590     61643.18
1657     52605.29
1659     84313.09
Name: price_usd, dtype: float64)
('Baja California Sur', 95      131747.44
307      50485.62
329      68492.42
358      62696.91
383      33358.45
429     226551.88
737      65789.47
773     184210.53
958      50000.00
1083   

In [8]:
for items in state_group["price_usd"]: #Now when we apply this, it will give us the actual mean of price_usd column of each group.
    print(type(items))
    print(items[0])
    print(items[1].mean())


<class 'tuple'>
Aguascalientes
110543.88799999999
<class 'tuple'>
Baja California
63152.43172413793
<class 'tuple'>
Baja California Sur
109069.33933333332
<class 'tuple'>
Campeche
121734.63333333335
<class 'tuple'>
Chiapas
104342.31327272726
<class 'tuple'>
Chihuahua
127073.85200000003
<class 'tuple'>
Colima
65786.646
<class 'tuple'>
Distrito Federal
128347.26742574258
<class 'tuple'>
Durango
78034.51142857142
<class 'tuple'>
Estado de México
122723.49050279328
<class 'tuple'>
Guanajuato
133277.9658333333
<class 'tuple'>
Guerrero
119854.27612244901
<class 'tuple'>
Hidalgo
94012.32647058823
<class 'tuple'>
Jalisco
123386.47216666669
<class 'tuple'>
Morelos
112697.295625
<class 'tuple'>
Nayarit
87378.60555555556
<class 'tuple'>
Nuevo León
129221.9856626506
<class 'tuple'>
Oaxaca
59681.585
<class 'tuple'>
Puebla
121732.97400000003
<class 'tuple'>
Querétaro
133955.91328125
<class 'tuple'>
Quintana Roo
128065.41605263157
<class 'tuple'>
San Luis Potosí
92435.54036363638
<class 'tuple'>
Sina

In [7]:
# And finally when we do the below code, we get the mean of the price_usd column of each group by their group name -
mean_price =  state_group["price_usd"].mean() #Now when we apply this, it will give us the actual mean of price_usd column of each group.
mean_price

state
Aguascalientes                     110543.888000
Baja California                     63152.431724
Baja California Sur                109069.339333
Campeche                           121734.633333
Chiapas                            104342.313273
Chihuahua                          127073.852000
Colima                              65786.646000
Distrito Federal                   128347.267426
Durango                             78034.511429
Estado de México                   122723.490503
Guanajuato                         133277.965833
Guerrero                           119854.276122
Hidalgo                             94012.326471
Jalisco                            123386.472167
Morelos                            112697.295625
Nayarit                             87378.605556
Nuevo León                         129221.985663
Oaxaca                              59681.585000
Puebla                             121732.974000
Querétaro                          133955.913281
Quintana Roo  