---
# Grouping and Aggregating Data
Up until this point, we have been building our pandas foundations and mainly been doing the technical parts (the operations). Now we will be using what we've learned to analyze and explore our stackoverflow data. 

---

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display

In [None]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)

In [None]:
# Stackoverflow developer survey
df = pd.read_csv("data/survey_results_public_2022.csv", index_col="ResponseId")
schema_df = pd.read_csv("data/survey_results_schema.csv", index_col="qname")

---
## Aggregating Data
Aggregation is any process where data is expressed in summary form. (e.g. taking the mean of data)

---

---
### `describe()` Method
Pandas has a `describe()` method that shows different aggregate statistics of the column(s) (determined by the **include** parameter. Defaults to numeric columns).

---

In [None]:
# Applied on df
display(df.describe())

# Applied on a Series (single column)
display(df["ConvertedCompYearly"].describe())

---
### `value_counts()` Method
This method counts the unique row values of a Series or a DataFrame. When used on a DataFrame, a list of labels can be passed to count the unique combinations of the passed labels there are.  
**normalize** is a bool parameter which determines whether the counts are showed in frequencies, or in proportion.

---

In [134]:
# Used on a df with combination of columns
x = df.value_counts(["ConvertedCompYearly", "RemoteWork"])
display(x)

ConvertedCompYearly  RemoteWork                          
150000.0             Fully remote                            270
200000.0             Fully remote                            242
120000.0             Fully remote                            208
100000.0             Fully remote                            174
180000.0             Fully remote                            172
                                                            ... 
43428.0              Fully remote                              1
43380.0              Hybrid (some remote, some in-person)      1
43356.0              Hybrid (some remote, some in-person)      1
43346.0              Fully remote                              1
50000000.0           Full in-person                            1
Name: count, Length: 11465, dtype: int64

In [136]:
# Used on a Series and using the normalize parameter
x = df["ConvertedCompYearly"].value_counts(normalize=True)
display(x)

# We can multiply all the values in the Series by 100
# to show the values in their percentage (out of 100%)
display(x*100)

ConvertedCompYearly
150000.0    0.010323
200000.0    0.009509
120000.0    0.008957
63986.0     0.007985
100000.0    0.007328
              ...   
76472.0     0.000026
1368.0      0.000026
104952.0    0.000026
3648.0      0.000026
110245.0    0.000026
Name: proportion, Length: 7909, dtype: float64

ConvertedCompYearly
150000.0    1.032282
200000.0    0.950855
120000.0    0.895695
63986.0     0.798508
100000.0    0.732841
              ...   
76472.0     0.002627
1368.0      0.002627
104952.0    0.002627
3648.0      0.002627
110245.0    0.002627
Name: proportion, Length: 7909, dtype: float64

---
### Measures of Central Tendency
Pandas has a built-in `mean()`, `median()`, and `mode()` methods. These methods can be applied to both Series and DataFrames. 

To apply the methods on all numeric columns of the DataFrame, **True** should be passed to **numeric_only**

---

In [None]:
# Taking median of all numeric columns
medians = df.median(numeric_only=True)
display(medians)

# Note that CompTotal is not normalized; currencies differ

---
We might ask: what is the typical compensation of software developers in 2022?  
If we decide to use the median as a measure:

**Note: all examples from hereon out are for illustrative purposes only. The methods used might not be the best way to answer the questions**  

---

In [None]:
# Taking the median compensation
# We will use ConvertedCompYearly as it normalizes everything to USD
median_comp = df["ConvertedCompYearly"].median()
display(median_comp)

---
We can see that the median salary for software developers is $67,845. (This of course lacks context and does not take account other factors such as experience, among others)

---

---
## Grouping Data
Aggregation is any process where data is expressed in summary form. (e.g. taking the mean of data)

---