# Pivot & Calculating Z-Scores in Pandas


Recall that the Z-score tells you the precise numerical value of the standard deviation for an individual data point in your sample.

When you know the Z-score for two things that are measured in different ways, we know their "standard" scores which allows us to compare one against the other!

For example, we have a dataset of test scores for 100 students who have taken both both the ACT and SAT.
- The composite ACT had a mean score of 20.9 and SD of  6.
- The composite SAT had a mean score of 1060 and SD of 196.

**Create a column that tell us which score is better for each student.**


In [24]:
import pandas as pd
from IPython.display import display, HTML
## run the display code here
pd.options.display.float_format = '{:,.3f}'.format

## small trick to improve our display
## will allow us to see dataframes side-by-side
from IPython.display import display, HTML

css = """
.output {
    flex-direction: row;
}
"""

HTML('<style>{}</style>'.format(css))

In [25]:
## read standerized test score results data
df = pd.read_csv("https://raw.githubusercontent.com/sandeepmj/datasets/main/standardized-test-scores.csv")
df

Unnamed: 0,student_ID,scores,test
0,1,25,ACT
1,2,21,ACT
2,3,26,ACT
3,4,31,ACT
4,5,20,ACT
...,...,...,...
195,96,767,SAT
196,97,1146,SAT
197,98,1139,SAT
198,99,1084,SAT


In [26]:
## let's confirm each student appears twice using query and sort
df.query(" 1 <= student_ID <= 10").sort_values(by="student_ID")

Unnamed: 0,student_ID,scores,test
0,1,25,ACT
100,1,1190,SAT
1,2,21,ACT
101,2,1053,SAT
2,3,26,ACT
102,3,1222,SAT
3,4,31,ACT
103,4,1411,SAT
4,5,20,ACT
104,5,1032,SAT


In [27]:
## let's confirm each student appears twice using sort
df.sort_values(by = "student_ID").tail(10).sort_values(by="student_ID")

Unnamed: 0,student_ID,scores,test
195,96,767,SAT
95,96,12,ACT
196,97,1146,SAT
96,97,23,ACT
197,98,1139,SAT
97,98,23,ACT
198,99,1084,SAT
98,99,22,ACT
99,100,20,ACT
199,100,1032,SAT


In [28]:
## confirm our mean and STD to zero decimal places
# round(df[["act_scores", "sat_scores"]].agg(["mean", "std"]), 0)
round(df.groupby("test")["scores"].agg(["mean", "std"]),0)

Unnamed: 0_level_0,mean,std
test,Unnamed: 1_level_1,Unnamed: 2_level_1
ACT,21.0,6.0
SAT,1060.0,196.0


## Comparison challenge
Why is it difficult to compare the test scores for each student in the df's current shape?


<style>
    table {
        width: 100%;
        table-layout: fixed;
        border-collapse: collapse;
    }
    td {
        width: 50%;
        text-align: center;
        vertical-align: top;
        padding: 0;
    }
    img {
        max-width: 100%;
        height: auto;
        display: block;
        margin: 0;
        padding: 0;
    }
</style>
<table>
    <tr>
        <td><img src='https://sandeepmj.github.io/image-host/test-scores-tidy.png'></td>
        <td><img src='https://sandeepmj.github.io/image-host/test-scores-untidy.png'></td>
    </tr>
</table>

## Pivoting for comparison

We need to pivot our df so we can compare side by side the two types of test for each student.



```df.pivot(columns = "columns you want to pivot",
index = "What your new index should be",
values = "What values are for your columns"```

In [29]:
dfp = df.pivot(columns = "test", index ="student_ID", values = "scores")
dfp

test,ACT,SAT
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,25,1190
2,21,1053
3,26,1222
4,31,1411
5,20,1032
...,...,...
96,12,767
97,23,1146
98,23,1139
99,22,1084


In [30]:
## let's confirm both test scores appear for students 1-10
dfp.query(" 1 <= student_ID <= 10")

test,ACT,SAT
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,25,1190
2,21,1053
3,26,1222
4,31,1411
5,20,1032
6,20,1032
7,31,1423
8,26,1248
9,19,981
10,25,1199


In [31]:
## let's confirm both test scores appear 90-100
dfp.query(" 90 <= student_ID <= 100")

test,ACT,SAT
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1
90,25,1193
91,22,1103
92,28,1291
93,17,931
94,20,1012
95,19,998
96,12,767
97,23,1146
98,23,1139
99,22,1084


### Z-score or Standard score package

`from scipy.stats import zscore `

We target the `zscore` method on the column that must be standardized:

`zscore(df["target_col"])`

In [32]:
## mad math functions package
from scipy.stats import zscore 


In [33]:
# Calculate the SAT zscores and place in new zscores column
dfp["sat_zscore"] = zscore(dfp["SAT"])
dfp

test,ACT,SAT,sat_zscore
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,25,1190,0.667
2,21,1053,-0.036
3,26,1222,0.831
4,31,1411,1.801
5,20,1032,-0.143
...,...,...,...
96,12,767,-1.503
97,23,1146,0.441
98,23,1139,0.405
99,22,1084,0.123


In [34]:
# Calculate the ACT zscores and place in new zscores column
dfp["act_zscore"] = zscore(dfp["ACT"])
dfp

test,ACT,SAT,sat_zscore,act_zscore
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,25,1190,0.667,0.718
2,21,1053,-0.036,0.011
3,26,1222,0.831,0.895
4,31,1411,1.801,1.780
5,20,1032,-0.143,-0.166
...,...,...,...,...
96,12,767,-1.503,-1.582
97,23,1146,0.441,0.365
98,23,1139,0.405,0.365
99,22,1084,0.123,0.188


In [35]:
## create a column that says which test is stronger
dfp['better_score'] =\
dfp.apply(lambda x: 'SAT' if x['sat_zscore'] > x['act_zscore'] else 'ACT', axis=1)
dfp.sample(20)

test,ACT,SAT,sat_zscore,act_zscore,better_score
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
87,27,1280,1.129,1.072,SAT
59,24,1154,0.482,0.541,ACT
24,13,775,-1.462,-1.405,ACT
82,24,1159,0.508,0.541,ACT
57,16,901,-0.815,-0.874,SAT
16,18,961,-0.508,-0.52,SAT
90,25,1193,0.682,0.718,ACT
62,20,1042,-0.092,-0.166,SAT
50,11,702,-1.836,-1.759,ACT
36,14,819,-1.236,-1.228,ACT


## who were the students with the biggest difference between test scores?

In [36]:
## create a column that that lists the numeric value difference between zscores
dfp['z-diff'] =\
dfp.apply(lambda x: x['sat_zscore'] - x['act_zscore'], axis=1)
dfp

test,ACT,SAT,sat_zscore,act_zscore,better_score,z-diff
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,25,1190,0.667,0.718,ACT,-0.051
2,21,1053,-0.036,0.011,ACT,-0.046
3,26,1222,0.831,0.895,ACT,-0.064
4,31,1411,1.801,1.780,SAT,0.020
5,20,1032,-0.143,-0.166,SAT,0.023
...,...,...,...,...,...,...
96,12,767,-1.503,-1.582,SAT,0.079
97,23,1146,0.441,0.365,SAT,0.077
98,23,1139,0.405,0.365,SAT,0.041
99,22,1084,0.123,0.188,ACT,-0.064


### Because I calculated `sat_zscore minus act_score`:

- If both numbers are positive, the result will be positive if sat_zscore is greater than act_zscore: `4-3 = 1` which means the sat_score was better.

- If act_zscore is bigger, then the difference will be negative: `3-4 = -1` which means the act_zscore is better.

- If both numbers are negative, the result will be positive if sat_zscore is less negative than act_zscore: `-3 - -5 = 2` which means the sat_zscore is better.


- If both are negative but the act_zscore is less negative than sat_zscore, it will be negative: `-6 - -5 = -1` which means the act_zscore is better.

- If sat_zscore is positive and act_score is negative, the result is positive: `10 - -5 = 15` and means the sat_zscore is better.

- If sat_zscore is negative and act_score is positive, the result is negative: `-5 - 7 = -12` which means the act_zscore is better. 


### Gap between scores

To calculate the gap between `act_score` and `sat_zscore`, we do the same calculation above but turn result into `absolute difference`.

The `absolute difference` gives us the magnitude of the gap, regardless of which score is higher.




In [37]:
## create a column that that lists the numeric value difference between zscores
dfp['z-gap'] =\
dfp.apply(lambda x: abs(x['sat_zscore'] - x['act_zscore']), axis=1)
dfp

test,ACT,SAT,sat_zscore,act_zscore,better_score,z-diff,z-gap
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,25,1190,0.667,0.718,ACT,-0.051,0.051
2,21,1053,-0.036,0.011,ACT,-0.046,0.046
3,26,1222,0.831,0.895,ACT,-0.064,0.064
4,31,1411,1.801,1.780,SAT,0.020,0.020
5,20,1032,-0.143,-0.166,SAT,0.023,0.023
...,...,...,...,...,...,...,...
96,12,767,-1.503,-1.582,SAT,0.079,0.079
97,23,1146,0.441,0.365,SAT,0.077,0.077
98,23,1139,0.405,0.365,SAT,0.041,0.041
99,22,1084,0.123,0.188,ACT,-0.064,0.064


In [39]:
## sort it to show 5 biggest difference
dfp.sort_values(by="z-gap", ascending = False).head(5)

test,ACT,SAT,sat_zscore,act_zscore,better_score,z-diff,z-gap
student_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
18,23,1150,0.462,0.365,SAT,0.097,0.097
33,21,1079,0.098,0.011,SAT,0.087,0.087
85,17,908,-0.779,-0.697,ACT,-0.082,0.082
7,31,1423,1.862,1.78,SAT,0.082,0.082
70,18,943,-0.6,-0.52,ACT,-0.08,0.08
