# Z score

- In this notebook, let us understand what a Z score is, it's applications and how it is used in comparing multiple scores which are in a different scale.
![word-image.png](attachment:word-image.png)

- We are going to have a deep discussion on **Z score.** But right before that, we need to understand what a **normal distribution and a standard normal distribution** is.

## 1. What is a distribution?

- Okay, for the people new to statistics, I want you to know what exactly a distribution is. Think about it in a simple manner.


- Today is your birthday! And you're 'distributing' sweets in your class. How would you do that? Let me tell you how I would distribute them. I would give three sweets to two of my best friends, two sweets to 10 of my friends and the remaining guys get only one sweet. 

### Well, that's what a distribution is!

#### Let me define what a distribution is, in layman terms.

- A distribution in statistics is a function that shows the possible values for a variable and how often they occur.


- All these distributions have "Probability distribution functions" or "Probability mass functions" depending on the distribution, but for now that's not our piece of cake. We're here to learn the basics of distributions.

### Great! you've learnt what a distribution is. So now let's try to understand the normal distribution.

### 2. What is a normal distribution?

- **Normal distribution** is a distribution which is **symmetric** about the mean(Well, mean is nothing but average of all the observations). Most of the observations in the normal distribution cluster around the mean. Have a look at the image below, that's how a normal distribution looks like.
![191a8f604b04f7b6e4a80d04db881c12798856f7.svg](attachment:191a8f604b04f7b6e4a80d04db881c12798856f7.svg)

### Looks interesting! Right?

- I hope you got a brief idea about **distributions and normal distribution.** Let us have a look at **standard normal distribution,** which is pretty simple. It is just a special case of normal distribution.

### 3. What is a standard normal distribution?

- Standard normal distribution is a normal distribution whose **mean and standard deviation** are **scaled at 0 and 1** respectively.

### Why scale the data so that the mean is zero and standard deviation is 1?
- It is to **compare two values** which are in a **different scale.** And you'll get to know how to do that later in this notebook.

### A question might arise in your mind that, why am I discussing about Normal distribution when our topic is Z score? I got you covered! 

- **Z score** can only be calculated for the observations which **follow normal distribution.** Now! Let's start discussing about Z score.

# What is a Z score?

- A Z-score is a numerical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean.


- Didn't quite get that? Don't worry, let me explain with a simple example.


- Suppose there are three students whose marks in their english examination are 12, 16 and 23. At this stage, I would expect you to calculate the mean of these marks. However, I'll do that for you. The mean is, 17.


- Apart from mean I've used another term in the definition of Z score, which is standard deviation.

### 4. What is standard deviation?

- Standard deviation is the average of all deviations of observations from their mean. In the above example, the mean is 17 and the observations are 12, 16 and 23. How to calculate the standard deviation? I would write a simple code for that!

In [14]:
marks = [12,16,23]
mean = sum(marks)/len(marks)
print('Mean:',mean)

# Here comes the standard deviation
# First let us calculate the individual deviations of observations from their mean

marks_dev = [abs(x-mean) for x in marks]
st_dev = sum(marks_dev)/len(marks_dev)
print('Standard deviation:',st_dev)

Mean: 17.0
Standard deviation: 4.0


# How to calculate Z score?

# z = (data point - mean) / standard deviation
![Z-score-formula.jpg](attachment:Z-score-formula.jpg)
## What are we doing here?
- We're just scaling the mean to zero. So let us calculate the Z scores for the marks in the above example so that I can explain what a Z score is, in extreme detail.

In [16]:
# Calculating Z scores
Z_scores = [(x-mean)/st_dev for x in marks]
Z_scores

[-1.25, -0.25, 1.5]

### We've got the Z scores respectively for 12, 16 and 23 as -1.25, -0.25 and 1.5 respectively.

- Everything's fine, but what actually does this Z score mean?


- Let us consider the Z score of 23. It is 1.5, which means that 23 is 1.5 times the standard deviation away from the mean! That's the whole point! 

### Why do I calculate Z scores? I mean, we can just compare the scores as they are right?
- NO! you can't do that. Let me give you an example of two students trying to enter a university for M.tech but through different exams. 

- Suresh appeared at the GATE exam and wants to use his score for the admission. Where as, Archana didn't appear for GATE but she did well at her PGECET exam. 


- Suresh's score was 73 where the average GATE score that year was 87 and the standard deviation was 23. 


- Archana's score was 345 where the average PGECET score was 374 and the standard deviation was 115.

- Can you tell me who did comparitively well just by looking at their scores? Well, I don't have that superpower. So here comes our next question.


- How is Z score used to compare multiple scores in a different scale?

## I am eager to know who's gonna make it to the university. Are you too? Let's find out then!

In [19]:
Suresh_score = 73
Archana_score = 345
avg_gate_score = 87
avg_pgecet_score = 374
gate_stdev = 23
pgecet_stdev = 115

Suresh_z_score = (Suresh_score-avg_gate_score)/gate_stdev
print("Suresh's Z score is:",Suresh_z_score)

Archana_z_score = (Archana_score-avg_pgecet_score)/pgecet_stdev
print("Archana's Z score is:",Archana_z_score)

if Suresh_z_score > Archana_z_score:
    print("Suresh made it to the university!")
elif Suresh_z_score == Archana_z_score:
    print("That's some great news! Both of them made it!")
else:
    print("Archana made it to the university!")

Suresh's Z score is: -0.6086956521739131
Archana's Z score is: -0.25217391304347825
Archana made it to the university!


### Cool example, right? You can also play with the data if you understand this well. 
- Alright! We've come to know that Z scores are helpful in comparing data which are not in the same scale. What are the other uses of this?

### Outlier detection

- Yes, Z scores can also be used for outlier detection. If I did forget to mention above, if the z score is less than -3 or greater than 3, That observation might be considered as an outlier.

# What is an outlier?
- Outlier is a value which differs significantly from other values in the data.
![main-qimg-1e46c34e60220d34ba47ed71ff2cad75.png](attachment:main-qimg-1e46c34e60220d34ba47ed71ff2cad75.png)

### Okay! so let's have a look at a problem.
- There are a sample of 15 observations given below for the areas of houses in Greater Hyderabad.


- Areas of houses are given in square yards.


- Observations = [200,234,523,1255,623,324,65,123,192,4332,433,235,543,720,239]


- Now let us detect outliers in the data using z score.

In [21]:
from statistics import mean
from statistics import stdev
# Given Observations = [200,234,523,1255,623,324,65,123,192,4332,433,235,543,720,239]
Observations = [200,234,523,1255,623,324,65,123,192,4332,433,235,543,720,239]
avg = mean(Observations)
st_dev = stdev(Observations)

# Now that we've calculated mean and standard deviation, it's time for outlier detection.
outliers = list()
for i in Observations:
    z = (i-avg)/st_dev
    if z <= -3 or z>=3:
        outliers.append(i)
outliers  

[4332]

- So as we can see, 4332 is the only outlier detected in the data, which is quite obvious that it is very rare for a person to have his house built in such a large area.

----------------------------

# Here's a practice problem for you!

- Here's a story of three farmers from three different villages, each owning 15 acre in their respective villages. All three of them grow paddy in their fields.


- The yield of the crop depends upon many influencing factors such as soil fertility, farmer's affordability to buy fertilizers etc which you don't need to worry about.


- All you need to find out is which farmer did well taking into account his village's situation. Data needed is given below.
![government-notifies-extension-of-pm-kisan-scheme-to-all-farmers.webp](attachment:government-notifies-extension-of-pm-kisan-scheme-to-all-farmers.webp)


- Nagesh, Krishna and Saidulu were able to grow 600 bags, 400 bags and 545 bags respectively. Average yield per 15 acre in their villages were 700 bags, 650 bags and 560 bags respectively. Whereas standard deviations were 35 bags, 64 bags and 56 bags respectively.

# Which farmer did well?
#### Not only finding out who did well, I need you to print ('name of the farmer' did well) just as we did in the university entrance example. That's some basic coding, right?