# 01: Frequency Distributions and Graphs

In [1]:
import pandas as pd
import numpy as np 
import math 


A frequency distribution helps us to detect any pattern in the data, assuming that a pattern exists. 



### Working Example 

Given a list of movie ratings, let's build a frequency distribution

In [3]:
ratings = [3, 7, 2, 7, 8, 3, 1, 4, 10, 3, 2, 5, 3, 5, 8, 9, 7, 6, 3, 7, 8, 9, 7, 3, 6]

In [4]:
# Transform the list into a Pandas Series for easier manipulation 
obs = pd.Series(ratings)

In [5]:
# Build a frequency distribution
freq = obs.value_counts().sort_index(ascending=False).to_frame('f').rename_axis('rating')

In [6]:
freq               

Unnamed: 0_level_0,f
rating,Unnamed: 1_level_1
10,1
9,2
8,3
7,5
6,2
5,2
4,1
3,6
2,2
1,1


## Rules for Creating Frequency Distributions 

1. Each observation should be included in one, and only one, class. 
2. List all classes, even those with zero frequencies. 
3. All classes should have equal intervals 
4. All classes should have both an upper boundary and a lower boundary 
5. Select the class interval from convenient numbers, such as 1,2,3, ... 10, particularly 5 and 10 or multiples of 5 and 10. 
6. The lower boundary of each class interval should be a multiple of the class interval. 
7. Aim for a total of approximately 10 classes

#### Example Dataset: IQ scores of students in a class

We will be using the following dataset to explain the process of creating a frequency distribution. 

In [7]:
iq_scores = [91, 85, 84, 79, 80,
87, 96, 75, 86, 104,
95, 71, 105, 90, 77,
123, 80, 100, 93, 108,
98, 69, 99, 95, 90,
110, 109, 94, 100, 103,
112, 90, 90, 98, 89]

In [8]:
iq = pd.Series(iq_scores)

In [10]:
iq.head()

0    91
1    85
2    84
3    79
4    80
dtype: int64

### 3 Stage Process of Creating a Frequency Distribution 

#### Stage 1: Creating the bins 

In [11]:
### 1.0.1 Identify the lowest value
lowest = iq.min()

lowest

69

In [12]:
### 1.0.2 Identify the highest value
highest = iq.max()

highest

123

In [13]:
### 1.1 Identify the range 
range = highest - lowest

range

54

In [16]:
### 1.2.1 Select desired number of classes 
num_bins = 10 

### 1.2 Find the class interval required to span the range
interval = range / num_bins

interval

5.4

In [19]:
### 1.3 Round off to the nearest convenient interval 
interval = round(interval)

interval

5

In [20]:
### 1.4 Determine where the lowest class should begin

# Lower bound (rounded down to nearest convenient interval)

lower_bound = math.floor(lowest/interval)*interval

lower_bound

65

In [21]:
### 1.5 Determine where the highest class should begin
upper_bound = math.ceil(highest/interval)*interval

upper_bound

125

In [22]:
### 1.6 Create the bins 
bins = pd.interval_range(start=lower_bound, end=upper_bound, freq=interval, closed='left')

bins

IntervalIndex([[65, 70), [70, 75), [75, 80), [80, 85), [85, 90) ... [100, 105), [105, 110), [110, 115), [115, 120), [120, 125)], dtype='interval[int64, left]')

#### Stage 2 Map values to bins


In [23]:
mapped = pd.cut(iq, bins=bins)

mapped.head()

0    [90, 95)
1    [85, 90)
2    [80, 85)
3    [75, 80)
4    [80, 85)
dtype: category
Categories (12, interval[int64, left]): [[65, 70) < [70, 75) < [75, 80) < [80, 85) ... [105, 110) < [110, 115) < [115, 120) < [120, 125)]

#### Stage 3: Count the frequency of each bin 

In [24]:
## Mimic the SQL style with Group By
# f = mapped.groupby(by=mapped, sort=False).count()

# Using Value Counts 
f = mapped.value_counts(sort=False)

f

[65, 70)      1
[70, 75)      1
[75, 80)      3
[80, 85)      3
[85, 90)      4
[90, 95)      7
[95, 100)     6
[100, 105)    4
[105, 110)    3
[110, 115)    2
[115, 120)    0
[120, 125)    1
dtype: int64

#### (Optional) Stage 4: Supply headings and title 

In [25]:
f = f.to_frame('f').rename_axis('IQ Score')

In [26]:
f

Unnamed: 0_level_0,f
IQ Score,Unnamed: 1_level_1
"[65, 70)",1
"[70, 75)",1
"[75, 80)",3
"[80, 85)",3
"[85, 90)",4
"[90, 95)",7
"[95, 100)",6
"[100, 105)",4
"[105, 110)",3
"[110, 115)",2


### Handling Outliers 

- An outlier is a very extreme score 
- There is a chance that an outlier is a regular score that was erroneously recorded (eg. GPA: 0.6 instead of 3.6)
- Check to see if the score is legitimate or an error 
- Might exclude them from summaries 
- They might enhance understanding of the data 

## Relative Frequency Distributions 

Relative frequency distributions show the frequency of each class as a part or fraction of the total frequency for the entire distribution. 

- To convert a frequency distribution into a relative frequency distribution, divide the frequency for each class by the total frequency for the entire distribution. 

In [27]:
rel_f = round(f / f.sum(),2)*100

rel_f

Unnamed: 0_level_0,f
IQ Score,Unnamed: 1_level_1
"[65, 70)",3.0
"[70, 75)",3.0
"[75, 80)",9.0
"[80, 85)",9.0
"[85, 90)",11.0
"[90, 95)",20.0
"[95, 100)",17.0
"[100, 105)",11.0
"[105, 110)",9.0
"[110, 115)",6.0
