# 01: Frequency Distributions and Graphs

In [2]:
import pandas as pd
import numpy as np 
import math 


A frequency distribution helps us to detect any pattern in the data, assuming that a pattern exists. 



### Working Example 

Given a list of movie ratings, let's build a frequency distribution

In [3]:
ratings = [3, 7, 2, 7, 8, 3, 1, 4, 10, 3, 2, 5, 3, 5, 8, 9, 7, 6, 3, 7, 8, 9, 7, 3, 6]

In [4]:
# Transform the list into a Pandas Series for easier manipulation 
obs = pd.Series(ratings)

In [5]:
# Build a frequency distribution
freq = obs.value_counts().sort_index(ascending=False).to_frame('f').rename_axis('rating')

In [6]:
freq               

Unnamed: 0_level_0,f
rating,Unnamed: 1_level_1
10,1
9,2
8,3
7,5
6,2
5,2
4,1
3,6
2,2
1,1


## Rules for Creating Frequency Distributions 

1. Each observation should be included in one, and only one, class. 
2. List all classes, even those with zero frequencies. 
3. All classes should have equal intervals 
4. All classes should have both an upper boundary and a lower boundary 
5. Select the class interval from convenient numbers, such as 1,2,3, ... 10, particularly 5 and 10 or multiples of 5 and 10. 
6. The lower boundary of each class interval should be a multiple of the class interval. 
7. Aim for a total of approximately 10 classes

#### Example Dataset: IQ scores of students in a class

We will be using the following dataset to explain the process of creating a frequency distribution. 

In [7]:
iq_scores = [91, 85, 84, 79, 80,
87, 96, 75, 86, 104,
95, 71, 105, 90, 77,
123, 80, 100, 93, 108,
98, 69, 99, 95, 90,
110, 109, 94, 100, 103,
112, 90, 90, 98, 89]

In [8]:
iq = pd.Series(iq_scores)

In [9]:
iq.head()

0    91
1    85
2    84
3    79
4    80
dtype: int64

### 3 Stage Process of Creating a Frequency Distribution 

#### Stage 1: Creating the bins 

In [10]:
### 1.0.1 Identify the lowest value
lowest = iq.min()

lowest

69

In [11]:
### 1.0.2 Identify the highest value
highest = iq.max()

highest

123

In [12]:
### 1.1 Identify the range 
range = highest - lowest

range

54

In [13]:
### 1.2.1 Select desired number of classes 
num_bins = 10 

### 1.2 Find the class interval required to span the range
interval = range / num_bins

interval

5.4

In [14]:
### 1.3 Round off to the nearest convenient interval 
interval = round(interval)

interval

5

In [15]:
### 1.4 Determine where the lowest class should begin

# Lower bound (rounded down to nearest convenient interval)

lower_bound = math.floor(lowest/interval)*interval

lower_bound

65

In [16]:
### 1.5 Determine where the highest class should begin
upper_bound = math.ceil(highest/interval)*interval

upper_bound

125

In [17]:
### 1.6 Create the bins 
bins = pd.interval_range(start=lower_bound, end=upper_bound, freq=interval, closed='left')

bins

IntervalIndex([[65, 70), [70, 75), [75, 80), [80, 85), [85, 90) ... [100, 105), [105, 110), [110, 115), [115, 120), [120, 125)], dtype='interval[int64, left]')

#### Stage 2 Map values to bins


In [18]:
mapped = pd.cut(iq, bins=bins)

mapped.head()

0    [90, 95)
1    [85, 90)
2    [80, 85)
3    [75, 80)
4    [80, 85)
dtype: category
Categories (12, interval[int64, left]): [[65, 70) < [70, 75) < [75, 80) < [80, 85) ... [105, 110) < [110, 115) < [115, 120) < [120, 125)]

#### Stage 3: Count the frequency of each bin 

In [19]:
## Mimic the SQL style with Group By
# f = mapped.groupby(by=mapped, sort=False).count()

# Using Value Counts 
f = mapped.value_counts(sort=False)

f

[65, 70)      1
[70, 75)      1
[75, 80)      3
[80, 85)      3
[85, 90)      4
[90, 95)      7
[95, 100)     6
[100, 105)    4
[105, 110)    3
[110, 115)    2
[115, 120)    0
[120, 125)    1
dtype: int64

#### (Optional) Stage 4: Supply headings and title 

In [20]:
f = f.to_frame('f').rename_axis('IQ Score')

In [21]:
f

Unnamed: 0_level_0,f
IQ Score,Unnamed: 1_level_1
"[65, 70)",1
"[70, 75)",1
"[75, 80)",3
"[80, 85)",3
"[85, 90)",4
"[90, 95)",7
"[95, 100)",6
"[100, 105)",4
"[105, 110)",3
"[110, 115)",2


### Handling Outliers 

- An outlier is a very extreme score 
- There is a chance that an outlier is a regular score that was erroneously recorded (eg. GPA: 0.6 instead of 3.6)
- Check to see if the score is legitimate or an error 
- Might exclude them from summaries 
- They might enhance understanding of the data 

## Relative Frequency Distributions 

Relative frequency distributions show the frequency of each class as a part or fraction of the total frequency for the entire distribution. 

- To convert a frequency distribution into a relative frequency distribution, divide the frequency for each class by the total frequency for the entire distribution. 

In [22]:
rel_f = round(f / f.sum(),2)*100

rel_f

Unnamed: 0_level_0,f
IQ Score,Unnamed: 1_level_1
"[65, 70)",3.0
"[70, 75)",3.0
"[75, 80)",9.0
"[80, 85)",9.0
"[85, 90)",11.0
"[90, 95)",20.0
"[95, 100)",17.0
"[100, 105)",11.0
"[105, 110)",9.0
"[110, 115)",6.0


## Cumulative Frequency Distributions

Cumulative frequency distributions show the total number of observations in each class and in all lower-ranked classes. 

- Useful when relative standing within distribution is important (example: academic test scores)
- Cumulative frequencies are often converted into cumulative percentages, commonly known as percentile ranks. 
- The **percentile rank** of a score indicates the percentage of scores in the entire distribution with similar or smaller values than that score

- To convert a frequency distribution into a cumulative frequency distribution, add to the frequency of each class the sum of the frequencies of all classes ranked below it. 

#### Example: Frequency distribution of GRE scores

In [64]:
gre_freq = pd.Series({'725–749': 1,
'700–724': 3,
'675–699': 14,
'650–674': 30,
'625–649': 34,
'600–624': 42,
'575–599': 30,
'550–574': 27,
'525–549': 13,
'500–524': 4,
'475–499': 2}, name='GRE')

gre_freq

725–749     1
700–724     3
675–699    14
650–674    30
625–649    34
600–624    42
575–599    30
550–574    27
525–549    13
500–524     4
475–499     2
Name: GRE, dtype: int64

In [65]:
# Calculate the relative frequency
gre_rel_freq = round(gre_freq / gre_freq.sum(),3)

gre_rel_freq

725–749    0.005
700–724    0.015
675–699    0.070
650–674    0.150
625–649    0.170
600–624    0.210
575–599    0.150
550–574    0.135
525–549    0.065
500–524    0.020
475–499    0.010
Name: GRE, dtype: float64

In [67]:
 # Calculate cumulative frequencies. Sorting of classes is important as cumsum cumulates from top to bottom 
 gre_cum_freq = gre_freq.sort_index().cumsum()

 gre_cum_freq

475–499      2
500–524      6
525–549     19
550–574     46
575–599     76
600–624    118
625–649    152
650–674    182
675–699    196
700–724    199
725–749    200
Name: GRE, dtype: int64

In [68]:
# Create a cumulative percent column
gre_percentiles = (gre_cum_freq / gre_cum_freq.max()) * 100

gre_percentiles

475–499      1.0
500–524      3.0
525–549      9.5
550–574     23.0
575–599     38.0
600–624     59.0
625–649     76.0
650–674     91.0
675–699     98.0
700–724     99.5
725–749    100.0
Name: GRE, dtype: float64

In [69]:
# Create a dataframe for storing different frequencies
gre_dist = gre_freq.to_frame('f').rename_axis('GRE')

# Add relative frequency to the dataframe
gre_dist['Relative f'] = gre_rel_freq

# Add cumulative frequency to the dataframe
gre_dist['Cumulative f'] = gre_cum_freq

gre_dist['Percentile Rank'] = gre_percentiles

In [70]:
gre_dist

Unnamed: 0_level_0,f,Relative f,Cumulative f,Percentile Rank
GRE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
725–749,1,0.005,200,100.0
700–724,3,0.015,199,99.5
675–699,14,0.07,196,98.0
650–674,30,0.15,182,91.0
625–649,34,0.17,152,76.0
600–624,42,0.21,118,59.0
575–599,30,0.15,76,38.0
550–574,27,0.135,46,23.0
525–549,13,0.065,19,9.5
500–524,4,0.02,6,3.0


#### Example: Distribution of Parental Ratings of Movies 

In [98]:
ratings = ['PG', 'PG', 'PG', 'PG-13', 'G',
'G', 'PG-13', 'R', 'PG', 'PG',
'R', 'PG', 'R', 'PG', 'R',
'NC-17', 'NC-17', 'PG','G', 'PG-13']

categories = ['NC-17', 'R', 'PG-13', 'PG', 'G']

ratings = pd.Categorical(values=ratings, categories=categories, ordered=True)

ratings = pd.Series(ratings, dtype='category')

ratings.head()

0       PG
1       PG
2       PG
3    PG-13
4        G
dtype: category
Categories (5, object): ['NC-17' < 'R' < 'PG-13' < 'PG' < 'G']

In [99]:
ratings_dist = ratings.value_counts(sort=False).to_frame('f').rename_axis('Rating')

ratings_dist

Unnamed: 0_level_0,f
Rating,Unnamed: 1_level_1
NC-17,2
R,4
PG-13,3
PG,8
G,3


In [100]:
ratings_dist['Rel f'] = round(ratings_dist['f']/ratings_dist['f'].sum() * 100 ,3)

In [101]:
ratings_dist

Unnamed: 0_level_0,f,Rel f
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1
NC-17,2,10.0
R,4,20.0
PG-13,3,15.0
PG,8,40.0
G,3,15.0


In [102]:
ratings_dist['Cum f'] = ratings_dist['f'].sort_index(ascending=False).cumsum()

In [103]:
ratings_dist

Unnamed: 0_level_0,f,Rel f,Cum f
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NC-17,2,10.0,20
R,4,20.0,18
PG-13,3,15.0,14
PG,8,40.0,11
G,3,15.0,3


In [104]:
ratings_dist['Percentile'] = round(ratings_dist['Cum f']/ratings_dist['Cum f'].max() * 100, 3)

In [105]:
ratings_dist

Unnamed: 0_level_0,f,Rel f,Cum f,Percentile
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
NC-17,2,10.0,20,100.0
R,4,20.0,18,90.0
PG-13,3,15.0,14,70.0
PG,8,40.0,11,55.0
G,3,15.0,3,15.0
