# Zipf's Law

- Zipf’s law: r * freq = A * N
- r = word rank 
- freq = word frequency 
- A = constant. In most cases A = 0.1
- N = total number of words in collection

- Zipf's law is not an exact law, but a statistical law and therefore does not hold exactly but only on average (for most words).
- We can also rewrite this equation as r * Prob(r) = A where Prob(r) = freq(r) / N


In [1]:
import pandas as pd
import glob
import re
from nltk.probability import FreqDist
import matplotlib.pyplot as plt

files_dir = "/usr/local/share/nltk_data/corpora/gutenberg/" # Files in the gutenberg folder that we want to analyse
txt_files = glob.glob(files_dir + '*.txt') # Returns list of .txt files from this folder with their abs path
wordlist = []

- We will use only 5 files for this analysis and prepare a word list of our own.

In [2]:
def wordsinfile(filename): # function that returns words
    file = open(filename, 'r')
    text = file.read()
    file.close()
    text = list(filter(lambda x: re.sub('[^A-Za-z\ \']+', "", x), text.split())) 
    # lambda function that filters the words not matching the regex
    return text

for index in range(5): # Restricting to 5 iterations
    wordlist = wordlist + wordsinfile(txt_files[index]) # adding words from each file to word list


In [3]:
fdist1 = FreqDist(wordlist) # Creating the frequency distribution of the wordslist

# Creating a dataframe from the fdist with most common words at the top
df = pd.DataFrame(data=fdist1.most_common(len(fdist1)), index=range(1, len(fdist1) + 1), columns=['Word', 'Frequency'])

# Adding rank column in reverse sorted dataframe df
df['Rank'] = range(1, len(fdist1) + 1)

#Lets calculate Zips Law constant
# A = Rank(r) * Freq(r)/ N
# Calculating constant A for each row of this dataframe
df['Constant A'] = df['Rank'] * df['Frequency'] / len(wordlist) 

# A typical average value of all the constant values of A is 0.1 
# For our dataset thus created lets calulate average value of A
print("Mean of all the constant values of A: ", df['Constant A'].mean())

df.to_csv('Frequency Distribution and Ranks.csv') # Storing the dataframe to a csv file.
df[0:20].style

Mean of all the constant values of A:  0.04335904766213589


Unnamed: 0,Word,Frequency,Rank,Constant A
1,the,74219,1,0.0641559
2,and,48919,2,0.0845725
3,of,44821,3,0.116232
4,to,25248,4,0.0872988
5,in,17461,5,0.0754676
6,that,16081,6,0.0834038
7,a,14522,7,0.0878711
8,I,14091,8,0.0974436
9,And,13172,9,0.102474
10,he,12307,10,0.106383


In [4]:
# To prove zipf's law we plot the data. 
plt.loglog(df['Rank'], df['Frequency'], basex=10, basey=10) # Selecting Rank on X axis and Frequency on Y. base of log is 10
plt.ylabel('Log Frequency')
plt.xlabel('Log Rank')
plt.title("Log Log plot to check Zipf's law")

plt.text(10**2, 10**4, "Zipf's law makes most errors for\nHighest and Lowest Frequency Words", style='italic',
        bbox={'facecolor':'white', 'alpha':0.5, 'pad':10})

plt.plot((df['Rank'][len(df) - 1],1),(1, df['Frequency'][1]), 'k--') # Reference line

#plt.show()

plt.savefig("img/LogLogPlot.png")

![Log Log Plot to prove if Zipf's law holds true](img/LogLogPlot.png)

# Conclusion
- looking at most frequent and least frequent words is misleading. Because law does not hold true for these words and gives maximum errors.
- Zipf's law holds true for all the words in the middle and the line with the slope of -1 proves it.