## Skew and Kurtosis
`Skewness` is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

`Kurtosis` is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

In [1]:
import csv

i = 0
## List to hold sub sample
incomeData = []
with open('data/usa.csv', newline='') as myFile:
    data = csv.reader(myFile)
    for row in data:
        i += 1
        if row[6] != 'INCTOT': # Remove first row i.e. header
            if int(row[6]) > 1: # Taking positive values
                if int(row[6]) != 9999999: # Income value of 9999999 is unrealisitc
                    incomeData.append(int(row[6]))
        
        # size of our sample
        if i > 30000:
            break
print(incomeData[0:5])

[10000, 38500, 82000, 8700, 18300]


In [2]:
# Sample Size
sampleSize = len(incomeData)
print('Sample Size: ', sampleSize)

# Sample Mean
totalSumIncome = 0
for row in incomeData:
    totalSumIncome = row + totalSumIncome
    
mean = totalSumIncome / sampleSize
print('Mean: ', mean)

# Variance
sumOfSquares = 0
s3 = 0
s4 = 0

for row in incomeData:
    deviationScore = row - mean
    sumOfSquares = deviationScore**2 + sumOfSquares
    s3 = deviationScore**3 + s3
    s4 = deviationScore**4 + s4
    
variance = sumOfSquares/(sampleSize - 1)
print('Variance: ', variance)


# Standard Deviation
SD = variance**0.5
print('Standard Deviation: ', SD)



Sample Size:  21883
Mean:  39579.76557144816
Variance:  2504546686.3267283
Standard Deviation:  50045.446209687536


### Kurtosis

In [3]:
n = sampleSize
s2 = sumOfSquares
m2 = s2/n
m4 = s4/n

populationKurtosis = (m4/m2**2) - 3
print('Population Kurtosis: ', populationKurtosis)

sampleKurtosis = ((n*(n+1)) / ((n-1)*(n-2)*(n-3))) * ((n-1)**2)*(s4/(s2**2)) - 3
print('Sample Kurtosis: ', sampleKurtosis)


Population Kurtosis:  28.869652964520665
Sample Kurtosis:  28.87693599168334


### Skewness

In [4]:
sampleSkew = s3/((n-1)*SD**3)
print('Sample Skewness: ', sampleSkew)

Sample Skewness:  4.442958947034826


#### Let's check our work

In [5]:
import pandas as pd
df_income = pd.DataFrame({'INCTOT':incomeData})
testKurtosis = pd.Series.kurtosis(df_income)
print('Test Kurtosis: ', testKurtosis)

testSkew = df_income.skew()
print('Test Skew: ', testSkew)

Test Kurtosis:  INCTOT    28.876525
dtype: float64
Test Skew:  INCTOT    4.443365
dtype: float64
