# Univariate Statistics

## Exploratory data analysis (EDA).
EDA is the process of analyzing datasets to summarize their characteristics through a combination of statistical calculations and data visualizations.

## Univariate
One variable data. Univariate data does not deal with causes or relationships between two or more variables.

In [1]:
import pandas as pd

df = pd.read_csv('C:/Users/skous2/Documents/GitHub/IS-315-Code/data/insurance.csv')

In [2]:
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


Explore - Look at the data

In [3]:
df.shape

(1338, 7)

In [4]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [7]:
# Calculate the count of non-empty values for each feature
print('age: ' + str(df.age.count()))
print('sex: ' + str(df.sex.count()))
print('bmi: ' + str(df.bmi.count()))
print('children: ' + str(df.children.count()))
print('smoker: ' + str(df.smoker.count()))
print('region: ' + str(df.region.count()))
print('charges: ' + str(df.charges.count()))

# Calculate the number of unique values for each feature
print('age: ' + str(df.age.nunique()))
print('sex: ' + str(df.sex.nunique()))
print('bmi: ' + str(df.bmi.nunique()))
print('children: ' + str(df.children.nunique()))
print('smoker: ' + str(df.smoker.nunique()))
print('region: ' + str(df.region.nunique()))
print('charges: ' + str(df.charges.nunique()))

age: 1338
sex: 1338
bmi: 1338
children: 1338
smoker: 1338
region: 1338
charges: 1338
age: 47
sex: 2
bmi: 548
children: 6
smoker: 2
region: 4
charges: 1337


Do it on the whole Data Frame

In [8]:
print(df.nunique())
print()
print(df.dtypes)

age           47
sex            2
bmi          548
children       6
smoker         2
region         4
charges     1337
dtype: int64

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object


The data type is easily determined using the .dtype attribute of Pandas DataFrames:

In [9]:
print('age: ' + str(df.age.dtype))
print('sex: ' + str(df.sex.dtype))
print('bmi: ' + str(df.bmi.dtype))
print('children: ' + str(df.children.dtype))
print('smoker: ' + str(df.smoker.dtype))
print('region: ' + str(df.region.dtype))
print('charges: ' + str(df.charges.dtype))

age: int64
sex: object
bmi: float64
children: int64
smoker: object
region: object
charges: float64


## Missing Values

In [10]:
print('age: ' + str(df.age.isna().sum()))
print('sex: ' + str(df.sex.isna().sum()))
print('bmi: ' + str(df.bmi.isna().sum()))
print('children: ' + str(df.children.isna().sum()))
print('smoker: ' + str(df.smoker.isna().sum()))
print('region: ' + str(df.region.isna().sum()))
print('charges: ' + str(df.charges.isna().sum()))

age: 0
sex: 0
bmi: 0
children: 0
smoker: 0
region: 0
charges: 0


## Boundaries and the Middle

In [11]:
myList = [1,2,3,4,5,6,7,8,9,10]
print(max(myList)) #Upper limit
print(min(myList)) #Lower limit

10
1


### Quantiles - Use numpy for lists, pandas for DataFrame

In [12]:
# Using a Python list
import numpy as np

myList = [1,2,3,4,5,6,7,8,9,10]
print(np.quantile(myList, .25))
print(np.quantile(myList, .50))
print(np.quantile(myList, .75))

# Using a Pandas DataFrame column
df_quantiles = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,10], columns=['myList'])
print("\n")
print(df_quantiles.myList.quantile(.25))
print(df_quantiles.myList.quantile(.50))
print(df_quantiles.myList.quantile(.75))

3.25
5.5
7.75


3.25
5.5
7.75


### Mean

In [14]:
import statistics as stat ###Only works with built-in Python data structures

myList = [1,2,3,4,5,6,7,8,9,10]
stat.mean(myList)

5

In [16]:
import pandas as pd

# First we create a DataFrame to test this on. Notice that we first add 
# the data, then specify a series of column headers and row index names
# (as opposed to index numbers)
fruit = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])

# Now print it to see what it looks like
fruit

Unnamed: 0,Apple,Orange,Banana,Pear
Basket1,10,20,30,40
Basket2,7,14,21,28
Basket3,55,15,8,12
Basket4,15,14,1,8
Basket5,7,1,1,8
Basket6,5,4,9,2


In [20]:
print(fruit.mean())

Apple     16.500000
Orange    11.333333
Banana    11.666667
Pear      16.333333
dtype: float64


In [24]:
fruit.mean(axis="columns") #OR
fruit.mean(axis=1)

Basket1    25.00
Basket2    17.50
Basket3    22.50
Basket4     9.50
Basket5     4.25
Basket6     5.00
dtype: float64

### Median

In [25]:
fruit.median() #For the dataframe

Apple      8.5
Orange    14.0
Banana     8.5
Pear      10.0
dtype: float64

In [27]:
fruit.Apple.median() ### For the column

8.5

### Mode

In [28]:
fruit.mode()

Unnamed: 0,Apple,Orange,Banana,Pear
0,7,14,1,8


In [29]:
fruit.mode(axis="columns") #if you want to find the mode by columns instead

Unnamed: 0,0,1,2,3
Basket1,10.0,20.0,30.0,40.0
Basket2,7.0,14.0,21.0,28.0
Basket3,8.0,12.0,15.0,55.0
Basket4,1.0,8.0,14.0,15.0
Basket5,1.0,,,
Basket6,2.0,4.0,5.0,9.0


## Standard Deviation

Standard Deviation tells us how spread out the data is from the mean

In [30]:
 # Using Python list
import numpy as np
myList = [1,2,3,4,5,6,7,8,9,10]
print(np.std(myList, ddof=1)) # The parameter 'ddof=1' is used to change the default std to sample mode (s)

# Using Pandas DataFrame column
import pandas as pd
df = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,10], columns=['numbers'])
print(df.numbers.std())       # Assumes a sample std (s) by default

3.0276503540974917
3.0276503540974917


## Skewness

measures how much the bulk of the histogram data is skewed to the right or left of the x-axis

## Kurtosis

Kurtosis tells us how peaked or how flat a distribution is

In [2]:
 # Using Python list
from scipy.stats import kurtosis, skew
myList = [1,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9,9,10]
print(skew(myList, bias=False))     
print(kurtosis(myList, bias=False))

# Using Pandas DataFrame
import pandas as pd
df = pd.DataFrame(data=[1,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9,9,10], columns=['numbers'])
print('The skewness is ' + str(df.numbers.skew()))
print('The kurtosis is ' + str(df.numbers.kurt()))

-0.01972922271337009
-0.03905580479600701
The skewness is -0.01972922271337009
The kurtosis is -0.0390558047960079


The conservative suggestion regarding skewness and kurtosis is that they must be within the range of -1 to 1 in order to be considered normal enough to use during the Modeling phase. 