# Univariate Data Exploration

Although there are multiple ways to calculate these statistics, we will specifically use pandas for the following reasons:
1. Pandas by itself can calculate all the statistics in a univariate data exploration process. For other packages, we would have to mix and match
2. Pandas is data-centric, taking dataframes as inputs. In other words, pandas treats a dataframe as the main object. This is very similar to R!

## Mean

In [14]:
# Example of calculating mean

# Import pandas library
import pandas as pd #Note that you can name the library however you want. "pd" is just a convention that we follow

# Create dataframe
raw_data = {"name" : ["A", "B", "C", "D", "E", "G"],
       "gender": ["M","F", "F", "M", "F", "F"],
       "age": [18, 20, 19, 22, 17, 19],
       "gpa": [3.9, 3.7, 3.0, 4.0, 3.3, 3.7]}

# To officially make it a pandas dataframe, we need this code
data = pd.DataFrame(raw_data, columns = ["name", "gender", "age", "gpa"])

# Look at the data
data


Unnamed: 0,name,gender,age,gpa
0,A,M,18,3.9
1,B,F,20,3.7
2,C,F,19,3.0
3,D,M,22,4.0
4,E,F,17,3.3
5,G,F,19,3.7


In [28]:
# Calculate mean across dataframe. axis = 0 is by column, axis = 1 is by row
mean_all = data.mean(axis = 0)
print(mean_all)
print()

# If you only need mean of one column, you can specify the column to calculate. Default setting is by column
mean_age = data["age"].mean()
print(mean_age)

# Q: The book says axis = 0 represents the mean ALONG THE ROWS, which I found a bit unclear

age    19.166667
gpa     3.600000
dtype: float64

19.166666666666668


  mean_all = data.mean(axis = 0)


## Median

We will use the same dataframe for consistency

In [26]:
# Calculate median across dataframe
median_all = data.median(axis = 0)
print(median_all)
print()

# Calculate median of a single column
median_gpa = data["gpa"].median(axis = 0)
print(median_gpa)

age    19.0
gpa     3.7
dtype: float64

3.7


  median_all = data.median(axis = 0)


## Mode
We will use the same dataframe for consistency

In [25]:
# Calculate mode across dataframe --> Notice that it will list out of values with the same mode
mode_all = data.mode()
print(mode_all)
print()

# Calculate mode of a single column
mode_gender = data["gender"].mode()
print(mode_gender)

  name gender   age  gpa
0    A      F  19.0  3.7
1    B    NaN   NaN  NaN
2    C    NaN   NaN  NaN
3    D    NaN   NaN  NaN
4    E    NaN   NaN  NaN
5    G    NaN   NaN  NaN

0    F
Name: gender, dtype: object


## Max, Min, and Range

We will use the same dataframe for consistency

In [31]:
# Calculate the max/min age
max_age = data["age"].max()
min_age = data["age"].min()

# Calculate the age range
range = max_age - min_age
print(range)

5


## IQR: Inter Quartile Range

We will use the same dataframe for consistency.

Note that pandas doesn't have a function that calculates directly the IQR. pandas, however, provides a workaround: Calculate Q1 and Q3, then find IQR by subtracting Q1 from Q3.

In [30]:
# Calculate the quantiles of gpa
q1 = data["gpa"].quantile(0.25)
q3 = data["gpa"].quantile(0.75)

# Calculate the IQR of gpa
iqr = q3 - q1
print(iqr)

0.4500000000000002
