# Introduction to Statistics

We will use very basic statistics in this course, so we'll go over some definitions. 

### Sample Dataset
Set up a sample dataset on student grades to demonstrate basic statistics

In [1]:
# Put the import statements at the top of the file
import random
import pandas as pd
import numpy as np

#### Random list generator

We will randomly generate a list of numbers

In [2]:
# Assign the seed, so the randomly generated numbers will always be the same (Psuedo-random)
random.seed(10)

**Here are the different type of random functions**

> random.sample(seq, n): 

Generate n unique samples (multiple items) from a sequence without repetition. Here, A seq can be a list, set, string, tuple. Sample without replacement.

> random.choices(seq, n)

Generate n samples from a sequence with the possibility of repetition. Sample with replacement

In [3]:
# Randomly generate some lists
student_id = random.choices(range(10000000, 99999999), k=500)

assignment1 = random.choices(range(0, 100), k=500)

assignment2 = random.choices(range(0, 100), k=500)

assignment3 = random.choices(range(0, 100), k=500)

#### Create the dataframe

In [4]:
grades = pd.DataFrame({'student_id': student_id, 
                       'assignment1': assignment1, 
                       'assignment2': assignment2,
                       'assignment3': assignment3,})

grades.head()

Unnamed: 0,student_id,assignment1,assignment2,assignment3
0,61426232,7,15,91
1,48600014,37,53,87
2,62028216,8,82,82
3,28548840,0,14,91
4,83198911,82,72,99


### Basic Statistics

**Mean/Average:** This is the sum of all values over the number of values

In [5]:
grades.assignment1.mean()

50.206

In [6]:
# Numpy Functions
np.mean(grades.assignment1)

50.206

**Maximum:** Largest value in your dataset

In [7]:
grades.assignment1.max()

99

**Minimum:** Smallest value in your dataset

In [8]:
grades.assignment1.min()

0

**Variance:** Measure to describe the spread between numbers in your dataset and the mean of the dataset. This is calcualted by summing the squared differences in the numbers in your dataset from the mean. 

**Standard Deviation:** Measure of how spread out the numbers in the dataset are. This takes the square root of the variance so it is back in terms of your original unit.

In [9]:
grades.assignment1.var()

803.4304248496994

This number is very large and above 100 which doesn't make sense since grades are between 0-100 so we take the standard deviation to scale it back.

In [10]:
grades.assignment1.std()

28.344848294702505

Or if you want to calculate the standard deviation from the variance:

In [11]:
import math

math.sqrt(grades.assignment1.var())

28.344848294702505

**Mode:** Most common value in the dataset

In [12]:
grades.assignment1.mode()

0    65
dtype: int64

Our mode in this case is 75 since we have that the value appears twice. One way to get the distribution of the counts of our data is to use `df.column_name.value_counts()`.

In [13]:
print(grades.assignment1.value_counts())

65    11
84    10
29    10
5     10
55     9
      ..
13     1
93     1
44     1
21     1
52     1
Name: assignment1, Length: 100, dtype: int64


Execute the following line if you want more lines to be displayed:

> pd.options.display.max_rows = 4000


**Median:** Value that indicates the "middle" of a dataset
* This can be found by ordering the dataset from smallest to largest and finding the middle value 
* If the length of a dataset is an even number, we take the average of the two middle numbers

In [14]:
grades.assignment1.median()

53.0

**Percentiles:** This is a number where a certain percentage of your values fall below that number. 
* 25th Percentile: Let's call this x, then 25% of your values would be less than x

We typically look at the 25th, 50th and 75th percentile values. The median is also known as the 50th percentile.

In [15]:
# 25th percentile
np.percentile(grades.assignment1, 25)

27.0

In [16]:
# 50th percentile
np.percentile(grades.assignment1, 50)

53.0

In [17]:
# 75th percentile
np.percentile(grades.assignment1, 75)

74.0

**Summary Statistics:** This is a compilation of the statistics to provide some insights into the data. The five number summary consists of:
* Minimum
* 25th Percentile
* 50th Percentile (Median)
* 75th Percentile
* Maximum

In [18]:
# Summary statistics
grades.describe()

Unnamed: 0,student_id,assignment1,assignment2,assignment3
count,500.0,500.0,500.0,500.0
mean,54967040.0,50.206,49.518,49.96
std,25564530.0,28.344848,28.627132,28.783805
min,10254740.0,0.0,0.0,0.0
25%,32817050.0,27.0,26.0,26.0
50%,54392470.0,53.0,48.0,49.0
75%,75040990.0,74.0,75.0,76.0
max,99985420.0,99.0,99.0,99.0


You can see that these values match with the percentiles we calculated above. Lastly, we are interested in understanding the range of our data. 

**Range:** The difference between the maximum and minimum of your data

**Interquartile Range:** The difference between your 75th and 25th percentiles.

In [19]:
# Range
grades.assignment1.max() - grades.assignment1.min()

99

In [20]:
# IQR
np.percentile(grades.assignment1, 75) - np.percentile(grades.assignment1, 25)

47.0

This should give you an introduction to how to create DataFrames (possibly from random numbers) and evaluate basic statistics.