# Basic Statistics

- Population: Entire group of individuals or objects to be studied is called population
    - Parameter: Numerical summary of a population
- Sample: Subset of population that is being studied 
    - Statistics: Numerical summary of a sample 
- Individual: Person or object that is a part of the population being studied

# Variables

- Qualitative or Categorical variables: 
    - Allow for classification of individuals based on some attribute
    - Cannot perform arithmetic operation on this data
    - Example: Gender
- Quantitative variables: 
    - Provide numerical measure of individuals
    - Example: Age

### Qualitative or Categorical variables

- Dichotomous: Only 2 values (Present/Absent)
- Nominal: Unordered (Blood type: A, B, AB, O)
- Ordinal: Ordered Rating (high to low)

### Quantitative or Numerical variables

- Discrete: Certain Values Gaps (Days sick/year)
- Continuous: Range of values, No Gap (Blood glucose)

# Categorical Variables in R with Factors

### Nominal Categorical Factors:

In [1]:
diabetes <- c("Type1", "Type2", "Type1", "Type1")
diabetes

In [2]:
class(diabetes)

#### Create Factors of diabetes

In [3]:
diabetes <- factor(diabetes)
diabetes

In [4]:
class(diabetes)

### Ordinal Categorical Factors: default order alphabetical

In [5]:
status <- c("Poor", "Improved", "Excellent", "Poor")
status

In [6]:
class(status)

#### Create Factor of status

In [7]:
status <- factor(status, order=TRUE)
print(status)

[1] Poor      Improved  Excellent Poor     
Levels: Excellent < Improved < Poor


In [8]:
class(status) #ordered factor

#### Ordinal Order can be changed  

In [9]:
status <- factor(status, order=TRUE, levels=c("Poor", "Improved", "Excellent"))
print(status)

[1] Poor      Improved  Excellent Poor     
Levels: Poor < Improved < Excellent


## Factors with Data Frames

In [10]:
patientID <- c(1,2,3,4)
age <- c(25,34,28,52)

In [11]:
diabetes <- c("Type1", "Type2", "Type1", "Type1")
diabetes <- factor(diabetes)
print(diabetes)

[1] Type1 Type2 Type1 Type1
Levels: Type1 Type2


In [12]:
status <- c("Poor","Improved","Excellent","Poor")
status <- factor(status, order=TRUE, levels=c("Poor", "Improved", "Excellent"))
print(status)

[1] Poor      Improved  Excellent Poor     
Levels: Poor < Improved < Excellent


#### Create Data Frame with given data

In [13]:
patientData <- data.frame(patientID, age, diabetes, status)
patientData

patientID,age,diabetes,status
1,25,Type1,Poor
2,34,Type2,Improved
3,28,Type1,Excellent
4,52,Type1,Poor


#### Info about the patientData Data Frame:

In [14]:
str(patientData)

'data.frame':	4 obs. of  4 variables:
 $ patientID: num  1 2 3 4
 $ age      : num  25 34 28 52
 $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
 $ status   : Ord.factor w/ 3 levels "Poor"<"Improved"<..: 1 2 3 1


# Sampling 

- If the results of the sample are not representative of the population then the sample has bias
- Three types of Bias: 
    - Sampling Bias
        - Technique used to obtain the individuals to be in the sample tends to favor one part of the population
    - Non-response bias
        - Individual selected to be in the sample who do not respond
    - Response Bias
        - Answer on the survey do not represent true feeling of the respondent

### Sampling in R

In [15]:
set.seed(0) #Seed for same random numbers.  Seed with time of day for different seed each time

In [16]:
x <- 1:10
x

#### Sample 5 random elements from x

In [17]:
sample(x, 5)

In [18]:
#same as above
sample(1:10, 5)

In [19]:
#same as above
sample(10, 5)

#### Generating same numbers with replace=T option

In [20]:
sample(10, 5, replace=T)