# Statistical Fundamental


In this series of courses on probability and statistics we'll learn powerful statistical techniques and metrics like standard deviation, z-scores, confidence intervals, probability estimation, and hypothesis testing (including A/B testing).

In this first course, we begin with discussing the details around getting data for analysis, and continue with trying to understand the intricacies around how data is structured and measured. We'll then move on with learning techniques to organize and visualize relatively large amounts of data, which will make the process of finding patterns considerably less difficult.

Below is a diagram describing the workflow we'll be focusing on throughout this first course.

<Img src="https://github.com/rhnyewale/DataQuest/blob/master/Step5-Statistics%20Fundamental/Images/Intro.jpg?raw=true">

Our focus will be on the details around getting data for analysis. As usual, we'll work with a real world data set. Before we dive into the technical details and start playing with the data, we begin with getting a sense about what statistics is.

At this stage in our learning journey, a one-sentence definition of statistics would probably sound dull and be difficult to grasp. We'll avoid defining statistics that way, and we'll discuss instead what sort of problems can be solved with statistics. Understanding what challenges we can overcome using statistics should give us a good sense about what statistics is.

In statistics, the set of all individuals relevant to a particular statistical question is called a population. For our analyst's question, all the people inside the company were relevant. So the population in this case consisted from all the people in the company.

A smaller group selected from a population is called a sample. When we select a smaller group from a population we do sampling. In our example, the data analyst took a sample of approximately 100 people from a population of over 50,000 people.

<Img src="https://github.com/rhnyewale/DataQuest/blob/master/Step5-Statistics%20Fundamental/Images/Sampling.jpg?raw=true">
    
Whether a set of data is a sample or a population depends on the question we're trying to answer. For our analyst's question, the population consisted of all the company members. But if we change the question, the same group of individuals can become a sample.

For instance, if we tried to find out whether people at international companies are satisfied at work, then our group formed by over 50,000 employees would become a sample. There are a lot of international companies out there, and ours is just one of them. The population (the set of all individuals relevant to this question) is made up of all the people working in all the international companies.
    
<Img src="https://github.com/rhnyewale/DataQuest/blob/master/Step5-Statistics%20Fundamental/Images/Sampling2.jpg?raw=true">
    
Populations do not necessarily consist of people. Behavioral scientists, for instance, often try to answer questions about populations of monkeys, rats or other lab animals. In a similar way, other people try to answer questions about countries, companies, vegetables, soils, pieces of equipment produced in a factory, etc.
    
The individual elements of a population or a sample go under many names. You'll often see the elements of a population referred to as individuals, units, events, observations. These are all used interchangeably and refer to the same thing: the individual parts of a population. When we use the term "population individuals", the population is not necessarily composed of people. "Individuals" here is a general term that could refer to people, needles, frogs, stars, etc.

In the case of a sample, you'll often see this terminology used interchangeably: sample unit, sample point, sample individual, or sample observation.
    


For every statistical question we want to answer, we should try to use the population. In practice, that's not always possible because the populations of interest usually vary from large to extremely large. Also, getting data is generally not an easy task, so small populations often pose problems too.

These problems can be solved by sampling from the population that interests us. Although not as good as working with the entire population, working with a sample is the next best thing we can do.

When we sample, the data we get might be more or less similar to the data in the population. For instance, let's say we know that the average salary in our company is $34500, and the proportion of women is 60%. We take two samples and find these results:

<Img src = "https://github.com/rhnyewale/DataQuest/blob/master/Step5-Statistics%20Fundamental/Images/Sampling3.jpg?raw=true">


As you can see, the metrics of the two samples are different than the metrics of the population. A sample is by definition an incomplete set of data for the question we're trying to answer. For this reason, there's almost always some difference between the metrics of a population and the metrics of a sample. This difference can be seen as an error, and because it's the result of sampling, it's called sampling error.

A metric specific to a population is called a parameter, while one specific to a sample is called a statistic. In our example above, the average salary of all the employees is a parameter because it's a metric that describes the entire population. The average salaries from our two samples are examples of statistics because they only describe the samples.

Another way to think of the concept of the sampling error is as the difference between a parameter and a statistic:

*Sampling error = Parameter - Statistic*

In [1]:
import pandas as pd 

In [3]:
wnba = pd.read_csv('Sampling_NBA Player Stats/WNBA Stats.csv')
wnba.head()

Unnamed: 0,Name,Team,Pos,Height,Weight,BMI,Birth_Place,Birthdate,Age,College,...,OREB,DREB,REB,AST,STL,BLK,TO,PTS,DD2,TD3
0,Aerial Powers,DAL,F,183,71.0,21.200991,US,"January 17, 1994",23,Michigan State,...,6,22,28,12,3,6,12,93,0,0
1,Alana Beard,LA,G/F,185,73.0,21.329438,US,"May 14, 1982",35,Duke,...,19,82,101,72,63,13,40,217,0,0
2,Alex Bentley,CON,G,170,69.0,23.875433,US,"October 27, 1990",26,Penn State,...,4,36,40,78,22,3,24,218,0,0
3,Alex Montgomery,SAN,G/F,185,84.0,24.543462,US,"December 11, 1988",28,Georgia Tech,...,35,134,169,65,20,10,38,188,2,0
4,Alexis Jones,MIN,G,175,78.0,25.469388,US,"August 5, 1994",23,Baylor,...,3,9,12,12,7,0,14,50,0,0


In [4]:
wnba.shape

(143, 32)

In [5]:
parameter = wnba['Games Played'].max()

In [6]:
parameter

32

Using the DataFrame.sample() method, sample randomly 30 players from the population, and assign the result to a variable named sample

In [7]:
sample = wnba['Games Played'].sample(30, random_state = 1)
statistic = sample.max()

In [8]:
sampling_error = parameter - statistic
print(sampling_error)

2


When we sample we want to minimize the sampling error as much as possible. We want our sample to mirror the population as closely as possible.

If we sampled to measure the mean height of adults in the US, we'd like our sample statistic (sample mean height) to get as close as possible to the population's parameter (population mean height). For this to happen, we need the individuals in our sample to form a group that is similar in structure with the group forming the population.

The US adult population is diverse, made of people of various heights. If we sampled 100 individuals from various basketball teams, then we'd almost certainly get a sample whose structure is significantly different than that of the population. As a consequence, we should expect a large sampling error (a large discrepancy between our sample's statistic (sample mean height) and the population's parameter (population mean height)).

In statistical terms, we want our samples to be **representative** of their corresponding populations. If a sample is representative, then the sampling error is low. The more representative a sample is, the smaller the sampling error. The less representative a sample is, the greater the sampling error.

<Img src="https://github.com/rhnyewale/DataQuest/blob/master/Step5-Statistics%20Fundamental/Images/Sampling4.jpg?raw=true">
    
To make our samples representative, we can try to give every individual in the population an equal chance to be selected in our samples. We want a very tall individual to have the same chance as being selected as an individual having a medium or short height. To give every individual an equal chance of being picked, we need to sample **randomly**.

One way to perform random sampling is to generate random numbers and use them to select a few sample units from the population. In statistics, this sampling method is called **simple random sampling**, and it's often abbreviated as **SRS**.
    
<Img src="https://github.com/rhnyewale/DataQuest/blob/master/Step5-Statistics%20Fundamental/Images/population.jpg?raw=true">

    
