

<center><h1>Normal Distribution & Central Limit Theorem</h1></center>



Its understandable that the Central Limit Theorem (CLT) can seem a bit confusing.  The goal of this notebook is to demystify the CLT by having you write an algorithm that actually uses sampling to approximate a normal distribution from a non-normally distributed data set. 

In this notebook you will:

1. Run code to generate a non-normal data set.  
1. Create an function to randomly sample subsets of data.
1. Create a data set of the means of each sample.
1. Visualize the distribution of the means of each sample.  


<center><h3>Creating our Dummy Data</h3></center>

We're going to use numpy to create a non-normal distribution.  The easiest way to do this is just to create a uniform distribution!  

**TASKS:** Run the code below to import numpy and set a random seed, and then use numpy to create a uniform distribution with integer values between 0 and 100.

(Hint: For integer values, random.uniform is not our best choice since it generates floats.  Which numpy method should you use to generate a uniform distribution of random integers?)

In [1]:
# Run this cell to import the packages you'll need and set a seed
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

# Please dont change this--otherwise, you'll get different results from everyone else!
np.random.seed(1547)

In [27]:
# Create a uniform distribution of 10000 integers between 0 and 100.
non_normal_data = np.random.randint(low=0, high=5, size=10)
# to visualize our the distribution of our dummy data set
non_normal_data
#plt.hist(non_normal_data) 


array([3, 4, 1, 4, 0, 2, 0, 3, 0, 4])

<center><h3>Creating a Sampling Function</h3></center>

Now that we have created our data set, we'll need to sample from it.  In order to do this, you'll need to create two different functions--a `get_sample` to create random samples of size 'n', and a `create_sample_distribution` function to actually create a sample distribution of size `n` (using our helper function). 

Your `get_sample` function should:

1.  Take a keyword argument for sample size (called 'n' for short)
1.  Randomly grab 'n' samples from the uniform distribution with replacement (any samples selected should NOT be removed from the original data set).
1.  Calculate the mean of the sub-sample and return it.


Your `create_sample_distribution` function should:

1.  Take a keyword argument for size, which will determine the total size of the sample distribution.
1.  Use the `get_sample` helper function to create sample distributions and calculate sample mean.   
1.  Store the sample mean.
1.  Repeat this process until there a distribution of `[size]` sample means exist.  When the data set is complete, return it as a numpy array.  

``` python
def get_sample(dataset, n=30):
    """Grabs a random subsample of size 'n' from dataset.
    Outputs the mean of the subsample."""
    pass

def create_sample_distribution(dataset, size=100):
    """Creates a dataset of subsample means.  The length of the dataset is specified by the 'size' 
    keyword argument. Should return the entire sample distribution as a numpy array.  """
    pass
```



In [41]:
# Complete the two functions below.  
def get_sample(dataset, n=30):
    """Grabs a random subsample of size 'n' from dataset.
    Outputs the mean of the subsample."""
    
    rand_subsample = np.random.randint(low=0, high=len(dataset), size=n)
    _sum = 0 
    
    for index in range(len(rand_subsample)):
        print(rand_subsample[index])
        _sum += rand_subsample[index]
        print(_sum)
        
    return _sum / n 

def create_sample_distribution(dataset, size=100):
    """Creates a dataset of subsample means.  The length of the dataset is specified by the 'size' 
    keyword argument. Should return the entire sample distribution as a numpy array.  """
    
    subsample_means = np.array([])
    
    for _ in range(size):
        _mean = get_sample(dataset)
        subsample_means.append(_mean)
    
    return subsample_means

In [44]:
print(create_sample_distribution(non_normal_data))

1
1
1
2
2
4
4
8
5
13
1
14
9
23
1
24
2
26
7
33
2
35
1
36
8
44
0
44
4
48
2
50
1
51
6
57
2
59
8
67
6
73
2
75
0
75
9
84
4
88
4
92
3
95
6
101
4
105
8
113
2
2
0
2
4
6
8
14
1
15
8
23
5
28
0
28
1
29
9
38
7
45
2
47
5
52
0
52
2
54
3
57
8
65
4
69
4
73
9
82
4
86
8
94
9
103
0
103
6
109
2
111
2
113
5
118
7
125
8
133
8
8
3
11
0
11
8
19
4
23
8
31
8
39
5
44
9
53
2
55
1
56
7
63
7
70
7
77
4
81
1
82
8
90
6
96
1
97
7
104
2
106
3
109
8
117
2
119
0
119
8
127
8
135
5
140
2
142
0
142
0
0
6
6
2
8
3
11
1
12
3
15
9
24
8
32
1
33
7
40
4
44
0
44
8
52
6
58
6
64
8
72
6
78
7
85
4
89
4
93
5
98
9
107
0
107
5
112
2
114
8
122
2
124
3
127
9
136
6
142
6
6
4
10
4
14
3
17
2
19
1
20
0
20
3
23
2
25
4
29
9
38
3
41
7
48
8
56
0
56
9
65
3
68
9
77
1
78
7
85
0
85
6
91
7
98
0
98
1
99
9
108
8
116
7
123
6
129
5
134
6
6
1
7
5
12
5
17
2
19
2
21
0
21
9
30
3
33
4
37
6
43
9
52
3
55
9
64
2
66
6
72
2
74
3
77
9
86
5
91
5
96
8
104
3
107
2
109
3
112
4
116
6
122
9
131
9
140
9
149
6
6
1
7
4
11
0
11
3
14
0
14
4
18
3
21
8
29
0
29
3
32
3
35
2
37
2
39
8

44
2
46
3
49
4
53
4
57
4
61
6
67
4
71
6
77
1
78
1
79
0
79
8
87
9
96
5
101
7
108
9
117
5
122
3
125
1
126
7
133
2
135
4
4
3
7
3
10
2
12
1
13
8
21
2
23
5
28
5
33
7
40
2
42
4
46
1
47
3
50
4
54
9
63
8
71
8
79
8
87
6
93
9
102
5
107
6
113
1
114
9
123
6
129
0
129
5
134
9
143
0
143
7
7
6
13
4
17
7
24
9
33
8
41
3
44
7
51
1
52
1
53
9
62
2
64
1
65
3
68
3
71
0
71
2
73
1
74
4
78
0
78
0
78
7
85
0
85
2
87
3
90
5
95
0
95
2
97
2
99
3
102
7
7
2
9
7
16
8
24
5
29
9
38
7
45
8
53
8
61
5
66
0
66
5
71
9
80
7
87
6
93
9
102
7
109
1
110
4
114
7
121
9
130
6
136
7
143
6
149
5
154
4
158
1
159
2
161
0
161
6
167
3
3
3
6
3
9
8
17
7
24
7
31
4
35
5
40
5
45
3
48
1
49
6
55
4
59
5
64
2
66
7
73
4
77
7
84
0
84
5
89
6
95
3
98
4
102
5
107
6
113
1
114
9
123
5
128
1
129
8
137
0
0
4
4
8
12
9
21
0
21
8
29
3
32
4
36
9
45
2
47
9
56
5
61
1
62
5
67
5
72
7
79
0
79
6
85
5
90
5
95
6
101
7
108
0
108
0
108
5
113
3
116
5
121
7
128
7
135
0
135
9
9
4
13
2
15
3
18
3
21
1
22
9
31
2
33
0
33
5
38
0
38
4
42
0
42
0
42
0
42
4
46
2
48
5
53
0
53
1
54
6

NameError: name 'sample_distribution' is not defined

<center><h3>Visualizing our Sample Distribution</h3></center>

Now that we have created our sample distribution, let's visualize it to determine if it's a normal distribution.  

**TASK:** Use matplotlib to visualize our sample distribution.

In [None]:
# Visualize our sample distribution below.
# Remember, we aliased matplotlib.pyplot as plt!


<center><h3>Great Job!</h3></center>

Now that you've used the Central Limit Theorem, you're able to create to treat non-normally distributed datasets as normally distributed.  You can now compute Z-scores and compute probabilities for values in these datasets!  