## Import data and take a look at it

In [1]:
# Import gen_data function
from data_gen import gen_data

# Get the data by calling the gen_data function
data1, data2 = gen_data()

# Print 10 entries from data1 and data2
for i in range(10):
    print("Entry", i + 1, "from data1:", data1[i])
    print("Entry", i + 1, "from data2:", data2[i])

Entry 1 from data1: 17
Entry 1 from data2: 39
Entry 2 from data1: 24
Entry 2 from data2: 4
Entry 3 from data1: 90
Entry 3 from data2: 4
Entry 4 from data1: 24
Entry 4 from data2: 7
Entry 5 from data1: 67
Entry 5 from data2: 8
Entry 6 from data1: 46
Entry 6 from data2: 23
Entry 7 from data1: 69
Entry 7 from data2: 4
Entry 8 from data1: 74
Entry 8 from data2: 4
Entry 9 from data1: 72
Entry 9 from data2: 6
Entry 10 from data1: 40
Entry 10 from data2: 2


## Standardize the data:
1. Calculate it's mean $\mu = (\frac{\Sigma(x_i)}{n})$ 

    $Mean= \frac{Sum\ of\ all\ the\ values}{Total\ number\ of\ values}$


2. Calculate it's standard deviation $(\frac{\Sigma(x_i^2)}{n} - \mu^2)^{1/2}$.

    $Standard\ deviation =(\frac{Sum\ of squared\ values}{Total\ number of values} -mean^2)^\frac{1}{2}$

3. For each element perform the following:

    $z_i = \frac{x_i - \mu}{\sigma}$
    
    Step 1: Subract the mean from the value
    
    Step 2: Divide the resulting value from step 1 by standard deviation

In [2]:
### edTest(test_std) ###

# Create a list with the squared values of the elements of data1
data_sq1 = []
for item in data1:
    data_sq1.append(item**2)


# Calculate mean and standard deviation using formula provided in the markdown cell above
mean1 = sum(data1) / len(data1)
sum_sq_diff = sum((item - mean1) ** 2 for item in data1)
std1 = (sum_sq_diff / len(data1)) ** 0.5

# Standardize the data using a loop and display 10 elements
std_data = []
for item in data1:
    standardized_item = (item - mean1) / std1
    std_data.append(standardized_item)

print(std_data[:10])

[-1.4468290609114096, -1.1804691359612514, 1.3309244421402406, -1.1804691359612514, 0.45574183158972065, -0.3433379432607541, 0.5318446672897659, 0.7221017565398788, 0.6459989208398337, -0.5716464503608897]


### Similarly standardize data2

In [3]:
# Repeat the same process above but this time for the `data2` list
data_sq2 = []
for i in data2:
    data_sq2.append(i*i)

# Calculate mean and standard deviation
mean2 = sum(data2)/len(data2)
std2 = (sum((x - mean2) ** 2 for x in data2) / len(data2)) ** 0.5

# Standardize data and print 10 elements
std_data2 = []

for x in data2:
    standardized_value = (x - mean2) / std2
    std_data2.append(standardized_value)

print("Standardized Data (First 10 elements):", std_data2[:10])


Standardized Data (First 10 elements): [0.8833328743076009, -0.7988397466601356, -0.7988397466601356, -0.6546535220057582, -0.6065914471209657, 0.11433967615092136, -0.7988397466601356, -0.7988397466601356, -0.7027155968905506, -0.8949638964297205]


### ⏸ If you had 1000 such data sets, what would be the most efficient way of standardizing them all?

#### A. Copy-paste the code for each dataset.
#### B. Call the TA and ask him/her to do it.
#### C. Write a function to standardize the data.

In [4]:
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')


answer = 'C'

## Writing a Function
Manually copy-pasting code in order to process all different datasets would be very tedious and it would also reduce code readability which increases the chances of small errors.

This is why we will declare a function to do the job for us. Everytime we wish to standardize data all we have to do is simply call the function.

In [6]:
### edTest(test_func) ###
# Define a function which calculates mean and std of input data, and returns standardized data
def standardize(data):
    # Calculate mean
    mean = sum(data) / len(data)

    # Calculate standard deviation
    variance = sum((x - mean) ** 2 for x in data) / len(data)
    std = variance ** 0.5

    # Standardize the data and store it in a new list
    standardized_data = [(x - mean) / std for x in data]

    return standardized_data

In [7]:
# Call the standardize function on data1 and display 10 elements
data1_std = standardize(data1)
print("Data 1 Standardized (First 10 elements):", data1_std[:10])



Data 1 Standardized (First 10 elements): [-1.4468290609114096, -1.1804691359612514, 1.3309244421402406, -1.1804691359612514, 0.45574183158972065, -0.3433379432607541, 0.5318446672897659, 0.7221017565398788, 0.6459989208398337, -0.5716464503608897]


In [8]:
# Call the standardize function on data2 and display 10 elements
data2_std = standardize(data2)
print("Data 2 Standardized (First 10 elements):", data2_std[:10])


Data 2 Standardized (First 10 elements): [0.8833328743076009, -0.7988397466601356, -0.7988397466601356, -0.6546535220057582, -0.6065914471209657, 0.11433967615092136, -0.7988397466601356, -0.7988397466601356, -0.7027155968905506, -0.8949638964297205]


## De-standardization function
Often in data science, we perform manipulations on the standardized dataset (because it's usually easier) and then convert it back to the original scale by destandardizing. 
So let's write a function to retrieve the data by de-standardizing.

## Function to de-standardize
You wil require the original `mean` and `std` values in order to de-standardize. Perform the following on each element: 

$x_i = z_i . \sigma + \mu$

In [9]:
### edTest(test_de) ###
# Write a function which takes data, mean and std as input 
# and returns de-standardized data
# Make sure you use the correct mean and std for 
# data1 and data2 calculated earlier
def destandardize(mean, std, data):
    # Initialize an empty list to store the de-standardized data
    destandardized_data = []

    # De-standardize each value in the data using the formula
    for standardized_value in data:
        destandardized_value = (standardized_value * std) + mean
        destandardized_data.append(destandardized_value)

    return destandardized_data

In [10]:
### edTest(test_de1) ###
# Use mean and std of data1 calculated earlier and destandardize data1_std
data_de1 = destandardize(mean1, std1, data1_std)

# Display the de-standardized data for data1
print("Data 1 De-standardized:", data_de1)



Data 1 De-standardized: [17.0, 24.0, 90.0, 24.0, 67.0, 46.0, 69.0, 74.0, 72.0, 40.0, 65.0, 39.0, 68.0, 81.0, 59.0, 28.0, 60.0, 20.0, 14.0, 62.0, 42.0, 69.0, 42.0, 10.0, 12.0, 84.0, 53.0, 93.0, 36.0, 41.0, 70.0, 70.0, 56.0, 74.0, 96.0, 43.0, 37.0, 70.0, 13.0, 66.0, 75.0, 56.0, 82.0, 80.0, 85.0, 18.0, 41.0, 81.0, 43.0, 48.0, 85.0, 43.0, 97.0, 86.0, 84.0, 60.0, 97.0, 97.0, 69.0, 25.000000000000004, 38.0, 95.0, 15.0, 99.0, 30.0, 95.0, 85.0, 79.0, 16.0, 50.0, 32.0, 71.0, 35.0, 25.999999999999996, 90.0, 48.0, 90.0, 63.0, 44.0, 68.0, 90.0, 18.0, 78.0, 49.0, 49.0, 77.0, 63.0, 36.0, 90.0, 14.0, 18.0, 69.0, 49.0, 38.0, 13.0, 54.0, 38.0, 62.0, 57.0, 17.0, 66.0, 76.0, 36.0, 61.0, 24.0, 39.0, 33.0, 75.0, 25.000000000000004, 61.0, 91.0, 21.0, 44.0, 50.0, 55.0, 92.0, 89.0, 95.0, 70.0, 69.0, 42.0, 50.0, 97.0, 72.0, 45.0, 74.0, 46.0, 65.0, 76.0, 66.0, 14.0, 44.0, 71.0, 45.0, 95.0, 46.0, 31.0, 10.0, 42.0, 97.0, 53.0, 53.0, 75.0, 45.0, 40.0, 97.0, 31.0, 10.0, 87.0, 94.0, 47.0, 30.0, 36.0, 53.0, 50.0, 49.

In [11]:
### edTest(test_de2) ###
# Use mean and std of data1 calculated earlier and destandardize data2_std
data_de2 = destandardize(mean2, std2, data2_std)

# Display the de-standardized data for data1
print("Data 2 De-standardized:", data_de2)


Data 2 De-standardized: [39.0, 4.0, 4.0, 7.0, 8.0, 23.0, 4.0, 4.0, 6.0, 2.0, 94.0, 14.0, 39.0, 13.0, 11.0, 5.0, 36.0, 34.0, 15.0, 41.0, 20.0, 28.0, 2.0, 4.0, 1.0, 5.0, 10.0, 44.0, 3.0, 10.0, 14.0, 29.0, 28.0, 36.0, 21.0, 5.0, 18.0, 5.0, 30.0, 66.0, 6.0, 22.0, 12.0, 8.0, 8.0, 48.0, 5.0, 63.0, 6.0, 41.0, 13.0, 12.0, 10.0, 75.0, 11.0, 20.0, 60.0, 25.0, 7.0, 22.0, 15.0, 15.0, 22.0, 12.0, 22.0, 14.0, 19.0, 7.0, 8.0, 3.0, 8.0, 1.0, 18.0, 21.0, 14.0, 2.0, 30.0, 11.0, 23.0, 17.0, 16.0, 0.0, 12.0, 7.0, 8.0, 22.0, 33.0, 7.0, 11.0, 50.0, 34.0, 8.0, 4.0, 66.0, 25.0, 26.0, 13.0, 7.0, 4.0, 16.0, 30.0, 25.0, 3.0, 14.0, 22.0, 32.0, 4.0, 12.0, 11.0, 23.0, 39.0, 7.0, 21.0, 17.0, 7.0, 8.0, 70.00000000000001, 23.0, 14.0, 41.0, 11.0, 3.0, 21.0, 28.0, 11.0, 5.0, 23.0, 58.0, 2.0, 10.0, 42.0, 47.0, 17.0, 19.0, 17.0, 5.0, 68.0, 3.0, 12.0, 0.0, 9.0, 7.0, 3.0, 15.0, 17.0, 1.0, 12.0, 13.0, 0.0, 20.0, 17.0, 20.0, 14.0, 33.0, 65.0, 14.0, 15.0, 3.0, 22.0, 8.0, 40.0, 1.0, 12.0, 42.0, 24.0, 21.0, 42.0, 6.0, 19.0, 9.0,


### ⏸ By looking at what data is required for destandardizing, do you observe something out of place?

#### A. No, all looks good.
#### B. `mean` and `std` got over-written when copy-pasting code.
#### C. Function to de-standardize requires extra data (mean,std) which were not given by standardize function.
#### D. B and C.

In [12]:
### edTest(test_chow2) ###
# Submit an answer choice as a string below (eg. if you choose option A put 'A')


answer = 'C'