# **Module 6**
In this module, we will use two open-source scientific computing packages for Python - Numpy and Scipy. 

* Numpy is used to create n-dimensional arrays and matrices operations and solve linear equations. Scipy is built on top of Numpy. 

* It contains a number of sub-packages important for engineering, scientific, and statistical computing. 

* We will use Scipy to generate random variables based on a variety of discrete and continuous distributions. 

* We will also use Scipy to conduct statistical analysis of data and detect anomalies. 

## **1. Numpy Array Operations**

In [1]:
#import Library
import numpy as np
import pandas as pd
import scipy 
from scipy import stats

**1. Let generates a 3x3 NumPy array and then converts it into a one-dimensional array.**

In [3]:
# Generate 3x3 array 
array_3D = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

#Convert to 1D array
array_1d = array_3D.flatten()
array_1d

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [5]:
# Generate 4x4 random integer array 
array_4x4 = np.random.randint(0, 20, size=(4, 4))
array_4x4

array([[15, 17, 10, 11],
       [ 8, 19,  6,  2],
       [ 7,  1,  3,  0],
       [ 2, 11,  6, 17]])

**2. Let us use the NumPy to generate a 4x4 array of random integers between 0 and 10, and then calculate the mean, median, and standard deviation of the array.**

In [7]:

#Calculate mean, median, and standard deviation
mean = np.mean(array_4x4)
median = np.median(array_4x4)
std_dev = np.std(array_4x4)

print("Mean:",mean)
print("Median:",median)
print("Std_dev:",std_dev)

Mean: 8.4375
Median: 7.5
Std_dev: 5.968340954570206


**3. Slicing**

* Let creates a 5x5 identity matrix and then slices the matrix to obtain a 3x3 sub-matrix.

In [20]:
# Create 5x5 identity matrix 
identity_matrix = np.identity(5)

#Slice to obtain 3x3 sub-matrix
sub_matrix = identity_matrix[0:3, 0:3]

print(identity_matrix)
sub_matrix[::-1]

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


array([[0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

##**2. Statistical Properties**

In [21]:
array_5x5 = np.random.randint(0, 100, size=(5, 5))
array_5x5

array([[44, 20,  6, 37, 13],
       [49, 55, 33, 64,  1],
       [52, 73, 42, 36, 26],
       [35, 99, 12, 35, 57],
       [23, 55, 39, 66,  3]])

In [22]:
# 1. Generate 5x5 array and calculate maximum, minimum, and range
array_5x5 = np.random.randint(0, 100, size=(5, 5))
maximum = np.max(array_5x5)
minimum = np.min(array_5x5)
range_ = np.ptp(array_5x5)
print("maximum:",maximum)
print("minimum:",minimum)
print("range_:",range_)

maximum: 99
minimum: 19
range_: 80


In [24]:
# 2. Calculate correlation coefficient between two arrays
array_1 = np.random.randint(0, 10, size=10)
array_2 = np.random.randint(0, 10, size=10)
corr_coeff = np.corrcoef(array_1, array_2)[0][1]
corr_coeff

-0.5242640985197536

In [25]:
# 3. Generate 5x5 normally distributed array and calculate skewness and kurtosis
norm_array = np.random.normal(loc=0, scale=1, size=(5, 5))
skewness = scipy.stats.skew(norm_array)
kurtosis = scipy.stats.kurtosis(norm_array)
norm_array

array([[ 0.44926246,  0.88577211,  0.10524126,  0.53878686,  0.52298218],
       [ 0.49530549,  0.04542353,  0.68966014,  0.34129135, -0.84805755],
       [-0.52521572,  0.44049401, -0.35532238,  1.2124192 , -1.60569138],
       [ 0.11988962,  0.2158606 , -0.12673018,  0.80101273,  1.16698381],
       [-1.06269458,  0.19836834, -1.74469015,  0.82008012, -0.20629129]])

## **3. Distributions**

In [None]:
# 1. Generate Poisson distribution and calculate PMF, PDF, and CDF
poisson_dist = np.random.poisson(lam=3, size=1000)
pmf = np.histogram(poisson_dist, bins=np.arange(11))[0]/len(poisson_dist)
pdf = scipy.stats.poisson.pmf(np.arange(11), mu=3)
cdf = scipy.stats.poisson.cdf(np.arange(11), mu=3)

In [None]:
# 2. Generate normal distribution and calculate PDF and CDF
norm_dist = np.random.normal(loc=5, scale=2, size=1000)
pdf = scipy.stats.norm.pdf(np.arange(0, 11, 0.1), loc=5, scale=2)
cdf = scipy.stats.norm.cdf(np.arange(0, 11, 0.1), loc=5, scale=2)

## **4. A/B Testing**

**1. Let us write a Python code that generates two arrays of random integers between 0 and 10 with a sample size of 10.** 

In [None]:
# Generate two arrays and conduct t-test
array_1 = np.random.randint(0, 10, size=10)
array_2 = np.random.randint(0, 10, size=10)

**2. Let use Scipy to conduct a t-test to compare the means of the two arrays.**

In [None]:
t_test = scipy.stats.ttest_ind(array_1, array_2)

**3. Let us use Scipy to conduct a chi-square test on two arrays of random integers between 0 and 5 with a sample size of 100.**

In [None]:
# Conduct chi-square test on two arrays
array_1 = np.random.randint(0, 5, size=100)
array_2 = np.random.randint(0, 5, size=100)
chi_square = scipy.stats.chi2_contingency

In [None]:
# Create two arrays for observed data
observed_data = np.array([[10, 20], [30, 40]])

# Perform chi-square test
chi2, p, dof, expected = scipy.stats.chi2_contingency(observed_data)

# Print the results
print("Chi-square statistic:", chi2)
print("p-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies table:")
print(expected)

## **5. Anomaly Detection**

In [None]:
# Import Librarie
import numpy as np
import pandas as pd

1. Let us write a Python code that generates a 5x5 array of random integers between 0 and 10 with one value that is significantly larger than the others. Now let use NumPy to detect the anomaly and replace the value with the mean of the other values in the array.

In [None]:
# Generate random 5x5 array with one significantly larger value
arr = np.random.randint(0, 10, size=(5, 5))
arr[3, 3] = 100 #sets one value to 100 to represent the anomaly

# Detect anomaly using NumPy and replace with mean of other values
mean = np.mean(arr)
std = np.std(arr)
threshold = mean + 3*std
anomaly_mask = arr > threshold
if np.any(anomaly_mask):
    arr[anomaly_mask] = mean

# Print original array and array with anomaly replaced
print("Original array:")
print(arr)

2. Generate a random dataset with anomalies, detects the anomalies using both the z-score and percentile methods, and removes the anomalies from the dataset. 

In [None]:
# Generate random data with anomalies
data = np.random.normal(0, 1, 100)
anomalies = np.random.normal(10, 1, 3)
data = np.concatenate((data, anomalies))

In [None]:
# Convert data to pandas dataframe
df = pd.DataFrame(data, columns=['Values'])

In [None]:
# Detect anomalies using z-score method
df['Z-score'] = (df['Values'] - df['Values'].mean()) / df['Values'].std()
anomalies_zscore = df.loc[df['Z-score'].abs() > 3, 'Values']

In [None]:
# Remove anomalies using z-score method
df = df.loc[df['Z-score'].abs() <= 3, :].drop(columns=['Z-score'])

In [None]:
# Detect anomalies using percentile method
q1, q3 = np.percentile(df['Values'], [25, 75])
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
anomalies_percentile = df.loc[(df['Values'] < lower_bound) | (df['Values'] > upper_bound), 'Values']

In [None]:
# Remove anomalies using percentile method
df = df.loc[(df['Values'] >= lower_bound) & (df['Values'] <= upper_bound), :]

In [None]:
print('Original data:')
print(data)
print('Anomalies detected using Z-score method:')
print(anomalies_zscore.values)
print('Anomalies detected using percentile method:')
print(anomalies_percentile.values)
print('Data after removing anomalies:')
print(df['Values'].values)