#Numpy

<img src="https://numpy.org/images/logo.svg" height="50">

[NumPy](https://numpy.org/) (short for Numerical Python) is a powerful open-source library for numerical computing in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays. NumPy is a fundamental package for scientific computing and is widely used in data analysis, machine learning, image processing, and many other fields requiring numerical computations.

NumPy was created by [Travis Olliphant](https://en.wikipedia.org/wiki/Travis_Oliphant) in 2005. It was built on top of Numeric, an earlier library developed by [Jim Hugunin](https://en.wikipedia.org/wiki/Jim_Hugunin) in the 1990s. Numeric was the first Python package that provided support for multi-dimensional arrays and operations on them. Over time, it became clear that a more robust and feature-rich version was needed, leading to the creation of NumPy, which integrated features from both Numeric and numarray (another similar package).

NumPy quickly became the cornerstone of the Python data analysis ecosystem. It allows efficient handling of arrays, fast operations on large datasets, and is designed to integrate seamlessly with other scientific libraries like **SciPy**, **Pandas**, **Matplotlib**, and **Scikit-learn**.


## Examples of use cases in Data Analytics with NumPy

### Handling and Manipulating Large Datasets

You have a large dataset (e.g., 1 million rows) and need to perform element-wise arithmetic operations, such as adding 10 to each element of the dataset.


In [None]:
import numpy as np

# Creating a large array of random numbers
data = np.random.rand(1000000)

# Adding 10 to each element in the array
data_plus_10 = data + 10

The same can be achieved using list comprehension or loops but using Python lists, we would need to loop through each element, which would be slower due to the lack of optimization in Python lists for numerical computations.

In [None]:
import random
data = [random.random() for _ in range(1000000)]
data_plus_10 = [x + 10 for x in data]


What are the main differences?
- Performance: NumPy provides highly optimized C-based implementations, so operations on large datasets (like adding 10 to every element) are done much faster compared to Python loops. NumPy’s vectorized operations are often orders of magnitude faster than pure Python loops.
- Memory Efficiency: NumPy arrays are stored more efficiently than Python lists and use less memory, especially for large datasets.


### Matrix Operations Scenario

You need to perform matrix multiplication, which is a common task in machine learning (e.g., calculating the dot product between feature matrices).

In [None]:
A = np.random.rand(3, 3)  # 3x3 matrix
B = np.random.rand(3, 3)  # 3x3 matrix

# Matrix multiplication
result = np.dot(A, B)
result


array([[0.50279847, 0.5632417 , 0.96707076],
       [0.79505735, 0.84528553, 1.50311837],
       [0.87707853, 0.7817518 , 1.45299398]])

We can do it using just Python lists: Without NumPy, you would need to manually iterate through each element of the matrices and calculate the dot product, which is both inefficient and error-prone.


In [None]:
def matrix_multiply(A, B):
  """
  Multiplies two matrices A and B.

  Args:
    A: The first matrix.
    B: The second matrix.

  Returns:
    The resulting matrix after multiplication.
  """
  num_rows_A = len(A)
  num_cols_A = len(A[0])
  num_rows_B = len(B)
  num_cols_B = len(B[0])

  if num_cols_A != num_rows_B:
    raise ValueError("Matrices cannot be multiplied due to incompatible dimensions.")

  result = [[0 for _ in range(num_cols_B)] for _ in range(num_rows_A)]

  for i in range(num_rows_A):
    for j in range(num_cols_B):
      for k in range(num_cols_A):  # or num_rows_B
        result[i][j] += A[i][k] * B[k][j]

  return result


A = [[random.random() for _ in range(3)] for _ in range(3)]
B = [[random.random() for _ in range(3)] for _ in range(3)]

# Manual matrix multiplication (inefficient)
#Using list comprenhension
#result = [[sum(A[i][k] * B[k][j] for k in range(len(B))) for j in range(len(B[0]))] for i in range(len(A))]

#Using function
result = matrix_multiply(A, B)
result


[[1.1111205894333458, 1.5430904682874056, 0.872687420319606],
 [1.3042689579162192, 1.5780279847576144, 0.4708319772078078],
 [0.6674111690199104, 0.9611250318451263, 0.3918035847418381]]

What are the differences?
- Efficiency: NumPy's np.dot is implemented in C and optimized for performance. The manual approach is significantly slower, especially as the matrix sizes increase.
- Code Simplicity: The NumPy solution is much more concise and readable. The Python list approach is not only slower but also more complex and harder to debug.


### Statistical Analysis
You need to calculate the mean, standard deviation and correlation of a dataset




In [None]:
data = np.random.rand(1000)

mean = np.mean(data)
std_dev = np.std(data)

print(f"mean: {mean}, std_dev: {std_dev}")

mean: 0.5008557282612514, std_dev: 0.2850620809156234


Using Python's built-in functions: You could calculate the mean and standard deviation manually using Python loops, but this would be inefficient and require more lines of code.


In [None]:

data = [random.random() for _ in range(1000)]

mean = sum(data) / len(data)
std_dev = (sum((x - mean)**2 for x in data) / len(data))**0.5

print(f"mean: {mean}, std_dev: {std_dev}")

mean: 0.5132583429051921, std_dev: 0.29279051443165854


What is different ?
- Performance: The NumPy functions np.mean, np.std are highly optimized and execute much faster than manually implementing these calculations in Python.
- Code Simplicity: NumPy abstracts away the complex calculations, reducing the lines of code and making it easier to perform statistical analysis on datasets.


### Reshaping and Slicing Arrays
You have a 1D array and want to reshape it into a 2D array or slice it to work with subsets of the data

In [None]:
# Create a 1D array of 12 elements
data = np.arange(12)

# Reshape it to a 3x4 2D array
reshaped_data = data.reshape(3, 4)

# Slice the 2D array
subset = reshaped_data[1:, 1:]

print(f" The original data is \n {data}")
print(f" The reshaped data is \n {reshaped_data}")
print(f" The subset data is \n {subset}	")

 The original data is 
 [ 0  1  2  3  4  5  6  7  8  9 10 11]
 The reshaped data is 
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
 The subset data is 
 [[ 5  6  7]
 [ 9 10 11]]	


Using Python lists: Reshaping or slicing using Python lists is more complex and less intuitive than NumPy. Python lists don't support reshaping directly, so you would need to manually rearrange the data.

In [None]:
data = [i for i in range(12)]

# Manual reshaping (inefficient and error-prone)
reshaped_data = [data[i:i+4] for i in range(0, 12, 4)]
subset = [row[1:] for row in reshaped_data[1:]]

print(f" The original data is \n {data}")
print(f" The reshaped data is \n {reshaped_data}")
print(f" The subset data is \n {subset}	")

 The original data is 
 [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
 The reshaped data is 
 [[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
 The subset data is 
 [[5, 6, 7], [9, 10, 11]]	


What is different?
- Ease of Use: NumPy's reshape method provides an easy way to change the shape of an array without needing to manually rearrange the data.
- Performance: The reshaping operation in NumPy is optimized and does not require copying data, unlike with Python lists.


### Element-wise operations
You need to apply a mathematical transformation (e.g square each element of a dataset)

In [None]:
data = np.random.rand(1000)

# Squaring each element
squared_data = np.square(data)


Without NumPy, you would have to loop through each element and apply the transformation manually.


In [None]:
data = [random.random() for _ in range(1000)]

# Manual element-wise operation (inefficient)
squared_data = [x**2 for x in data]

What is different?
- Performance: NumPy’s vectorized operations (like np.square) are much faster than using Python loops, especially for large datasets.
- Code Readability: The NumPy approach is concise and easier to understand. The loop-based approach is longer and more error-prone.



Summary of Key Differences Between NumPy and Traditional Python Methods
- Performance: NumPy outperforms traditional Python loops in terms of speed due to its implementation in C, optimized for handling large arrays and matrices.
- Memory Efficiency: NumPy uses contiguous blocks of memory, allowing for better memory management compared to Python lists, which can lead to slower performance as data grows.
- Ease of Use: NumPy’s high-level functions (e.g., np.dot, np.mean, np.reshape) simplify complex operations and reduce the likelihood of errors.
- Scalability: NumPy handles larger datasets much more efficiently than pure Python, which can become slow and cumbersome as the size of the data increases.

In data analytics, NumPy is often the go-to tool for handling large datasets, performing complex mathematical operations, and achieving both performance and simplicity in your code.