<a target="_blank" href="https://colab.research.google.com/github/lukebarousse/Python_Data_Analytics_Course/blob/main/1_Basics/22_NumPy.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# NumPy

## Notes

* NumPy, which stands for Numerical Python, is foundational for numerical computing in Python.
* Designed for scientific computation and is used extensively for data analysis because of its ability to handle large, multi-dimensional arrays and matrices efficiently. 
* Serves as the basis for many other Python data science libraries, including Pandas, due to its speed and efficiency in numerical computations.
* A few special use cases for NumPy specifically for data analysis: 
    * Array operations
    * Linear algebra
    * Statistical functions
    * Random number generation

## Importance 

* Important for data analysis.
* Way to store data in a structured format, making it easier to organize, access, and manipulate data.
* Pandas library is built on top of NumPy.

For more information on NumPy check out the official documentation [here](https://numpy.org/doc/1.26/).

## Array Operations

### Notes
* Creation: **`import numpy as np; np.array([1, 2, 3])`**
* Basic Operations:  `+`, `-`, `/`, `*` performed element-wise
* Slicing: **`array[1:3]`**
* Boolean Indexing: **`array[array > 0]`**

### Examples

Let's start with creating a fictional array with the number of years of experience required for five different data science job listings.

### Create an Array

In [None]:
# Install package
!conda install numpy

In [2]:
# Import Package
import numpy as np

In [3]:
# Example: An array representing the number of years of experience required for three different data science job listings.
years_of_experience = np.array([1, 2, 3, 4, 5])

### Basic Operations

Applying mathematics operations to the entire array.

#### Addition
Adding 1 year to the experience requirements for each job listing.

In [4]:
years_of_experience_plus_one = years_of_experience + 1
years_of_experience_plus_one

array([2, 3, 4, 5, 6])

#### Subtraction

Subtracting 1 year from the experience requirements for each job listing.

In [5]:
years_of_experience_minus_one = years_of_experience - 1
years_of_experience_minus_one

array([0, 1, 2, 3, 4])

#### Division

Dividing the experience requirements by 2 (maybe for a junior role).

In [6]:
years_of_experience_half = years_of_experience / 2
years_of_experience_half

array([0.5, 1. , 1.5, 2. , 2.5])

#### Multiplication

Doubling the experience requirements for each job listing.

In [7]:
years_of_experience_double = years_of_experience * 2
years_of_experience_double

array([ 2,  4,  6,  8, 10])

### Slicing 

Get subset of information from the array. Let's get the experience requirements for the second and third job postings.

In [4]:
# Example: Selecting the experience requirement for the second and third job listings.
second_and_third_jobs_experience = years_of_experience[1:3]
second_and_third_jobs_experience

array([2, 3])

### Boolean Indexing

Get items from the array that meet a specific condition. For this get only the job postings that require more than 2 years of experience.

In [5]:
# Example: Selecting only those job listings that require more than 1 year of experience.
jobs_with_more_than_one_year_exp = years_of_experience[years_of_experience > 2]
jobs_with_more_than_one_year_exp

array([3, 4, 5])

## Math Operations

### Notes

* Aggregate functions:
    * `sum`: sum
    * `prod`: product
    * `cumsum`: cumulative sum
    * `cumprod`: cumulative product
* Mathematical operations (we won't be going into this during the course):
    * `sqrt` 
    * `exp` 
    * `log` 
    * `sin` 
    * `cos`

### Examples

First lets create a list with 10 yearly salaries for a Senior Data Analyst job. We're just using a combination of the `random` library to get random integers between 100000 and 150000. Then using a `for` loop to get 10 (random) values. 

In [6]:
import random 

salary = [random.randint(100000, 150000) for num in range(10)]

In [7]:
salary

[111298,
 110003,
 103540,
 116494,
 120441,
 149248,
 107357,
 141818,
 102895,
 139437]

Now we'll convert this list into a NumPy array.

In [8]:
salary_array = np.array(salary)
salary_array

array([111298, 110003, 103540, 116494, 120441, 149248, 107357, 141818,
       102895, 139437])

### Sum 

Calculate the total sum of the elements in the `salary_array`.

In [9]:
total_sum_salaries = np.sum(salary_array)
total_sum_salaries

1202531

### Prod

Calculate the product of the elements in the `salary_array`.

In [10]:
# This is a conceptual example since taking the product of a boolean series isn't common
product_salaries = np.prod(salary_array)
product_salaries

-2136956928

### Cumsum (Cumulative Sum)

Calculates the cumulative sum of elements of the `salary_array`. It calculates the cumulative sum at each index, meaning each element in the output array is the sum of all preceding elements including the current one from the original array.

For the `salary_array`:

* First element of cumsum is 110003 (just the first element)
* Second element is 110003 + 133394 = 243397
* Third element is 243397 + 148741 = 392138
* And so on...

In [11]:
cumulative_sum_salaries = np.cumsum(salary_array)
cumulative_sum_salaries

array([ 111298,  221301,  324841,  441335,  561776,  711024,  818381,
        960199, 1063094, 1202531])

### Cumprod (Cumulative Product) 

Calculates the cumulative product of elements of the `salary_array`. It calculates the cumulative product at each index, meaning each element in the output array is the product of all preceding elements including the current one from the original array.

For the `salary_array`:

* First element of cumprod is 110003
* Second element is 110003 * 133394
* Third element is (110003 * 133394) * 148741
* And so on...

In [12]:
# Cumulative product of 'job_no_degree_mention' column (conceptual example)
cumulative_prod_salaries = np.cumprod(salary_array)
cumulative_prod_salaries

array([     111298,  -641787994,  1005104952,  -702145264,   828316816,
       -1710493696, -2144980992, -1560616960,   555163648, -2136956928])

**Note**: Due to the large numbers, the cumulative product values quickly escalate to the point where they exceed the numerical limit for typical data types in Python, leading to integer overflow. This is why some numbers appear as negative.

## Statistics Operations

### Notes

A lot of these you can also use Pandas because Pandas is built on top of NumPy. So here's a few examples but we won't be diving deep into them.

* `mean`
* `median`
* `var`: variance
* `std`: standard deviation
* `min`
* `max`

### Mean

Calculate the average salary in the `salary_array`.

In [13]:
average_salary = np.mean(salary_array)
average_salary

120253.1

### Median 

Find the median salary in the `salary_array`.

In [14]:
median_salary = np.median(salary_array)
median_salary

113896.0

### Var 

Calculate the variance of the `salary_array`.

In [15]:
salary_variance = np.var(salary_array, ddof=1)  # ddof=1 for sample variance
salary_variance

291126267.21111107

### Std

Calculate the standard deviation of the `salary_array`.

In [16]:
# Standard deviation of 'salary_year_avg' column
salary_std_dev = np.std(salary_array, ddof=1)  # ddof=1 for sample standard deviation
salary_std_dev

17062.42266535181

### Min

Find the minimum element in the `salary_array`.

In [17]:
# Minimum of 'salary_year_avg' column
min_salary = np.min(salary_array)
min_salary

102895

### Max

Find the maximum element in the `salary_array`.

In [18]:
# Maximum of 'salary_year_avg' column
max_salary = np.max(salary_array)
max_salary

149248

## NaN


### Notes

* Generate NaN values using `np.nan`
* `np.nan` value is used in NumPy (and by extension, Pandas) to represent missing or undefined data
* Helpful because it:
    * Handles missing data.
    * Helps with computations since it won't return errors but instead return `np.nan`.
    * Help filter out or fill in missing data using other methods that we'll use often in the `pandas` library like `dropna()`, `fillna()`, `isna()`, or `notna()`.

### Examples

Below are a few examples of what you can do.

#### Insert Missing Values 

If you want to insert missing values into your array intentionally, perhaps to indicate that data is expected but not yet available. You use `np.nan`.

In [19]:
salary_with_nan = np.array([123124, np.nan, 145000, 128000, 110000, 149999, np.nan, 135000, 115000, 140000], dtype=float)
salary_with_nan

array([123124.,     nan, 145000., 128000., 110000., 149999.,     nan,
       135000., 115000., 140000.])

#### Replace Values with NaN

If you want to replace existing values with `np.nan`, for example, if certain values are considered invalid or outliers:

In [20]:
salary_with_nan[salary_with_nan < 130000] = np.nan
salary_with_nan

array([    nan,     nan, 145000.,     nan,     nan, 149999.,     nan,
       135000.,     nan, 140000.])

## Where

### Notes

* `np.where` check elements of an array against a condition and to assign a value for True and another for false.
* Syntax: `np.where(condition)`.
* It's commonly used to conditionally replace array elements.

### Example 

We're going to replace all values in `salary_array` that are less than 120,000 with 120,000 (to apply a minimum salary threshold). We'll use this syntax for it: `np.where(condition, x, y)`. With a `condition` and if it's true we do `x` and if not then do `y`.

In [21]:
# Replace values using np.where
salary_array = np.where(salary_array < 120000, 120000, salary_array)
salary_array

array([120000, 120000, 120000, 120000, 120441, 149248, 120000, 141818,
       120000, 139437])

## Random Sampling

### Notes
* Generate a random numbers or samples. 
* `np.random.normal` - draws random samples from a normal (Gaussian) distribution. 
    * Specify the Arguments:
        * `loc`: This is the mean (`μ`) of the normal distribution.
        * `scale`: This is the standard deviation (`σ`) of the normal distribution, representing the dispersion from the mean.
        * `size`: This defines the number of random samples to draw, which is set to match the number of job postings.
    * Syntax: `np.random.normal(loc=0.0, scale=1.0, size=None)`
* A few other random sampling functions: 
    * `np.random.rand`
    * `np.random.randn`
    * `np.random.randint`
    * `np.random.random`
    * `np.random.uniform`
    * `np.random.binomial`
    * `np.random.poisson`

### Example

Let's add some random noise to the `salary_array` to simulate salary variations. We can generate random values from a normal distribution with a mean of 0 and a standard deviation of 5000, then add these values to the salaries.

**Why?** This can be used to simulate salary data for job postings if actual salary data isn't available, for instance, in modeling or simulation scenarios.

In [22]:
# Generate numbers based on normal distribution 
noise = np.random.normal(0, 5000, salary_array.size)
# Add these numbers to the salary_array
salary_array_with_noise = salary_array + noise
salary_array_with_noise

array([124219.08195846, 131116.06759323, 119455.49265214, 122182.62053755,
       118818.2917239 , 151255.86681034, 130457.25819974, 143396.35776544,
       120961.92775885, 130976.31025803])